University of Illinois at Urbana-Champaign. Midterm Examination

Similar documents
Midterm Examination Practice

Language Models. Hongning Wang

Ranked Retrieval (2)

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS


CS 188: Artificial Intelligence Spring Today

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Lecture 13: More uses of Language Models

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Modeling Environment

Information Retrieval

A Note on the Expectation-Maximization (EM) Algorithm

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

1 Information retrieval fundamentals

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Information Retrieval

Introduction to AI Learning Bayesian networks. Vibhav Gogate

CS 646 (Fall 2016) Homework 3

CS 188: Artificial Intelligence Fall 2008

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

Machine Learning. Hal Daumé III. Computer Science University of Maryland CS 421: Introduction to Artificial Intelligence 8 May 2012

CS 188: Artificial Intelligence. Machine Learning

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Chap 2: Classical models for information retrieval

Score Distribution Models

Expectation maximization tutorial

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

A Risk Minimization Framework for Information Retrieval

With Question/Answer Animations. Chapter 7

Information Retrieval and Web Search Engines

Recap: Language models. Foundations of Natural Language Processing Lecture 4 Language Models: Evaluation and Smoothing. Two types of evaluation in NLP

Lecture 3: Probabilistic Retrieval Models

Language Models, Smoothing, and IDF Weighting

Language as a Stochastic Process

Modern Information Retrieval

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

Machine Learning: Assignment 1

CS 572: Information Retrieval

Natural Language Processing. Statistical Inference: n-grams

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

CS Lecture 18. Topic Models and LDA

INFO 630 / CS 674 Lecture Notes

CSCI 5822 Probabilistic Model of Human and Machine Learning. Mike Mozer University of Colorado

Statistical methods for NLP Estimation

Probabilistic Language Modeling

Today. Statistical Learning. Coin Flip. Coin Flip. Experiment 1: Heads. Experiment 1: Heads. Which coin will I use? Which coin will I use?

Machine Learning for natural language processing

Boolean and Vector Space Retrieval Models

. Using a multinomial model gives us the following equation for P d. , with respect to same length term sequences.

Retrieval models II. IN4325 Information Retrieval

LECTURE 1. 1 Introduction. 1.1 Sample spaces and events

N-gram Language Modeling

Information Retrieval and Organisation

CS145: INTRODUCTION TO DATA MINING

PV211: Introduction to Information Retrieval

Hierarchical Bayesian Languge Model Based on Pitman-Yor Processes. Yee Whye Teh

CS4705. Probability Review and Naïve Bayes. Slides from Dragomir Radev

Probably About Probability p <.05. Probability. What Is Probability?

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Approximate Inference

Entropy. Expected Surprise

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Lecture 3: Pivoted Document Length Normalization

Bayesian Methods: Naïve Bayes

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Bayesian Analysis for Natural Language Processing Lecture 2

The Naïve Bayes Classifier. Machine Learning Fall 2017

Gaussian Models

Midterm, Fall 2003

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

BLAST: Target frequencies and information content Dannie Durand

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

Naïve Bayes Classifiers

Classification Algorithms

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Modern Information Retrieval

Text Analysis. Week 5

Bayesian Learning. Artificial Intelligence Programming. 15-0: Learning vs. Deduction

1 Ex. 1 Verify that the function H(p 1,..., p n ) = k p k log 2 p k satisfies all 8 axioms on H.

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Classification & Information Theory Lecture #8

Introduction to Randomized Algorithms III

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Chapter 8 PROBABILISTIC MODELS FOR TEXT MINING. Yizhou Sun Department of Computer Science University of Illinois at Urbana-Champaign

More Smoothing, Tuning, and Evaluation

COMP 328: Machine Learning

Latent Semantic Analysis. Hongning Wang

Example. If 4 tickets are drawn with replacement from ,

CS 188: Artificial Intelligence. Outline

Preparing for the CS 173 (A) Fall 2018 Midterm 1

Naive Bayes classification

An Algorithms-based Intro to Machine Learning

CSC321 Lecture 18: Learning Probabilistic Models

Transcription:

University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105, Siebel Center Name: NetID:

1. [10 points] Evaluation (a) [5/10 points] Suppose we have a topic (i.e., a query) with a total of 5 relevant documents in the whole collection. A system has retrieved 6 documents whose relevance status is [+, +,,,, +] in the order of ranking. A + (or - ) indicates that the corresponding document is relevant (or nonrelevant). For example, the first two documents are relevant, while the third is non-relevant, etc. Compute the precision, recall, and the (non-interpolated) mean average precision for this result. precision = 3 6 = 0.5 recall = 3 5 = 0.6 MAP = ( 1 1 + 2 2 + 3 6 )/5 = 0.5 (b) [3/10 points] Briefly explain why precision at k (e.g., k = 10) documents is not as good as the (mean) average precision for comparing two retrieval systems in terms of their ranking accuracy. Precision at k documents is not sensitive to the order of the documents as far as they are among the top k. MAP is sensitive to the entire ranking and also contains recall-oriented aspects. (c) [2/10 points] Does precision at k (e.g., k = 10) documents have any advantage over the (mean) average precision as a measure for evaluating retrieval performance? Yes, in cases where the focus is only on the top k documents. For example in Web search, users usually look at the first few results. 1

2. [25 points] Bayes Rule Author H and author T are co-authoring a paper in the following way: (a) Each word is written independently. (b) When writing a word, they would first toss a coin to decide who will write the word. The coin is known to show up as HEAD 80% of the time. If the coin shows up as HEAD, then author H would write the word, otherwise, author T would write the word. (c) If it is author H s turn to write, he would write the word by simply drawing a word according to word distribution θ H. Similarly, if it is author T s turn to write, he would write the word by drawing a word according to word distribution θ T. Suppose the two distributions θ H and θ T are defined as follows: Word w p(w θ H ) p(w θ T ) the 0.3 0.3 computer 0.1 0.2 data 0.1 0.1 baseball 0.2 0.1 game 0.2 0.1......... (a) [6/25 points] What is the probability that they would write the as the first word of the paper? Show your calculation. p(w = the ) = p(w = the Θ H )p(θ H ) + p(w = the Θ T )p(θ T ) = 0.8 0.3 + 0.2 0.3 = 0.3 (b) [4/25 points] What is the probability that they would write the as the second word of the paper? Since each word is drawn independently, the probability of writing the as the second word is the same as the probability of writing the as the first word, i.e. 0.3. (c) [6/25 points] Suppose we observe that the first word they wrote is data, what is the probability that this word was written by author H (i.e., p(θ H w = data ))? Show your calculation. p(θ H w = data p(w= data ) = Θ H )p(θ H ) p(w= data Θ H )p(θ H )+p(w= data Θ T )p(θ T ) = 0.1 0.8 0.1 0.8+0.1 0.2 = 0.8 (d) [4/25 points] Imagine that we observe a very long paper written by them (e.g., with more than 10,000 words). Among the 5 words shown in the table above (i.e., the, computer, data, baseball, game ), which one would you expect to occur least frequently in the paper? Briefly explain why. data would occur least frequently, since p(w = data ) is the smallest. 2

(e) [5/25 points] Suppose we don t know θ H, but observed a paper D known to be written solely by author H. That is, the coin somehow always showed up as HEAD when they wrote the paper. Suppose D = the computer data the computer game the computer data game and we would like to use the maximum likelihood estimator to estimate θ H. Fill in the values for the following estimated probabilities: Word w p(w θ H ) the 0.3 computer 0.3 data 0.2 baseball 0 game 0.2 3. [10 points] Zipf s law. Assume that the frequency distribution of words in a collection of documents C roughly follows the Zipf s law r p(w r C) = 0.1, where r = 1, 2, 3,... is the rank of a word in the descending order of frequency. w r is the word at rank r, and p(w r C) is the probability (frequency) of word w r in the collection. What is the probability of the most frequent word in the collection? What is the probability of the second most frequent word in the collection? From Zipf s law: p(w r C) = 0.1 r Probability of the most frequent word: p(w 1 C) = 0.1 1 = 0.1 Probability of the second most frequent word: p(w 2 C) = 0.1 2 = 0.05 3

4. [20 points] Vector Space Model Consider the following TF-IDF retrieval formula: score(d, Q) = (c(w, D) + IDF (w))/idf (w) w Q D where c(w, D) is the raw count of word w in document D, and IDF (w) is the IDF of word w. (a) [8 points] Point out at least two reasons why the formula above is unlikely to perform well empirically. Wrong usage of IDF: Larger IDF leads to lower weight No document length normalization No TF normalization No query term frequency (qtf) (b) [6 points] How would IDF(w) change (i.e., increase, decrease or stay the same) in each of the following cases: (1) adding the word w to a document; (2) make each document twice as long as its original length by concatenating the document with itself. (1) If the document already contains w, IDF (w) will not change, otherwise it will decrease, since the number of documents containing w is increased. (2) IDF (w) does not change, since the number of documents containing w does not change. 4

(c) [6 points] Briefly sketch how you may use the vector space retrieval method (e.g., pivoted normalization retrieval function, Rocchio feedback method) to design a spam filter. Let S = {s 1,..., s n } be a set of n known spam messages, and R = {r 1,..., r m } a set of m regular email messages. Describe how you can build a spam filter based on S and R such that it can process any new email message e and decide whether it is a spam. Possible (1) Represent each email message as a term vector (2) Calculate the centroids of S and R, s c and r c (3) Given a new email message e, measure the similarity between e and the centroids s c and r c. e is reported as spam if it is more similar to s c, or not a spam if it is more similar to r c. (4) Once we get the feedback from user whether e is a spam or not, we add e to S or R respectively and update the centroids in step (2) according to the Rocchio feedback formula. 5

5. [25 points] Dirichlet prior smoothing and retrieval Suppose we have a document collection with an extremely small vocabulary with only 6 words w 1,..., w 6. The following table shows the estimated reference language model p(w REF ) using the whole collection of documents (2nd column) and the word counts for document d 1 (3rd column) and d 2 (4th column), where c(w, d i ) is the count of word w in document d i. Let Q = w 1 w 2 be a query. Word p(w REF ) c(w, d 1 ) c(w, d 2 ) w 1 0.8 2 7 w 2 0.1 3 1 w 3 0.025 2 1 w 4 0.025 2 1 w 5 0.025 1 0 w 6 0.025 0 0 SUM 1.0 10 10 (a) [10 points] Suppose we do not smooth the language model for d 1 and d 2. Compute the likelihood of the query for both d 1 and d 2, i.e., p(q d 1 ) and p(q d 2 ). (Do not compute the log-likelihood.) Show your calculations. Which document would be ranked higher? p(q d 1 ) = 2 10 3 10 = 0.06 p(q d 2 ) = 7 10 1 10 = 0.07 d 2 would be ranked higher (b) [10 points] Suppose we now smooth the language model for d 1 and d 2 using Dirichlet prior smoothing method with µ = 10. Recompute the likelihood of the query for both d 1 and d 2, i.e., p(q d 1 ) and p(q d 2 ). (Do not compute the log-likelihood.) Show your calculations. Which document would be ranked higher this time? 10 p(q d 1 ) = ( 10+10 2 10 + 10 10+10 8 10 ) ( 10 10+10 2 10 + 10 10+10 8 10 ) = 0.1 10 p(q d 2 ) = ( 10+10 7 10 + 10 10+10 8 10 ) ( 10 10+10 1 10 + 10 10+10 1 10 ) = 0.075 d 1 would be ranked higher (c) [5 points] Intuitively, which document do you think should be ranked higher? d 1 or d 2? Why? d 1 should be ranked higher, since it contains a larger number of the more informative word w 2. From the estimated reference language model, it is clear that w 1 is a popular word and w 2 is more discriminative. 6

6. [10 points] Compression and Information Theory (a) [5 points] Write down the gamma code for the integer 11. 1110011 (b) [5 points] Efficient Computation of smoothed language models When a smoothed language model is involved in a problem, the computation generally involves a sum over all the words in the vocabulary due to the fact that smoothing assigns non-zero probability to all words. However, it is often possible to rewrite a formula so that the computation can be done much more efficiently than taking a brute-force sum over all words, as in the case of the query-likelihood retrieval formula. Assume that a smoothed unigram language model for document D is given by { ps (w D) if word w is seen p(w D) = α D p(w C) otherwise where p s (w D) is the smoothed probability of a word seen in the document, p(w C) is the collection language model, and α d is a coefficient controlling the probability mass assigned to unseen words, so that all probabilities sum to one. Show that the following formula correctly computes the entropy of any smoothed document language model in this way. (Note that this formula has two parts one part is a sum over all words in D and the other is the collection model entropy which can be pre-computed.) H(θ D ) = [α D p(w C)log(α D p(w C)) (p s (w θ D ) log p s (w θ D ))] α D log α D α D p(w C) log p(w C) where V is the set of all the words in the vocabulary. H(Θ D) = p(w Θ D) log p(w Θ D) = p s(w D) log p s(w D) α Dp(w C) log(α Dp(w C)) w / D = p s(w D) log p s(w D) α Dp(w C) log(α Dp(w C)) + α Dp(w C) log(α Dp(w C)) = = = [α Dp(w C) log(α Dp(w C)) p s(w D) log p s(w D)] α Dp(w C) log(α Dp(w C)) [α Dp(w C) log(α Dp(w C)) p s(w D) log p s(w D)] α Dp(w C) log α D α Dp(w C) log p(w C) [α Dp(w C) log(α Dp(w C)) p s(w D) log p s(w D)] α D log α D α D p(w C) log p(w C) 7

scratch 8