CS 646 (Fall 2016) Homework 3

Similar documents

Ranked Retrieval (2)

Lecture 13: More uses of Language Models

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

University of Illinois at Urbana-Champaign. Midterm Examination

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Information Retrieval

Machine Learning: Assignment 1

Language Models, Smoothing, and IDF Weighting

Midterm Examination Practice

Is Document Frequency important for PRF?

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

INFO 630 / CS 674 Lecture Notes

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Probabilistic Field Mapping for Product Search

A Study of the Dirichlet Priors for Term Frequency Normalisation

Information Retrieval

Lecture 3: Pivoted Document Length Normalization

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

Language Models. Hongning Wang

Name: Matriculation Number: Tutorial Group: A B C D E

arxiv: v1 [cs.ir] 1 May 2018

Generalized Inverse Document Frequency

1 Information retrieval fundamentals

CS103 Handout 22 Fall 2012 November 30, 2012 Problem Set 9

Positional Language Models for Information Retrieval

IMU Experiment in IR4QA at NTCIR-8

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lower-Bounding Term Frequency Normalization

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

Back to the Roots: A Probabilistic Framework for Query-Performance Prediction

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Retrieval models II. IN4325 Information Retrieval

Pivoted Length Normalization I. Summary idf II. Review

CSC411 Fall 2018 Homework 5

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

Homework 6: Image Completion using Mixture of Bernoullis

A Note on the Expectation-Maximization (EM) Algorithm

Natural Language Processing. Statistical Inference: n-grams

HOMEWORK 4: SVMS AND KERNELS

Language as a Stochastic Process

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Data Structure and Algorithm Homework #1 Due: 2:20pm, Tuesday, March 12, 2013 TA === Homework submission instructions ===

Utilizing Passage-Based Language Models for Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

PROBABILISTIC LATENT SEMANTIC ANALYSIS

Semantic Term Matching in Axiomatic Approaches to Information Retrieval

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

Statistical methods for NLP Estimation

A Document Frequency Constraint for Pseudo-Relevance Feedback Models

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Advanced Introduction to Machine Learning: Homework 2 MLE, MAP and Naive Bayes

Concept Tracking for Microblog Search

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

Non-Boolean models of retrieval: Agenda

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Score Distribution Models

Week 12: Optimisation and Course Review.

Optimal Mixture Models in IR

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

/home/thierry/columbia/msongsdb/pyreport_tutorials/tutorial1/tutorial1.py January 23, 20111

Lecture 3: Probabilistic Retrieval Models

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Lecture 2 August 31, 2007

Chap 2: Classical models for information retrieval

EECS 349 (Machine Learning) Homework 4, Fall 2011

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

Machine Learning for natural language processing

Midterm sample questions

Citation for published version (APA): Tsagkias, M. (2012). Mining social media: tracking content and predicting behavior

Why Language Models and Inverse Document Frequency for Information Retrieval?

Linear Models for Regression CS534

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Predicting Neighbor Goodness in Collaborative Filtering

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Query-document Relevance Topic Models

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Lab Exercise 6 CS 2334

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Passage language models in ad hoc document retrieval

N-gram Language Modeling Tutorial

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

Feature Engineering, Model Evaluations

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

Language Processing with Perl and Prolog

Transcription:

CS 646 (Fall 2016) Homework 3 Deadline: 11:59pm, Oct 31st, 2016 (EST) Access the following resources before you start working on HW3: Download and uncompress the index file and other data from Moodle. index lucene robust04.tar.gz (about 550 MB) stopwords inquery: a list of stopwords queries: 249 search queries qrels: relevance judgments for the 249 queries The index is similar to the krovetz index used in HW2, except that the index stored document vectors. The docno and content of documents are stored in the docno and content fields, respectively. Please read the tutorial for how to access a stored document vector from a Lucene index: https://github.com/jiepujiang/cs646_tutorials#lucene-6 Check out the Lucene starter code from GitHub: https://github.com/jiepujiang/cs646_hw3 1 Unigram Language Model (10 points) 1.1 Log Probability (5 points) Implement MySearcher.logpdc(String field, String docno). It should return the log probability of the document with the specified docno (external ID) according to the corpus language model of an index P (D Corpus). log P (D Corpus) = w D c(w, D) log P (w Corpus) (1) 1

c(w, Corpus) P (w Corpus) = Corpus D is a document and t denotes each term in D. c(t, D) and c(t, Corpus) are the frequency of the term t in D and in the corpus, respectively. Corpus is the length of the whole corpus (the total number of terms). You will need to retrieve the stored document vectors in order to compute logpdc. Report the log probability of the following documents (the table provides the result of one document to help you verify your implementation). docno log P (D Corpus) FBIS4-41991 8438.50779524626 FBIS4-67701 FT921-7107 FR940617-0-00103 FR941212-0-00060 FBIS3-25118 (2) 1.2 Comparing Log Probability (5 points) In the query likelihood model (QL), we rank results by P (q D) or log P (q D). If log P (q D 1 ) > log P (q D 2 ), QL believes that D 1 is more relevant or more similar to the query q than D 2 and thus it ranks D 1 at a higher position. Does log P (q 1 D) > log P (q 2 D) also indicate that q 1 is more similar to the content of the document D compared with q 2? If yes, discuss why; If no, discuss how to adjust log P (q D) to make it appropriate for comparing the relevance/similarity of a document with different queries. 2 Query Likelihood and Smoothing (40 points) 2.1 Implementation (10 points) Implement two scoring functions: MySearcher.QLJMSmoothing: QL with Jelinek-Mercer smoothing: P (t D) = (1 λ) c(t, D) D 2 + λ P (t Corpus) (3)

MySearcher.QLDirichletSmoothing: QL with Dirichlet smoothing: P (t D) = c(t, D) + µ P (t Corpus) D + µ (4) MySearcher implemented the document-at-a-time search algorithm with a replaceable scoring function (as defined in the ScoringFunction interface). After implementing the two scoring functions, you should be able to run the main function of MySearcher. It retrieves the top 1,000 results for each of the 249 queries in the Robust04 dataset using a scoring function. It reports the P@10 and average precision (AP) of each query and the mean values of P@10 and AP over the 249 queries. You can verify your Dirichlet smoothing by comparing your results with those by running LuceneQLSearcher s main function. LuceneQLSearcher includes an implementation of QL with Dirichlet smoothing using a slightly faster approach optimized for term-at-a-time search. We need to use QL with Dirichlet smoothing a lot in follow-up questions. You can use either your own implementation or LuceneQLSearcher in followup questions. They take similar input parameters, but the latter is probably faster (note that some of the experiments may take 30 minutes or longer). 2.2 Smoothing Parameters (10 points) Divide the 249 queries into a training set (No. 301 450) and a testing set (No. 601 700). Separately plot the mean P@10 and AP of queries in the training and the testing set for: QL with Jelinek-Mercer smoothing where λ ranges from 0.1 to 0.9 with step 0.1 (set x-axis to λ and y-axis to P@10 or AP). QL with Dirichlet smoothing where µ ranges from 500 to 5000 with step 500 (set x-axis to µ and y-axis to P@10 or AP). To examine the stability of the two smoothing approaches across different query sets, we apply the best λ or µ in the training set (the value that gives the best mean P@10 or AP) to the testing set. Usually the results would be worse than those by using the best λ or µ in the testing set. Report the differences in mean P@10 and AP comparing to those by using the best λ or µ in the testing set. Discuss which smoothing approach seems more effective and stable on the Robust04 dataset based on your results. 3

2.3 Smoothing, TF, and Document Length (20 points) We discussed in class that the log query likelihood score can be considered as a summation over each query term s scores in the document as in Equation 6, where a term s score in a document is: score(t, D) = log P (t D). score(q, D) = t q score(t, D) (5) score(t, D) = log P (t D) (6) For convenience, we can consider score(t, D) as a continuous and derivable function of c t,d (the frequency of t in the document D) and other variables. This makes it easier to examine the influence of c t,d on score(t, D) and consequently on score(q, D). score(t, D) = For Jelinek-Mercer smoothing: score(t, D) = For Dirichlet smoothing: 1 P (t D) (7) P (t D) 1 c t,d + λ 1 λ D P (t Corpus) (8) score(t, D) = 1 c t,d + µ P (t Corpus) (9) As we discussed in class, according to Equation 7, 8, and 9: When c t,d increases, score(t,d) descreases: thus QL penalizes repeated occurrences of the same term (regardless of which smoothing approach is employed). score(t, D) increases by a slower rate as c t,d increases if t is a common term in the corpus t has a higher P (t Corpus) compared with a less common term. Thus, QL with smoothing penalizes common terms in the corpus (similar to IDF). Now discuss the following questions: 4

2.3.1 According to Equation 8 and 9, how do the two approaches differ in terms of discounting repeated term occurrences? How do the length of the document and the smoothing parameters influence the discount? (10 points) 2.3.2 Analyze to which extent the two smoothing approaches penalize longer documents. Discuss while other conditions (c t,d and P (t Corpus)) being equal, when will one smoothing approach penalize longer documents by a greater extent than another? (10 points) Note that 2.3.1 asks the influence of D and smoothing parameters on score(t,d), while 2.3.2 asks the influence of D and smoothing parameters on score(t, D). 3 Relevance Model (50 points) 3.1 Implementing RM1 (10 points) As we introduced in Lecture 11, the relevance model RM1 [1] is a pseudorelevance feedback approach for estimating query language models from the top k search results of QL. RM1 models the joint probability of a term t with a query q as in Equation 10, where: P QL (q D i ) is the query likelihood score of q from a document D i ; P FB (t D i ) is the probability of a feedback term t from the document D i. P (t, q) k P FB (t D i )P QL (q D i ) (10) i=1 We can use different language models for P QL (q D i ) and P FB (t D i ). Here we use Dirichlet smoothing for both P QL (q D i ) and P FB (t D i ), but allow the smoothing parameters to be different µ and µ FB. P QL (t D) = t q c(t, D i ) + µ P (t Corpus) D i + µ (11) P FB (t D i ) = c(t, D i) + µ FB P (t Corpus) D i + µ FB (12) Since P (t q) P (t, q), we estimate a query model P (t q) by normalizing Equation 10 as Equation 13. Here V is the vocabulary of the query language model. Theoretically, V can be set to the vocabulary of the corpus, but this 5

is practically very inefficient. Normally, we just set V to the terms appeared in the top k documents. Also, remember to exclude stopwords (provided in the stopwords inquery file) from V as we had not removed stopwords at indexing time, the document vectors include stopwords. k i=1 P (t q) = P FB(t D)P QL (q D i ) k t j V i=1 P FB(t j D i )P QL (q D i ) (13) We can apply P (t q) to the KL-Divergence language model to rank search results. As we discussed in Lecture 9, the KLD language model is equivalent to a query likelihood model (using the log QL score), where we set the frequencies of query terms to their probabilities by P (t q). In such a case, we can simply use LuceneQLSearcher to perform a KLD search after estimating the query model P (t q). KLD(q D) = t q P (t q) log P (t q) rank = P (t q) log P (t D) (14) P (t D) t q As it is slow to process long queries, we often only use the top n highest probability terms from P (t q) for ranking. Note that you still need to normalize the probability by setting V to the whole set of terms appeared in the top k documents rather than only the top n words in V. Implement RM1 in LuceneQLSearcher.estimateQueryModelRM1. RM1 takes the following input parameters: q: the original search query (as a list of tokenized query terms) µ: the Dirichlet smoothing parameter for QL (Equation 11) µ FB : the feedback Dirichlet smoothing parameter (Equation 12) k: the number of top feedback documents n: the number of top feedback terms The output of LuceneQLSearcher.estimateQueryModelRM1 is a HashMap storing the query model. LuceneQLSearcher.search can take such a query model as input and perform a QL search (the implementation is in AbstractQLSearcher). 6

3.2 µ FB (5 points) Lv and Zhai [2] reported that setting µ FB = 0 is optimal for RM3, which indicates that smoothing seems unnecessary for P FB (t D i ). Please verify whether this is true for RM1 in the Robust04 corpus. Following Lv and Zhai s work [2], we simply set µ = 1000, k = 10, and n = 100 in Problem 3.2. Report the mean P@10 and AP of the 249 queries for RM1 by setting µ FB from 0 to 2500 with step 500. Discuss whether or not the result is consistent with Lv and Zhai s results [2] and explain why smoothing is necessary or unnecessary for P FB (t D i ) based on your results. 3.3 New Search vs. Re-ranking (15 points) The default way of using a query model is to use the top n feedback terms to perform a new search, as we described in 3.1. An alternative approach is that we simply rerank the top m results of the original query using the full query model (including all terms in V ). Implement the reranking approach described above. You can simply use LuceneQLSearcher to retrieve the top m results, load their document vectors, and then rerank the results by the full RM1 model (keeping all query terms). Report the mean P@10 and AP of the reranking approach by setting m from 500 to 5000 with step 500. Compare the results with those by the default approach using top 100 feedback terms (n = 100). For both approaches, set other parameters to µ = 1000, µ FB = 0, k = 10. Discuss the pros and cons of the two approaches based on your results. 3.4 RM3 (10 points) Implement RM3, which is simply a mixture model of RM1 with the original query s MLE model, where λ org controls the weight of the original query: P RM3 (t q) = (1 λ org ) P RM1 (t q) + λ org P MLE (t q) (15) Plot the mean P@10 and AP of the 249 queries for RM3 by setting λ org from 0 to 1 with step 0.1. Set the other parameters to: µ = 1000, µ FB = 0, k = 10, and n = 100. 3.5 RM3 vs. QL (10 points) As we introduced in class, there are some general concerns regarding pseudorelevance feedback: 7

The approach is recall-oriented and has limited improvements on precision at top ranks. Note that average precision (AP) is a recall-oriented measure and P@10 is a precision-oriented one (we will discuss details in Lecture 12 and 13). The quality of the expanded query depends on the quality of the original query in case the original query does not perform well, the topranked results by the original query are mostly not relevant, and consequently the assumption for performing pseudo-relevance feedback does not hold anymore. Please compare RM3 (setting λ org to the best value you found in 3.4, µ = 1000, µ FB = 0, k = 10, and n = 100) with QL (µ = 1000). Discuss whether or not your results suggest that RM3 has these issues on Robust04? References [1] V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 01, pages 120 127, 2001. [2] Y. Lv and C. Zhai. A comparative study of methods for estimating query language models with pseudo feedback. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 09, pages 1895 1898, 2009. A A Checklist for Your Submission A report including the experiment results and answers to all questions (in pdf). All your source codes. Write a brief readme file for the purpose of the program if necessary. Pack all the stuff as a.zip or.tar.gz file and upload to Moodle s HW3 submission link. If you decide to use the 5-day extension (you can only use it once during the whole semester), send an email and let the instructor and the TA know. 8