CS 646 (Fall 2016) Homework 3 Deadline: 11:59pm, Oct 31st, 2016 (EST) Access the following resources before you start working on HW3: Download and uncompress the index file and other data from Moodle. index lucene robust04.tar.gz (about 550 MB) stopwords inquery: a list of stopwords queries: 249 search queries qrels: relevance judgments for the 249 queries The index is similar to the krovetz index used in HW2, except that the index stored document vectors. The docno and content of documents are stored in the docno and content fields, respectively. Please read the tutorial for how to access a stored document vector from a Lucene index: https://github.com/jiepujiang/cs646_tutorials#lucene-6 Check out the Lucene starter code from GitHub: https://github.com/jiepujiang/cs646_hw3 1 Unigram Language Model (10 points) 1.1 Log Probability (5 points) Implement MySearcher.logpdc(String field, String docno). It should return the log probability of the document with the specified docno (external ID) according to the corpus language model of an index P (D Corpus). log P (D Corpus) = w D c(w, D) log P (w Corpus) (1) 1
c(w, Corpus) P (w Corpus) = Corpus D is a document and t denotes each term in D. c(t, D) and c(t, Corpus) are the frequency of the term t in D and in the corpus, respectively. Corpus is the length of the whole corpus (the total number of terms). You will need to retrieve the stored document vectors in order to compute logpdc. Report the log probability of the following documents (the table provides the result of one document to help you verify your implementation). docno log P (D Corpus) FBIS4-41991 8438.50779524626 FBIS4-67701 FT921-7107 FR940617-0-00103 FR941212-0-00060 FBIS3-25118 (2) 1.2 Comparing Log Probability (5 points) In the query likelihood model (QL), we rank results by P (q D) or log P (q D). If log P (q D 1 ) > log P (q D 2 ), QL believes that D 1 is more relevant or more similar to the query q than D 2 and thus it ranks D 1 at a higher position. Does log P (q 1 D) > log P (q 2 D) also indicate that q 1 is more similar to the content of the document D compared with q 2? If yes, discuss why; If no, discuss how to adjust log P (q D) to make it appropriate for comparing the relevance/similarity of a document with different queries. 2 Query Likelihood and Smoothing (40 points) 2.1 Implementation (10 points) Implement two scoring functions: MySearcher.QLJMSmoothing: QL with Jelinek-Mercer smoothing: P (t D) = (1 λ) c(t, D) D 2 + λ P (t Corpus) (3)
MySearcher.QLDirichletSmoothing: QL with Dirichlet smoothing: P (t D) = c(t, D) + µ P (t Corpus) D + µ (4) MySearcher implemented the document-at-a-time search algorithm with a replaceable scoring function (as defined in the ScoringFunction interface). After implementing the two scoring functions, you should be able to run the main function of MySearcher. It retrieves the top 1,000 results for each of the 249 queries in the Robust04 dataset using a scoring function. It reports the P@10 and average precision (AP) of each query and the mean values of P@10 and AP over the 249 queries. You can verify your Dirichlet smoothing by comparing your results with those by running LuceneQLSearcher s main function. LuceneQLSearcher includes an implementation of QL with Dirichlet smoothing using a slightly faster approach optimized for term-at-a-time search. We need to use QL with Dirichlet smoothing a lot in follow-up questions. You can use either your own implementation or LuceneQLSearcher in followup questions. They take similar input parameters, but the latter is probably faster (note that some of the experiments may take 30 minutes or longer). 2.2 Smoothing Parameters (10 points) Divide the 249 queries into a training set (No. 301 450) and a testing set (No. 601 700). Separately plot the mean P@10 and AP of queries in the training and the testing set for: QL with Jelinek-Mercer smoothing where λ ranges from 0.1 to 0.9 with step 0.1 (set x-axis to λ and y-axis to P@10 or AP). QL with Dirichlet smoothing where µ ranges from 500 to 5000 with step 500 (set x-axis to µ and y-axis to P@10 or AP). To examine the stability of the two smoothing approaches across different query sets, we apply the best λ or µ in the training set (the value that gives the best mean P@10 or AP) to the testing set. Usually the results would be worse than those by using the best λ or µ in the testing set. Report the differences in mean P@10 and AP comparing to those by using the best λ or µ in the testing set. Discuss which smoothing approach seems more effective and stable on the Robust04 dataset based on your results. 3
2.3 Smoothing, TF, and Document Length (20 points) We discussed in class that the log query likelihood score can be considered as a summation over each query term s scores in the document as in Equation 6, where a term s score in a document is: score(t, D) = log P (t D). score(q, D) = t q score(t, D) (5) score(t, D) = log P (t D) (6) For convenience, we can consider score(t, D) as a continuous and derivable function of c t,d (the frequency of t in the document D) and other variables. This makes it easier to examine the influence of c t,d on score(t, D) and consequently on score(q, D). score(t, D) = For Jelinek-Mercer smoothing: score(t, D) = For Dirichlet smoothing: 1 P (t D) (7) P (t D) 1 c t,d + λ 1 λ D P (t Corpus) (8) score(t, D) = 1 c t,d + µ P (t Corpus) (9) As we discussed in class, according to Equation 7, 8, and 9: When c t,d increases, score(t,d) descreases: thus QL penalizes repeated occurrences of the same term (regardless of which smoothing approach is employed). score(t, D) increases by a slower rate as c t,d increases if t is a common term in the corpus t has a higher P (t Corpus) compared with a less common term. Thus, QL with smoothing penalizes common terms in the corpus (similar to IDF). Now discuss the following questions: 4
2.3.1 According to Equation 8 and 9, how do the two approaches differ in terms of discounting repeated term occurrences? How do the length of the document and the smoothing parameters influence the discount? (10 points) 2.3.2 Analyze to which extent the two smoothing approaches penalize longer documents. Discuss while other conditions (c t,d and P (t Corpus)) being equal, when will one smoothing approach penalize longer documents by a greater extent than another? (10 points) Note that 2.3.1 asks the influence of D and smoothing parameters on score(t,d), while 2.3.2 asks the influence of D and smoothing parameters on score(t, D). 3 Relevance Model (50 points) 3.1 Implementing RM1 (10 points) As we introduced in Lecture 11, the relevance model RM1 [1] is a pseudorelevance feedback approach for estimating query language models from the top k search results of QL. RM1 models the joint probability of a term t with a query q as in Equation 10, where: P QL (q D i ) is the query likelihood score of q from a document D i ; P FB (t D i ) is the probability of a feedback term t from the document D i. P (t, q) k P FB (t D i )P QL (q D i ) (10) i=1 We can use different language models for P QL (q D i ) and P FB (t D i ). Here we use Dirichlet smoothing for both P QL (q D i ) and P FB (t D i ), but allow the smoothing parameters to be different µ and µ FB. P QL (t D) = t q c(t, D i ) + µ P (t Corpus) D i + µ (11) P FB (t D i ) = c(t, D i) + µ FB P (t Corpus) D i + µ FB (12) Since P (t q) P (t, q), we estimate a query model P (t q) by normalizing Equation 10 as Equation 13. Here V is the vocabulary of the query language model. Theoretically, V can be set to the vocabulary of the corpus, but this 5
is practically very inefficient. Normally, we just set V to the terms appeared in the top k documents. Also, remember to exclude stopwords (provided in the stopwords inquery file) from V as we had not removed stopwords at indexing time, the document vectors include stopwords. k i=1 P (t q) = P FB(t D)P QL (q D i ) k t j V i=1 P FB(t j D i )P QL (q D i ) (13) We can apply P (t q) to the KL-Divergence language model to rank search results. As we discussed in Lecture 9, the KLD language model is equivalent to a query likelihood model (using the log QL score), where we set the frequencies of query terms to their probabilities by P (t q). In such a case, we can simply use LuceneQLSearcher to perform a KLD search after estimating the query model P (t q). KLD(q D) = t q P (t q) log P (t q) rank = P (t q) log P (t D) (14) P (t D) t q As it is slow to process long queries, we often only use the top n highest probability terms from P (t q) for ranking. Note that you still need to normalize the probability by setting V to the whole set of terms appeared in the top k documents rather than only the top n words in V. Implement RM1 in LuceneQLSearcher.estimateQueryModelRM1. RM1 takes the following input parameters: q: the original search query (as a list of tokenized query terms) µ: the Dirichlet smoothing parameter for QL (Equation 11) µ FB : the feedback Dirichlet smoothing parameter (Equation 12) k: the number of top feedback documents n: the number of top feedback terms The output of LuceneQLSearcher.estimateQueryModelRM1 is a HashMap storing the query model. LuceneQLSearcher.search can take such a query model as input and perform a QL search (the implementation is in AbstractQLSearcher). 6
3.2 µ FB (5 points) Lv and Zhai [2] reported that setting µ FB = 0 is optimal for RM3, which indicates that smoothing seems unnecessary for P FB (t D i ). Please verify whether this is true for RM1 in the Robust04 corpus. Following Lv and Zhai s work [2], we simply set µ = 1000, k = 10, and n = 100 in Problem 3.2. Report the mean P@10 and AP of the 249 queries for RM1 by setting µ FB from 0 to 2500 with step 500. Discuss whether or not the result is consistent with Lv and Zhai s results [2] and explain why smoothing is necessary or unnecessary for P FB (t D i ) based on your results. 3.3 New Search vs. Re-ranking (15 points) The default way of using a query model is to use the top n feedback terms to perform a new search, as we described in 3.1. An alternative approach is that we simply rerank the top m results of the original query using the full query model (including all terms in V ). Implement the reranking approach described above. You can simply use LuceneQLSearcher to retrieve the top m results, load their document vectors, and then rerank the results by the full RM1 model (keeping all query terms). Report the mean P@10 and AP of the reranking approach by setting m from 500 to 5000 with step 500. Compare the results with those by the default approach using top 100 feedback terms (n = 100). For both approaches, set other parameters to µ = 1000, µ FB = 0, k = 10. Discuss the pros and cons of the two approaches based on your results. 3.4 RM3 (10 points) Implement RM3, which is simply a mixture model of RM1 with the original query s MLE model, where λ org controls the weight of the original query: P RM3 (t q) = (1 λ org ) P RM1 (t q) + λ org P MLE (t q) (15) Plot the mean P@10 and AP of the 249 queries for RM3 by setting λ org from 0 to 1 with step 0.1. Set the other parameters to: µ = 1000, µ FB = 0, k = 10, and n = 100. 3.5 RM3 vs. QL (10 points) As we introduced in class, there are some general concerns regarding pseudorelevance feedback: 7
The approach is recall-oriented and has limited improvements on precision at top ranks. Note that average precision (AP) is a recall-oriented measure and P@10 is a precision-oriented one (we will discuss details in Lecture 12 and 13). The quality of the expanded query depends on the quality of the original query in case the original query does not perform well, the topranked results by the original query are mostly not relevant, and consequently the assumption for performing pseudo-relevance feedback does not hold anymore. Please compare RM3 (setting λ org to the best value you found in 3.4, µ = 1000, µ FB = 0, k = 10, and n = 100) with QL (µ = 1000). Discuss whether or not your results suggest that RM3 has these issues on Robust04? References [1] V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 01, pages 120 127, 2001. [2] Y. Lv and C. Zhai. A comparative study of methods for estimating query language models with pseudo feedback. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 09, pages 1895 1898, 2009. A A Checklist for Your Submission A report including the experiment results and answers to all questions (in pdf). All your source codes. Write a brief readme file for the purpose of the program if necessary. Pack all the stuff as a.zip or.tar.gz file and upload to Moodle s HW3 submission link. If you decide to use the 5-day extension (you can only use it once during the whole semester), send an email and let the instructor and the TA know. 8