CS 646 (Fall 2016) Homework 3

Size: px
Start display at page:

Download "CS 646 (Fall 2016) Homework 3"

Transcription

1 CS 646 (Fall 2016) Homework 3 Deadline: 11:59pm, Oct 31st, 2016 (EST) Access the following resources before you start working on HW3: Download and uncompress the index file and other data from Moodle. index lucene robust04.tar.gz (about 550 MB) stopwords inquery: a list of stopwords queries: 249 search queries qrels: relevance judgments for the 249 queries The index is similar to the krovetz index used in HW2, except that the index stored document vectors. The docno and content of documents are stored in the docno and content fields, respectively. Please read the tutorial for how to access a stored document vector from a Lucene index: Check out the Lucene starter code from GitHub: 1 Unigram Language Model (10 points) 1.1 Log Probability (5 points) Implement MySearcher.logpdc(String field, String docno). It should return the log probability of the document with the specified docno (external ID) according to the corpus language model of an index P (D Corpus). log P (D Corpus) = w D c(w, D) log P (w Corpus) (1) 1

2 c(w, Corpus) P (w Corpus) = Corpus D is a document and t denotes each term in D. c(t, D) and c(t, Corpus) are the frequency of the term t in D and in the corpus, respectively. Corpus is the length of the whole corpus (the total number of terms). You will need to retrieve the stored document vectors in order to compute logpdc. Report the log probability of the following documents (the table provides the result of one document to help you verify your implementation). docno log P (D Corpus) FBIS FBIS FT FR FR FBIS (2) 1.2 Comparing Log Probability (5 points) In the query likelihood model (QL), we rank results by P (q D) or log P (q D). If log P (q D 1 ) > log P (q D 2 ), QL believes that D 1 is more relevant or more similar to the query q than D 2 and thus it ranks D 1 at a higher position. Does log P (q 1 D) > log P (q 2 D) also indicate that q 1 is more similar to the content of the document D compared with q 2? If yes, discuss why; If no, discuss how to adjust log P (q D) to make it appropriate for comparing the relevance/similarity of a document with different queries. 2 Query Likelihood and Smoothing (40 points) 2.1 Implementation (10 points) Implement two scoring functions: MySearcher.QLJMSmoothing: QL with Jelinek-Mercer smoothing: P (t D) = (1 λ) c(t, D) D 2 + λ P (t Corpus) (3)

3 MySearcher.QLDirichletSmoothing: QL with Dirichlet smoothing: P (t D) = c(t, D) + µ P (t Corpus) D + µ (4) MySearcher implemented the document-at-a-time search algorithm with a replaceable scoring function (as defined in the ScoringFunction interface). After implementing the two scoring functions, you should be able to run the main function of MySearcher. It retrieves the top 1,000 results for each of the 249 queries in the Robust04 dataset using a scoring function. It reports the P@10 and average precision (AP) of each query and the mean values of P@10 and AP over the 249 queries. You can verify your Dirichlet smoothing by comparing your results with those by running LuceneQLSearcher s main function. LuceneQLSearcher includes an implementation of QL with Dirichlet smoothing using a slightly faster approach optimized for term-at-a-time search. We need to use QL with Dirichlet smoothing a lot in follow-up questions. You can use either your own implementation or LuceneQLSearcher in followup questions. They take similar input parameters, but the latter is probably faster (note that some of the experiments may take 30 minutes or longer). 2.2 Smoothing Parameters (10 points) Divide the 249 queries into a training set (No ) and a testing set (No ). Separately plot the mean P@10 and AP of queries in the training and the testing set for: QL with Jelinek-Mercer smoothing where λ ranges from 0.1 to 0.9 with step 0.1 (set x-axis to λ and y-axis to P@10 or AP). QL with Dirichlet smoothing where µ ranges from 500 to 5000 with step 500 (set x-axis to µ and y-axis to P@10 or AP). To examine the stability of the two smoothing approaches across different query sets, we apply the best λ or µ in the training set (the value that gives the best mean P@10 or AP) to the testing set. Usually the results would be worse than those by using the best λ or µ in the testing set. Report the differences in mean P@10 and AP comparing to those by using the best λ or µ in the testing set. Discuss which smoothing approach seems more effective and stable on the Robust04 dataset based on your results. 3

4 2.3 Smoothing, TF, and Document Length (20 points) We discussed in class that the log query likelihood score can be considered as a summation over each query term s scores in the document as in Equation 6, where a term s score in a document is: score(t, D) = log P (t D). score(q, D) = t q score(t, D) (5) score(t, D) = log P (t D) (6) For convenience, we can consider score(t, D) as a continuous and derivable function of c t,d (the frequency of t in the document D) and other variables. This makes it easier to examine the influence of c t,d on score(t, D) and consequently on score(q, D). score(t, D) = For Jelinek-Mercer smoothing: score(t, D) = For Dirichlet smoothing: 1 P (t D) (7) P (t D) 1 c t,d + λ 1 λ D P (t Corpus) (8) score(t, D) = 1 c t,d + µ P (t Corpus) (9) As we discussed in class, according to Equation 7, 8, and 9: When c t,d increases, score(t,d) descreases: thus QL penalizes repeated occurrences of the same term (regardless of which smoothing approach is employed). score(t, D) increases by a slower rate as c t,d increases if t is a common term in the corpus t has a higher P (t Corpus) compared with a less common term. Thus, QL with smoothing penalizes common terms in the corpus (similar to IDF). Now discuss the following questions: 4

5 2.3.1 According to Equation 8 and 9, how do the two approaches differ in terms of discounting repeated term occurrences? How do the length of the document and the smoothing parameters influence the discount? (10 points) Analyze to which extent the two smoothing approaches penalize longer documents. Discuss while other conditions (c t,d and P (t Corpus)) being equal, when will one smoothing approach penalize longer documents by a greater extent than another? (10 points) Note that asks the influence of D and smoothing parameters on score(t,d), while asks the influence of D and smoothing parameters on score(t, D). 3 Relevance Model (50 points) 3.1 Implementing RM1 (10 points) As we introduced in Lecture 11, the relevance model RM1 [1] is a pseudorelevance feedback approach for estimating query language models from the top k search results of QL. RM1 models the joint probability of a term t with a query q as in Equation 10, where: P QL (q D i ) is the query likelihood score of q from a document D i ; P FB (t D i ) is the probability of a feedback term t from the document D i. P (t, q) k P FB (t D i )P QL (q D i ) (10) i=1 We can use different language models for P QL (q D i ) and P FB (t D i ). Here we use Dirichlet smoothing for both P QL (q D i ) and P FB (t D i ), but allow the smoothing parameters to be different µ and µ FB. P QL (t D) = t q c(t, D i ) + µ P (t Corpus) D i + µ (11) P FB (t D i ) = c(t, D i) + µ FB P (t Corpus) D i + µ FB (12) Since P (t q) P (t, q), we estimate a query model P (t q) by normalizing Equation 10 as Equation 13. Here V is the vocabulary of the query language model. Theoretically, V can be set to the vocabulary of the corpus, but this 5

6 is practically very inefficient. Normally, we just set V to the terms appeared in the top k documents. Also, remember to exclude stopwords (provided in the stopwords inquery file) from V as we had not removed stopwords at indexing time, the document vectors include stopwords. k i=1 P (t q) = P FB(t D)P QL (q D i ) k t j V i=1 P FB(t j D i )P QL (q D i ) (13) We can apply P (t q) to the KL-Divergence language model to rank search results. As we discussed in Lecture 9, the KLD language model is equivalent to a query likelihood model (using the log QL score), where we set the frequencies of query terms to their probabilities by P (t q). In such a case, we can simply use LuceneQLSearcher to perform a KLD search after estimating the query model P (t q). KLD(q D) = t q P (t q) log P (t q) rank = P (t q) log P (t D) (14) P (t D) t q As it is slow to process long queries, we often only use the top n highest probability terms from P (t q) for ranking. Note that you still need to normalize the probability by setting V to the whole set of terms appeared in the top k documents rather than only the top n words in V. Implement RM1 in LuceneQLSearcher.estimateQueryModelRM1. RM1 takes the following input parameters: q: the original search query (as a list of tokenized query terms) µ: the Dirichlet smoothing parameter for QL (Equation 11) µ FB : the feedback Dirichlet smoothing parameter (Equation 12) k: the number of top feedback documents n: the number of top feedback terms The output of LuceneQLSearcher.estimateQueryModelRM1 is a HashMap storing the query model. LuceneQLSearcher.search can take such a query model as input and perform a QL search (the implementation is in AbstractQLSearcher). 6

7 3.2 µ FB (5 points) Lv and Zhai [2] reported that setting µ FB = 0 is optimal for RM3, which indicates that smoothing seems unnecessary for P FB (t D i ). Please verify whether this is true for RM1 in the Robust04 corpus. Following Lv and Zhai s work [2], we simply set µ = 1000, k = 10, and n = 100 in Problem 3.2. Report the mean P@10 and AP of the 249 queries for RM1 by setting µ FB from 0 to 2500 with step 500. Discuss whether or not the result is consistent with Lv and Zhai s results [2] and explain why smoothing is necessary or unnecessary for P FB (t D i ) based on your results. 3.3 New Search vs. Re-ranking (15 points) The default way of using a query model is to use the top n feedback terms to perform a new search, as we described in 3.1. An alternative approach is that we simply rerank the top m results of the original query using the full query model (including all terms in V ). Implement the reranking approach described above. You can simply use LuceneQLSearcher to retrieve the top m results, load their document vectors, and then rerank the results by the full RM1 model (keeping all query terms). Report the mean P@10 and AP of the reranking approach by setting m from 500 to 5000 with step 500. Compare the results with those by the default approach using top 100 feedback terms (n = 100). For both approaches, set other parameters to µ = 1000, µ FB = 0, k = 10. Discuss the pros and cons of the two approaches based on your results. 3.4 RM3 (10 points) Implement RM3, which is simply a mixture model of RM1 with the original query s MLE model, where λ org controls the weight of the original query: P RM3 (t q) = (1 λ org ) P RM1 (t q) + λ org P MLE (t q) (15) Plot the mean P@10 and AP of the 249 queries for RM3 by setting λ org from 0 to 1 with step 0.1. Set the other parameters to: µ = 1000, µ FB = 0, k = 10, and n = RM3 vs. QL (10 points) As we introduced in class, there are some general concerns regarding pseudorelevance feedback: 7

8 The approach is recall-oriented and has limited improvements on precision at top ranks. Note that average precision (AP) is a recall-oriented measure and P@10 is a precision-oriented one (we will discuss details in Lecture 12 and 13). The quality of the expanded query depends on the quality of the original query in case the original query does not perform well, the topranked results by the original query are mostly not relevant, and consequently the assumption for performing pseudo-relevance feedback does not hold anymore. Please compare RM3 (setting λ org to the best value you found in 3.4, µ = 1000, µ FB = 0, k = 10, and n = 100) with QL (µ = 1000). Discuss whether or not your results suggest that RM3 has these issues on Robust04? References [1] V. Lavrenko and W. B. Croft. Relevance based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 01, pages , [2] Y. Lv and C. Zhai. A comparative study of methods for estimating query language models with pseudo feedback. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 09, pages , A A Checklist for Your Submission A report including the experiment results and answers to all questions (in pdf). All your source codes. Write a brief readme file for the purpose of the program if necessary. Pack all the stuff as a.zip or.tar.gz file and upload to Moodle s HW3 submission link. If you decide to use the 5-day extension (you can only use it once during the whole semester), send an and let the instructor and the TA know. 8

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

University of Illinois at Urbana-Champaign. Midterm Examination

University of Illinois at Urbana-Champaign. Midterm Examination University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,

More information

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13 Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic

More information

Machine Learning: Assignment 1

Machine Learning: Assignment 1 10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Midterm Examination Practice

Midterm Examination Practice University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following

More information

Is Document Frequency important for PRF?

Is Document Frequency important for PRF? Author manuscript, published in "ICTIR 2011 - International Conference on the Theory Information Retrieval, Bertinoro : Italy (2011)" Is Document Frequency important for PRF? Stéphane Clinchant 1,2 and

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

A Study of Smoothing Methods for Language Models Applied to Information Retrieval A Study of Smoothing Methods for Language Models Applied to Information Retrieval CHENGXIANG ZHAI and JOHN LAFFERTY Carnegie Mellon University Language modeling approaches to information retrieval are

More information

INFO 630 / CS 674 Lecture Notes

INFO 630 / CS 674 Lecture Notes INFO 630 / CS 674 Lecture Notes The Language Modeling Approach to Information Retrieval Lecturer: Lillian Lee Lecture 9: September 25, 2007 Scribes: Vladimir Barash, Stephen Purpura, Shaomei Wu Introduction

More information

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing

Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing Within-Document Term-Based Index Pruning with Statistical Hypothesis Testing Sree Lekha Thota and Ben Carterette Department of Computer and Information Sciences University of Delaware, Newark, DE, USA

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Study of the Dirichlet Priors for Term Frequency Normalisation A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Lecture 3: Pivoted Document Length Normalization

Lecture 3: Pivoted Document Length Normalization CS 6740: Advanced Language Technologies February 4, 2010 Lecture 3: Pivoted Document Length Normalization Lecturer: Lillian Lee Scribes: Lakshmi Ganesh, Navin Sivakumar Abstract In this lecture, we examine

More information

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models

Language Models and Smoothing Methods for Collections with Large Variation in Document Length. 2 Models Language Models and Smoothing Methods for Collections with Large Variation in Document Length Najeeb Abdulmutalib and Norbert Fuhr najeeb@is.inf.uni-due.de, norbert.fuhr@uni-due.de Information Systems,

More information

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign

ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign Axiomatic Analysis and Optimization of Information Retrieval Models ChengXiang ( Cheng ) Zhai Department of Computer Science University of Illinois at Urbana-Champaign http://www.cs.uiuc.edu/homes/czhai

More information

Language Models. Hongning Wang

Language Models. Hongning Wang Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity

More information

Name: Matriculation Number: Tutorial Group: A B C D E

Name: Matriculation Number: Tutorial Group: A B C D E Name: Matriculation Number: Tutorial Group: A B C D E Question: 1 (5 Points) 2 (6 Points) 3 (5 Points) 4 (5 Points) Total (21 points) Score: General instructions: The written test contains 4 questions

More information

arxiv: v1 [cs.ir] 1 May 2018

arxiv: v1 [cs.ir] 1 May 2018 On the Equivalence of Generative and Discriminative Formulations of the Sequential Dependence Model arxiv:1805.00152v1 [cs.ir] 1 May 2018 Laura Dietz University of New Hampshire dietz@cs.unh.edu John Foley

More information

Generalized Inverse Document Frequency

Generalized Inverse Document Frequency Generalized Inverse Document Frequency Donald Metzler metzler@yahoo-inc.com Yahoo! Research 2821 Mission College Blvd. Santa Clara, CA 95054 ABSTRACT Inverse document frequency (IDF) is one of the most

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

CS103 Handout 22 Fall 2012 November 30, 2012 Problem Set 9

CS103 Handout 22 Fall 2012 November 30, 2012 Problem Set 9 CS103 Handout 22 Fall 2012 November 30, 2012 Problem Set 9 In this final problem set of the quarter, you will explore the limits of what can be computed efficiently. What problems do we believe are computationally

More information

Positional Language Models for Information Retrieval

Positional Language Models for Information Retrieval Positional Language Models for Information Retrieval Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 61801 ylv2@uiuc.edu ChengXiang Zhai Department of Computer

More information

IMU Experiment in IR4QA at NTCIR-8

IMU Experiment in IR4QA at NTCIR-8 IMU Experiment in IR4QA at NTCIR-8 Xiangdong Su, Xueliang Yan, Guanglai Gao, Hongxi Wei School of Computer Science Inner Mongolia University Hohhot, China 010021 Email csggl@imu.edu.cn 6/16/2010 Inner

More information

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for

More information

Lower-Bounding Term Frequency Normalization

Lower-Bounding Term Frequency Normalization Lower-Bounding Term Frequency Normalization Yuanhua Lv Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL 68 ylv@uiuc.edu ChengXiang Zhai Department of Computer Science

More information

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

NLP Homework: Dependency Parsing with Feed-Forward Neural Network NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used

More information

Back to the Roots: A Probabilistic Framework for Query-Performance Prediction

Back to the Roots: A Probabilistic Framework for Query-Performance Prediction Back to the Roots: A Probabilistic Framework for Query-Performance Prediction Oren Kurland 1 kurland@ie.technion.ac.il Fiana Raiber 1 fiana@tx.technion.ac.il Anna Shtok 1 annabel@tx.technion.ac.il David

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn

Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn Homework 1 Solutions Probability, Maximum Likelihood Estimation (MLE), Bayes Rule, knn CMU 10-701: Machine Learning (Fall 2016) https://piazza.com/class/is95mzbrvpn63d OUT: September 13th DUE: September

More information

Retrieval models II. IN4325 Information Retrieval

Retrieval models II. IN4325 Information Retrieval Retrieval models II IN4325 Information Retrieval 1 Assignment 1 Deadline Wednesday means as long as it is Wednesday you are still on time In practice, any time before I make it to the office on Thursday

More information

Pivoted Length Normalization I. Summary idf II. Review

Pivoted Length Normalization I. Summary idf II. Review 2 Feb 2006 1/11 COM S/INFO 630: Representing and Accessing [Textual] Digital Information Lecturer: Lillian Lee Lecture 3: 2 February 2006 Scribes: Siavash Dejgosha (sd82) and Ricardo Hu (rh238) Pivoted

More information

CSC411 Fall 2018 Homework 5

CSC411 Fall 2018 Homework 5 Homework 5 Deadline: Wednesday, Nov. 4, at :59pm. Submission: You need to submit two files:. Your solutions to Questions and 2 as a PDF file, hw5_writeup.pdf, through MarkUs. (If you submit answers to

More information

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP

CSA4050: Advanced Topics Natural Language Processing. Lecture Statistics III. Statistical Approaches to NLP University of Malta BSc IT (Hons)Year IV CSA4050: Advanced Topics Natural Language Processing Lecture Statistics III Statistical Approaches to NLP Witten-Bell Discounting Unigrams Bigrams Dept Computer

More information

Homework 6: Image Completion using Mixture of Bernoullis

Homework 6: Image Completion using Mixture of Bernoullis Homework 6: Image Completion using Mixture of Bernoullis Deadline: Wednesday, Nov. 21, at 11:59pm Submission: You must submit two files through MarkUs 1 : 1. a PDF file containing your writeup, titled

More information

A Note on the Expectation-Maximization (EM) Algorithm

A Note on the Expectation-Maximization (EM) Algorithm A Note on the Expectation-Maximization (EM) Algorithm ChengXiang Zhai Department of Computer Science University of Illinois at Urbana-Champaign March 11, 2007 1 Introduction The Expectation-Maximization

More information

Natural Language Processing. Statistical Inference: n-grams

Natural Language Processing. Statistical Inference: n-grams Natural Language Processing Statistical Inference: n-grams Updated 3/2009 Statistical Inference Statistical Inference consists of taking some data (generated in accordance with some unknown probability

More information

HOMEWORK 4: SVMS AND KERNELS

HOMEWORK 4: SVMS AND KERNELS HOMEWORK 4: SVMS AND KERNELS CMU 060: MACHINE LEARNING (FALL 206) OUT: Sep. 26, 206 DUE: 5:30 pm, Oct. 05, 206 TAs: Simon Shaolei Du, Tianshu Ren, Hsiao-Yu Fish Tung Instructions Homework Submission: Submit

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval Ranking-II Temporal Representation and Retrieval Models Temporal Information Retrieval Ranking in Information Retrieval Ranking documents important for information overload, quickly finding documents which

More information

Data Structure and Algorithm Homework #1 Due: 2:20pm, Tuesday, March 12, 2013 TA === Homework submission instructions ===

Data Structure and Algorithm Homework #1 Due: 2:20pm, Tuesday, March 12, 2013 TA   === Homework submission instructions === Data Structure and Algorithm Homework #1 Due: 2:20pm, Tuesday, March 12, 2013 TA email: dsa1@csie.ntu.edu.tw === Homework submission instructions === For Problem 1, submit your source code, a Makefile

More information

Utilizing Passage-Based Language Models for Document Retrieval

Utilizing Passage-Based Language Models for Document Retrieval Utilizing Passage-Based Language Models for Document Retrieval Michael Bendersky 1 and Oren Kurland 2 1 Center for Intelligent Information Retrieval, Department of Computer Science, University of Massachusetts,

More information

A Neural Passage Model for Ad-hoc Document Retrieval

A Neural Passage Model for Ad-hoc Document Retrieval A Neural Passage Model for Ad-hoc Document Retrieval Qingyao Ai, Brendan O Connor, and W. Bruce Croft College of Information and Computer Sciences, University of Massachusetts Amherst, Amherst, MA, USA,

More information

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014

Understanding Comments Submitted to FCC on Net Neutrality. Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Understanding Comments Submitted to FCC on Net Neutrality Kevin (Junhui) Mao, Jing Xia, Dennis (Woncheol) Jeong December 12, 2014 Abstract We aim to understand and summarize themes in the 1.65 million

More information

PROBABILISTIC LATENT SEMANTIC ANALYSIS

PROBABILISTIC LATENT SEMANTIC ANALYSIS PROBABILISTIC LATENT SEMANTIC ANALYSIS Lingjia Deng Revised from slides of Shuguang Wang Outline Review of previous notes PCA/SVD HITS Latent Semantic Analysis Probabilistic Latent Semantic Analysis Applications

More information

Semantic Term Matching in Axiomatic Approaches to Information Retrieval

Semantic Term Matching in Axiomatic Approaches to Information Retrieval Semantic Term Matching in Axiomatic Approaches to Information Retrieval Hui Fang Department of Computer Science University of Illinois at Urbana-Champaign ChengXiang Zhai Department of Computer Science

More information

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection

A Probabilistic Model for Online Document Clustering with Application to Novelty Detection A Probabilistic Model for Online Document Clustering with Application to Novelty Detection Jian Zhang School of Computer Science Cargenie Mellon University Pittsburgh, PA 15213 jian.zhang@cs.cmu.edu Zoubin

More information

Statistical methods for NLP Estimation

Statistical methods for NLP Estimation Statistical methods for NLP Estimation UNIVERSITY OF Richard Johansson January 29, 2015 why does the teacher care so much about the coin-tossing experiment? because it can model many situations: I pick

More information

A Document Frequency Constraint for Pseudo-Relevance Feedback Models

A Document Frequency Constraint for Pseudo-Relevance Feedback Models A Document Frequency Constraint for Pseudo-Relevance Feedback Models Stephane Clinchant, Eric Gaussier Xerox Research Center Europe Laboratoire d Informatique de Grenoble, Université de Grenoble stephane.clinchant@xrce.xerox.com,

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Advanced Introduction to Machine Learning: Homework 2 MLE, MAP and Naive Bayes

Advanced Introduction to Machine Learning: Homework 2 MLE, MAP and Naive Bayes 10-715 Advanced Introduction to Machine Learning: Homework 2 MLE, MAP and Naive Baes Released: Wednesda, September 12, 2018 Due: 11:59 p.m. Wednesda, September 19, 2018 Instructions Late homework polic:

More information

Concept Tracking for Microblog Search

Concept Tracking for Microblog Search 1,a) 1,b) 1,c) 2013 12 22, 2014 4 7 2 Tweets2011 Twitter Concept Tracking for Microblog Search Taiki Miyanishi 1,a) Kazuhiro Seki 1,b) Kuniaki Uehara 1,c) Received: December 22, 2013, Accepted: April 7,

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

Non-Boolean models of retrieval: Agenda

Non-Boolean models of retrieval: Agenda Non-Boolean models of retrieval: Agenda Review of Boolean model and TF/IDF Simple extensions thereof Vector model Language Model-based retrieval Matrix decomposition methods Non-Boolean models of retrieval:

More information

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability

More information

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018

Machine Learning & Data Mining Caltech CS/CNS/EE 155 Set 2 January 12 th, 2018 Policies Due 9 PM, January 9 th, via Moodle. You are free to collaborate on all of the problems, subject to the collaboration policy stated in the syllabus. You should submit all code used in the homework.

More information

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes.

Midterm II. Introduction to Artificial Intelligence. CS 188 Spring ˆ You have approximately 1 hour and 50 minutes. CS 188 Spring 2013 Introduction to Artificial Intelligence Midterm II ˆ You have approximately 1 hour and 50 minutes. ˆ The exam is closed book, closed notes except a one-page crib sheet. ˆ Please use

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Score Distribution Models

Score Distribution Models Score Distribution Models Evangelos Kanoulas Virgil Pavlu Keshi Dai Javed Aslam Score Distributions 2 Score Distributions 2 Score Distributions 9.6592 9.5761 9.4919 9.4784 9.2693 9.2066 9.1407 9.0824 9.0110

More information

Week 12: Optimisation and Course Review.

Week 12: Optimisation and Course Review. Week 12: Optimisation and Course Review. MA161/MA1161: Semester 1 Calculus. Prof. Götz Pfeiffer School of Mathematics, Statistics and Applied Mathematics NUI Galway November 21-22, 2016 Assignments. Problem

More information

Optimal Mixture Models in IR

Optimal Mixture Models in IR Optimal Mixture Models in IR Victor Lavrenko Center for Intelligent Information Retrieval Department of Comupter Science University of Massachusetts, Amherst, MA 01003, lavrenko@cs.umass.edu Abstract.

More information

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes COMP 55 Applied Machine Learning Lecture 2: Gaussian processes Instructor: Ryan Lowe (ryan.lowe@cs.mcgill.ca) Slides mostly by: (herke.vanhoof@mcgill.ca) Class web page: www.cs.mcgill.ca/~hvanho2/comp55

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

/home/thierry/columbia/msongsdb/pyreport_tutorials/tutorial1/tutorial1.py January 23, 20111

/home/thierry/columbia/msongsdb/pyreport_tutorials/tutorial1/tutorial1.py January 23, 20111 /home/thierry/columbia/msongsdb/pyreport_tutorials/tutorial1/tutorial1.py January 23, 20111 27 """ 28 Tutorial for the Million Song Dataset 29 30 by Thierry Bertin - Mahieux ( 2011) Columbia University

More information

Lecture 3: Probabilistic Retrieval Models

Lecture 3: Probabilistic Retrieval Models Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme

More information

Query Performance Prediction: Evaluation Contrasted with Effectiveness

Query Performance Prediction: Evaluation Contrasted with Effectiveness Query Performance Prediction: Evaluation Contrasted with Effectiveness Claudia Hauff 1, Leif Azzopardi 2, Djoerd Hiemstra 1, and Franciska de Jong 1 1 University of Twente, Enschede, the Netherlands {c.hauff,

More information

Lecture 2 August 31, 2007

Lecture 2 August 31, 2007 CS 674: Advanced Language Technologies Fall 2007 Lecture 2 August 31, 2007 Prof. Lillian Lee Scribes: Cristian Danescu Niculescu-Mizil Vasumathi Raman 1 Overview We have already introduced the classic

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

EECS 349 (Machine Learning) Homework 4, Fall 2011

EECS 349 (Machine Learning) Homework 4, Fall 2011 WHAT TO HAND IN You are to submit the following things for this homework: 1. A PDF document containing answers to the homework questions.unless you ve used code to solve any problems, then show your work

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing N-grams and language models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 25 Introduction Goals: Estimate the probability that a

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Citation for published version (APA): Tsagkias, M. (2012). Mining social media: tracking content and predicting behavior

Citation for published version (APA): Tsagkias, M. (2012). Mining social media: tracking content and predicting behavior UvA-DARE (Digital Academic Repository) Mining social media: tracking content and predicting behavior Tsagkias, E. Link to publication Citation for published version (APA): Tsagkias, M. (2012). Mining social

More information

Why Language Models and Inverse Document Frequency for Information Retrieval?

Why Language Models and Inverse Document Frequency for Information Retrieval? Why Language Models and Inverse Document Frequency for Information Retrieval? Catarina Moreira, Andreas Wichert Instituto Superior Técnico, INESC-ID Av. Professor Cavaco Silva, 2744-016 Porto Salvo, Portugal

More information

Linear Models for Regression CS534

Linear Models for Regression CS534 Linear Models for Regression CS534 Example Regression Problems Predict housing price based on House size, lot size, Location, # of rooms Predict stock price based on Price history of the past month Predict

More information

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016

Language Modelling: Smoothing and Model Complexity. COMP-599 Sept 14, 2016 Language Modelling: Smoothing and Model Complexity COMP-599 Sept 14, 2016 Announcements A1 has been released Due on Wednesday, September 28th Start code for Question 4: Includes some of the package import

More information

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science Information Retrieval CS 6900 Lecture 03 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Properties of Text Zipf s Law models the distribution of terms

More information

Predicting Neighbor Goodness in Collaborative Filtering

Predicting Neighbor Goodness in Collaborative Filtering Predicting Neighbor Goodness in Collaborative Filtering Alejandro Bellogín and Pablo Castells {alejandro.bellogin, pablo.castells}@uam.es Universidad Autónoma de Madrid Escuela Politécnica Superior Introduction:

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

Query-document Relevance Topic Models

Query-document Relevance Topic Models Query-document Relevance Topic Models Meng-Sung Wu, Chia-Ping Chen and Hsin-Min Wang Industrial Technology Research Institute, Hsinchu, Taiwan National Sun Yat-Sen University, Kaohsiung, Taiwan Institute

More information

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer

Ngram Review. CS 136 Lecture 10 Language Modeling. Thanks to Dan Jurafsky for these slides. October13, 2017 Professor Meteer + Ngram Review October13, 2017 Professor Meteer CS 136 Lecture 10 Language Modeling Thanks to Dan Jurafsky for these slides + ASR components n Feature Extraction, MFCCs, start of Acoustic n HMMs, the Forward

More information

Lab Exercise 6 CS 2334

Lab Exercise 6 CS 2334 Lab Exercise 6 CS 2334 September 28, 2017 Introduction In this lab, you will experiment with using inheritance in Java through the use of abstract classes and interfaces. You will implement a set of classes

More information

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018

Mustafa Jarrar. Machine Learning Linear Regression. Birzeit University. Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Mustafa Jarrar: Lecture Notes on Linear Regression Birzeit University, 2018 Version 1 Machine Learning Linear Regression Mustafa Jarrar Birzeit University 1 Watch this lecture and download the slides Course

More information

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya

Week 13: Language Modeling II Smoothing in Language Modeling. Irina Sergienya Week 13: Language Modeling II Smoothing in Language Modeling Irina Sergienya 07.07.2015 Couple of words first... There are much more smoothing techniques, [e.g. Katz back-off, Jelinek-Mercer,...] and techniques

More information

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006 Scribes: Gilly Leshed, N. Sadat Shami Outline. Review. Mixture of Poissons ( Poisson) model 3. BM5/Okapi method 4. Relevance feedback. Review In discussing probabilistic models for information retrieval

More information

Passage language models in ad hoc document retrieval

Passage language models in ad hoc document retrieval Passage language models in ad hoc document retrieval Research Thesis In Partial Fulfillment of The Requirements for the Degree of Master of Science in Information Management Engineering Michael Bendersky

More information

N-gram Language Modeling Tutorial

N-gram Language Modeling Tutorial N-gram Language Modeling Tutorial Dustin Hillard and Sarah Petersen Lecture notes courtesy of Prof. Mari Ostendorf Outline: Statistical Language Model (LM) Basics n-gram models Class LMs Cache LMs Mixtures

More information

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling

An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling An Algorithm for Fast Calculation of Back-off N-gram Probabilities with Unigram Rescaling Masaharu Kato, Tetsuo Kosaka, Akinori Ito and Shozo Makino Abstract Topic-based stochastic models such as the probabilistic

More information

Feature Engineering, Model Evaluations

Feature Engineering, Model Evaluations Feature Engineering, Model Evaluations Giri Iyengar Cornell University gi43@cornell.edu Feb 5, 2018 Giri Iyengar (Cornell Tech) Feature Engineering Feb 5, 2018 1 / 35 Overview 1 ETL 2 Feature Engineering

More information

Programming Assignment 4: Image Completion using Mixture of Bernoullis

Programming Assignment 4: Image Completion using Mixture of Bernoullis Programming Assignment 4: Image Completion using Mixture of Bernoullis Deadline: Tuesday, April 4, at 11:59pm TA: Renie Liao (csc321ta@cs.toronto.edu) Submission: You must submit two files through MarkUs

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte

CMPT 310 Artificial Intelligence Survey. Simon Fraser University Summer Instructor: Oliver Schulte CMPT 310 Artificial Intelligence Survey Simon Fraser University Summer 2017 Instructor: Oliver Schulte Assignment 3: Chapters 13, 14, 18, 20. Probabilistic Reasoning and Learning Instructions: The university

More information

Language Processing with Perl and Prolog

Language Processing with Perl and Prolog Language Processing with Perl and Prolog Chapter 5: Counting Words Pierre Nugues Lund University Pierre.Nugues@cs.lth.se http://cs.lth.se/pierre_nugues/ Pierre Nugues Language Processing with Perl and

More information