Information Retrieval Basic IR models. Luca Bondi
|
|
- Hilary Spencer
- 6 years ago
- Views:
Transcription
1 Basic IR models Luca Bondi
2 Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to be mutually independent
3 Boolean model 3 Boolean model Vector space model Probabilistic model
4 Boolean model 4 The Boolean model is a simple retrieval model based on set theory, Boolean algebra Index term s significance is represented by binary weights w i,j 0,1 The set of index terms for a document d j is denoted as R dj The set of documents for an index term t i is denoted as R ti Queries are defined as Boolean expressions over index terms Boolean operators AND, OR, NOT The relevance is modelled as a binary property of the documents SC q, d j = 0 or SC q, d j = 1
5 Boolean model: example 5 Given a set of index terms t 1, t 2, t 3 Given a set of documents d 1 = 1,1,1 T d 2 = 1,0,0 T d 3 = 0,1,0 T Calculate the set of documents for each index term R t1 = d 1, d 2 R t2 = d 1, d 3 R t3 = d 1
6 Boolean model: example 6 Each query can be expressed in terms of R ti q = t 1 R t1 = *d 1, d 2 + q = t 1 t 2 R t1 R t2 = d 1, d 2 d 1, d 3 = *d 1 + q = t 1 t 2 R t1 R t2 = d 1, d 2 d 1, d 3 = d 1, d 2, d 3 q = t 3 R C t3 = d C 1 = *d 2, d 3 +
7 Boolean model: queries in DNF 7 Each query can be also expressed in a Disjunctive Normal Form q = t 3 t 1 t 2 R t3 R t1 R t2 C t 1 t 2 t q dnf = ( t 1 t 2 t 3 ) ( t 1 t 2 t 3 ) (t 1 t 2 t 3 )
8 Boolean model: queries in DNF 8 q = t 3 (t 1 t 2 ) q dnf = t 1 t 2 t 3 t 1 t 2 t 3 t 1 t 2 t 3 t 1 1,0,1 T t 3 0,0,1 T 0,1,1 T t 2 Each disjunction represents an ideal set of documents The query is satisfied by a document if such document is contained in a disjunction term
9 Boolean model: conclusions 9 Pros Precise semantics Structured queries Intuitive for experts Simple and neat formalism Adopted by many of early commercial bibliographic systems Cons No ranking Retrieval strategy is based on a binary decision criterion Not simple to translate an information need into a Boolean expression
10 Vector space model 10 Boolean model Vector space model Probabilistic model
11 Vector space model 11 Documents and queries are represented as vectors in the index terms space M = number of index terms = index terms space dimensionality = dictionary size Index term s significance is represented by real valued weights w i,j 0 is associated to the pair t i, d j Index terms are represented by unit vectors and form a canonical basis for the vector space Index term vector for term t i t i = 0,, 1,, 0 T
12 Vector space model 12 Document d j is represented by a vector d j which is a linear combination of index term vectors weighted by the index term significance for the document d j = M i<1 w i,j t i Query q is represented by a vector q in the index terms space q = w 1,q, w 2,q,, w M,q T t 2 d 2 d 1 q t 1
13 Vector space model: similarity 13 The Similarity Coefficient between a query and each document is real valued, thus producing a ranked list of documents It takes into consideration documents which match the query terms only partially Ranked document answer set is more effective than document answer set retrieved by the Boolean model Better match of user information need There are various measures that can be used to asses the similarity between documents/query A measure of similarity between documents shall fulfil the following If d 1 is near d 2, then d 2 is near d 1 If d 1 is near d 2, and d 2 is near d 3, then d 1 is not far from d 3 No document is closer to d than d itself
14 Vector space model: similarity Euclidean distance Length of difference vector 14 t 2 M 2 d L2 q, d j = q d j = w 2 i,q w i,j i<1 Can be converted in a similarity coefficient in different ways SC q, d j = e ; q;d j 2 SC q, d j = 1: q;d j 2 Issue of normalization 1 Euclidian distance applied to un-normalized vectors tend to make any large document to be not relevant to most queries, which are typically short d j q t 1
15 Vector space model: similarity 15 Cosine similarity Cosine of the angle between two vectors SC q, d j = cos α = qt d j q d j = = M i<1 w i,q w i,j M 2 M 2 i<1 w i,q i<1 w i,j It s a similarity, not a distance! Triangle inequality holds for distances but not for similarities The cosine measure normalizes the results by considering the length of the document vector Given two vectors, their similarity is determined by their directions For normalized vectors, the cosine similarity is equal to the inner product t 2 d j α q t 1
16 Vector space model: similarity 16 Jaccard s similarity q T d j SC q, d j = q T q + d T j d j q T = d j = M i<1 w i,q w i,j M 2 M 2 M i<1 w i,q + i<1 w i,j i<1 w i,q w i,j Extension of Jaccard s similarity coefficient for binary vectors Other similarity measures Dice s similarity coefficient Overlap coefficient
17 Vector space model: index term weighting 17 Boolean model used binary weights for index terms in a document terms with different discriminative power have the same weight normalization might not be enough to compensate for differences in documents lengths. A longer document has more opportunity to have some components that are relevant to a query In Vector Space Model index terms weights are non-negative realvalued Index term weights should be made proportional to its importance, both in the document and in the document collection
18 Vector space model: index term weighting 18 In Vector Space Model the index term weighting is defined as w i,j = tf i,j idf i tf i,j frequency of term t i in document d j provides one measure of how well term t i describes the document contents idf i inverse document frequency of term t i for the whole document collection terms which appear in many documents are not very useful for distinguishing a relevant document from a non-relevant one w i,j increases with the number of occurrences of term t i within document d j Increases with the rarity of term t i across the whole document collection
19 Vector space model: index term weighting 19 tf i,j frequency of term t i in document d j Define freq i,j = number of occurrences of term t i in document d j Three possible models for tf i,j tf i,j = freq i,j simplest model tf i,j = freq i,j max i freq i,j normalized model tf i,j = 1 + log 2(freq i,j ) if freq i,j 1 0 otherwise prevents bias toward longer documents
20 Vector space model: index term weighting 20 idf i inverse document frequency of term t i for the whole documents collection Define N = number of documents in the collection Define n i = number of documents containing term t i idf i = log 2 N n i
21 Vector space model: example 21 Given Query q = gold silver truck Documents collection d 1 = shipment of gold damaged in a fire d 2 = delivery of silver arrived in a silver truck d 3 = shipment of gold arrived in a truck The term frequency model tf i,j = freq i,j Determine the ranking of the documents collection with respect to the given query using a Vector Space Model with the following similarity measures Euclidean distance Cosine similarity
22 Vector space model: example 22 Dictionary = { shipment, of, gold, damaged, in, a, fire, delivery, silver, arrived, truck } N = 3 number of documents in the collection idf i for each term t i t i ship ment of gold dam aged in a fire deliv ery silver arriv ed truck n i idf i Purge dictionary removing terms with idf i = 0 Dictionary = { shipment, gold, damaged, fire, delivery, silver, arrived, truck }
23 Vector space model: example 23 tf i,j for each couple (t i, d j ) t i shipment gold damaged fire delivery silver arrived truck tf i, tf i, tf i, w i,j for each couple (t i, d j ) t i shipment gold damaged fire delivery silver arrived truck w i, w i, w i,
24 Vector space model: example 24 tf i,q and w i,q for each term t i t i shipment gold damaged fire delivery silver arrived truck tf i,q w i,q Let s write the documents and the query as vectors d 1 = 0.58, 0.58, 1.58, 1.58, 0, 0, 0, 0 d 2 = 0, 0, 0, 0, 1.58, 3.16, 0.58, 0.58 d 3 = 0.58, 0.58, 0, 0, 0, 0, 0.58, 0.58 q = 0, 0.58, 0, 0, 0, 1.58, 0, 0.58
25 Vector space model: example 25 Euclidean distance as Similarity Coefficient d L2 q, d 1 = = 2.86 d L2 q, d 2 = = 2.38 d L2 q, d 3 = = 1.78 SC q, d 1 = SC q, d 2 = SC q, d 3 = 1 1:d L 2 q,d 1 1 1:d L 2 q,d 2 1 1:d L 2 q,d 3 = 0.26 = 0.30 = 0.36 Rank: d 3 > d 2 > d 1
26 Vector space model: example 26 Cosine similarity as Similarity Coefficient SC q, d 1 = qt d 1 q d 1 = = 0.08 SC q, d 2 = qt d 2 q d 2 = = 0.82 SC q, d 3 = qt d 3 q d 3 = = 0.33 Rank: d 2 > d 3 > d 1
27 Vector space model: conclusions 27 Pros Term-weighting scheme improves retrieval performance w.r.t. Boolean model Partial matching strategy allows retrieval of documents that approximate the query conditions Ranked output and output magnitude control Flexibility and intuitive geometric interpretation Cons Assumption of independency between terms Impossibility of formulating structured queries (No operator (OR, AND, NOT, etc..) Terms are axes of a vector space (Even with stemming, may have dimensions)
28 Probabilistic model 28 Boolean model Vector space model Probabilistic model
29 Probabilistic model 29 The probabilistic model computes the Similarity Coefficient between queries and documents as the probability that a document will be relevant to a query P(relevant q, dj) Given a query q, let R q denote the set of documents relevant to the query q (the ideal answer set) The set R q is unknown
30 Probabilistic model 30 First, generate a preliminary probabilistic description of the ideal answer set R q which is used to retrieve a first set of documents. From relevant documents if some are known relevance feedback Prior domain knowledge
31 Probabilistic model 31 The probabilistic model can be used together with relevance feedback An interaction with the user is then initiated with the purpose of improving the probabilistic description of the ideal answer set. The user takes a look at the retrieved documents and decides which ones are relevant and which ones are not (in truth, only the first top documents need to be examined). The system then uses this information to refine the description of the ideal answer set. By repeating this process many times, it is expected that such a description will evolve and become closer to the real description of the ideal answer set.
32 Probabilistic model: probability basics 32 Complementary events p a + p a = 1 Joint probability p a, b = p(a b) Conditional probability p a b = p a b p b = p a,b p b Marginal probability p a = p a x p(x) x *b,b+ Bayes theorem p a b p b = p b a p(a) Odds O a = p a p a = p a 1;p(a)
33 Probabilistic model: Prob. Ranking Principle 33 Probabilistic model captures IR problem in a probabilistic framework The probability ranking principle (PRP) If a reference retrieval system's response to each request is a ranking of the documents in the collection in order of decreasing probability of relevance to the user who submitted the request,, the overall effectiveness of the system to its user will be the best that is obtainable [Robertson and Jones, 1976] Classic probabilistic models are also known as binary independence retrieval (BIR) models
34 Probabilistic model: Prob. Ranking Principle 34 Let d be a document in the collection Let R represent relevance of a document w.r.t a given (fixed) query q and let NR represent non-relevance Need to find p(r q, d j ): probability that a document d j is relevant given the query q According to PRP, rank documents in descending order of p(r q, d j )
35 Probabilistic model: BIM 35 Model hypothesis: Binary Independence Model Distribution of terms in relevant documents is different from of terns in non relevant documents Terms that occur in many relevant documents, and are absent in many irrelevant documents, should be given greater importance Binary = Boolean: documents are represented as binary incidence vectors of terms (remember the Boolean model?) w i,j 0,1 w i,j = 1 if term t i is present in document d j otherwise w i,j = 0 Independence: terms occur in documents independently one from each other Remark: different documents can be modelled by same vector
36 Probabilistic model: BIM 36 Documents are binary incidence vectors (as for Boolean model) d j d j Also queries are binary incidence vectors (remember that in Boolean models queries were logical expressions) q q Given a query q For each document d j need to compute p(r q, d j ) Replace with computing p(r q, d j ), where d i is the binary incidence vector representing d j and q is the binary incidence vector representing q
37 Probabilistic model: BIM 37 Prior probabilities (independent from a specific document) p(r q) probability of retrieving a relevant document p(nr q) probability of retrieving a non-relevant document p(d j q, R) probability that if a relevant document is retrieved, it is document d j. p(d j q, NR) probability that if a non-relevant document is retrieved, it is document d j.
38 Probabilistic model: BIM 38 Need to find the posterior probability for a specific document. By Bayes theorem: p R q, d j = p d j q, R p R q p d j q p NR q, d j = p d j q, NR p NR q p d j q p R q, d j + p NR q, d j = 1 Given a query q Estimate how terms contribute to relevance Compute the probability of each document d j to be relevant with respect to the query q p R q, d j Order documents by decreasing probability p R q, d j
39 Probabilistic model: BIM 39 Starting from the odds O R q, d j = p R q, d j p NR q, d j = = p d j q, R p R q p d j q, NR p NR q p d j q, R p R q p d j q p d j q, NR p NR q p d j q p R q = p NR q p d j q, R p d j q, NR constant for a given query p R q O R q = p NR q needs to be estimated for each document
40 Probabilistic model: BIM 40 Using the independence assumption Consider a dictionary of M terms M p d j q, R = p(w i,j q, R) i<1 M p d j q, NR = p(w i,j q, NR) i<1 p d j q, R p d j q, NR = p(w i,j q, R) p(w i,j q, NR) M i<1 O R q, d j = O R q M i<1 p(w i,j q, R) p(w i,j q, NR)
41 Probabilistic model: BIM 41 O R q, d j Given that w i,j 0,1 M i<1 M i<1 p(w i,j q, R) p(w i,j q, NR) = O R q M i<1 = p(w i = 1 q, R) i w i,j <1 p(w i,j q, R) p(w i,j q, NR) = p(w i = 1 q, NR) i w i,j <1 i w i,j <0 i w i,j <0 p(w i = 0 q, R) p(w i = 0 q, NR) p i p w i = 1 q, R probability of term t i appearing in a document relevant to the query u i p w i = 1 q, NR probability of term t i appearing in a document non relevant to the query Assuming that for all the terms not occurring in the query p i = u i
42 Probabilistic model: BIM 42 Continues O R q, d j = O R q = O(R q) = O R q M i<1 p(w i,j q, R) p(w i,j q, NR) p(w i = 1 q, R) p(w i = 1 q, NR) i w i,j <1 p i u i i w i,j <1 i w i,j <0 1 p i 1 u i i w i,j <0 p(w i = 0 q, R) p(w i = 0 q, NR) Since for the terms not occurring in the query p i = u i we consider only the terms t i occurring in the query (w i,q = 1) p i 1 p i O R q, d j = O R q u i 1 u i i w i,j<1 w i,q <1 i w i,j<0 w i,q <1 Terms occurring both in the query and in the document Terms occurring only in the query
43 Probabilistic model: BIM 43 Continues i w i,j<0 w i,q <1 1 p i 1 u i = 1 p i i w i,q <1 1 u i 1 p i i w i,j<1 1 u i w i,q <1 = i w i,j<1 w i,q <1 1 u i 1 p i i w i,q <1 1 p i 1 u i Terms occurring both in the query and in the document All the terms occurring in the query: document independent O R q, d j = O R q i w i,j<1 w i,q <1 p i 1 u i u i 1 p i i w i,q <1 1 p i 1 u i
44 Probabilistic model: BIM 44 O R q, d j = O R q Constant for each query i w i,j<1 w i,q <1 p i 1 u i u i 1 p i Only quantity to be estimated for ranking i w i,q <1 1 p i 1 u i SC q, d j log 2 i w i,j<1 w i,q <1 p i 1 u i u i 1 p i = log 2 p i 1 u i u i 1 p i i w i,j<1 w i,q <1 How do we estimate p i and u i from data?
45 Probabilistic model: BIM 45 For each term t i consider the following table of document counts Relevant Non-Relevant Total documents containing t i s i n i s i n i documents not containing t i S s i N S (n i s i ) N n i Total S N S N Estimates p i = p(w i = 1 q, R) = s i S u i = p(w i = 1 q, NR) = n i;s i N;S
46 Probabilistic model: BIM 46 u i is initialized as n i, as non relevant documents are approximated N by the whole documents collection p i can be initialized in various ways From relevant documents if known some (e.g. relevance feedback, prior knowledge) Constant (e.g. p i = 0.5 (even odds) for any given document) Proportional to probability of occurrence in collection An iterative procedure is used to refine p i and u i 1. Determine a guess of relevant document set Top-S ranked documents or user s relevance feedback 2. Update the estimates for p i and u i p i = s i u S i = n i;s i N;S Go to 1 until convergence then return ranking
47 Probabilistic model: example 47 Given Query incidence vector q =, Documents incidence vectors d 1 = d 2 =, d 3 =, d 4 =, Initialize p i = 0.5 Consider as relevant the top-2 documents (S = 2)
48 Probabilistic model: example 48 Determine the ranking of the documents collection with respect to the given query using a Probabilistic Model under the Binary Independence assumption. N = 4 number of documents in the collection Reduce the documents incidence vectors considering only the terms that appear in the query d 1 = d 2 =, d 3 =, d 4 =, Initialize u i for each term t i Recall that p i = 0.5 t i t 1 t 2 t 3 n i u i
49 Probabilistic model: example 49 1 st iteration Determine the SC for each document SC d 1, q p = log 3 1;u 2 + log 3 1;p 2 = u 3 SC d 2, q = (d 2 contains no query terms) SC d 3, q = log 2 p 2 1;p 2 + log 2 1;u 2 u 2 = 0 SC d 4, q = log 2 p 1 1;p 1 + log 2 1;u 1 u 1 + log 2 p 2 1;p 2 + log 2 1;u 2 u 2 = 1.59 Rank the documents and consider the first top-k as relevant (in case of tie keep documents ordering d 1 > d 4 > d 3 > d 2 Relevant documents = *d 1, d 4 +
50 Probabilistic model: example 50 1 st iteration (continues ) Calculate s i and update p i and u i s 1 = 1 s 2 = 1 s 3 = 1 p 1 = s 1 S = 1 2 = 0.5 p 2 = s 2 S = 1 2 = 0.5 p 3 = s 3 S = 1 2 = 0.5 u 1 = n 1;s 1 N;S = 0 2 = 0 u 2 = n 2;s 2 N;S = 1 2 = 0.5 u 3 = n 3;s 3 N;S = 0 2 = 0
51 Probabilistic model: example 51 2 nd iteration Determine the SC for each document SC d 1, q p = log 3 1;u 2 + log 3 1;p 2 3 u 3 SC d 2, q = SC d 3, q = log 2 p 2 1;p 2 + log 2 1;u 2 u 2 = 0 = log 2 1;ε ε SC d 4, q = log 2 p 1 1;p 1 + log 2 1;u 1 u 1 + log 2 p 2 1;p 2 + log 2 1;u 2 u 2 = log 2 1;ε ε Rank the documents and consider the first top-k as relevant (in case of tie keep documents ordering d 1 > d 4 > d 3 > d 2 Relevant documents = *d 1, d 4 + Convergence reached
52 Probabilistic model: conclusions 52 Pros Documents are ranked in decreasing order of probability of being relevant It includes a mechanism for relevance feedback Cons The need to guess the initial separation of documents into relevant and irrelevant It does not take into account the frequency with which a term occurs inside a document The assumption of independence of index terms
53 References 53 [Baeza-Yates and Ribeiro-Nieto, 1999] R. Baeza-Yates and B. Ribeiro-Nieto, Modern, 1999 ( [Manning et al., 2008] C.D. Manning, P. Raghavan and H. Schütze, Introduction to, Cambridge University Press, 2008 ( Alexander Strehl, Relationship-based Clustering and Cluster Ensembles for High-dimensional Data Mining, 2002 (
Modern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,
More informationProbabilistic Information Retrieval
Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationLecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25
Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationSome Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm
Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University
More informationLecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science
Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science http://www.cs.odu.edu/ jbollen January 30, 2003 Page 1 Structure 1. IR formal characterization (a) Mathematical
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationRanked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationMotivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models
3. Retrieval Models Motivation Information Need User Retrieval Model Result: Query 1. 2. 3. Document Collection 2 Agenda 3.1 Boolean Retrieval 3.2 Vector Space Model 3.3 Probabilistic IR 3.4 Statistical
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig
More informationDealing with Text Databases
Dealing with Text Databases Unstructured data Boolean queries Sparse matrix representation Inverted index Counts vs. frequencies Term frequency tf x idf term weights Documents as vectors Cosine similarity
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 13: Query Expansion and Probabilistic Retrieval Paul Ginsparg Cornell University,
More informationExtended IR Models. Johan Bollen Old Dominion University Department of Computer Science
Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science jbollen@cs.odu.edu http://www.cs.odu.edu/ jbollen January 20, 2004 Page 1 UserTask Retrieval Classic Model Boolean
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationPROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
PROBABILITY AND INFORMATION THEORY Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Probability space Rules of probability
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationLecture 3: Probabilistic Retrieval Models
Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationKnowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationThe troubles with using a logical model of IR on a large collection of documents
The troubles with using a logical model of IR on a large collection of documents Fabio Crestani Dipartimento di Elettronica ed Informatica Universitá a degli Studi di Padova Padova - Italy Ian Ruthven
More informationVector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationCS47300: Web Information Search and Management
CS47300: Web Informaton Search and Management Probablstc Retreval Models Prof. Chrs Clfton 7 September 2018 Materal adapted from course created by Dr. Luo S, now leadng Albaba research group 14 Why probabltes
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationTowards Collaborative Information Retrieval
Towards Collaborative Information Retrieval Markus Junker, Armin Hust, and Stefan Klink German Research Center for Artificial Intelligence (DFKI GmbH), P.O. Box 28, 6768 Kaiserslautern, Germany {markus.junker,
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic
More informationA Formalization of Logical Imaging for IR using QT
A Formalization of Logical Imaging for IR using QT September 1st, 2008 guido@dcs.gla.ac.uk http://www.dcs.gla.ac.uk/ guido Information Retrieval Group University of Glasgow QT in IR Why QT? General guidelines
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationa b = a T b = a i b i (1) i=1 (Geometric definition) The dot product of two Euclidean vectors a and b is defined by a b = a b cos(θ a,b ) (2)
This is my preperation notes for teaching in sections during the winter 2018 quarter for course CSE 446. Useful for myself to review the concepts as well. More Linear Algebra Definition 1.1 (Dot Product).
More informationQuery Propagation in Possibilistic Information Retrieval Networks
Query Propagation in Possibilistic Information Retrieval Networks Asma H. Brini Université Paul Sabatier brini@irit.fr Luis M. de Campos Universidad de Granada lci@decsai.ugr.es Didier Dubois Université
More informationmeans is a subset of. So we say A B for sets A and B if x A we have x B holds. BY CONTRAST, a S means that a is a member of S.
1 Notation For those unfamiliar, we have := means equal by definition, N := {0, 1,... } or {1, 2,... } depending on context. (i.e. N is the set or collection of counting numbers.) In addition, means for
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationCan Vector Space Bases Model Context?
Can Vector Space Bases Model Context? Massimo Melucci University of Padua Department of Information Engineering Via Gradenigo, 6/a 35031 Padova Italy melo@dei.unipd.it Abstract Current Information Retrieval
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationGeneric Text Summarization
June 27, 2012 Outline Introduction 1 Introduction Notation and Terminology 2 3 4 5 6 Text Summarization Introduction Notation and Terminology Two Types of Text Summarization Query-Relevant Summarization:
More informationCAIM: Cerca i Anàlisi d Informació Massiva
1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim
More informationCS 237: Probability in Computing
CS 237: Probability in Computing Wayne Snyder Computer Science Department Boston University Lecture 13: Normal Distribution Exponential Distribution Recall that the Normal Distribution is given by an explicit
More informationMidterm Examination Practice
University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following
More informationA Risk Minimization Framework for Information Retrieval
A Risk Minimization Framework for Information Retrieval ChengXiang Zhai a John Lafferty b a Department of Computer Science University of Illinois at Urbana-Champaign b School of Computer Science Carnegie
More informationLanguage Models. Hongning Wang
Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity
More informationSpazi vettoriali e misure di similaritá
Spazi vettoriali e misure di similaritá R. Basili Corso di Web Mining e Retrieval a.a. 2009-10 March 25, 2010 Outline Outline Spazi vettoriali a valori reali Operazioni tra vettori Indipendenza Lineare
More informationBasic Probabilistic Reasoning SEG
Basic Probabilistic Reasoning SEG 7450 1 Introduction Reasoning under uncertainty using probability theory Dealing with uncertainty is one of the main advantages of an expert system over a simple decision
More information6 Probabilistic Retrieval Models
Probabilistic Retrieval Models 1 6 Probabilistic Retrieval Models Notations Binary Independence Retrieval model Probability Ranking Principle Probabilistic Retrieval Models 2 6.1 Notations Q Q Q D rel.
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationLanguage Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13
Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis
More informationECE521 week 3: 23/26 January 2017
ECE521 week 3: 23/26 January 2017 Outline Probabilistic interpretation of linear regression - Maximum likelihood estimation (MLE) - Maximum a posteriori (MAP) estimation Bias-variance trade-off Linear
More informationThe Boolean Model ~1955
The Boolean Model ~1955 The boolean model is the first, most criticized, and (until a few years ago) commercially more widespread, model of IR. Its functionalities can often be found in the Advanced Search
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: Naive Bayes Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 20 Introduction Classification = supervised method for
More informationCS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006
Scribes: Gilly Leshed, N. Sadat Shami Outline. Review. Mixture of Poissons ( Poisson) model 3. BM5/Okapi method 4. Relevance feedback. Review In discussing probabilistic models for information retrieval
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Collection Frequency, cf Define: The total
More informationMachine Learning for natural language processing
Machine Learning for natural language processing Classification: k nearest neighbors Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 28 Introduction Classification = supervised method
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr
More informationComputational Complexity of Bayesian Networks
Computational Complexity of Bayesian Networks UAI, 2015 Complexity theory Many computations on Bayesian networks are NP-hard Meaning (no more, no less) that we cannot hope for poly time algorithms that
More informationElements of probability theory
The role of probability theory in statistics We collect data so as to provide evidentiary support for answers we give to our many questions about the world (and in our particular case, about the business
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationLecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval
Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval Scribes: Ellis Weng, Andrew Owens February 11, 2010 1 Introduction In this lecture, we will introduce our second paradigm for
More informationText classification II CE-324: Modern Information Retrieval Sharif University of Technology
Text classification II CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Some slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationMath 350: An exploration of HMMs through doodles.
Math 350: An exploration of HMMs through doodles. Joshua Little (407673) 19 December 2012 1 Background 1.1 Hidden Markov models. Markov chains (MCs) work well for modelling discrete-time processes, or
More informationBayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016
Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 8: Evaluation & SVD Paul Ginsparg Cornell University, Ithaca, NY 20 Sep 2011
More information9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering
Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make
More informationManning & Schuetze, FSNLP, (c)
page 554 554 15 Topics in Information Retrieval co-occurrence Latent Semantic Indexing Term 1 Term 2 Term 3 Term 4 Query user interface Document 1 user interface HCI interaction Document 2 HCI interaction
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationProbability Kinematics in Information. Retrieval: a case study. F. Crestani. Universita' di Padova, Italy. C.J. van Rijsbergen.
Probability Kinematics in Information Retrieval: a case study F. Crestani Dipartimento di Elettronica e Informatica Universita' di Padova, Italy C.J. van Rijsbergen Department of Computing Science University
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).
More informationBehavioral Data Mining. Lecture 2
Behavioral Data Mining Lecture 2 Autonomy Corp Bayes Theorem Bayes Theorem P(A B) = probability of A given that B is true. P(A B) = P(B A)P(A) P(B) In practice we are most interested in dealing with events
More informationA Dichotomy. in in Probabilistic Databases. Joint work with Robert Fink. for Non-Repeating Queries with Negation Queries with Negation
Dichotomy for Non-Repeating Queries with Negation Queries with Negation in in Probabilistic Databases Robert Dan Olteanu Fink and Dan Olteanu Joint work with Robert Fink Uncertainty in Computation Simons
More informationComputer Science CPSC 322. Lecture 18 Marginalization, Conditioning
Computer Science CPSC 322 Lecture 18 Marginalization, Conditioning Lecture Overview Recap Lecture 17 Joint Probability Distribution, Marginalization Conditioning Inference by Enumeration Bayes Rule, Chain
More information3. Basics of Information Retrieval
Text Analysis and Retrieval 3. Basics of Information Retrieval Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder With contributions from dr. sc. Goran Glavaš Mladen Karan, mag. ing. University of Zagreb
More informationK. Nishijima. Definition and use of Bayesian probabilistic networks 1/32
The Probabilistic Analysis of Systems in Engineering 1/32 Bayesian probabilistic bili networks Definition and use of Bayesian probabilistic networks K. Nishijima nishijima@ibk.baug.ethz.ch 2/32 Today s
More informationMobile Robot Localization
Mobile Robot Localization 1 The Problem of Robot Localization Given a map of the environment, how can a robot determine its pose (planar coordinates + orientation)? Two sources of uncertainty: - observations
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:
More informationCHAPTER 4 CLASSICAL PROPOSITIONAL SEMANTICS
CHAPTER 4 CLASSICAL PROPOSITIONAL SEMANTICS 1 Language There are several propositional languages that are routinely called classical propositional logic languages. It is due to the functional dependency
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationUniversity of Illinois at Urbana-Champaign. Midterm Examination
University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,
More informationLatent Semantic Analysis. Hongning Wang
Latent Semantic Analysis Hongning Wang CS@UVa Recap: vector space model Represent both doc and query by concept vectors Each concept defines one dimension K concepts define a high-dimensional space Element
More informationData Mining 2018 Logistic Regression Text Classification
Data Mining 2018 Logistic Regression Text Classification Ad Feelders Universiteit Utrecht Ad Feelders ( Universiteit Utrecht ) Data Mining 1 / 50 Two types of approaches to classification In (probabilistic)
More information