3. Basics of Information Retrieval
|
|
- Gillian Quinn
- 5 years ago
- Views:
Transcription
1 Text Analysis and Retrieval 3. Basics of Information Retrieval Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder With contributions from dr. sc. Goran Glavaš Mladen Karan, mag. ing. University of Zagreb Faculty of Electrical Engineering and Computing (FER) Academic Year 2017/2018 Creative Commons Attribution NonCommercial NoDerivs 3.0 v1.2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
2 After today s lecture, you ll... 1 Understand the importance of information retrieval 2 Gain familiarity with basic principles of IR 3 Gain basic understanding of the three main IR paradigms 4 Be aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
3 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
4 Retrieval and search What is your first association with information retrieval? What is your first association with search or search engine? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
5 Retrieval and search Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
6 What is IR? Information retrieval The activity of obtaining information resources relevant to a user s information need from a collection of information resources. (Wikipedia) Elements of an information retrieval system Information needs (expressed by users in the form of queries) Information (re)sources (typically unstructured text, images, video, audio, etc.) Component for the efficient retrieval of relevant sources for a given expressed information need, typically from a large collection of information sources Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
7 Information needs Information need Information need is an individual or group s desire to locate and obtain information to satisfy a conscious or unconscious need. Needs and interests call forth information. (Robert S. Taylor: The process of asking questions, 2007) (Un)conscious needs for information are expressed via queries Words and phrases in text information retrieval (e.g., ISIS attacks ) Images in image content retrieval Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
8 Why text information retrieval? Large repositories of unstructured information sources Companies e.g., technical documentation, business contracts Governments e.g., documentation, regulations, laws (CADIAL) Scientific material e.g., research articles (Google Scholar) Personal e.g., books, s, files World Wide Web most demanding due to largest scale of all Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
9 Text information retrieval The focus of TAR is text information retrieval, where models differ in: The representation of documents and queries The way they determine the relevance of a document for a given query Relevance of the document is most often given as a score (and not a binary decision) Documents are ranked according to the assigned scores for the given query Relevance scores usually incorporate an element of uncertainty Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
10 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
11 Text representation in IR Unstructured representation Text represented as an unordered set of terms (the so-called bag-of-words representation) Considerable oversimplification: ignoring syntax and semantics Despite oversimplifying, satisfiable retrieval performance Weakly-structured representations Certain groups of terms given more importance (other terms contribution downscaled or ignored) Noun phrases (NP), named entities, etc. Structured representations Virtually not used in IR context Information extraction (IE) techniques not sufficiently accurate IE models can be time-costly not acceptable in IR Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
12 Unstructured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of words {(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1), (walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1), (Both, 1), (of, 2), (them, 1), (felt, 1), (restless, 1), (again, 1), (On, 1), (suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1), (to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near,1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
13 Weakly-structured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of nouns {(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1), (shadow, 1), (parting, 1), (time, 1), (Lothlorien, 1)} Bag of named entity terms {(Frodo, 2), (Sam, 1), (Lothlorien, 1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
14 Preprocessing Two very common preprocessing steps: Morphological normalization stemming or lemmatization Stop-words removal Morphological normalization Conflating various forms of the same word to a common form Important for morphologically rich languages such as Croatian Stemming (e.g., kućom kuć) more often used than lemmatization (e.g., kućom kuća) Stop-words removal Removing semantically poor terms such as articles, prepositions, conjunctions, pronouns, etc. Keeping just content words nouns, verbs, adjectives, adverbs English stop words Both methods reduce the cardinality of the bag-of-words set of the document and generally boost IR performance Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
15 Formalization of IR models The basic retrieval model is a triple (f d, f q, r) where: 1 f d is a function that maps documents to their representations for retrieval, i.e., f d (d) = x d, where x d is the retrieval representation of the document d; 2 f q is a function that maps queries to their representations for retrieval, i.e., f q (q) = x q, where x q is the retrieval representation of the query q; 3 r is a ranking function Takes into account document representation x d and query representation x q Associates a real number that indicates the potential relevance of the document d for the query q based on x d and x q relevance(d, q) = r(fd(d), fq(q)) = r(x d, x q ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
16 Index terms and term weights Index terms are all the terms in the collection (the vocabulary) The set of all index terms: K = {k 1, k 2,..., k t }, where t is the number of index terms For each document d j, each term k i is assigned a weight w ij The weight of index terms not appearing in the document is 0 Document d j is represented by the term vector [w 1j, w 2j,..., w tj ] Let g be the function that computes the weights, i.e., w ij = g(k i, d j ) Different choices for the weighting function g and the ranking function r define different IR models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
17 IR paradigms Information retrieval models roughly fall into three paradigms: 1 Set-theoretic models Boolean model Extended Boolean model 2 Algebraic models Vector space model Latent semantic indexing 3 Probabilistic models Classic probabilistic model Language model Additionally, there are IR models that utilize link analysis algorithms (e.g., PageRank, HITS) Typically used in web retrieval where documents are (hyper)linked Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
18 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
19 Boolean retrieval model Documents represented as bags of words Term weights are all binary w ij {0, 1} w ij = 1 iff index term k i can be found in the bag of words of document d j Query q is given as a propositional logic formula over index terms Index terms are connected via Boolean operators (, ) and can be negated ( ) Each query q can be transformed into disjunctive normal form (DNF), i.e., q = q c1 q c2 q cn, where q cl is the l-th conjunctive component of q s DNF The relevance of the document d j for the query q is given as follows: { 1, if q cl k i terms(q cl ), w ij = 1. relevance(d j, q) = 0, otherwise. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
20 Boolean retrieval example Assume the following set of index terms: K = {Frodo, Sam, blue, sword, orc, Mordor} Let us have the following documents: d 1 : Frodo stabbed the orc with the red sword d 2 : Frodo and Sam used the blue lamp to locate orcs d 3 : Sam killed many orcs in Mordor with the blue sword Which documents are relevant for the following queries? q 1 : (Frodo orc sword) (Frodo blue) {d 1, d 2 } q 2 : (Sam blue) (Sam orc Mordor) {d 2, d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
21 Boolean retrieval inverted index A data structure for computationally efficient retrieval Inverted file index contains a list of references to documents for all index terms For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Frodo stabbed the orc with the sword. d 3 : Sam stole Frodo s sword. We have the following inverted index: L( Frodo ) = {d 1, d 2, d 3 } L( Sam ) = {d 1, d 3 } L( orc ) = {d 1, d 2 } L( stab ) = {d 1, d 2 } L( sword ) = {d 2, d 3 } L( stole ) = {d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
22 Boolean retrieval inverted index Full inverted index additionally contains the positions of each word within a document E.g., Frodo : {(d 1, 1), (d 2, 1), (d 3, 3)} Inverted index allows for handling Boolean queries via set intersections and set unions q: (Frodo stab) (Sam orc) Results obtained by computing as follows: rel(d, q) = (L( Frodo ) L( stab )) (L( Sam ) L( orc )) = ({d 1, d 2, d 3 } {d 1, d 2 }) ({d 1, d 3 } {d 1, d 2 }) = {d 1, d 2 } {d 1 } = {d 1, d 2 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
23 Boolean retrieval model Advantages Only one: simplicity (computational efficiency) Popular in early commercial systems (e.g., Westlaw) Shortcomings Expressing information needs as Boolean expressions is unintuitive A puritan model No ranking documents are either relevant or non-relevant Relative importance of indexed terms is ignored Extended Boolean model a variant of the Boolean model that accounts for the partial fulfilment of a Boolean expression Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
24 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
25 Vector space model Documents and queries are represented as vectors of index terms Weights are real numbers 0 d j = [w 1j, w 2j,..., w tj ] q = [w 1q, w 2q,..., w tq ] The relevance of the document for the query is estimated by computing some distance or similarity metric between the two vectors Distance metrics Euclidean, Manhattan, etc. More relevant when distance is lower Similarity metrics Cosine, Dice, etc. More relevant when similarity is larger Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
26 Vector space model distance metrics Euclidean distance dis E (d j, q) = t (w ij w iq ) 2 Manhattan distance i=1 dis M (d j, q) = t w ij w iq i=1 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
27 Vector space model similarity metrics Cosine similarity Cosine(d j, q) = = d j q d j q t i=1 w ijw iq t t i=1 w2 ij i=1 w2 iq Dice similarity Dice(d j, q) = 2 t i=1 w ijw iq t i=1 w ij + t i=1 w iq Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
28 Vector space model term weighting How are the weights w ij of index terms for documents computed? Two intuitive assumptions: 1 The relevance of an index term for the document is proportional to its frequency in the document (term frequency component) i.e., more frequent more relevant 2 The relevance of an index term for any document is inversely proportional to the number of documents in the collection in which it occurs (inverse document frequency component) i.e., more common across documents less relevant (e.g., stopwords such as the ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
29 Vector space model weighting schemes Binary weighting scheme Not really a weighting scheme Ignores the two aforementioned assumptions w ij = 1 iff document d j contains term k i TF-IDF weighting scheme The weight computed as the product of the term frequency component and the inverse document frequency component w ij = tf (k i, d j ) idf (k i, D) The most popular local and global schemes: tf (k i, d j ) = freq(k i, d j ) max k dj freq(k, d j ) idf (k i, D) = log D {d j D k i d j } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
30 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
31 Probabilistic retrieval models Probabilistic retrieval models View retrieval as a problem of estimating the probability of relevance given a query, document, collection, etc. Documents are ranked in decreasing order of this probability The most popular models from this category: Classical probabilistic model Two-Poisson model BM25 Language model Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
32 Probability refresher Bayes rule: P (B A) = P (A B)P (B) P (A) Allows us to compute P (B A) using P (A B) Particularly useful if the former is difficult to compute directly Chain rule: P (A, B, C) = P (A B, C)P (B C)P (C) Allows us to write a joint probability using conditional probabilities. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
33 Probability ranking principle Probability ranking principle (Robertson, 1977) If an IR system s response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximal. Probabilistic IR models need to answer the basic question: What is the probability that a user will judge this document as relevant for this query? (Sparck Jones et al., 2000) How do we formalize this question? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
34 Probability ranking principle (2) Random variables: D = {D 1,... D t,... D N } Document (set of terms) Q = {Q 1,... Q t,... Q L } Query (set of terms) R {0, 1} relevance judgement R = 1 if D is relevant for Q, R = 0 otherwise The basic question now translates to: estimate P (R = 1 D = d, Q = q) (1) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
35 Probability ranking principle (3) Let r be a shorthand for R = 1, and r for R = 0 We apply the logit function to probability (1) ( ) p logit(p) = log 1 p This is a rank preserving transformation, giving: log p(r D, Q) p(r D, Q) = log 1 p(r D, Q) p( r D, Q) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
36 Logit Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
37 Probability ranking prinicple (4) We can now apply the Bayes rule: log p(r D, Q) p(d, Q r)p(r) = log p( r D, Q) p(d, Q r)p( r) (A benefit of having used logit is that P (D, Q) cancels out) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
38 Probability ranking prinicple (5) Finally, we use the chain rule: log p(d, Q r)p(r) p(d, Q r)p( r) p(d Q, r)p(q r)p(r) = log p(d Q, r)p(q r)p( r) p(d Q, r) P (Q r)p(r) = log + log p(d Q, r) p(q r)p( r) p(d Q, r) log p(d Q, r) (2) The right term in the second row doesn t depend on D, we can remove it without affecting the ordering (it can be interpreted as a measure of query difficulty) Expression (2) is the heart of probabilistic retrieval models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
39 Binary Independence Model Our first attempt at estimating expression (2) Documents are represented as a vector of binary random variables D = [D 1,..., D N ], with one dimension for each term in vocabulary V D i = 1 denotes the presence of term i D i = 0 denotes the absence of term i Similarly, the query is a vector Q = [Q 1,..., Q L ] Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
40 Binary Independence Model (2) Assumption 1: given relevance, terms are statistically independent Now we can write the terms in (2) as follows: Turning expression (2) into: V p(d Q, r) = p(d i Q, r) i=1 V p(d Q, r) = p(d i Q, r) i=1 log p(d Q, r) p(d Q, r) = V i=1 log p(d i Q, r) p(d i Q, r) (3) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
41 Binary Independence Model (3) Assumption 2: The presence of a term in a document depends on relevance only when that term is present in the query. For a fixed query Q = q = [q 1,..., q L ], if q i = 0, according to this assumption P (D i q, r) does not depend on relevance: so p(d i q, r) = p(d i q, r) log p(d i q, r) p(d i q, r) = 0 we can simplify (3) by ignoring all summation terms not in q: V i=1 log p(d i q, r) p(d i q, r) = log p(d t q, r) p(d t q t q, r) (4) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
42 Binary Independence Model (4) We can view expression (4) as summing weights of terms t q log p(d t q, r) p(d t q, r) = w t (5) t q Although assumptions are often violated these models work well Common practice is to approximate p(d t q, r) with 0.5 and p(d t q, r) with n t /N d n t number of documents containing term i N d total number of documents A very similar alternative is to use IDF scores as w t These probabilities can be reestimated through relevance feedback Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
43 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. and the query Sam stabbed orc using common-practice approximations: d 1 d 2 d 3 t Sam stabbed orcs Sam orc Sam P (D t q, r) P (D t q, r) 3/3 1/3 2/3 3/3 2/3 3/3 w t log 0.5 log 1.5 log 0.75 log 0.5 log 0.75 log 0.5 wt Yields d 1 as the most relevant result, followed by d 3 and d 2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
44 Two-Poisson model A more realistic document representation: a vector of word frequencies Uses the Poisson distribution to model frequencies Assumption: all documents are of equal length Can be approximated by the following expression: t q f t,d (k 1 + 1) k 1 + f t,d w t (6) Where f t,d is the frequency of term t in document d and k 1 a constant (typically 1 k 1 < 2) higher frequency words get boosted weights Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
45 BM11 Removes document length assumptions of the two-poisson model matches in longer documents should be less important We can correct the frequency f t,d = f t,d (l avg /l d ) l avg the average length of a document l d the length of document d dampens/boosts word frequencies based on above/below average document length Now we can rewrite (6) as: t q f t,d (k 1 + 1) k 1 (l d /l avg ) + f t,d w t Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
46 BM25 While BM11 removes the problem with assuming equal document length, in practice it still has problems: Long relevant documents are getting too much dampening Short irrelevant documents are getting too much boosting To control the amount of correction, we introduce b (often set to 0.75) f t,d (k 1 + 1) w t (7) k 1 (1 b) + k 1 (l d /l avg )b + f t,d t q This expression represents the famous BM25 ranking function, which gives state-of-the-art results Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
47 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
48 Language modeling Approaching the probabilistic information retrieval problem from a different prespective Instead of modeling document probability given the query, we model the query probability given the document Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
49 Language modeling Probabilistic modeling of language A sentence t is a vector of terms < t 1,, t n > Ideally, the probability of sentence t is: p(t) = p(t 1 ) p(t 2 t 1 ) p(t i t 1 t i 1 ) p(t n t 1,, t n 1 ) E.g., for t = {one, ring, to, rule} we have p(t) = p(one) p(ring one) p(to one,ring) p(rule, one,ring,to) In practice the sparseness problem makes this intractable Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
50 Unigram language models Solve the sparseness problem by completely ignoring conditioning Probability of sentence t under the unigram model is: p(t M) = p(t n ) p(t i ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i ) = n i n T (8) n i number of times term t i occurs in the collection n T total number of term occurences in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
51 Bigram language models We simplify the conditionals by leaving only the previous word A better approximation of reality than unigram models Probability of sentence t under the bigram model: p(t M) = p(t n t n 1 ) p(t i t i 1 ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i t i 1 ) = n(t i 1, t i ) n(t i 1 ) n(t i 1, t i ) number of times bigram (t i 1, t i ) occurs in the collection n i 1 number of times term t i 1 occurs in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
52 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. Estimating a unigram language model (for the entire collection): t i Frodo Sam orc chased sword... P (t i ) 1/16 3/16 2/16 1/16 2/16... Estimating a bigram language model (for the entire collection): t i 1, t i Frodo, chased the, sword the, orc... P (t i t i 1 ) 0 2/3 1/3... Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
53 Query likelihood model Given a document collection D and a query q A language model M d is built for each document Documents are scored according to the probability P (q M d ) Intuition: language models corresponding to relevant documents should assign higher probability to the query Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
54 Query likelihood model(2) A unigram language model of document d can be estimated by dividing the number of times term i occurs in d by the number of terms in d p(t i M d ) = n i,d n d Bigram language models are rarely applied to individual documents due to sparsity problems Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
55 Smoothing language models Models we ve considered so far give probability 0 to queries which contain terms that do not occur in the document We can prevent this by using smoothing techniques Smoothing adds a small probability under the model even to unseen words Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
56 Smoothing techniques Laplace smoothing Adding a fixed small count (e.g., 1) to all word counts (even the unobserved ones) and renormalizing to get a probability distribution p (t i M d ) = n i,d + α n d + V α Adds an artificial count α for each possible word in our vocabulary V Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
57 Smoothing techniques Laplace smoothing assumes all unseen words are equally likely! Jelinek-Mercer smoothing rather builds a language model M D of the entire document collection, and interpolates: p (t i M d ) = λp(t i M d ) + (1 λ)p(t i M D ) Words absent from a document will still get some probability mass from the right term However, the amount each word gets will vary depending on their likelihood in the collection as a whole Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
58 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
59 IR Tools Terrier IR platform Open command-line IR platform Popular in academia used for IR evaluations in research Stepwise usage first indexing, then retrieval, and then evaluation Lucene Open source information retrieval library written in Java (ports to many languages exist) Describes a document as a set of (user definable) fields Very flexible solution for applications requiring full text indexing and search Elasticsearch RESTful search engine on top of Lucene Docs: Very limited NLP functionality: text-classification-made-easy-with-elasticsearch Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
60 Terrier (University of Glasgow) Implements a wide variety of IR models VSMs, LMs, PMs Stepwise usage Loading and indexing the collection of documents (which need to be in the TREC format) >> trec_setup.sh <coll-path> >> trec_terrier.sh -i Performing retrieval (must provide the collection of queries and define the retrieval model) >> trec_terrier.sh -r -Dtrec.topics=<q-path> -Dtrec.model=<ir-model> Evaluating the retrieval performance (must provide relevance judgements) >> trec_terrier.sh -e Dtrec.qrels=<rj-path> Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
61 Now you... 1 Understand the importance of information retrieval 2 Are familiar with basic principles of IR 3 Have basic understanding of the three main IR paradigms 4 Are aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61
Ranked Retrieval (2)
Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF
More informationBoolean and Vector Space Retrieval Models
Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic
More informationBoolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).
Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval
More informationModern Information Retrieval
Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,
More informationLanguage Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13
Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis
More informationInformation Retrieval Basic IR models. Luca Bondi
Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to
More informationChap 2: Classical models for information retrieval
Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic
More informationRetrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1
Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)
More informationRETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS
RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic
More informationInformation Retrieval
Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A
More informationNatural Language Processing. Topics in Information Retrieval. Updated 5/10
Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background
More informationIR Models: The Probabilistic Model. Lecture 8
IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index
More informationChapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze
Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationLanguage Models. CS6200: Information Retrieval. Slides by: Jesse Anderton
Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationScoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology
Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)
More informationFall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26
Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent
More informationLecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25
Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for
More informationInformation Retrieval and Web Search
Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic
More information6. Logical Inference
Artificial Intelligence 6. Logical Inference Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder University of Zagreb Faculty of Electrical Engineering and Computing Academic Year 2016/2017 Creative Commons
More informationProbabilistic Information Retrieval
Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem
More information5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists
More informationText Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University
Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data
More informationLanguage Models. Hongning Wang
Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity
More informationINFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from
INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,
More informationPart A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )
Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig
More informationCAIM: Cerca i Anàlisi d Informació Massiva
1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim
More informationUniversity of Illinois at Urbana-Champaign. Midterm Examination
University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,
More information16 The Information Retrieval "Data Model"
16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues
More informationLecture 3: Probabilistic Retrieval Models
Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;
More informationPV211: Introduction to Information Retrieval
PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,
More informationVector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson
Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying
More informationTerm Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan
Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes
More informationOutline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting
Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,
More informationScore Distribution Models
Score Distribution Models Evangelos Kanoulas Virgil Pavlu Keshi Dai Javed Aslam Score Distributions 2 Score Distributions 2 Score Distributions 9.6592 9.5761 9.4919 9.4784 9.2693 9.2066 9.1407 9.0824 9.0110
More informationText Analytics (Text Mining)
http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS
More informationINFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review
INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion
More informationCS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002
CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope
More informationLecture 13: More uses of Language Models
Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches
More informationProbabilistic Field Mapping for Product Search
Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This
More informationvector space retrieval many slides courtesy James Amherst
vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the
More information.. CSC 566 Advanced Data Mining Alexander Dekhtyar..
.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a
More information1 Information retrieval fundamentals
CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving
More informationInformation Retrieval
Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections
More informationMaschinelle Sprachverarbeitung
Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle
More informationIntroduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model
Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationMidterm sample questions
Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts
More informationKnowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..
Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery
More informationA Study of the Dirichlet Priors for Term Frequency Normalisation
A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval
More informationInformation Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)
Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare
More informationCS 572: Information Retrieval
CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).
More informationVariable Latent Semantic Indexing
Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background
More informationLecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science
Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science http://www.cs.odu.edu/ jbollen January 30, 2003 Page 1 Structure 1. IR formal characterization (a) Mathematical
More informationText Analysis. Week 5
Week 5 Text Analysis Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association
More informationLecture 5: Web Searching using the SVD
Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially
More informationScoring, Term Weighting and the Vector Space
Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J
More informationMachine Learning: Assignment 1
10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this
More informationSparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent
More informationANLP Lecture 22 Lexical Semantics with Dense Vectors
ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous
More informationRanked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy
Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus
More informationMidterm Examination Practice
University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following
More information13 Searching the Web with the SVD
13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More information9 Searching the Internet with the SVD
9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this
More informationInformation Retrieval and Organisation
Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents
More informationThe Noisy Channel Model and Markov Models
1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle
More informationCategorization ANLP Lecture 10 Text Categorization with Naive Bayes
1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition
More informationANLP Lecture 10 Text Categorization with Naive Bayes
ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition
More informationA Study of Smoothing Methods for Language Models Applied to Information Retrieval
A Study of Smoothing Methods for Language Models Applied to Information Retrieval CHENGXIANG ZHAI and JOHN LAFFERTY Carnegie Mellon University Language modeling approaches to information retrieval are
More informationThe Naïve Bayes Classifier. Machine Learning Fall 2017
The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning
More informationLecture 12: Link Analysis for Web Retrieval
Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities
More informationThe Static Absorbing Model for the Web a
Journal of Web Engineering, Vol. 0, No. 0 (2003) 000 000 c Rinton Press The Static Absorbing Model for the Web a Vassilis Plachouras University of Glasgow Glasgow G12 8QQ UK vassilis@dcs.gla.ac.uk Iadh
More informationTerm Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze
Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either
More informationDocument indexing, similarities and retrieval in large scale text collections
Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1
More informationDISTRIBUTIONAL SEMANTICS
COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.
More informationInformation Retrieval. Lecture 6
Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture
More informationLanguage as a Stochastic Process
CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any
More informationCS47300: Web Information Search and Management
CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:
More informationLecture 2: Probability, Naive Bayes
Lecture 2: Probability, Naive Bayes CS 585, Fall 205 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp205/ Brendan O Connor Today Probability Review Naive Bayes classification
More informationTopic Models and Applications to Short Documents
Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text
More informationLeverage Sparse Information in Predictive Modeling
Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from
More informationMatrices, Vector Spaces, and Information Retrieval
Matrices, Vector Spaces, and Information Authors: M. W. Berry and Z. Drmac and E. R. Jessup SIAM 1999: Society for Industrial and Applied Mathematics Speaker: Mattia Parigiani 1 Introduction Large volumes
More informationGeometric and Quantum Methods for Information Retrieval
PAPER Geometric and Quantum Methods for Information Retrieval Yaoyong Li, Hamish Cunningham Department of Computer Science, University of Sheffield, Sheffield, UK {yaoyong;hamish}@dcs.shef.ac.uk Abstract
More informationName: Matriculation Number: Tutorial Group: A B C D E
Name: Matriculation Number: Tutorial Group: A B C D E Question: 1 (5 Points) 2 (6 Points) 3 (5 Points) 4 (5 Points) Total (21 points) Score: General instructions: The written test contains 4 questions
More informationInformation Retrieval
Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices
More informationPart I: Web Structure Mining Chapter 1: Information Retrieval and Web Search
Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim
More informationModern Information Retrieval
Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction
More informationVector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model
Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set
More informationData Mining Recitation Notes Week 3
Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents
More informationWeb Information Retrieval Dipl.-Inf. Christoph Carl Kling
Institute for Web Science & Technologies University of Koblenz-Landau, Germany Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Exercises WebIR ask questions! WebIR@c-kling.de 2 of 40 Probabilities
More informationProbabilistic Language Modeling
Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November
More informationHidden Markov Models in Language Processing
Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications
More informationLanguage Models, Smoothing, and IDF Weighting
Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship
More informationComparing Relevance Feedback Techniques on German News Articles
B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 301 Comparing Relevance Feedback Techniques on German News Articles Julia
More information