3. Basics of Information Retrieval

Size: px
Start display at page:

Download "3. Basics of Information Retrieval"

Transcription

1 Text Analysis and Retrieval 3. Basics of Information Retrieval Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder With contributions from dr. sc. Goran Glavaš Mladen Karan, mag. ing. University of Zagreb Faculty of Electrical Engineering and Computing (FER) Academic Year 2017/2018 Creative Commons Attribution NonCommercial NoDerivs 3.0 v1.2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

2 After today s lecture, you ll... 1 Understand the importance of information retrieval 2 Gain familiarity with basic principles of IR 3 Gain basic understanding of the three main IR paradigms 4 Be aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

3 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

4 Retrieval and search What is your first association with information retrieval? What is your first association with search or search engine? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

5 Retrieval and search Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

6 What is IR? Information retrieval The activity of obtaining information resources relevant to a user s information need from a collection of information resources. (Wikipedia) Elements of an information retrieval system Information needs (expressed by users in the form of queries) Information (re)sources (typically unstructured text, images, video, audio, etc.) Component for the efficient retrieval of relevant sources for a given expressed information need, typically from a large collection of information sources Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

7 Information needs Information need Information need is an individual or group s desire to locate and obtain information to satisfy a conscious or unconscious need. Needs and interests call forth information. (Robert S. Taylor: The process of asking questions, 2007) (Un)conscious needs for information are expressed via queries Words and phrases in text information retrieval (e.g., ISIS attacks ) Images in image content retrieval Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

8 Why text information retrieval? Large repositories of unstructured information sources Companies e.g., technical documentation, business contracts Governments e.g., documentation, regulations, laws (CADIAL) Scientific material e.g., research articles (Google Scholar) Personal e.g., books, s, files World Wide Web most demanding due to largest scale of all Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

9 Text information retrieval The focus of TAR is text information retrieval, where models differ in: The representation of documents and queries The way they determine the relevance of a document for a given query Relevance of the document is most often given as a score (and not a binary decision) Documents are ranked according to the assigned scores for the given query Relevance scores usually incorporate an element of uncertainty Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

10 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

11 Text representation in IR Unstructured representation Text represented as an unordered set of terms (the so-called bag-of-words representation) Considerable oversimplification: ignoring syntax and semantics Despite oversimplifying, satisfiable retrieval performance Weakly-structured representations Certain groups of terms given more importance (other terms contribution downscaled or ignored) Noun phrases (NP), named entities, etc. Structured representations Virtually not used in IR context Information extraction (IE) techniques not sufficiently accurate IE models can be time-costly not acceptable in IR Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

12 Unstructured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of words {(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1), (walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1), (Both, 1), (of, 2), (them, 1), (felt, 1), (restless, 1), (again, 1), (On, 1), (suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1), (to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near,1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

13 Weakly-structured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of nouns {(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1), (shadow, 1), (parting, 1), (time, 1), (Lothlorien, 1)} Bag of named entity terms {(Frodo, 2), (Sam, 1), (Lothlorien, 1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

14 Preprocessing Two very common preprocessing steps: Morphological normalization stemming or lemmatization Stop-words removal Morphological normalization Conflating various forms of the same word to a common form Important for morphologically rich languages such as Croatian Stemming (e.g., kućom kuć) more often used than lemmatization (e.g., kućom kuća) Stop-words removal Removing semantically poor terms such as articles, prepositions, conjunctions, pronouns, etc. Keeping just content words nouns, verbs, adjectives, adverbs English stop words Both methods reduce the cardinality of the bag-of-words set of the document and generally boost IR performance Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

15 Formalization of IR models The basic retrieval model is a triple (f d, f q, r) where: 1 f d is a function that maps documents to their representations for retrieval, i.e., f d (d) = x d, where x d is the retrieval representation of the document d; 2 f q is a function that maps queries to their representations for retrieval, i.e., f q (q) = x q, where x q is the retrieval representation of the query q; 3 r is a ranking function Takes into account document representation x d and query representation x q Associates a real number that indicates the potential relevance of the document d for the query q based on x d and x q relevance(d, q) = r(fd(d), fq(q)) = r(x d, x q ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

16 Index terms and term weights Index terms are all the terms in the collection (the vocabulary) The set of all index terms: K = {k 1, k 2,..., k t }, where t is the number of index terms For each document d j, each term k i is assigned a weight w ij The weight of index terms not appearing in the document is 0 Document d j is represented by the term vector [w 1j, w 2j,..., w tj ] Let g be the function that computes the weights, i.e., w ij = g(k i, d j ) Different choices for the weighting function g and the ranking function r define different IR models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

17 IR paradigms Information retrieval models roughly fall into three paradigms: 1 Set-theoretic models Boolean model Extended Boolean model 2 Algebraic models Vector space model Latent semantic indexing 3 Probabilistic models Classic probabilistic model Language model Additionally, there are IR models that utilize link analysis algorithms (e.g., PageRank, HITS) Typically used in web retrieval where documents are (hyper)linked Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

18 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

19 Boolean retrieval model Documents represented as bags of words Term weights are all binary w ij {0, 1} w ij = 1 iff index term k i can be found in the bag of words of document d j Query q is given as a propositional logic formula over index terms Index terms are connected via Boolean operators (, ) and can be negated ( ) Each query q can be transformed into disjunctive normal form (DNF), i.e., q = q c1 q c2 q cn, where q cl is the l-th conjunctive component of q s DNF The relevance of the document d j for the query q is given as follows: { 1, if q cl k i terms(q cl ), w ij = 1. relevance(d j, q) = 0, otherwise. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

20 Boolean retrieval example Assume the following set of index terms: K = {Frodo, Sam, blue, sword, orc, Mordor} Let us have the following documents: d 1 : Frodo stabbed the orc with the red sword d 2 : Frodo and Sam used the blue lamp to locate orcs d 3 : Sam killed many orcs in Mordor with the blue sword Which documents are relevant for the following queries? q 1 : (Frodo orc sword) (Frodo blue) {d 1, d 2 } q 2 : (Sam blue) (Sam orc Mordor) {d 2, d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

21 Boolean retrieval inverted index A data structure for computationally efficient retrieval Inverted file index contains a list of references to documents for all index terms For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Frodo stabbed the orc with the sword. d 3 : Sam stole Frodo s sword. We have the following inverted index: L( Frodo ) = {d 1, d 2, d 3 } L( Sam ) = {d 1, d 3 } L( orc ) = {d 1, d 2 } L( stab ) = {d 1, d 2 } L( sword ) = {d 2, d 3 } L( stole ) = {d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

22 Boolean retrieval inverted index Full inverted index additionally contains the positions of each word within a document E.g., Frodo : {(d 1, 1), (d 2, 1), (d 3, 3)} Inverted index allows for handling Boolean queries via set intersections and set unions q: (Frodo stab) (Sam orc) Results obtained by computing as follows: rel(d, q) = (L( Frodo ) L( stab )) (L( Sam ) L( orc )) = ({d 1, d 2, d 3 } {d 1, d 2 }) ({d 1, d 3 } {d 1, d 2 }) = {d 1, d 2 } {d 1 } = {d 1, d 2 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

23 Boolean retrieval model Advantages Only one: simplicity (computational efficiency) Popular in early commercial systems (e.g., Westlaw) Shortcomings Expressing information needs as Boolean expressions is unintuitive A puritan model No ranking documents are either relevant or non-relevant Relative importance of indexed terms is ignored Extended Boolean model a variant of the Boolean model that accounts for the partial fulfilment of a Boolean expression Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

24 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

25 Vector space model Documents and queries are represented as vectors of index terms Weights are real numbers 0 d j = [w 1j, w 2j,..., w tj ] q = [w 1q, w 2q,..., w tq ] The relevance of the document for the query is estimated by computing some distance or similarity metric between the two vectors Distance metrics Euclidean, Manhattan, etc. More relevant when distance is lower Similarity metrics Cosine, Dice, etc. More relevant when similarity is larger Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

26 Vector space model distance metrics Euclidean distance dis E (d j, q) = t (w ij w iq ) 2 Manhattan distance i=1 dis M (d j, q) = t w ij w iq i=1 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

27 Vector space model similarity metrics Cosine similarity Cosine(d j, q) = = d j q d j q t i=1 w ijw iq t t i=1 w2 ij i=1 w2 iq Dice similarity Dice(d j, q) = 2 t i=1 w ijw iq t i=1 w ij + t i=1 w iq Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

28 Vector space model term weighting How are the weights w ij of index terms for documents computed? Two intuitive assumptions: 1 The relevance of an index term for the document is proportional to its frequency in the document (term frequency component) i.e., more frequent more relevant 2 The relevance of an index term for any document is inversely proportional to the number of documents in the collection in which it occurs (inverse document frequency component) i.e., more common across documents less relevant (e.g., stopwords such as the ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

29 Vector space model weighting schemes Binary weighting scheme Not really a weighting scheme Ignores the two aforementioned assumptions w ij = 1 iff document d j contains term k i TF-IDF weighting scheme The weight computed as the product of the term frequency component and the inverse document frequency component w ij = tf (k i, d j ) idf (k i, D) The most popular local and global schemes: tf (k i, d j ) = freq(k i, d j ) max k dj freq(k, d j ) idf (k i, D) = log D {d j D k i d j } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

30 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

31 Probabilistic retrieval models Probabilistic retrieval models View retrieval as a problem of estimating the probability of relevance given a query, document, collection, etc. Documents are ranked in decreasing order of this probability The most popular models from this category: Classical probabilistic model Two-Poisson model BM25 Language model Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

32 Probability refresher Bayes rule: P (B A) = P (A B)P (B) P (A) Allows us to compute P (B A) using P (A B) Particularly useful if the former is difficult to compute directly Chain rule: P (A, B, C) = P (A B, C)P (B C)P (C) Allows us to write a joint probability using conditional probabilities. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

33 Probability ranking principle Probability ranking principle (Robertson, 1977) If an IR system s response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximal. Probabilistic IR models need to answer the basic question: What is the probability that a user will judge this document as relevant for this query? (Sparck Jones et al., 2000) How do we formalize this question? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

34 Probability ranking principle (2) Random variables: D = {D 1,... D t,... D N } Document (set of terms) Q = {Q 1,... Q t,... Q L } Query (set of terms) R {0, 1} relevance judgement R = 1 if D is relevant for Q, R = 0 otherwise The basic question now translates to: estimate P (R = 1 D = d, Q = q) (1) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

35 Probability ranking principle (3) Let r be a shorthand for R = 1, and r for R = 0 We apply the logit function to probability (1) ( ) p logit(p) = log 1 p This is a rank preserving transformation, giving: log p(r D, Q) p(r D, Q) = log 1 p(r D, Q) p( r D, Q) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

36 Logit Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

37 Probability ranking prinicple (4) We can now apply the Bayes rule: log p(r D, Q) p(d, Q r)p(r) = log p( r D, Q) p(d, Q r)p( r) (A benefit of having used logit is that P (D, Q) cancels out) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

38 Probability ranking prinicple (5) Finally, we use the chain rule: log p(d, Q r)p(r) p(d, Q r)p( r) p(d Q, r)p(q r)p(r) = log p(d Q, r)p(q r)p( r) p(d Q, r) P (Q r)p(r) = log + log p(d Q, r) p(q r)p( r) p(d Q, r) log p(d Q, r) (2) The right term in the second row doesn t depend on D, we can remove it without affecting the ordering (it can be interpreted as a measure of query difficulty) Expression (2) is the heart of probabilistic retrieval models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

39 Binary Independence Model Our first attempt at estimating expression (2) Documents are represented as a vector of binary random variables D = [D 1,..., D N ], with one dimension for each term in vocabulary V D i = 1 denotes the presence of term i D i = 0 denotes the absence of term i Similarly, the query is a vector Q = [Q 1,..., Q L ] Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

40 Binary Independence Model (2) Assumption 1: given relevance, terms are statistically independent Now we can write the terms in (2) as follows: Turning expression (2) into: V p(d Q, r) = p(d i Q, r) i=1 V p(d Q, r) = p(d i Q, r) i=1 log p(d Q, r) p(d Q, r) = V i=1 log p(d i Q, r) p(d i Q, r) (3) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

41 Binary Independence Model (3) Assumption 2: The presence of a term in a document depends on relevance only when that term is present in the query. For a fixed query Q = q = [q 1,..., q L ], if q i = 0, according to this assumption P (D i q, r) does not depend on relevance: so p(d i q, r) = p(d i q, r) log p(d i q, r) p(d i q, r) = 0 we can simplify (3) by ignoring all summation terms not in q: V i=1 log p(d i q, r) p(d i q, r) = log p(d t q, r) p(d t q t q, r) (4) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

42 Binary Independence Model (4) We can view expression (4) as summing weights of terms t q log p(d t q, r) p(d t q, r) = w t (5) t q Although assumptions are often violated these models work well Common practice is to approximate p(d t q, r) with 0.5 and p(d t q, r) with n t /N d n t number of documents containing term i N d total number of documents A very similar alternative is to use IDF scores as w t These probabilities can be reestimated through relevance feedback Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

43 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. and the query Sam stabbed orc using common-practice approximations: d 1 d 2 d 3 t Sam stabbed orcs Sam orc Sam P (D t q, r) P (D t q, r) 3/3 1/3 2/3 3/3 2/3 3/3 w t log 0.5 log 1.5 log 0.75 log 0.5 log 0.75 log 0.5 wt Yields d 1 as the most relevant result, followed by d 3 and d 2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

44 Two-Poisson model A more realistic document representation: a vector of word frequencies Uses the Poisson distribution to model frequencies Assumption: all documents are of equal length Can be approximated by the following expression: t q f t,d (k 1 + 1) k 1 + f t,d w t (6) Where f t,d is the frequency of term t in document d and k 1 a constant (typically 1 k 1 < 2) higher frequency words get boosted weights Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

45 BM11 Removes document length assumptions of the two-poisson model matches in longer documents should be less important We can correct the frequency f t,d = f t,d (l avg /l d ) l avg the average length of a document l d the length of document d dampens/boosts word frequencies based on above/below average document length Now we can rewrite (6) as: t q f t,d (k 1 + 1) k 1 (l d /l avg ) + f t,d w t Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

46 BM25 While BM11 removes the problem with assuming equal document length, in practice it still has problems: Long relevant documents are getting too much dampening Short irrelevant documents are getting too much boosting To control the amount of correction, we introduce b (often set to 0.75) f t,d (k 1 + 1) w t (7) k 1 (1 b) + k 1 (l d /l avg )b + f t,d t q This expression represents the famous BM25 ranking function, which gives state-of-the-art results Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

47 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

48 Language modeling Approaching the probabilistic information retrieval problem from a different prespective Instead of modeling document probability given the query, we model the query probability given the document Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

49 Language modeling Probabilistic modeling of language A sentence t is a vector of terms < t 1,, t n > Ideally, the probability of sentence t is: p(t) = p(t 1 ) p(t 2 t 1 ) p(t i t 1 t i 1 ) p(t n t 1,, t n 1 ) E.g., for t = {one, ring, to, rule} we have p(t) = p(one) p(ring one) p(to one,ring) p(rule, one,ring,to) In practice the sparseness problem makes this intractable Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

50 Unigram language models Solve the sparseness problem by completely ignoring conditioning Probability of sentence t under the unigram model is: p(t M) = p(t n ) p(t i ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i ) = n i n T (8) n i number of times term t i occurs in the collection n T total number of term occurences in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

51 Bigram language models We simplify the conditionals by leaving only the previous word A better approximation of reality than unigram models Probability of sentence t under the bigram model: p(t M) = p(t n t n 1 ) p(t i t i 1 ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i t i 1 ) = n(t i 1, t i ) n(t i 1 ) n(t i 1, t i ) number of times bigram (t i 1, t i ) occurs in the collection n i 1 number of times term t i 1 occurs in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

52 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. Estimating a unigram language model (for the entire collection): t i Frodo Sam orc chased sword... P (t i ) 1/16 3/16 2/16 1/16 2/16... Estimating a bigram language model (for the entire collection): t i 1, t i Frodo, chased the, sword the, orc... P (t i t i 1 ) 0 2/3 1/3... Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

53 Query likelihood model Given a document collection D and a query q A language model M d is built for each document Documents are scored according to the probability P (q M d ) Intuition: language models corresponding to relevant documents should assign higher probability to the query Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

54 Query likelihood model(2) A unigram language model of document d can be estimated by dividing the number of times term i occurs in d by the number of terms in d p(t i M d ) = n i,d n d Bigram language models are rarely applied to individual documents due to sparsity problems Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

55 Smoothing language models Models we ve considered so far give probability 0 to queries which contain terms that do not occur in the document We can prevent this by using smoothing techniques Smoothing adds a small probability under the model even to unseen words Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

56 Smoothing techniques Laplace smoothing Adding a fixed small count (e.g., 1) to all word counts (even the unobserved ones) and renormalizing to get a probability distribution p (t i M d ) = n i,d + α n d + V α Adds an artificial count α for each possible word in our vocabulary V Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

57 Smoothing techniques Laplace smoothing assumes all unseen words are equally likely! Jelinek-Mercer smoothing rather builds a language model M D of the entire document collection, and interpolates: p (t i M d ) = λp(t i M d ) + (1 λ)p(t i M D ) Words absent from a document will still get some probability mass from the right term However, the amount each word gets will vary depending on their likelihood in the collection as a whole Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

58 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

59 IR Tools Terrier IR platform Open command-line IR platform Popular in academia used for IR evaluations in research Stepwise usage first indexing, then retrieval, and then evaluation Lucene Open source information retrieval library written in Java (ports to many languages exist) Describes a document as a set of (user definable) fields Very flexible solution for applications requiring full text indexing and search Elasticsearch RESTful search engine on top of Lucene Docs: Very limited NLP functionality: text-classification-made-easy-with-elasticsearch Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

60 Terrier (University of Glasgow) Implements a wide variety of IR models VSMs, LMs, PMs Stepwise usage Loading and indexing the collection of documents (which need to be in the TREC format) >> trec_setup.sh <coll-path> >> trec_terrier.sh -i Performing retrieval (must provide the collection of queries and define the retrieval model) >> trec_terrier.sh -r -Dtrec.topics=<q-path> -Dtrec.model=<ir-model> Evaluating the retrieval performance (must provide relevance judgements) >> trec_terrier.sh -e Dtrec.qrels=<rj-path> Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

61 Now you... 1 Understand the importance of information retrieval 2 Are familiar with basic principles of IR 3 Have basic understanding of the three main IR paradigms 4 Are aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

Ranked Retrieval (2)

Ranked Retrieval (2) Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF

More information

Boolean and Vector Space Retrieval Models

Boolean and Vector Space Retrieval Models Boolean and Vector Space Retrieval Models Many slides in this section are adapted from Prof. Joydeep Ghosh (UT ECE) who in turn adapted them from Prof. Dik Lee (Univ. of Science and Tech, Hong Kong) 1

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 12: Language Models for IR Outline Language models Language Models for IR Discussion What is a language model? We can view a finite state automaton as a deterministic

More information

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). Boolean and Vector Space Retrieval Models 2013 CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK). 1 Table of Content Boolean model Statistical vector space model Retrieval

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 3 Modeling Introduction to IR Models Basic Concepts The Boolean Model Term Weighting The Vector Model Probabilistic Model Retrieval Evaluation, Modern Information Retrieval,

More information

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13

Language Models. Web Search. LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing. Slides based on the books: 13 Language Models LM Jelinek-Mercer Smoothing and LM Dirichlet Smoothing Web Search Slides based on the books: 13 Overview Indexes Query Indexing Ranking Results Application Documents User Information analysis

More information

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval Basic IR models. Luca Bondi Basic IR models Luca Bondi Previously on IR 2 d j q i IRM SC q i, d j IRM D, Q, R q i, d j d j = w 1,j, w 2,j,, w M,j T w i,j = 0 if term t i does not appear in document d j w i,j and w i:1,j assumed to

More information

Chap 2: Classical models for information retrieval

Chap 2: Classical models for information retrieval Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81 Outline Basic IR Models 1 Basic

More information

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1 Retrieval by Content Part 2: Text Retrieval Term Frequency and Inverse Document Frequency Srihari: CSE 626 1 Text Retrieval Retrieval of text-based information is referred to as Information Retrieval (IR)

More information

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS RETRIEVAL MODELS Dr. Gjergji Kasneci Introduction to Information Retrieval WS 2012-13 1 Outline Intro Basics of probability and information theory Retrieval models Boolean model Vector space model Probabilistic

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval Lecture 11: Probabilistic Information Retrieval 1 Outline Basic Probability Theory Probability Ranking Principle Extensions 2 Basic Probability Theory For events A

More information

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Natural Language Processing. Topics in Information Retrieval. Updated 5/10 Natural Language Processing Topics in Information Retrieval Updated 5/10 Outline Introduction to IR Design features of IR systems Evaluation measures The vector space model Latent semantic indexing Background

More information

IR Models: The Probabilistic Model. Lecture 8

IR Models: The Probabilistic Model. Lecture 8 IR Models: The Probabilistic Model Lecture 8 ' * ) ( % $ $ +#! "#! '& & Probability of Relevance? ' ', IR is an uncertain process Information need to query Documents to index terms Query terms and index

More information

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze

Chapter 10: Information Retrieval. See corresponding chapter in Manning&Schütze Chapter 10: Information Retrieval See corresponding chapter in Manning&Schütze Evaluation Metrics in IR 2 Goal In IR there is a much larger variety of possible metrics For different tasks, different metrics

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 11: Probabilistic Information Retrieval Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2017 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton Language Models CS6200: Information Retrieval Slides by: Jesse Anderton What s wrong with VSMs? Vector Space Models work reasonably well, but have a few problems: They are based on bag-of-words, so they

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2014 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology M. Soleymani Fall 2016 Most slides have been adapted from: Profs. Manning, Nayak & Raghavan (CS-276, Stanford)

More information

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Fall 2016 CS646: Information Retrieval Lecture 6 Boolean Search and Vector Space Model Jiepu Jiang University of Massachusetts Amherst 2016/09/26 Outline Today Boolean Retrieval Vector Space Model Latent

More information

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25 Trevor Cohn (Slide credits: William Webber) COMP90042, 2015, Semester 1 What we ll learn in this lecture Probabilistic models for

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search IR models: Vector Space Model IR Models Set Theoretic Classic Models Fuzzy Extended Boolean U s e r T a s k Retrieval: Adhoc Filtering Brosing boolean vector probabilistic

More information

6. Logical Inference

6. Logical Inference Artificial Intelligence 6. Logical Inference Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder University of Zagreb Faculty of Electrical Engineering and Computing Academic Year 2016/2017 Creative Commons

More information

Probabilistic Information Retrieval

Probabilistic Information Retrieval Probabilistic Information Retrieval Sumit Bhatia July 16, 2009 Sumit Bhatia Probabilistic Information Retrieval 1/23 Overview 1 Information Retrieval IR Models Probability Basics 2 Document Ranking Problem

More information

5 10 12 32 48 5 10 12 32 48 4 8 16 32 64 128 4 8 16 32 64 128 2 3 5 16 2 3 5 16 5 10 12 32 48 4 8 16 32 64 128 2 3 5 16 docid score 5 10 12 32 48 O'Neal averaged 15.2 points 9.2 rebounds and 1.0 assists

More information

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University Text Mining Dr. Yanjun Li Associate Professor Department of Computer and Information Sciences Fordham University Outline Introduction: Data Mining Part One: Text Mining Part Two: Preprocessing Text Data

More information

Language Models. Hongning Wang

Language Models. Hongning Wang Language Models Hongning Wang CS@UVa Notion of Relevance Relevance (Rep(q), Rep(d)) Similarity P(r1 q,d) r {0,1} Probability of Relevance P(d q) or P(q d) Probabilistic inference Different rep & similarity

More information

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from INFO 4300 / CS4300 Information Retrieval slides adapted from Hinrich Schütze s, linked from http://informationretrieval.org/ IR 26/26: Feature Selection and Exam Overview Paul Ginsparg Cornell University,

More information

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 ) Part A 1. A Markov chain is a discrete-time stochastic process, defined by a set of states, a set of transition probabilities (between states), and a set of initial state probabilities; the process proceeds

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 4: Probabilistic Retrieval Models April 29, 2010 Wolf-Tilo Balke and Joachim Selke Institut für Informationssysteme Technische Universität Braunschweig

More information

CAIM: Cerca i Anàlisi d Informació Massiva

CAIM: Cerca i Anàlisi d Informació Massiva 1 / 21 CAIM: Cerca i Anàlisi d Informació Massiva FIB, Grau en Enginyeria Informàtica Slides by Marta Arias, José Balcázar, Ricard Gavaldá Department of Computer Science, UPC Fall 2016 http://www.cs.upc.edu/~caim

More information

University of Illinois at Urbana-Champaign. Midterm Examination

University of Illinois at Urbana-Champaign. Midterm Examination University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,

More information

16 The Information Retrieval "Data Model"

16 The Information Retrieval Data Model 16 The Information Retrieval "Data Model" 16.1 The general model Not presented in 16.2 Similarity the course! 16.3 Boolean Model Not relevant for exam. 16.4 Vector space Model 16.5 Implementation issues

More information

Lecture 3: Probabilistic Retrieval Models

Lecture 3: Probabilistic Retrieval Models Probabilistic Retrieval Models Information Retrieval and Web Search Engines Lecture 3: Probabilistic Retrieval Models November 5 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Christopher Manning and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture;

More information

PV211: Introduction to Information Retrieval

PV211: Introduction to Information Retrieval PV211: Introduction to Information Retrieval http://www.fi.muni.cz/~sojka/pv211 IIR 6: Scoring, term weighting, the vector space model Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics,

More information

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson Content adapted from Hinrich Schütze http://www.informationretrieval.org Querying Corpus-wide statistics Querying

More information

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan Term Weighting and the Vector Space Model borrowing from: Pandu Nayak and Prabhakar Raghavan IIR Sections 6.2 6.4.3 Ranked retrieval Scoring documents Term frequency Collection statistics Weighting schemes

More information

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting Outline for today Information Retrieval Efficient Scoring and Ranking Recap on ranked retrieval Jörg Tiedemann jorg.tiedemann@lingfil.uu.se Department of Linguistics and Philology Uppsala University Efficient

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-017 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 1 Boolean Retrieval Thus far,

More information

Score Distribution Models

Score Distribution Models Score Distribution Models Evangelos Kanoulas Virgil Pavlu Keshi Dai Javed Aslam Score Distributions 2 Score Distributions 2 Score Distributions 9.6592 9.5761 9.4919 9.4784 9.2693 9.2066 9.1407 9.0824 9.0110

More information

Text Analytics (Text Mining)

Text Analytics (Text Mining) http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Text Analytics (Text Mining) Concepts, Algorithms, LSI/SVD Duen Horng (Polo) Chau Assistant Professor Associate Director, MS

More information

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review INFO 4300 / CS4300 Information Retrieval IR 9: Linear Algebra Review Paul Ginsparg Cornell University, Ithaca, NY 24 Sep 2009 1/ 23 Overview 1 Recap 2 Matrix basics 3 Matrix Decompositions 4 Discussion

More information

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002 CS276A Text Information Retrieval, Mining, and Exploitation Lecture 4 15 Oct 2002 Recap of last time Index size Index construction techniques Dynamic indices Real world considerations 2 Back of the envelope

More information

Lecture 13: More uses of Language Models

Lecture 13: More uses of Language Models Lecture 13: More uses of Language Models William Webber (william@williamwebber.com) COMP90042, 2014, Semester 1, Lecture 13 What we ll learn in this lecture Comparing documents, corpora using LM approaches

More information

Probabilistic Field Mapping for Product Search

Probabilistic Field Mapping for Product Search Probabilistic Field Mapping for Product Search Aman Berhane Ghirmatsion and Krisztian Balog University of Stavanger, Stavanger, Norway ab.ghirmatsion@stud.uis.no, krisztian.balog@uis.no, Abstract. This

More information

vector space retrieval many slides courtesy James Amherst

vector space retrieval many slides courtesy James Amherst vector space retrieval many slides courtesy James Allan@umass Amherst 1 what is a retrieval model? Model is an idealization or abstraction of an actual process Mathematical models are used to study the

More information

.. CSC 566 Advanced Data Mining Alexander Dekhtyar..

.. CSC 566 Advanced Data Mining Alexander Dekhtyar.. .. CSC 566 Advanced Data Mining Alexander Dekhtyar.. Information Retrieval Latent Semantic Indexing Preliminaries Vector Space Representation of Documents: TF-IDF Documents. A single text document is a

More information

1 Information retrieval fundamentals

1 Information retrieval fundamentals CS 630 Lecture 1: 01/26/2006 Lecturer: Lillian Lee Scribes: Asif-ul Haque, Benyah Shaparenko This lecture focuses on the following topics Information retrieval fundamentals Vector Space Model (VSM) Deriving

More information

Information Retrieval

Information Retrieval Introduction to Information Retrieval CS276: Information Retrieval and Web Search Pandu Nayak and Prabhakar Raghavan Lecture 6: Scoring, Term Weighting and the Vector Space Model This lecture; IIR Sections

More information

Maschinelle Sprachverarbeitung

Maschinelle Sprachverarbeitung Maschinelle Sprachverarbeitung Retrieval Models and Implementation Ulf Leser Content of this Lecture Information Retrieval Models Boolean Model Vector Space Model Inverted Files Ulf Leser: Maschinelle

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar.. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar Knowledge Discovery in Data: Naïve Bayes Overview Naïve Bayes methodology refers to a probabilistic approach to information discovery

More information

A Study of the Dirichlet Priors for Term Frequency Normalisation

A Study of the Dirichlet Priors for Term Frequency Normalisation A Study of the Dirichlet Priors for Term Frequency Normalisation ABSTRACT Ben He Department of Computing Science University of Glasgow Glasgow, United Kingdom ben@dcs.gla.ac.uk In Information Retrieval

More information

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Information Retrieval and Topic Models Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin) Sec. 1.1 Unstructured data in 1620 Which plays of Shakespeare

More information

CS 572: Information Retrieval

CS 572: Information Retrieval CS 572: Information Retrieval Lecture 11: Topic Models Acknowledgments: Some slides were adapted from Chris Manning, and from Thomas Hoffman 1 Plan for next few weeks Project 1: done (submit by Friday).

More information

Variable Latent Semantic Indexing

Variable Latent Semantic Indexing Variable Latent Semantic Indexing Prabhakar Raghavan Yahoo! Research Sunnyvale, CA November 2005 Joint work with A. Dasgupta, R. Kumar, A. Tomkins. Yahoo! Research. Outline 1 Introduction 2 Background

More information

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science http://www.cs.odu.edu/ jbollen January 30, 2003 Page 1 Structure 1. IR formal characterization (a) Mathematical

More information

Text Analysis. Week 5

Text Analysis. Week 5 Week 5 Text Analysis Reference and Slide Source: ChengXiang Zhai and Sean Massung. 2016. Text Data Management and Analysis: a Practical Introduction to Information Retrieval and Text Mining. Association

More information

Lecture 5: Web Searching using the SVD

Lecture 5: Web Searching using the SVD Lecture 5: Web Searching using the SVD Information Retrieval Over the last 2 years the number of internet users has grown exponentially with time; see Figure. Trying to extract information from this exponentially

More information

Scoring, Term Weighting and the Vector Space

Scoring, Term Weighting and the Vector Space Scoring, Term Weighting and the Vector Space Model Francesco Ricci Most of these slides comes from the course: Information Retrieval and Web Search, Christopher Manning and Prabhakar Raghavan Content [J

More information

Machine Learning: Assignment 1

Machine Learning: Assignment 1 10-701 Machine Learning: Assignment 1 Due on Februrary 0, 014 at 1 noon Barnabas Poczos, Aarti Singh Instructions: Failure to follow these directions may result in loss of points. Your solutions for this

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy Text Technologies for Data Science INFR11145 Ranked IR Instructor: Walid Magdy 10-Oct-2018 Lecture Objectives Learn about Ranked IR TFIDF VSM SMART notation Implement: TFIDF 2 1 Boolean Retrieval Thus

More information

Midterm Examination Practice

Midterm Examination Practice University of Illinois at Urbana-Champaign Midterm Examination Practice CS598CXZ Advanced Topics in Information Retrieval (Fall 2013) Professor ChengXiang Zhai 1. Basic IR evaluation measures: The following

More information

13 Searching the Web with the SVD

13 Searching the Web with the SVD 13 Searching the Web with the SVD 13.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

9 Searching the Internet with the SVD

9 Searching the Internet with the SVD 9 Searching the Internet with the SVD 9.1 Information retrieval Over the last 20 years the number of internet users has grown exponentially with time; see Figure 1. Trying to extract information from this

More information

Information Retrieval and Organisation

Information Retrieval and Organisation Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes

Categorization ANLP Lecture 10 Text Categorization with Naive Bayes 1 Categorization ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Important task for both humans and machines object identification face recognition spoken word recognition

More information

ANLP Lecture 10 Text Categorization with Naive Bayes

ANLP Lecture 10 Text Categorization with Naive Bayes ANLP Lecture 10 Text Categorization with Naive Bayes Sharon Goldwater 6 October 2014 Categorization Important task for both humans and machines 1 object identification face recognition spoken word recognition

More information

A Study of Smoothing Methods for Language Models Applied to Information Retrieval

A Study of Smoothing Methods for Language Models Applied to Information Retrieval A Study of Smoothing Methods for Language Models Applied to Information Retrieval CHENGXIANG ZHAI and JOHN LAFFERTY Carnegie Mellon University Language modeling approaches to information retrieval are

More information

The Naïve Bayes Classifier. Machine Learning Fall 2017

The Naïve Bayes Classifier. Machine Learning Fall 2017 The Naïve Bayes Classifier Machine Learning Fall 2017 1 Today s lecture The naïve Bayes Classifier Learning the naïve Bayes Classifier Practical concerns 2 Today s lecture The naïve Bayes Classifier Learning

More information

Lecture 12: Link Analysis for Web Retrieval

Lecture 12: Link Analysis for Web Retrieval Lecture 12: Link Analysis for Web Retrieval Trevor Cohn COMP90042, 2015, Semester 1 What we ll learn in this lecture The web as a graph Page-rank method for deriving the importance of pages Hubs and authorities

More information

The Static Absorbing Model for the Web a

The Static Absorbing Model for the Web a Journal of Web Engineering, Vol. 0, No. 0 (2003) 000 000 c Rinton Press The Static Absorbing Model for the Web a Vassilis Plachouras University of Glasgow Glasgow G12 8QQ UK vassilis@dcs.gla.ac.uk Iadh

More information

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze Term Weighting and Vector Space Model Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze 1 Ranked retrieval Thus far, our queries have all been Boolean. Documents either

More information

Document indexing, similarities and retrieval in large scale text collections

Document indexing, similarities and retrieval in large scale text collections Document indexing, similarities and retrieval in large scale text collections Eric Gaussier Univ. Grenoble Alpes - LIG Eric.Gaussier@imag.fr Eric Gaussier Document indexing, similarities & retrieval 1

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

Information Retrieval. Lecture 6

Information Retrieval. Lecture 6 Information Retrieval Lecture 6 Recap of the last lecture Parametric and field searches Zones in documents Scoring documents: zone weighting Index support for scoring tf idf and vector spaces This lecture

More information

Language as a Stochastic Process

Language as a Stochastic Process CS769 Spring 2010 Advanced Natural Language Processing Language as a Stochastic Process Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Basic Statistics for NLP Pick an arbitrary letter x at random from any

More information

CS47300: Web Information Search and Management

CS47300: Web Information Search and Management CS47300: Web Information Search and Management Prof. Chris Clifton 6 September 2017 Material adapted from course created by Dr. Luo Si, now leading Alibaba research group 1 Vector Space Model Disadvantages:

More information

Lecture 2: Probability, Naive Bayes

Lecture 2: Probability, Naive Bayes Lecture 2: Probability, Naive Bayes CS 585, Fall 205 Introduction to Natural Language Processing http://people.cs.umass.edu/~brenocon/inlp205/ Brendan O Connor Today Probability Review Naive Bayes classification

More information

Topic Models and Applications to Short Documents

Topic Models and Applications to Short Documents Topic Models and Applications to Short Documents Dieu-Thu Le Email: dieuthu.le@unitn.it Trento University April 6, 2011 1 / 43 Outline Introduction Latent Dirichlet Allocation Gibbs Sampling Short Text

More information

Leverage Sparse Information in Predictive Modeling

Leverage Sparse Information in Predictive Modeling Leverage Sparse Information in Predictive Modeling Liang Xie Countrywide Home Loans, Countrywide Bank, FSB August 29, 2008 Abstract This paper examines an innovative method to leverage information from

More information

Matrices, Vector Spaces, and Information Retrieval

Matrices, Vector Spaces, and Information Retrieval Matrices, Vector Spaces, and Information Authors: M. W. Berry and Z. Drmac and E. R. Jessup SIAM 1999: Society for Industrial and Applied Mathematics Speaker: Mattia Parigiani 1 Introduction Large volumes

More information

Geometric and Quantum Methods for Information Retrieval

Geometric and Quantum Methods for Information Retrieval PAPER Geometric and Quantum Methods for Information Retrieval Yaoyong Li, Hamish Cunningham Department of Computer Science, University of Sheffield, Sheffield, UK {yaoyong;hamish}@dcs.shef.ac.uk Abstract

More information

Name: Matriculation Number: Tutorial Group: A B C D E

Name: Matriculation Number: Tutorial Group: A B C D E Name: Matriculation Number: Tutorial Group: A B C D E Question: 1 (5 Points) 2 (6 Points) 3 (5 Points) 4 (5 Points) Total (21 points) Score: General instructions: The written test contains 4 questions

More information

Information Retrieval

Information Retrieval Introduction to Information CS276: Information and Web Search Christopher Manning and Pandu Nayak Lecture 13: Latent Semantic Indexing Ch. 18 Today s topic Latent Semantic Indexing Term-document matrices

More information

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search Part I: Web Structure Mining Chapter : Information Retrieval an Web Search The Web Challenges Crawling the Web Inexing an Keywor Search Evaluating Search Quality Similarity Search The Web Challenges Tim

More information

Modern Information Retrieval

Modern Information Retrieval Modern Information Retrieval Chapter 8 Text Classification Introduction A Characterization of Text Classification Unsupervised Algorithms Supervised Algorithms Feature Selection or Dimensionality Reduction

More information

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model Vector Space Model Yufei Tao KAIST March 5, 2013 In this lecture, we will study a problem that is (very) fundamental in information retrieval, and must be tackled by all search engines. Let S be a set

More information

Data Mining Recitation Notes Week 3

Data Mining Recitation Notes Week 3 Data Mining Recitation Notes Week 3 Jack Rae January 28, 2013 1 Information Retrieval Given a set of documents, pull the (k) most similar document(s) to a given query. 1.1 Setup Say we have D documents

More information

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Institute for Web Science & Technologies University of Koblenz-Landau, Germany Web Information Retrieval Dipl.-Inf. Christoph Carl Kling Exercises WebIR ask questions! WebIR@c-kling.de 2 of 40 Probabilities

More information

Probabilistic Language Modeling

Probabilistic Language Modeling Predicting String Probabilities Probabilistic Language Modeling Which string is more likely? (Which string is more grammatical?) Grill doctoral candidates. Regina Barzilay EECS Department MIT November

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Language Models, Smoothing, and IDF Weighting

Language Models, Smoothing, and IDF Weighting Language Models, Smoothing, and IDF Weighting Najeeb Abdulmutalib, Norbert Fuhr University of Duisburg-Essen, Germany {najeeb fuhr}@is.inf.uni-due.de Abstract In this paper, we investigate the relationship

More information

Comparing Relevance Feedback Techniques on German News Articles

Comparing Relevance Feedback Techniques on German News Articles B. Mitschang et al. (Hrsg.): BTW 2017 Workshopband, Lecture Notes in Informatics (LNI), Gesellschaft für Informatik, Bonn 2017 301 Comparing Relevance Feedback Techniques on German News Articles Julia

More information