3. Basics of Information Retrieval

Size: px

Start display at page:

Download "3. Basics of Information Retrieval"

Gillian Quinn
5 years ago
Views:

1 Text Analysis and Retrieval 3. Basics of Information Retrieval Prof. Bojana Dalbelo Bašić Assoc. Prof. Jan Šnajder With contributions from dr. sc. Goran Glavaš Mladen Karan, mag. ing. University of Zagreb Faculty of Electrical Engineering and Computing (FER) Academic Year 2017/2018 Creative Commons Attribution NonCommercial NoDerivs 3.0 v1.2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

2 After today s lecture, you ll... 1 Understand the importance of information retrieval 2 Gain familiarity with basic principles of IR 3 Gain basic understanding of the three main IR paradigms 4 Be aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

3 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

4 Retrieval and search What is your first association with information retrieval? What is your first association with search or search engine? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

5 Retrieval and search Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

6 What is IR? Information retrieval The activity of obtaining information resources relevant to a user s information need from a collection of information resources. (Wikipedia) Elements of an information retrieval system Information needs (expressed by users in the form of queries) Information (re)sources (typically unstructured text, images, video, audio, etc.) Component for the efficient retrieval of relevant sources for a given expressed information need, typically from a large collection of information sources Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

Information needs Information need Information need is an individual or group s desire to locate and obtain information to satisfy a conscious or unconscious need.

7 Information needs Information need Information need is an individual or group s desire to locate and obtain information to satisfy a conscious or unconscious need. Needs and interests call forth information. (Robert S. Taylor: The process of asking questions, 2007) (Un)conscious needs for information are expressed via queries Words and phrases in text information retrieval (e.g., ISIS attacks ) Images in image content retrieval Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

8 Why text information retrieval? Large repositories of unstructured information sources Companies e.g., technical documentation, business contracts Governments e.g., documentation, regulations, laws (CADIAL) Scientific material e.g., research articles (Google Scholar) Personal e.g., books, s, files World Wide Web most demanding due to largest scale of all Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

9 Text information retrieval The focus of TAR is text information retrieval, where models differ in: The representation of documents and queries The way they determine the relevance of a document for a given query Relevance of the document is most often given as a score (and not a binary decision) Documents are ranked according to the assigned scores for the given query Relevance scores usually incorporate an element of uncertainty Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

10 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

11 Text representation in IR Unstructured representation Text represented as an unordered set of terms (the so-called bag-of-words representation) Considerable oversimplification: ignoring syntax and semantics Despite oversimplifying, satisfiable retrieval performance Weakly-structured representations Certain groups of terms given more importance (other terms contribution downscaled or ignored) Noun phrases (NP), named entities, etc. Structured representations Virtually not used in IR context Information extraction (IE) techniques not sufficiently accurate IE models can be time-costly not acceptable in IR Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

12 Unstructured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of words {(One, 1), (evening, 1), (Frodo, 2), (and, 2), (Sam, 1) (were, 1), (walking, 1), (together, 1), (in, 1), (the, 3), (cool, 1), (twilight, 1), (Both, 1), (of, 2), (them, 1), (felt, 1), (restless, 1), (again, 1), (On, 1), (suddenly, 1), (shadow, 1), (parting, 1), (had, 1), (falling, 1), (time, 1), (to, 1), (leave, 1), (Lothlorien, 1), (was, 1), (near,1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

13 Weakly-structured document representation Document snippet One evening Frodo and Sam were walking together in the cool twilight. Both of them felt restless again. On Frodo suddenly the shadow of parting had falling: the time to leave Lothlorien was near. Bag of nouns {(evening, 1), (Frodo, 2), (Sam, 1), (twilight, 1), (shadow, 1), (parting, 1), (time, 1), (Lothlorien, 1)} Bag of named entity terms {(Frodo, 2), (Sam, 1), (Lothlorien, 1)} Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

14 Preprocessing Two very common preprocessing steps: Morphological normalization stemming or lemmatization Stop-words removal Morphological normalization Conflating various forms of the same word to a common form Important for morphologically rich languages such as Croatian Stemming (e.g., kućom kuć) more often used than lemmatization (e.g., kućom kuća) Stop-words removal Removing semantically poor terms such as articles, prepositions, conjunctions, pronouns, etc. Keeping just content words nouns, verbs, adjectives, adverbs English stop words Both methods reduce the cardinality of the bag-of-words set of the document and generally boost IR performance Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

15 Formalization of IR models The basic retrieval model is a triple (f d, f q, r) where: 1 f d is a function that maps documents to their representations for retrieval, i.e., f d (d) = x d, where x d is the retrieval representation of the document d; 2 f q is a function that maps queries to their representations for retrieval, i.e., f q (q) = x q, where x q is the retrieval representation of the query q; 3 r is a ranking function Takes into account document representation x d and query representation x q Associates a real number that indicates the potential relevance of the document d for the query q based on x d and x q relevance(d, q) = r(fd(d), fq(q)) = r(x d, x q ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

16 Index terms and term weights Index terms are all the terms in the collection (the vocabulary) The set of all index terms: K = {k 1, k 2,..., k t }, where t is the number of index terms For each document d j, each term k i is assigned a weight w ij The weight of index terms not appearing in the document is 0 Document d j is represented by the term vector [w 1j, w 2j,..., w tj ] Let g be the function that computes the weights, i.e., w ij = g(k i, d j ) Different choices for the weighting function g and the ranking function r define different IR models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

17 IR paradigms Information retrieval models roughly fall into three paradigms: 1 Set-theoretic models Boolean model Extended Boolean model 2 Algebraic models Vector space model Latent semantic indexing 3 Probabilistic models Classic probabilistic model Language model Additionally, there are IR models that utilize link analysis algorithms (e.g., PageRank, HITS) Typically used in web retrieval where documents are (hyper)linked Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

18 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

19 Boolean retrieval model Documents represented as bags of words Term weights are all binary w ij {0, 1} w ij = 1 iff index term k i can be found in the bag of words of document d j Query q is given as a propositional logic formula over index terms Index terms are connected via Boolean operators (, ) and can be negated ( ) Each query q can be transformed into disjunctive normal form (DNF), i.e., q = q c1 q c2 q cn, where q cl is the l-th conjunctive component of q s DNF The relevance of the document d j for the query q is given as follows: { 1, if q cl k i terms(q cl ), w ij = 1. relevance(d j, q) = 0, otherwise. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

20 Boolean retrieval example Assume the following set of index terms: K = {Frodo, Sam, blue, sword, orc, Mordor} Let us have the following documents: d 1 : Frodo stabbed the orc with the red sword d 2 : Frodo and Sam used the blue lamp to locate orcs d 3 : Sam killed many orcs in Mordor with the blue sword Which documents are relevant for the following queries? q 1 : (Frodo orc sword) (Frodo blue) {d 1, d 2 } q 2 : (Sam blue) (Sam orc Mordor) {d 2, d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

21 Boolean retrieval inverted index A data structure for computationally efficient retrieval Inverted file index contains a list of references to documents for all index terms For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Frodo stabbed the orc with the sword. d 3 : Sam stole Frodo s sword. We have the following inverted index: L( Frodo ) = {d 1, d 2, d 3 } L( Sam ) = {d 1, d 3 } L( orc ) = {d 1, d 2 } L( stab ) = {d 1, d 2 } L( sword ) = {d 2, d 3 } L( stole ) = {d 3 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

22 Boolean retrieval inverted index Full inverted index additionally contains the positions of each word within a document E.g., Frodo : {(d 1, 1), (d 2, 1), (d 3, 3)} Inverted index allows for handling Boolean queries via set intersections and set unions q: (Frodo stab) (Sam orc) Results obtained by computing as follows: rel(d, q) = (L( Frodo ) L( stab )) (L( Sam ) L( orc )) = ({d 1, d 2, d 3 } {d 1, d 2 }) ({d 1, d 3 } {d 1, d 2 }) = {d 1, d 2 } {d 1 } = {d 1, d 2 } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

23 Boolean retrieval model Advantages Only one: simplicity (computational efficiency) Popular in early commercial systems (e.g., Westlaw) Shortcomings Expressing information needs as Boolean expressions is unintuitive A puritan model No ranking documents are either relevant or non-relevant Relative importance of indexed terms is ignored Extended Boolean model a variant of the Boolean model that accounts for the partial fulfilment of a Boolean expression Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

24 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

25 Vector space model Documents and queries are represented as vectors of index terms Weights are real numbers 0 d j = [w 1j, w 2j,..., w tj ] q = [w 1q, w 2q,..., w tq ] The relevance of the document for the query is estimated by computing some distance or similarity metric between the two vectors Distance metrics Euclidean, Manhattan, etc. More relevant when distance is lower Similarity metrics Cosine, Dice, etc. More relevant when similarity is larger Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

26 Vector space model distance metrics Euclidean distance dis E (d j, q) = t (w ij w iq ) 2 Manhattan distance i=1 dis M (d j, q) = t w ij w iq i=1 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

27 Vector space model similarity metrics Cosine similarity Cosine(d j, q) = = d j q d j q t i=1 w ijw iq t t i=1 w2 ij i=1 w2 iq Dice similarity Dice(d j, q) = 2 t i=1 w ijw iq t i=1 w ij + t i=1 w iq Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

28 Vector space model term weighting How are the weights w ij of index terms for documents computed? Two intuitive assumptions: 1 The relevance of an index term for the document is proportional to its frequency in the document (term frequency component) i.e., more frequent more relevant 2 The relevance of an index term for any document is inversely proportional to the number of documents in the collection in which it occurs (inverse document frequency component) i.e., more common across documents less relevant (e.g., stopwords such as the ) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

29 Vector space model weighting schemes Binary weighting scheme Not really a weighting scheme Ignores the two aforementioned assumptions w ij = 1 iff document d j contains term k i TF-IDF weighting scheme The weight computed as the product of the term frequency component and the inverse document frequency component w ij = tf (k i, d j ) idf (k i, D) The most popular local and global schemes: tf (k i, d j ) = freq(k i, d j ) max k dj freq(k, d j ) idf (k i, D) = log D {d j D k i d j } Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

30 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

31 Probabilistic retrieval models Probabilistic retrieval models View retrieval as a problem of estimating the probability of relevance given a query, document, collection, etc. Documents are ranked in decreasing order of this probability The most popular models from this category: Classical probabilistic model Two-Poisson model BM25 Language model Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

32 Probability refresher Bayes rule: P (B A) = P (A B)P (B) P (A) Allows us to compute P (B A) using P (A B) Particularly useful if the former is difficult to compute directly Chain rule: P (A, B, C) = P (A B, C)P (B C)P (C) Allows us to write a joint probability using conditional probabilities. Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

33 Probability ranking principle Probability ranking principle (Robertson, 1977) If an IR system s response to each query is a ranking of the documents in the collection in order of decreasing probability of relevance, the overall effectiveness of the system to its user will be maximal. Probabilistic IR models need to answer the basic question: What is the probability that a user will judge this document as relevant for this query? (Sparck Jones et al., 2000) How do we formalize this question? Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

34 Probability ranking principle (2) Random variables: D = {D 1,... D t,... D N } Document (set of terms) Q = {Q 1,... Q t,... Q L } Query (set of terms) R {0, 1} relevance judgement R = 1 if D is relevant for Q, R = 0 otherwise The basic question now translates to: estimate P (R = 1 D = d, Q = q) (1) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

35 Probability ranking principle (3) Let r be a shorthand for R = 1, and r for R = 0 We apply the logit function to probability (1) ( ) p logit(p) = log 1 p This is a rank preserving transformation, giving: log p(r D, Q) p(r D, Q) = log 1 p(r D, Q) p( r D, Q) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

36 Logit Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

37 Probability ranking prinicple (4) We can now apply the Bayes rule: log p(r D, Q) p(d, Q r)p(r) = log p( r D, Q) p(d, Q r)p( r) (A benefit of having used logit is that P (D, Q) cancels out) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

38 Probability ranking prinicple (5) Finally, we use the chain rule: log p(d, Q r)p(r) p(d, Q r)p( r) p(d Q, r)p(q r)p(r) = log p(d Q, r)p(q r)p( r) p(d Q, r) P (Q r)p(r) = log + log p(d Q, r) p(q r)p( r) p(d Q, r) log p(d Q, r) (2) The right term in the second row doesn t depend on D, we can remove it without affecting the ordering (it can be interpreted as a measure of query difficulty) Expression (2) is the heart of probabilistic retrieval models Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

39 Binary Independence Model Our first attempt at estimating expression (2) Documents are represented as a vector of binary random variables D = [D 1,..., D N ], with one dimension for each term in vocabulary V D i = 1 denotes the presence of term i D i = 0 denotes the absence of term i Similarly, the query is a vector Q = [Q 1,..., Q L ] Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

40 Binary Independence Model (2) Assumption 1: given relevance, terms are statistically independent Now we can write the terms in (2) as follows: Turning expression (2) into: V p(d Q, r) = p(d i Q, r) i=1 V p(d Q, r) = p(d i Q, r) i=1 log p(d Q, r) p(d Q, r) = V i=1 log p(d i Q, r) p(d i Q, r) (3) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

41 Binary Independence Model (3) Assumption 2: The presence of a term in a document depends on relevance only when that term is present in the query. For a fixed query Q = q = [q 1,..., q L ], if q i = 0, according to this assumption P (D i q, r) does not depend on relevance: so p(d i q, r) = p(d i q, r) log p(d i q, r) p(d i q, r) = 0 we can simplify (3) by ignoring all summation terms not in q: V i=1 log p(d i q, r) p(d i q, r) = log p(d t q, r) p(d t q t q, r) (4) Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

42 Binary Independence Model (4) We can view expression (4) as summing weights of terms t q log p(d t q, r) p(d t q, r) = w t (5) t q Although assumptions are often violated these models work well Common practice is to approximate p(d t q, r) with 0.5 and p(d t q, r) with n t /N d n t number of documents containing term i N d total number of documents A very similar alternative is to use IDF scores as w t These probabilities can be reestimated through relevance feedback Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

43 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. and the query Sam stabbed orc using common-practice approximations: d 1 d 2 d 3 t Sam stabbed orcs Sam orc Sam P (D t q, r) P (D t q, r) 3/3 1/3 2/3 3/3 2/3 3/3 w t log 0.5 log 1.5 log 0.75 log 0.5 log 0.75 log 0.5 wt Yields d 1 as the most relevant result, followed by d 3 and d 2 Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

44 Two-Poisson model A more realistic document representation: a vector of word frequencies Uses the Poisson distribution to model frequencies Assumption: all documents are of equal length Can be approximated by the following expression: t q f t,d (k 1 + 1) k 1 + f t,d w t (6) Where f t,d is the frequency of term t in document d and k 1 a constant (typically 1 k 1 < 2) higher frequency words get boosted weights Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

45 BM11 Removes document length assumptions of the two-poisson model matches in longer documents should be less important We can correct the frequency f t,d = f t,d (l avg /l d ) l avg the average length of a document l d the length of document d dampens/boosts word frequencies based on above/below average document length Now we can rewrite (6) as: t q f t,d (k 1 + 1) k 1 (l d /l avg ) + f t,d w t Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

46 BM25 While BM11 removes the problem with assuming equal document length, in practice it still has problems: Long relevant documents are getting too much dampening Short irrelevant documents are getting too much boosting To control the amount of correction, we introduce b (often set to 0.75) f t,d (k 1 + 1) w t (7) k 1 (1 b) + k 1 (l d /l avg )b + f t,d t q This expression represents the famous BM25 ranking function, which gives state-of-the-art results Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

47 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

48 Language modeling Approaching the probabilistic information retrieval problem from a different prespective Instead of modeling document probability given the query, we model the query probability given the document Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

49 Language modeling Probabilistic modeling of language A sentence t is a vector of terms < t 1,, t n > Ideally, the probability of sentence t is: p(t) = p(t 1 ) p(t 2 t 1 ) p(t i t 1 t i 1 ) p(t n t 1,, t n 1 ) E.g., for t = {one, ring, to, rule} we have p(t) = p(one) p(ring one) p(to one,ring) p(rule, one,ring,to) In practice the sparseness problem makes this intractable Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

50 Unigram language models Solve the sparseness problem by completely ignoring conditioning Probability of sentence t under the unigram model is: p(t M) = p(t n ) p(t i ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i ) = n i n T (8) n i number of times term t i occurs in the collection n T total number of term occurences in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

51 Bigram language models We simplify the conditionals by leaving only the previous word A better approximation of reality than unigram models Probability of sentence t under the bigram model: p(t M) = p(t n t n 1 ) p(t i t i 1 ) p(t 1 ) The probabilites that define the model are estimated from a collection: p(t i t i 1 ) = n(t i 1, t i ) n(t i 1 ) n(t i 1, t i ) number of times bigram (t i 1, t i ) occurs in the collection n i 1 number of times term t i 1 occurs in the collection Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

52 Example For the collection D consisting of the following documents: d 1 : Frodo and Sam stabbed orcs. d 2 : Sam chased the orc with the sword. d 3 : Sam took the sword. Estimating a unigram language model (for the entire collection): t i Frodo Sam orc chased sword... P (t i ) 1/16 3/16 2/16 1/16 2/16... Estimating a bigram language model (for the entire collection): t i 1, t i Frodo, chased the, sword the, orc... P (t i t i 1 ) 0 2/3 1/3... Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

Query likelihood model Given a document collection D and a query q A language model M d is built for each document Documents are scored according to the probability P (q M d )

53 Query likelihood model Given a document collection D and a query q A language model M d is built for each document Documents are scored according to the probability P (q M d ) Intuition: language models corresponding to relevant documents should assign higher probability to the query Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

54 Query likelihood model(2) A unigram language model of document d can be estimated by dividing the number of times term i occurs in d by the number of terms in d p(t i M d ) = n i,d n d Bigram language models are rarely applied to individual documents due to sparsity problems Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

55 Smoothing language models Models we ve considered so far give probability 0 to queries which contain terms that do not occur in the document We can prevent this by using smoothing techniques Smoothing adds a small probability under the model even to unseen words Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

56 Smoothing techniques Laplace smoothing Adding a fixed small count (e.g., 1) to all word counts (even the unobserved ones) and renormalizing to get a probability distribution p (t i M d ) = n i,d + α n d + V α Adds an artificial count α for each possible word in our vocabulary V Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

57 Smoothing techniques Laplace smoothing assumes all unseen words are equally likely! Jelinek-Mercer smoothing rather builds a language model M D of the entire document collection, and interpolates: p (t i M d ) = λp(t i M d ) + (1 λ)p(t i M D ) Words absent from a document will still get some probability mass from the right term However, the amount each word gets will vary depending on their likelihood in the collection as a whole Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

58 Outline 1 What is IR? 2 Text representations in IR 3 Boolean retrieval model 4 Vector space model 5 Classic probabilistic retrieval 6 Language modeling for retrieval 7 IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

59 IR Tools Terrier IR platform Open command-line IR platform Popular in academia used for IR evaluations in research Stepwise usage first indexing, then retrieval, and then evaluation Lucene Open source information retrieval library written in Java (ports to many languages exist) Describes a document as a set of (user definable) fields Very flexible solution for applications requiring full text indexing and search Elasticsearch RESTful search engine on top of Lucene Docs: Very limited NLP functionality: text-classification-made-easy-with-elasticsearch Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

60 Terrier (University of Glasgow) Implements a wide variety of IR models VSMs, LMs, PMs Stepwise usage Loading and indexing the collection of documents (which need to be in the TREC format) >> trec_setup.sh <coll-path> >> trec_terrier.sh -i Performing retrieval (must provide the collection of queries and define the retrieval model) >> trec_terrier.sh -r -Dtrec.topics=<q-path> -Dtrec.model=<ir-model> Evaluating the retrieval performance (must provide relevance judgements) >> trec_terrier.sh -e Dtrec.qrels=<rj-path> Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

61 Now you... 1 Understand the importance of information retrieval 2 Are familiar with basic principles of IR 3 Have basic understanding of the three main IR paradigms 4 Are aware of some readily available IR tools Dalbelo Bašić, Šnajder (UNIZG FER) TAR IR Basics AY 2017/ / 61

Ranked Retrieval (2)

Text Technologies for Data Science INFR11145 Ranked Retrieval (2) Instructor: Walid Magdy 31-Oct-2017 Lecture Objectives Learn about Probabilistic models BM25 Learn about LM for IR 2 1 Recall: VSM & TFIDF