Chap 2: Classical models for information retrieval

Similar documents
IR Models: The Probabilistic Model. Lecture 8

Information Retrieval Basic IR models. Luca Bondi

Information Retrieval and Web Search

Modern Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Boolean and Vector Space Retrieval Models

CAIM: Cerca i Anàlisi d Informació Massiva

PV211: Introduction to Information Retrieval

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Information Retrieval and Web Search Engines

Lecture 3: Probabilistic Retrieval Models

Ranked Retrieval (2)

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

The Boolean Model ~1955

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Dealing with Text Databases

Information Retrieval

1 Information retrieval fundamentals

CS630 Representing and Accessing Digital Information Lecture 6: Feb 14, 2006

Maschinelle Sprachverarbeitung

Information Retrieval. Lecture 6

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Knowledge Discovery in Data: Overview. Naïve Bayesian Classification. .. Spring 2009 CSC 466: Knowledge Discovery from Data Alexander Dekhtyar..

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Probabilistic Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

16 The Information Retrieval "Data Model"

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Information Retrieval

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Modern Information Retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Midterm Examination Practice

University of Illinois at Urbana-Champaign. Midterm Examination

Lecture 13: More uses of Language Models

Lecture 5: Introduction to (Robertson/Spärck Jones) Probabilistic Retrieval

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Ad Placement Strategies

PV211: Introduction to Information Retrieval

Information Retrieval

CSCE 561 Information Retrieval System Models

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

Manning & Schuetze, FSNLP, (c)

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

vector space retrieval many slides courtesy James Amherst

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Querying. 1 o Semestre 2008/2009

Latent Semantic Analysis. Hongning Wang

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

CS 572: Information Retrieval

CS1021. Why logic? Logic about inference or argument. Start from assumptions or axioms. Make deductions according to rules of reasoning.

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Price: $25 (incl. T-Shirt, morning tea and lunch) Visit:

DISTRIBUTIONAL SEMANTICS

Can Vector Space Bases Model Context?

Vector Space Model. Yufei Tao KAIST. March 5, Y. Tao, March 5, 2013 Vector Space Model

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

Language Models. CS6200: Information Retrieval. Slides by: Jesse Anderton

Matrices, Vector Spaces, and Information Retrieval

CSE 494/598 Lecture-4: Correlation Analysis. **Content adapted from last year s slides

Problem 1: Suppose A, B, C and D are finite sets such that A B = C D and C = D. Prove or disprove: A = B.

Query Propagation in Possibilistic Information Retrieval Networks

Scoring, Term Weighting and the Vector Space

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

NP Completeness and Approximation Algorithms


22c:145 Artificial Intelligence

02 Propositional Logic

Text classification II CE-324: Modern Information Retrieval Sharif University of Technology

CS:4420 Artificial Intelligence

A statement is a sentence that is definitely either true or false but not both.

Packet #1: Logic & Proofs. Applied Discrete Mathematics

A Risk Minimization Framework for Information Retrieval

Modern Information Retrieval

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Information Retrieval

Systems of modal logic

PROBABILITY AND INFORMATION THEORY. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

Manning & Schuetze, FSNLP (c) 1999,2000

Knowledge representation DATA INFORMATION KNOWLEDGE WISDOM. Figure Relation ship between data, information knowledge and wisdom.

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Foundations of Mathematics MATH 220 FALL 2017 Lecture Notes

Transcription:

Chap 2: Classical models for information retrieval Jean-Pierre Chevallet & Philippe Mulhem LIG-MRIM Sept 2016 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 1 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 2 / 81

IR System Basic IR Models Request! Query! Information Retrieval System (IRS)! Knowledge! Base! Documents! Surrogate! Representation! of queries! (query language)! Implementation of Matching function! Representation of! the content documents! (indexing language)! Query model! Information Retrieval Model! Matching! function! Knowledge model! Document model! (content)! Jean-Pierre Chevallet & Philippe Mulhem Models of IR 3 / 81

IR Models Basic IR Models Classical Models Boolean Set Vector Space Probabilistic Strict Weighted Classical Inference Network Language Model Jean-Pierre Chevallet & Philippe Mulhem Models of IR 4 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 5 / 81

Set Model Basic IR Models Membership of a set Requests are single descriptors Indexing assignments : a set of descriptors In reply to a request, documents are either retrieved or not (no ordering) Retrieval rule : if the descriptor in the request is a member of the descriptors assigned to a document, then the document is retrieved. This model is the simplest one and describes the retrieval characteristics of a typical library where books are retrieved by looking up a single author, title or subject descriptor in a catalog. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 6 / 81

Example Basic IR Models Request information retrieval Doc1 { information retrieval, database, salton } > RETRIEVED < Doc2 { database, SQL } > NOT RETRIEVED < Jean-Pierre Chevallet & Philippe Mulhem Models of IR 7 / 81

Set inclusion Basic IR Models Request: a set of descriptors Indexing assignments : a set of descriptors Documents are either retrieved or not (no ordering) Retrieval rule : document is retrieved if ALL the descriptors in the request are in the indexing set of the document. This model uses the notion of inclusion of the descriptor set of the request in the descriptor set of the document Jean-Pierre Chevallet & Philippe Mulhem Models of IR 8 / 81

Set intersection Basic IR Models Request: a set of descriptors PLUS a cut off value Indexing assignments : a set of descriptors Documents are either retrieved or not (no ordering) Retrieval rule : document is retrieved if it shares a number of descriptors with the request that exceeds the cut-off value This model uses the notion of set intersection between the descriptor set of the request with the descriptor set of the document. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 9 / 81

Set intersection plus ranking Request: a set of descriptors PLUS a cut off value Indexing assignments : a set of descriptors Retrieved documents are ranked Retrieval rule : documents showing with the request more than the specified number of descriptors are ranked in order of decreasing overlap Jean-Pierre Chevallet & Philippe Mulhem Models of IR 10 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 11 / 81

Logical Models Basic IR Models Based on a given formalized logic: Propositional Logic: boolean model First Order Logic : Conceptual Graph Matching Modal Logic Description Logic: matching with knowledge Concept Analysis Fuzzy logic... Matching : a deduction from the query Q to the document D. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 12 / 81

Basic IR Models Request is any boolean combination of descriptors using the operators AND, OR and NOT Indexing assignments : a set of descriptors Retrieved documents are retrieved or not Retrieval rules: if Request = t a t b then retrieve only documents with both t a and t b if Request = t a t b then retrieve only documents with either t a or t b if Request = t a then retrieve only documents without t a. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 13 / 81

Basic IR Models Knowledge Model : T = {t i }, i [1,..N] Term t i that index the documents The document model (content) I a Boolean expression in the proposition logic, with the t i considered as propositions: A document D 1 = {t 1, t 3 } is represented by the logic formula as a conjunction of all terms direct (in the set) or negated. t 1 t 2 t 3 t 4... t N 1 t N A query Q is represented by any logic formula The matching function is the logical implication: D = Q Jean-Pierre Chevallet & Philippe Mulhem Models of IR 14 / 81

: relevance value No distinction between relevant documents: Q = t 1 t 2 over the vocabulary {t 1, t 2, t 3, t 4, t 5, t 6 } D 1 = {t 1, t 2 } t 1 t 2 t 3 t 4 t 5 t 6 D 2 = {t 1, t 2, t 3, t 4, t 5 } t 1 t 2 t 3 t 4 t 5 t 6 Both documents are relevant because : D 1 Q and D 2 Q. We feel that D 1 is a better response because closer to the query. Possible solution : D Q and Q D. (See chapter on Logical Models) Jean-Pierre Chevallet & Philippe Mulhem Models of IR 15 / 81

: complexity of queries Q = ((t 1 t 2 ) t 3 ) (t 4 ( t 5 t 6 )) Meaning of the logical (inclusive) different from the usual or (exclusive) Jean-Pierre Chevallet & Philippe Mulhem Models of IR 16 / 81

Conclusion for boolean model Model use till the the 1990. Very simple to implement Still used is some web search engine as a basic document fast filter with an extra step for document ordering. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 17 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 18 / 81

: choice of a logic A lot of possible choices to go beyond the boolean model. The choice of the underlying logic L to models the IR matching process: Propositional Logic First order logic Modal Logic Fuzzy Logic Description logic... Jean-Pierre Chevallet & Philippe Mulhem Models of IR 19 / 81

: matching interpretation For a logic L, different possible interpretations of matching: Deduction theory D L Q L D Q Model theory D = L Q = L D Q Jean-Pierre Chevallet & Philippe Mulhem Models of IR 20 / 81

Interpretations Basic IR Models Other Documents Document d = {t1, t2, t3} 1101 1100 Documents answer to q=t1"t3 0101 1001 1110 1111 0111 1000 1011 1010 0110 0100 0000 0001 0010 0011 Other interpretations Jean-Pierre Chevallet & Philippe Mulhem Models of IR 21 / 81

Inverted File Basic IR Models d1 d2 d3!. a 0 1 0 0 1 1 0 1!! b 1 1 1 1 0 1 0 0!! c 0 0 0 1 0 0 0 0!! d! a " b 0 1 0 0 0 1 0 0!.. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 22 / 81

Implementing the boolean model Posting list: List of unique doc id associates with an indexing term If no ordering, this list is in fact a set If ordering, nice algorithm to fuse lists Main set (list) fusion algorithms: Intersection, usion and complement Algorithm depend on order. If no order complexity is O(n m) if order, complexity is O(n + m)! So maintain posting list ordered. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 23 / 81

Implementing the boolean model Left to exercise : produces the 3 algorithms, of union, intersection and complement, with the following constraints: The two lists are read only The produced list is e new one and is write only Using sorting algorithms is not allowed Deduce from the algorithms the compexity. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 24 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 25 / 81

Extension of with weights. Weight denoting the representativity of a term for a document). Knowledge Model : T = {t i }, i [1,..N] Terms t i that index the documents. A document D is represented by A logical formula D (similar to ) A function W D : T [0, 1], which gives, for each term in T the weight of the term in D. The weight is 0 for a term not present in the document. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 26 / 81

: matching Non binary matching function based on Fuzzy logic RSV (D, a b) = Max[W D (a), W D (b)] RSV (D, a b) = Min[W D (a), W D (b)] RSV (D, a) = 1 W D (a) Limitation : this matching does not take all the query terms into account. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 27 / 81

: matching Non binary matching function based on a similarity function which take more into account all the query terms. RSV (D, a b) = RSV (D, a b) = 1 W D (a) 2 +W D (b) 2 2 RSV (D, a) = 1 W D (a) (1 W D (a)) 2 +(1 W D (b)) 2 2 Limitation : query expression for complex needs Jean-Pierre Chevallet & Philippe Mulhem Models of IR 28 / 81

: matching example Query Document Boolean Weighted Boolean a b a! b a " b a! b a " b D 1 1# 1# 1# 1# 1 1 D 2 1# 0# 1# 0# 1/$2=0.71# 1-1/$2=0.29 D 3 0# 1# 1# 0# 1/$2 1-1/$2# D 4 0# 0# 0# 0# 0# 0# Jean-Pierre Chevallet & Philippe Mulhem Models of IR 29 / 81

: matching example d1=(0.2, 0.4), d2=(0.6, 0.9), d3=(0,1) a To be closer to (1,1) (1,1) To avoid (0,0) (1,1) a or b 0.6 0.4 d d 2 1 d 1 d 2 a and b 0.29 d 3 0.71 (0,0) b (0,0) b 0.2 0.9 0.2 0.9 d 3 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 30 / 81

Outline Basic IR Models 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 31 / 81

Upper bounds for results Let t i indexing term, let #t i the size of its posting list and let Q the number of documents that answer the boolean query Q. 1 Propose an upper bound value in term of posting list size for the queries t 1 t 2 and t 1 t 2 2 If we suppose that #t 1 > #t 2 > #t 3, compute the upper bound value of t 1 t 2 t 3, then propose a way to reduce global computation of posting list merge 1. 3 If we suppose that #t 1 > #t 2 > #t 3, compute the upper bound value of t 1 t 2 and of (t 1 t 2 ) t 3, then propose a way to reduce global computation of posting list merge of query (t 1 t 2 ) t 3 1 Use the complexity value of merge depending of list sizes Jean-Pierre Chevallet & Philippe Mulhem Models of IR 32 / 81

Outline Basic IR Models Weighting Implementing VSM Leaning from user 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 33 / 81

Basic IR Models Weighting Implementing VSM Leaning from user Request: a set of descriptors each of which has a positive number associated with it Indexing assignments : a set of descriptors each of which has a positive number associated with it. Retrieved documents are ranked Retrieval rule : the weights of the descriptors common to the request and to indexing records are treated as vectors. The value of a retrieved document is the cosine of the angle between the document vector and the query vector. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 34 / 81

Basic IR Models Weighting Implementing VSM Leaning from user Knowledge model: T = {t i }, i [1,.., n] All documents are described using this vocabulary. A document D i is represented by a vector d i described in the R n vector space defined on T d i = (w i,1, w i,2,..., w i,j,..., w i,n ), with w k,l the weight of a term t l for a document. A query Q is represented by a vector q described in the same vector space q = (w Q,1, w Q,2,..., w Q,j,..., w Q,n ) Jean-Pierre Chevallet & Philippe Mulhem Models of IR 35 / 81

Weighting Implementing VSM Leaning from user The more two vectors that represent documents are?near?, the more the documents are similar: Term 1 D i D j Term 2 Term 3 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 36 / 81

: matching Weighting Implementing VSM Leaning from user Relevance: is related to a vector similarity. RSV (D, Q) = def SIM( D, Q) Symmetry: SIM( D, Q) = SIM( Q, D) Normalization: SIM : V [min, max] Reflectivity : SIM( X, X ) = max Jean-Pierre Chevallet & Philippe Mulhem Models of IR 37 / 81

Outline Basic IR Models Weighting Implementing VSM Leaning from user 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 38 / 81

: weighting Weighting Implementing VSM Leaning from user Based on counting the more frequent words, and also the more significant ones. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 39 / 81

Zipf Law Basic IR Models Weighting Implementing VSM Leaning from user Rank r and frequency f : r f = const f1 fr t1 tr Jean-Pierre Chevallet & Philippe Mulhem Models of IR 40 / 81

Heap Law Basic IR Models Weighting Implementing VSM Leaning from user Estimation of the vocabulary size V according to the size C of the corpus: V = K C β for English : K [10, 100] and β [0.4, 0.6] (e.g., V 39 000 for C =600 000, K=50, β=0.5) V Ti T Jean-Pierre Chevallet & Philippe Mulhem Models of IR 41 / 81

Weighting Implementing VSM Leaning from user Term Frequency Term Frequency The frequency tf i,j of the term t j in the document D i equals to the number of occurrences of t j in D i. Considering a whole corpus (document database) into account, a term that occurs a lot does not discriminate documents: Term frequent in the whole corpus Term frequent in only one document of the corpus Jean-Pierre Chevallet & Philippe Mulhem Models of IR 42 / 81

Document Frequency: tf.idf Weighting Implementing VSM Leaning from user Document Frequency The document frequency df j of a term t j is the number of documents in which t j occurs. The larger the df, the worse the term for an IR point of view... so, we use very often the inverse document frequency idf j : Inverse Document Frequency idf j = 1 df j idf j = log( C df j ) with C is the size of the corpus, i.e. the number of documents. The classical combination : w i,j = tf i,j idf j. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 43 / 81

matching Basic IR Models Weighting Implementing VSM Leaning from user Matching function: based on the angle between the query vector Q and the document vector D i. The smaller the angle the more the document matches the query. Term 1 D i Q Term 2 Term 3 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 44 / 81

matching: cosine Basic IR Models Weighting Implementing VSM Leaning from user One solution is to use the cosine the angle between the query vector and the document vector. Cosine SIM cos ( D i, Q) = k=1 n k=1 n w i,k w Q,k (w i,k ) 2 n (w Q,k ) 2 k=1 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 45 / 81

matching: set interpretation Weighting Implementing VSM Leaning from user When binary discrete values, on have a set interpretations D {} i Q {} : SIM cos ( D i, Q) = D{} i Q {} D {} i Q {} SIM cos ( D i, Q) = 0 : D {} i and Q {} are disjoined SIM cos ( D i, Q) = 1 : D {} i and Q {} are equal and Jean-Pierre Chevallet & Philippe Mulhem Models of IR 46 / 81

Other matching functions Weighting Implementing VSM Leaning from user Dice coefficient SIM dice ( D i, Q) = N 2 w i,k w Q,k k=1 N w i,k + w Q,k k=1 Discrete Dice coefficient SIM dice ( D i, Q) = 2 D{} i Q {} D {} i + Q {} Jean-Pierre Chevallet & Philippe Mulhem Models of IR 47 / 81

Outline Basic IR Models Weighting Implementing VSM Leaning from user 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 48 / 81

Matching cosine simplification Weighting Implementing VSM Leaning from user Question Transform the cosine formula so that to simplify the matching formula. Cosine SIM cos ( D i, Q) = k=1 n k=1 n w i,k w Q,k (w i,k ) 2 n (w Q,k ) 2 k=1 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 49 / 81

Weighting Implementing VSM Leaning from user Exercice: Similarity et dissimilarity and distance The Euclidian distance between point x and y is expressed by : L 2 (x, y) = i (x i y i ) 2 Question Show the potential link between L 2 and the cosine. Tips: consider points as vectors consider document vectors has normalized, i.e. with i y i 2 constant. as Jean-Pierre Chevallet & Philippe Mulhem Models of IR 50 / 81

Weighting Implementing VSM Leaning from user Exercice: dot product and list intersection The index is usually very sparse: a (usefull) term appears in less than 1% of document. A term t has a non null weight in D iff t appears in document D Question Show that the algorithm for computing the dot product is equivalent to list intersections. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 51 / 81

Weighting Implementing VSM Leaning from user Exercice: efficiency of list intersection Let the two lists X and Y of size X and Y : Question Compare the efficiency of a non sorted list intersection algorithm and a sorted list intesection. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 52 / 81

Outline Basic IR Models Weighting Implementing VSM Leaning from user 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 53 / 81

Weighting Implementing VSM Leaning from user Link between dot product and inverted files If RSV (x, y) i x iy i after some normalizations of y, then : RSV (x, y) i,x i 0,y i 0 x i y i Query: just to proceed with terms that are in the query, i.e. whose weight are not null. Documents: store only non null terms. Inverted file: access to non null document weight for for each term id. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 54 / 81

The index Basic IR Models Weighting Implementing VSM Leaning from user Use a Inverted file like Boolean model Store term frequency in the posting list Do not pre-compute the weight, but keep raw integer values in the index Compute the matching value on the fly, i.e. during posting list intersection Jean-Pierre Chevallet & Philippe Mulhem Models of IR 55 / 81

Matching: matrix product Weighting Implementing VSM Leaning from user With an inverted file, in practice, matching computation is a matrix product: d1 d2 d3!. a b c d! 0 1 0 0 1 1 0 1!! 1 1 1 1 0 1 0 0!! 0 0 0 1 0 0 0 0!! a! b 1 1 0 0! 1 1 1 1 1 1 0 1!.. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 56 / 81

Outline Basic IR Models Weighting Implementing VSM Leaning from user 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 57 / 81

Relevance feedback Basic IR Models Weighting Implementing VSM Leaning from user To learn system relevance from user relevance: Start First query Matching Answer presentation reformulation User eval no Satistaction Yes End Jean-Pierre Chevallet & Philippe Mulhem Models of IR 58 / 81

Rocchio Formula Basic IR Models Weighting Implementing VSM Leaning from user Rocchio Formula Q i+1 = α Q i + β Rel i γ nrel i With: Rel i : the cluster center of relevant documents, i.e., the positive feedback nrel i : the cluster center of non relevant documents, i.e., the negative feedback Note : when using this formula, the generated query vector may contain negative values. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 59 / 81

Outline 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 60 / 81

Probabilistic Models Basic IR Models Capture the IR problem in a probabilistic framework. first probabilistic model (Binary Independent Retrieval Model) by Robertson and Spark-Jones in 1976... late 90s, emergence of language models, still hot topic in IR Overall question: what is the probability for a document to be relevant to a query? several interpretation of this sentence Jean-Pierre Chevallet & Philippe Mulhem Models of IR 61 / 81

Models Basic IR Models Classical Models Boolean Vector Space Probabilistic [Robertson & Spärk-Jones] [Tutle & Croft] [Croft], [Hiemstra][Nie] Jean-Pierre Chevallet & Philippe Mulhem Models of IR 62 / 81

Probabilistic Model of IR Different approaches of seeing a probabilistic approach for information retrieval Classical approach: probability to have the event Relevant knowing one document and one query. Inference Networks approach: probability that the query is true after inference from the content of a document. Language Models approach: probability that a query is generated from a document. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 63 / 81

Binary Independent Retrieval Model Robertson & Spark-Jones 1976 computes the relevance of a document from the relevance known a priori from other documents. achieved by estimating of each indexing term a the document, and by using the Bayes Theorem and a decision rule. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 64 / 81

Binary Independent Retrieval Model R: binary random variable R = r : relevant; R = r : non relevant P(R = r d, q): probability that R is relavant for the document and the query considered (noted P(r d, q) ) Probability of relevance depends only from document and query term weights are binary: d = (11... 100... ), w d t = 0 or 1 Each term t is characterized by a binary variable w t, indicating the probability that the term occurs. P(w t = 1 q, r): probability that t occurs in a relevant document. P(wt = 0 q, r) = 1 P(wt = 1 q, r) The terms are conditionnaly independant to R Jean-Pierre Chevallet & Philippe Mulhem Models of IR 65 / 81

Binary Independent Retrieval Model with For a query Q Relevant Documents R=r Non Relevant Documents R=r CORPUS Corpus = rel nonrel Relevant Documents Non Relevant Documents = Jean-Pierre Chevallet & Philippe Mulhem Models of IR 66 / 81

Binary Independent Retrieval Model Matching function : use of Bayes theorem Probability to obtain the description d from observed relevances P(r,q) is the relevance probability. It is the chance of randomly taking one document from the corpus which is relevant for the query q P( d r, q). P( r, q) P ( r d, q) = P( d, q) Probability that the document d belongs to the set of relevant documents of the query q. probability that the document d is picked for q Jean-Pierre Chevallet & Philippe Mulhem Models of IR 67 / 81

BIR: Matching function Decision: document retrieved if: P(r d, q) P(d r, q).p(r, q) = P( r d, q) P(d r, q).p( r, q) > 1 IR looks for a ranking, so we eliminate P(r,q) P( r,q) for a given query because it is constant Jean-Pierre Chevallet & Philippe Mulhem Models of IR 68 / 81

BIR: Matching function In IR, it is more simple to use logs: P(d r, q) rsv(d) = rank log( P(d r, q) ) Jean-Pierre Chevallet & Philippe Mulhem Models of IR 69 / 81

BIR: Matching function Simplification using hypothesis of independence between terms (Binary Independence) with weight w for term t in d P(d r, q) = P(d = (10...110...) r, q) = P(wt d = 1 r, q) P(wt d = 0 r, q) wt d =1 wt d =0 P(d r, q) = P(d = (10...110...) r, q) = P(wt d = 1 r, q) P(wt d = 0 r, q) wt d =1 wt d =0 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 70 / 81

BIR: Matching function Let s simplify the notations: P(w d t P(w d t P(w d t = 1 r, q) = p t = 0 r, q) = 1 p t = 1 r, q) = u t Hence : P(wt d = 0 r, q) = 1 u t P(d r, q) = w d t =1 p t P(d r, q) = w d t =1 u t 1 p t wt d =0 1 u t wt d =0 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 71 / 81

BIR: Matching function Now let s compute the Relevance Status Value: P(d r, q) rsv(d) = rank log( P(d r, q) ) = log( wt d =1 p t wt d =0 1 p t wt d =1 u t wt d =0 1 u ) t Jean-Pierre Chevallet & Philippe Mulhem Models of IR 72 / 81

Conclusion Common step for indexing: Define index set Automatize index construction from documents Select a model for index representation and weighting Define a matching process and an associated ranking This is used for textual processing but also for other media. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 73 / 81

Outline 1 Basic IR Models 2 Weighting Implementing VSM Leaning from user 3 4 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 74 / 81

Solution: matching cosine simplification Because Q is constant, norm of vector Q is a constant that can be removed. 1 K = n (w Q,k ) 2 k=1 SIM cos ( D i, Q) = K n w i,k w Q,k k=1 n (w i,k ) 2 k=1 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 75 / 81

Solution: matching cosine simplification The division of the document vector by its norm can be precomputed: w i,k = w i,k n (w i,k ) 2 k=1 Jean-Pierre Chevallet & Philippe Mulhem Models of IR 76 / 81

Solution: matching cosine simplification Hence, one juste has to compute a vector dot product: SIM cos ( D i, Q) = K n k=1 The constant does not influence the order: RSV cos ( D i, Q) n k=1 w i,k w Q,k w i,k w Q,k Jean-Pierre Chevallet & Philippe Mulhem Models of IR 77 / 81

Solution: Similarity et dissimilarity and distance To transform a distance or dissimilarity into a similarity, simply negate the value. Ex. with the Squared Euclidian distance : RSV D2 (x, y) = D 2 (x, y) 2 = i (x i y i ) 2 = i (x 2 i + y 2 i 2x i y i ) = i x 2 i i y 2 i + 2 i x i y i if x is the query then i x i 2 document set. is constant for matching with a RSV (x, y) D2 i y 2 i + 2 i x i y i Jean-Pierre Chevallet & Philippe Mulhem Models of IR 78 / 81

Solution: Similarity et dissimilarity and distance RSV (x, y) D2 i y 2 i + 2 i x i y i If i y 2 i is constant over the corpus then : RSV (x, y) D2 2 i x i y i RSV (x, y) D2 i x i y i Hence, if we normalize the corpus so each document vector length is a constant, then using the Euclidean distance as a similarity, provides the same results than the cosine of the vector space model! Jean-Pierre Chevallet & Philippe Mulhem Models of IR 79 / 81

Solution: dot product and list intersection In the dot product i x iy i, if the term at rank i is not in x or y then the term x i y i is null and does not participate to the value. Hence, If we compute directly i x iy i using a list of terms for each document, then we sum the x i y i only for terms that are in the intersection. Hence, it is equivalent to an intersection list. Jean-Pierre Chevallet & Philippe Mulhem Models of IR 80 / 81

Solution: efficiency of list intersection If X and Y are not sorted, for each x i one must find the term in Y by sequential search. So complexity is O( X Y ) If X and Y are sorted, the algorithm used two pointer: Before the two pointer, lists are merged Lets compare the pointers : If terms are equal : we can compute the merge (or the product), then move the tow pointers It terms are different : because of the order, one are sure the smaller term to net be present in the other list. On ca move that pointeur. Hence one move at least one pointer on each list, the the complexity is : O( X + Y ) Jean-Pierre Chevallet & Philippe Mulhem Models of IR 81 / 81