The Boolean Model ~1955

Similar documents
Information Retrieval Using Boolean Model SEEM5680

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Dealing with Text Databases

Document Similarity in Information Retrieval

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

CS 572: Information Retrieval

Information Retrieval

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Scoring, Term Weighting and the Vector Space

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

Complex Data Mining & Workflow Mining. Introduzione al text mining

PV211: Introduction to Information Retrieval

Informa(on Retrieval

Informa(on Retrieval

Chap 2: Classical models for information retrieval

PV211: Introduction to Information Retrieval

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

Distributed Architectures

CS 572: Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

Introduction to Information Retrieval

CSCE 561 Information Retrieval System Models

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

PV211: Introduction to Information Retrieval

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Boolean and Vector Space Retrieval Models

Overview. Lecture 1: Introduction and the Boolean Model. What is Information Retrieval? What is Information Retrieval?

Maschinelle Sprachverarbeitung

Information Retrieval and Web Search

Information Retrieval Basic IR models. Luca Bondi

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

IR Models: The Probabilistic Model. Lecture 8

IR: Information Retrieval

Lecture 2: IR Models. Johan Bollen Old Dominion University Department of Computer Science

Semantic Similarity from Corpora - Latent Semantic Analysis

Geoffrey Zweig May 7, 2009

Information Retrieval and Web Search Engines

Modern Information Retrieval

CAIM: Cerca i Anàlisi d Informació Massiva

Information Retrieval. Lecture 6

Lecture 3: Probabilistic Retrieval Models

Embeddings Learned By Matrix Factorization

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Information Retrieval

Unary negation: T F F T

Extended IR Models. Johan Bollen Old Dominion University Department of Computer Science

3. Basics of Information Retrieval

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

STUDY GUIDE. Macbeth WILLIAM SHAKESPEARE

Information Retrieval

Linear Algebra Background

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Black-Box Testing Techniques III

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Querying. 1 o Semestre 2008/2009

CONTENTS. 6 Two Promises Miranda and Ferdinand plan to marry. Caliban 36 gets Stephano and Trinculo to promise to kill Prospero.

Integrating Logical Operators in Query Expansion in Vector Space Model

Compact Indexes for Flexible Top-k Retrieval

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

INF 141 IR METRICS LATENT SEMANTIC ANALYSIS AND INDEXING. Crista Lopes

1 Information retrieval fundamentals

1. Ignoring case, extract all unique words from the entire set of documents.

Dictionary: an abstract data type

Social networks and scripts. Zvi Lotker IRIF-France

First-Order Theorem Proving and Vampire. Laura Kovács (Chalmers University of Technology) Andrei Voronkov (The University of Manchester)

Latent Semantic Indexing (LSI) CE-324: Modern Information Retrieval Sharif University of Technology

Relations. We have seen several types of abstract, mathematical objects, including propositions, predicates, sets, and ordered pairs and tuples.

DISTRIBUTIONAL SEMANTICS

Experiment 7: Magnitude comparators

EE5585 Data Compression April 18, Lecture 23

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

RETRIEVAL MODELS. Dr. Gjergji Kasneci Introduction to Information Retrieval WS

5 Spatial Access Methods

Lecture 9: Probabilistic IR The Binary Independence Model and Okapi BM25

Searching Substances in Reaxys

Transcription:

The Boolean Model ~1955 The boolean model is the first, most criticized, and (until a few years ago) commercially more widespread, model of IR. Its functionalities can often be found in the Advanced Search windows of many search engines. A document is represented by means of an and of index terms; d 1 =(and jazz saxophone Coltrane) A query is represented by a combination, obtained through {and,or,not} of index terms belonging to a controlled vocabulary; q 1 =(and jazz (or saxophone clarinet) (or Parker Coltrane)) The matching function is RSV(d 1,q 1 )=1 if q 1 is a logical consequence of d 1 according to Boolean logic. 1 Boolean queries: Exact match Boolean Queries are queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., lawyers) still like Boolean queries: You know exactly what you re getting. 2

Closed Word Assumption There is an implicit Closed World Assumption in the interpretation of the document representations: The absence of term t in the representation of d is equivalent to the presence of (not t) in the same representation. Given document d 1 = (and jazz saxophone Coltrane) and query q 2 = (and jazz (not Parker)) then RSV(d 1,q 2 )=1 even if (d 1 q 2 ) 3 This means that the true definition of RSV(d i,q j ) {0,1} is RSV(d i,t) = 1, if t IREP(d i ), 0 otherwise RSV(d i,(not q j )) = 1 RSV(d i,q j ) RSV(d i,(and q 1 q n )) = min{rsv(d i,q 1 ),...,RSV(d i,q n )} RSV(d i,(or q 1 q n )) = max{rsv(d i,q 1 ),...,RSV(d i,q n )} 4

In queries, expert users tend to use only faceted queries, i.e. disjunctions of quasi-synonyms (facets) conjoined through and: (and (or jazz classical) (or sax* clarinet flute) (or Parker Coltrane)) Extensions: Truncation of the query term; e.g. q 3 = (and (or sax* clarinet) (or Parker Coltrane)) Information on adjacency, distance and word order invertibility; e.g. q 4 = (and jazz electric prox[0,f] guitar) Possibility to restrict the search to given subfields such as title, author, abstract, publication date, etc. (fielded search). 5 6

Query Example Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? 7 Term-document incidence matrix Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 0 0 0 1 Brutus 0 1 0 0 Caesar 0 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 worser 1 0 1 0 Brutus AND Caesar but NOT Calpurnia 1 if play contains word, 0 otherwise 8

Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND 110100 AND 110111 AND 101111 = 100100 9 Bigger corpora Consider n = 1M documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are m = 500K distinct terms among these. 10

Can t build the matrix 500K x 1M matrix has half-a-trillion 0 s and 1 s. But it has no more than one billion 1 s. matrix is extremely sparse. Why? What s a better representation? We only record the 1 positions. 11 Inverted index For each term t, we must store a list of all documents that contain t. Do we use an array or a list for this? Brutus Calpurnia Caesar 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 What happens if the word Caesar is added to document 14? 12

Inverted index Linked lists generally preferred to arrays Pro: Dynamic space allocation Pro: Insertion of terms into documents easy Cons: Space overhead of pointers Brutus Calpurnia Caesar Dictionary 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Postings Sorted by docid (more later on why). 13 Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Token stream. More on these later. Tokenizer Linguistic modules Friends Romans Countrymen Modified tokens. friend roman countryman Inverted index. Indexer friend roman countryman 2 4 1 14 2 13 16

Indexer steps Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term Doc # did 1 enact 1 julius 1 caesar 1 was 1 i' 1 the 1 capitol 1 brutus 1 me 1 so 2 let 2 it 2 be 2 with 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 was 2 ambitious 2 15 Sort by terms. Core indexing step Term Doc # did 1 enact 1 julius 1 caesar 1 was 1 i' 1 the 1 capitol 1 brutus 1 me 1 so 2 let 2 it 2 be 2 with 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 was 2 ambitious 2 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 did 1 enact 1 hath 1 i' 1 it 2 julius 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 16

Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 did 1 enact 1 hath 1 i' 1 it 2 julius 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # Freq ambitious be brutus brutus capitol caesar 2 did enact hath 2 i' it julius 2 let me noble so the the told you was was with 17 The result is split into a Dictionary file and a Postings file. Term Doc # Freq ambitious be brutus brutus capitol caesar 2 did enact hath 2 i' it julius 2 let me noble so the the told you was was with Term N docs Tot Freq ambitious be brutus 2 2 capitol 3 did enact hath 2 i' it julius 2 let me noble so the 2 2 told you was 2 2 with Doc # Freq 2 2 1 2 1 2 18

Where do we pay in storage? Terms Term N docs Tot Freq ambitious be brutus 2 2 capitol 3 did enact hath 2 i' it julius 2 let me noble so the 2 2 told you was 2 2 with Pointers Doc # Freq 2 2 1 2 1 2 19 Query processing Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. Merge the two postings: 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar 20

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 8 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docid. 21 More general merges Exercise: Adapt the merge for the queries: Brutus AND NOT Caesar Brutus OR NOT Caesar Can we still run through the merge in time O(x+y)? 22

Query Optimization What about an arbitrary Boolean formula? (Brutus OR Caesar) AND NOT (Antony OR Cleopatra) Can we always merge in linear time? Consider (Brutus AND Caesar) AND Calpurnia Can we do better? 23 Beyond term search What about phrases? Proximity: Find Gates NEAR Microsoft. Need index to capture position information in docs. Zones in documents: Find documents with (author = Ullman) AND (text_contains(automata)). 24

Evidence accumulation 1 vs. 0 occurrence of a search term 2 vs. 1 occurrence 3 vs. 2 occurrences, etc. Need term frequency information in docs 25 Ranking search results Boolean queries give inclusion or exclusion of docs. Need to measure proximity from query to each doc. 26

Advantages of the BM Possibility to formulate structured queries. E.g. to distinguish synonymy (or t 1 t 2 ) from noun phrases (and t 1 prox[0,f] t 2 ) In the case of expert users, intuitivity. To those users familiar with Boolean Logic, it is immediately clear why a doc has been or not been retrieved following a given query. This allows query refinement and query reformulation on the part of the user. Efficiency obtained through the use of inverted files (Ifs) the data structures in secondary storage in which document representations are physically stored. 27 Disadvantages of the BM 1. Intimidating (in general, there is not automatic query acquisition). The proponent of the BM where mathematicians; but lay users find the language of the BM unnatural. 2. Lack of output magnitude control after a given query, unless the user knows well the distribution of the topics in the collection. This is a consequence of the fact that RSV takes binary values. 3. Output obtained as a result of a given query is not ranked with respect to the estimated degree (probability) of relevance. 28

4. No importance factors can be attached to the index terms in the IREPs of documents and query. 5. Flattened, hence unintuitive, results (lack of output discrimination) Boolean systems have thus mainly been used by expert intermediaries and are widely considered unfit the needs of casual (non professional) information seekers. 29