Information Retrieval Using Boolean Model SEEM5680

Similar documents
boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

The Boolean Model ~1955

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

Document Similarity in Information Retrieval

Information Retrieval and Topic Models. Mausam (Based on slides of W. Arms, Dan Jurafsky, Thomas Hofmann, Ata Kaban, Chris Manning, Melanie Martin)

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Dealing with Text Databases

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Information Retrieval

Vector Space Scoring Introduction to Information Retrieval Informatics 141 / CS 121 Donald J. Patterson

Introduction to Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2018. Instructor: Walid Magdy

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 6 Scoring term weighting and the vector space model

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

Information Retrieval

Ranked IR. Lecture Objectives. Text Technologies for Data Science INFR Learn about Ranked IR. Implement: 10/10/2017. Instructor: Walid Magdy

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Informa(on Retrieval

Scoring, Term Weighting and the Vector Space

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Informa(on Retrieval

CS 572: Information Retrieval

Complex Data Mining & Workflow Mining. Introduzione al text mining

MATRIX DECOMPOSITION AND LATENT SEMANTIC INDEXING (LSI) Introduction to Information Retrieval CS 150 Donald J. Patterson

CSCE 561 Information Retrieval System Models

PV211: Introduction to Information Retrieval

Overview. Lecture 1: Introduction and the Boolean Model. What is Information Retrieval? What is Information Retrieval?

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Distributed Architectures

Matrix Decomposition and Latent Semantic Indexing (LSI) Introduction to Information Retrieval INF 141/ CS 121 Donald J. Patterson

Lecture 4 Ranking Search Results. Many thanks to Prabhakar Raghavan for sharing most content from the following slides

PV211: Introduction to Information Retrieval

Content-Addressable Memory Associative Memory Lernmatrix Association Heteroassociation Learning Retrieval Reliability of the answer

Introduction to Information Retrieval

CS 572: Information Retrieval

CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Introduction to Information Retrieval

Boolean and Vector Space Retrieval Models

PV211: Introduction to Information Retrieval

Maschinelle Sprachverarbeitung

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

IR: Information Retrieval

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Information Retrieval and Web Search

Geoffrey Zweig May 7, 2009

Information Retrieval. Lecture 6

Chap 2: Classical models for information retrieval

16 The Information Retrieval "Data Model"

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Slides based on those in:

Multimedia Databases - 68A6 Final Term - exercises

Social networks and scripts. Zvi Lotker IRIF-France

Spatio-Textual Indexing for Geographical Search on the Web

Complete Faceting. code4lib, Feb Toke Eskildsen Mikkel Kamstrup Erlandsen

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

CONTENTS. 6 Two Promises Miranda and Ferdinand plan to marry. Caliban 36 gets Stephano and Trinculo to promise to kill Prospero.

Divide-and-Conquer. Consequence. Brute force: n 2. Divide-and-conquer: n log n. Divide et impera. Veni, vidi, vici.

Chapter 5. Divide and Conquer. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Dealing with Text Databases

Recap of the last lecture. CS276A Information Retrieval. This lecture. Documents as vectors. Intuition. Why turn docs into vectors?

Algorithms for Data Science

Embeddings Learned By Matrix Factorization

COMP 382: Reasoning about algorithms

Relations. We have seen several types of abstract, mathematical objects, including propositions, predicates, sets, and ordered pairs and tuples.

Algorithm Design and Analysis

Spatial Data Management of Bio Regional Assessments Phase 1 for Coal Seam Gas Challenges and Opportunities

On Two Class-Constrained Versions of the Multiple Knapsack Problem

CAIM: Cerca i Anàlisi d Informació Massiva

Nearest Neighbor Search with Keywords: Compression

STUDY GUIDE. Macbeth WILLIAM SHAKESPEARE

Latent Dirichlet Allocation (LDA)

Singular Value Decompsition

Algorithms: COMP3121/3821/9101/9801

Flow Algorithms for Two Pipelined Filtering Problems

Chapter 5 Data Structures Algorithm Theory WS 2017/18 Fabian Kuhn

Link Mining PageRank. From Stanford C246

Web Information Retrieval Dipl.-Inf. Christoph Carl Kling

1 Information retrieval fundamentals

Searching Substances in Reaxys

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Topic 4 Randomized algorithms

Lecture 5: Web Searching using the SVD

Real Astronomy from Virtual Observatories

13 Searching the Web with the SVD

9 Searching the Internet with the SVD

INTRODUCTION TO GIS. Dr. Ori Gudes

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Information Retrieval

Nearest Neighbor Search with Keywords

Compact Indexes for Flexible Top-k Retrieval

A Data Repository for Named Places and Their Standardised Names Integrated With the Production of National Map Series

Transcription:

Information Retrieval Using Boolean Model SEEM5680 1

Unstructured (text) vs. structured (database) data in 1996 2 2

Unstructured (text) vs. structured (database) data in 2009 3 3

The problem of IR Goal = find documents relevant to an information need from a large document set Query Info. need Document collection Retrieval IR system Answer list 4

Example Google Web 5

Boolean Model for IR Queries are Boolean expressions. e.g., Caesar AND Brutus The search engine returns all documents that satisfy the Boolean expression. 6

Boolean queries: Exact match Queries using AND, OR and NOT together with query terms Views each document as a set of words Is precise: document matches condition or not. Primary commercial retrieval tool for 3 decades. Professional searchers (e.g., Lawyers) still like Boolean queries: You know exactly what you re getting. 7

Example: WestLaw http://www.westlaw.com/ Largest commercial (paying subscribers) legal search service (started 1975; ranking added 1992) About 7 terabytes of data; 700,000 users Majority of users still use Boolean queries 8

Example: WestLaw http://www.westlaw.com/ Example query: What is the statute of limitations in cases involving the federal tort claims act? LIMIT! /3 STATUTE ACTION /S FEDERAL /2 TORT /3 CLAIM Long, precise queries; proximity operators; incrementally developed; not like web search 9

Unstructured Data in 1650 Which plays of Shakespeare contain the words Brutus AND Caesar but NOT Calpurnia? Could grep all of Shakespeare s plays for Brutus and Caesar, then strip out lines containing Calpurnia? Slow (for large corpora) NOT Calpurnia is non-trivial Other operations (e.g., find the phrase Romans and countrymen) not feasible 10

Term-document incidence Query: Brutus AND Caesar but NOT Calpurnia Antony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth Antony 1 1 0 0 0 1 Brutus 1 1 0 1 0 0 Caesar 1 1 0 1 1 1 Calpurnia 0 1 0 0 0 0 Cleopatra 1 0 0 0 0 0 mercy 1 0 1 1 1 1 worser 1 0 1 1 1 0 1 if play contains word, 0 otherwise 11

Incidence vectors So we have a 0/1 vector for each term. To answer query: take the vectors for Brutus, Caesar and Calpurnia (complemented) bitwise AND. 110100 AND 110111 AND 101111 = 100100. 12

Answers to query Antony and Cleopatra, Act III, Scene ii Agrippa [Aside to DOMITIUS ENOBARBUS]: Why, Enobarbus, When Antony found Julius Caesar dead, He cried almost to roaring; and he wept When at Philippi he found Brutus slain. Hamlet, Act III, Scene ii Lord Polonius: I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. 13

Bigger document collections Consider N = 1million documents, each with about 1K terms. Avg 6 bytes/term incl spaces/punctuation 6GB of data in the documents. Say there are M = 500K distinct terms among these. 14

Can t build the matrix 500K x 1M matrix has half-a-trillion 0 s and 1 s. But it has no more than one billion 1 s. Why? matrix is extremely sparse. What s a better representation? We only record the 1 positions. 15

Inverted index For each term T: store a list of all documents that contain T. Do we use an array or a list for this? Brutus Calpurnia Caesar 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 What happens if the word Caesar is added to document 14? 16

Inverted index Linked lists generally preferred to arrays Dynamic space allocation Insertion of terms into documents easy Space overhead of pointers Brutus Calpurnia Caesar 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Dictionary Postings Sorted by docid (more later on why). 17

Inverted index construction Documents to be indexed. Friends, Romans, countrymen. Tokenizer Token stream. Friends Romans Countrymen Linguistic modules Modified tokens. friend roman countryman More on these later. Inverted index. Indexer friend roman countryman 2 4 1 2 18 13 16

Indexer steps Sequence of (Modified token, Document ID) pairs. Doc 1 I did enact Julius Caesar I was killed i' the Capitol; Brutus killed me. Doc 2 So let it be with Caesar. The noble Brutus hath told you Caesar was ambitious Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 19 ambitious 2

Sort by terms. Core indexing step. Term Doc # I 1 did 1 enact 1 julius 1 caesar 1 I 1 was 1 killed 1 i' 1 the 1 capitol 1 brutus 1 killed 1 me 1 so 2 let 2 it 2 be 2 with 2 caesar 2 the 2 noble 2 brutus 2 hath 2 told 2 you 2 caesar 2 was 2 ambitious 2 Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 20

Multiple term entries in a single document are merged. Frequency information is added. Why frequency? Will discuss later. Term Doc # ambitious 2 be 2 brutus 1 brutus 2 capitol 1 caesar 1 caesar 2 caesar 2 did 1 enact 1 hath 1 I 1 I 1 i' 1 it 2 julius 1 killed 1 killed 1 let 2 me 1 noble 2 so 2 the 1 the 2 told 2 you 2 was 1 was 2 with 2 Term Doc # Freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1 21

The result is split into a Dictionary file and a Postings file. Term Doc # Freq ambitious 2 1 be 2 1 brutus 1 1 brutus 2 1 capitol 1 1 caesar 1 1 caesar 2 2 did 1 1 enact 1 1 hath 2 1 I 1 2 i' 1 1 it 2 1 julius 1 1 killed 1 2 let 2 1 me 1 1 noble 2 1 so 2 1 the 1 1 the 2 1 told 2 1 you 2 1 was 1 1 was 2 1 with 2 1 Term N docs Tot Freq ambitious 1 1 be 1 1 brutus 2 2 capitol 1 1 caesar 2 3 did 1 1 enact 1 1 hath 1 1 I 1 2 i' 1 1 it 1 1 julius 1 1 killed 1 2 let 1 1 me 1 1 noble 1 1 so 1 1 the 2 2 told 1 1 you 1 1 was 2 2 with 1 1 Doc # Freq 2 1 2 1 1 1 2 1 1 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 2 1 2 1 2 1 1 1 2 1 2 1 22

Query processing Consider processing the query: Brutus AND Caesar Locate Brutus in the Dictionary; Retrieve its postings. Locate Caesar in the Dictionary; Retrieve its postings. Merge the two postings: 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar 23

The merge Walk through the two postings simultaneously, in time linear in the total number of postings entries 2 8 2 4 8 16 32 64 1 2 3 5 8 13 21 128 34 Brutus Caesar If the list lengths are x and y, the merge takes O(x+y) operations. Crucial: postings sorted by docid. 24

Basic postings intersection 25

Query optimization What is the best order for query processing? Consider a query that is an AND of t terms. For each of the t terms, get its postings, then AND together. Brutus Calpurnia Caesar 2 4 8 16 32 64 128 1 2 3 5 8 16 21 34 13 16 Query: Brutus AND Calpurnia AND Caesar 26

Query optimization example Process in order of increasing freq: start with smallest set, then keep cutting further. This is why we kept freq in dictionary Brutus Calpurnia Caesar 2 4 8 16 32 64 128 1 2 3 5 8 13 21 34 13 16 Execute the query as (Caesar AND Brutus) AND Calpurnia. 27

Query optimization 28

More general optimization e.g., (madding OR crowd) AND (ignoble OR strife) Get freq s for all terms. Estimate the size of each OR by the sum of its freq s (conservative). Process in increasing order of OR sizes. 29

Structured vs unstructured data Structured data tends to refer to information in tables Employee Manager Salary Smith Jones 50000 Chang Smith 60000 Ivy Smith 50000 Typically allows numerical range and exact match (for text) queries, e.g., Salary < 60000 AND Manager = Smith. 30

Unstructured data Typically refers to free text Allows Keyword queries including operators More sophisticated concept queries e.g., find all web pages dealing with drug abuse Classic model for searching text documents 31

Semi-structured data But in fact almost no data is unstructured E.g., this slide has distinctly identified zones such as the Title and Bullets Facilitates semi-structured search such as Title contains data AND Bullets contain search 32

More sophisticated semistructured search Title is about Object Oriented Programming AND Author something like stro*rup where * is the wild-card operator Issues: how do you process about? how do you rank results? The focus of XML search. 33