CSCE 561 Information Retrieval System Models

Similar documents
CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Information Retrieval Using Boolean Model SEEM5680

boolean queries Inverted index query processing Query optimization boolean model January 15, / 35

TDDD43. Information Retrieval. Fang Wei-Kleiner. ADIT/IDA Linköping University. Fang Wei-Kleiner ADIT/IDA LiU TDDD43 Information Retrieval 1

Dealing with Text Databases

Ricerca dell Informazione nel Web. Aris Anagnostopoulos

The Boolean Model ~1955

Probabilistic Information Retrieval

IR Models: The Probabilistic Model. Lecture 8

Chap 2: Classical models for information retrieval

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

16 The Information Retrieval "Data Model"

INFO 4300 / CS4300 Information Retrieval. slides adapted from Hinrich Schütze s, linked from

Maschinelle Sprachverarbeitung

Boolean and Vector Space Retrieval Models CS 290N Some of slides from R. Mooney (UTexas), J. Ghosh (UT ECE), D. Lee (USTHK).

Lecture 5: Web Searching using the SVD

9 Searching the Internet with the SVD

13 Searching the Web with the SVD

Hash-based Indexing: Application, Impact, and Realization Alternatives

Query CS347. Term-document incidence. Incidence vectors. Which plays of Shakespeare contain the words Brutus ANDCaesar but NOT Calpurnia?

IR: Information Retrieval

1998: enter Link Analysis

Information Retrieval

Querying. 1 o Semestre 2008/2009

Lecture 1b: Text, terms, and bags of words

How Latent Semantic Indexing Solves the Pachyderm Problem

Data Mining and Matrices

Outline for today. Information Retrieval. Cosine similarity between query and document. tf-idf weighting

Information Retrieval

1 Boolean retrieval. Online edition (c)2009 Cambridge UP

Information Retrieval and Web Search

Ranking-II. Temporal Representation and Retrieval Models. Temporal Information Retrieval

Natural Language Processing. Topics in Information Retrieval. Updated 5/10

Query. Information Retrieval (IR) Term-document incidence. Incidence vectors. Bigger corpora. Answers to query

Information System Design IT60105

Information Retrieval

Introduction to Information Retrieval

Retrieval by Content. Part 2: Text Retrieval Term Frequency and Inverse Document Frequency. Srihari: CSE 626 1

Vector Space Scoring Introduction to Information Retrieval INF 141 Donald J. Patterson

CS6220: DATA MINING TECHNIQUES

PV211: Introduction to Information Retrieval

Instructions for NLP Practical (Units of Assessment) SVM-based Sentiment Detection of Reviews (Part 2)

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Boolean and Vector Space Retrieval Models

Singular Value Decompsition

CS276A Text Information Retrieval, Mining, and Exploitation. Lecture 4 15 Oct 2002

Google PageRank. Francesco Ricci Faculty of Computer Science Free University of Bozen-Bolzano

CS6220: DATA MINING TECHNIQUES

CS425: Algorithms for Web Scale Data

Ad Placement Strategies

Information Retrieval. Lecture 6

Behavioral Data Mining. Lecture 2

Information Retrieval and Organisation

Information Retrieval Basic IR models. Luca Bondi

Ligand Scout Tutorials

Part I: Web Structure Mining Chapter 1: Information Retrieval and Web Search

Scoring, Term Weighting and the Vector Space

Improving Diversity in Ranking using Absorbing Random Walks

Machine Learning for natural language processing

Performing Map Cartography. using Esri Production Mapping

Dictionary: an abstract data type

Link Mining PageRank. From Stanford C246

Document Similarity in Information Retrieval

INFO 630 / CS 674 Lecture Notes

Algorithms for Data Science: Lecture on Finding Similar Items

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Large-Scale Behavioral Targeting

Updating PageRank. Amy Langville Carl Meyer

Term Weighting and the Vector Space Model. borrowing from: Pandu Nayak and Prabhakar Raghavan

INFO 4300 / CS4300 Information Retrieval. IR 9: Linear Algebra Review

Nearest Neighbor Search with Keywords

Combining Vector-Space and Word-based Aspect Models for Passage Retrieval

Matrices, Vector Spaces, and Information Retrieval

Scalable Asynchronous Gradient Descent Optimization for Out-of-Core Models

IS4200/CS6200 Informa0on Retrieval. PageRank Con+nued. with slides from Hinrich Schütze and Chris6na Lioma

Efficient query evaluation

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Latent Dirichlet Allocation (LDA)

INTRODUCTION TO GIS. Dr. Ori Gudes

Edo Liberty Principal Scientist Amazon Web Services. Streaming Quantiles

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Motivation. User. Retrieval Model Result: Query. Document Collection. Information Need. Information Retrieval / Chapter 3: Retrieval Models

new interface and features

Applications. Nonnegative Matrices: Ranking

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

Distributed Architectures

Mining Data Streams. The Stream Model Sliding Windows Counting 1 s

Information Retrieval CS Lecture 03. Razvan C. Bunescu School of Electrical Engineering and Computer Science

GIS Institute Center for Geographic Analysis

Impression Store: Compressive Sensing-based Storage for. Big Data Analytics

Lab 8: Measuring Graph Centrality - PageRank. Monday, November 5 CompSci 531, Fall 2018

Scoring (Vector Space Model) CE-324: Modern Information Retrieval Sharif University of Technology

Part A. P (w 1 )P (w 2 w 1 )P (w 3 w 1 w 2 ) P (w M w 1 w 2 w M 1 ) P (w 1 )P (w 2 w 1 )P (w 3 w 2 ) P (w M w M 1 )

Mass Asset Additions. Overview. Effective mm/dd/yy Page 1 of 47 Rev 1. Copyright Oracle, All rights reserved.

Why duplicate detection?

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

Image Compression Using the Haar Wavelet Transform

Fall CS646: Information Retrieval. Lecture 6 Boolean Search and Vector Space Model. Jiepu Jiang University of Massachusetts Amherst 2016/09/26

Term Weighting and Vector Space Model. Reference: Introduction to Information Retrieval by C. Manning, P. Raghavan, H. Schutze

Transcription:

CSCE 561 Information Retrieval System Models Satya Katragadda 26 August 2015

Agenda Introduction to Information Retrieval Inverted Index IR System Models Boolean Retrieval Model 2

Introduction Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collections (usually stored on computers) Satisfy what does this mean?

Introduction: Applications of IR Traditional Web Search Search Engines: Google, Bing, Yahoo? Digital Catalogues: IMDB, GLADYS Non-Traditional Searches File search Email search Other Applications Text categorization Text Summarization Structured Retrieval

Introduction: IR Assumptions Collection: A set of documents, assume its static Goal: Retrieve documents with information that is relevant to user s information need to complete a task.

Introduction: Classic Search Model User Task Info Need Query Misconception? Misformulation? Movies of Quentin Tarantino with Samuel Jackson List of Quentin Tarantino movies that also has Samuel Jackson Filmography of Samuel Jackson credited to Quentin Tarantino Search Engine Collection Query Refinement Results

Introduction: Search Model Search for movie credits with Quentin Tarantino and Samuel Jackson that does not have Leonardo DiCaprio grep all movie pages with for Quentin Tarantino and Samuel Jackson, then remove the ones that contain Leonardo DiCaprio Problems? Slow for large corpus More additional Information Ranked retrieval?

Indexing How to store documents and terms so that we can retrieve documents Efficiently Effectively With reasonable space requirements?

Indexing: Term - Matrix Create a table Rows: Terms Columns: IDs Term- Incidence Matrix Inverted View of Collection

Indexing: Term - Matrix 1 2 3 4 Term 1 1 0 0 1 Term 2 0 1 1 1 Term 3 1 0 1 1 Term 4 1 1 1 1 Rows: Vectors of documents containing term x Columns: Vectors of terms contained by document Y

Indexing: - Term Matrix Term 1 Term 2 Term 3 Term 4 1 1 0 1 1 2 0 1 0 1 3 0 1 1 1 4 1 1 1 1 Rows: Vectors of terms contained by document x Columns: Vectors of documents containing term y

Indexing: Term Matrix Naïve way of storage: Create and store the matrix Calculations 500,000 Terms 1,000,000 s ½ trillion entries: most of them 0 s and 1 s Memory Impact As documents/terms grows, needs constant updation Can t keep up with memory

Indexing: Term Matrix Observation Term Matrix->Sparse Only a small number of terms in any given document Assume a typical document contains 1000 terms A collection of 1,000,000 documents contains 1 billion 1 Rest of the entries are 0 s Thus, 99.8% of matrix is 0 s

Indexing: Inverted Index Also called Inverted File Dictionary of terms Vocabulary Lexicon Each term List of documents in which it appears Each document is called a posting

Indexing: Inverted Index Docume nt 1 Docume nt 2 Docume nt 3 Docume nt 4 Term 1 1 4 Term 1 1 0 0 1 Term 2 0 1 1 1 Term 3 1 0 1 1 Term 4 1 1 1 1 Term 2 2 3 4 Term 3 1 3 4 Term 4 1 2 3 4

Indexing: Inverted Index Dictionary: Sorted alphabetically Each posting: sorted by ID Storage Dictionary kept in memory Postings can be stored in memory or disk

Indexing: Inverted Index, changed Docume nt 1 Docume nt 2 Docume nt 3 Docume nt 4 Term 1: 2 1 4 Term 1 1 0 0 1 Term 2 0 1 1 1 Term 3 1 0 1 1 Term 4 1 1 1 1 Term 2: 3 2 3 4 Term 3: 3 1 3 4 Term 4: 4 1 2 3 4

IR System Models S=(D, Q, T, V, F) D: s Representation Space Q: Query Representation Space T: The set of Index terms (Indexing Vocabulary) F: D X Q V V: The set of Retrieval Status Value

IR System Models S = (D, Q, T, V, F) If the retrieval status values are unique, then the ordering is linear If the retrieval status values are non-unique, it is a weak ordering Select 10 relevant results?

IRS Models: Subject Catalog Model S=(D, Q, T, V, F) T = set of subject headings Q = T D = 2 T V = {0,1} F q (d), where q ϵ Q, d ϵ D 1, if q ϵ d 0, other wise

IRS Models: Coordination level System S=(D, Q, T, V, F) Q = 2 T D = 2 T V = {0,1} F q (d), where q ϵ Q, d ϵ D 1, if q d 0, other wise F q (d), where q ϵ Q, d ϵ D 1, if q > k 0, other wise

IRS Models: Boolean System S=(D, Q, T, V, F) D = 2 T Q = E (Expression) V = {0,1} F q (d), where q ϵ Q, d ϵ D 1, if q evaluates to True w.r.t. 0, other wise

Boolean System: Expression Let t Then t If Then If, Nothing else is an element of E.

Boolean System: Representation Set of s ID s D = {d α }, α = 1,2,,p Set of all term ID s T = {t i }, i = 1,2,,n

Boolean System: Representation Relation = { < α, t i, μ D α, t i )>} μ D =, μ D α, t i 1, if α contains t i 0, otherwise ti = α μ D α, t i ) = 1} α = {t i μ D α, t i ) = 1}

Boolean System: Retrieval Function V V t α = μ D α, t V α = 1 - V α V α = V α V α V α = V α V α

Boolean System: Example q = c d e a b

Boolean System: Example q =, α = {, } 1 V q α = 1 1 1 0 1 0 c 0 0 d e a b 1 0

Processing Boolean Query: Method 1 t t = α μ D α, t μ D α, t = } t t = α μ D α, t μ D α, t = } t = set of documents containing term t T = {a, b, c, d, e}

Processing Boolean Query: Method 1 Input a b Output

Processing Boolean Query: Method 1 Output Query t t e1 e e2 D\

Processing Boolean Query: Method 2 AND queries Construct a merged list M for and Transfer all duplicated records O d on merge list to output OR queries Construct a merged list M for and Transfer all unique records O u on merge list to output

Processing Boolean Query: Method 2 NOT queries Construct a merged list M for and Remove all the items appearing only once on this list First_List Create a merge list composed of and First_List Second_List Remove items appearing more than once from Second_List Transfer the remaining items (those that are alone) to output O a \ )

Boolean System: Method 2, Example q = t t t t :,, t :,, t :,, t t M( t, t ): { 1, 1, 2, 3} O t t : {1, 2, 3} unique M: Merge Operation, O: Output Selection

Boolean System: Method 2, Example t t t O t t : {1, 2, 3} t t t :,, M( t t, t ): { 1, 2, 2, 3, 3, 4} O( t t t : {2, 3} duplicate M( t t, t t t ): {1, 2, 2, 3, 3} O t t t : - alone

Boolean System: Variations Extended Boolean Has standard operations AND, OR and NOT Plus Term Proximity Within X words, sentences, paragraphs Wildcard Matching Fuzzy Allow for range Function F no longer restricted to {0,1}

Questions?

References Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, 2008. Abraham Bookstein and William Cooper, A General Mathematical Model for Information Retrieval Systems, The Library Quarterly, Vol 26, no. 2, pp 153-67. Ryan Benton s Lecture Notes: http://www.cacs.louisiana.edu/~cmps561/notes/cmps5 61Fall10-BooleanIR-fa2011.pdf