CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011

Size: px

Start display at page:

Download "CMPS 561 Boolean Retrieval. Ryan Benton Sept. 7, 2011"

Henry Young
5 years ago
Views:

1 CMPS 561 Boolean Retrieval Ryan Benton Sept. 7, 2011

2 Agenda Indices IR System Models Processing Boolean Query Algorithms for Intersection

3 Indices

4 Indices Question: How do we store documents and terms such that we can retrieve documents Efficiently Effectively With reasonable space requirements?

5 Term-Document Matrix Create table Rows: Terms Columns: Document Ids Official Name: Term-Document Incidence Matrix Also called: Inverted View of Collection

6 Term-Document Incidence Matrix Term Term Term Term

7 Term-Document Matrix Why Rows Vectors of documents containing term X. Columns Vectors of terms contained by document Y. Technically, take the transpose of Matrix to get the columns vectors.

8 Document-Term Incidence Matrix Term 1 Term 2 Term 3 Term

9 Term-Document Matrix Naïve Way of Building Create and store the Matrix Some Calculations 500,000 Terms 1,000,000 Documents ½ trillion entries : 500,000,000,00 All 0 s and 1 s. Memory Impact As documents and/or term list grows Can t keep in memory

10 Term-Document Matrix Observation: Term-Document Matrix Sparse Typically, only a small number of terms in any given document. If typical document contains 1,000 terms Matrix, in previous example, has 1 billion 1 s 1,000,000,000 Thus, 99.8% of matrix has 0 s.

11 Inverted Index Also called: Inverted File Dictionary of Terms Vocabulary Lexicon Each term List of documents in which it appears. Each document sometimes called a posting.

12 Term-Document Incidence Matrix Term Term Term Term

13 Inverted Index Term Term Term Term

14 Inverted Index Note: Dictionary sorted alphabetically Each posting list sorted by ID Storage: Dictionary kept in memory Postings Depends on space. In memory on disk.

15 Inverted Index, Some Change Term 1: Term 2: Term 3: Term 4:

16 IR System Models

17 Model S = (D, Q, T, V, F) d D q Q F: D C Q V F q : D V Retrieval Status Values (RSV) T: index terms

18 Model S = (D, Q, T, V, F) defined over elements of V is simple order defined over elements of D by F is weak order Breaks element of D into number of subsets Each subset are simply ordered

19 Subject Catalog Model S = (D, Q, T, V, F) T = set of subject headings Q = T D = 2 T V = { 0, 1 } F q (d) where q Q, d D 1, if q d 0, otherwise

20 Coordination Level System S = (D, Q, T, V, F) Q = 2 T D = 2 T V = { 0, 1 } F q (d) where 1, if q d 0, otherwise F q (d) where 1, if q d > k 0, otherwise

21 Boolean Systems S = (D, Q, T, V, F) D = 2 T Q = E V = { 0, 1 } F q (d) where 1, if q equates to True With respect to document 0, otherwise

22 What is E? Let t T Then t E If e E Then e E If e 1, e 2 E e 1 e 2 E e 1 e 2 E Nothing else is in E!

23 Document Representation Set of Document IDs D = {d a } a=1,2,,p Set of all term IDs: T = {t i } I = 1,2,,n

24 Document Representation Relation D = { < d a, t i,m D (d a, t i )> } m D :D x T {0,1} m D (d a, t i ) 1, if d a contains t i 0, otherwise D ti = {d a D m D (d a, t i ) = 1} d a D d = {t i T m D (d a, t i ) = 1

25 Retrieval Function RSV F RSV t (d a ) = m D (d a, t) RSV e (d a ) = 1 - RSV e (d a ) RSV e1 e2 (d a ) = RSV e1 (d a ) RSV e2 (d a ) RSV e1 e2 (d a ) = RSV e1 (d a ) RSV e2 (d a )

26 Processing Boolean Queries

27 Boolean example q = (d e) (c (a b)) c d e a b

28 Boolean example q = (d e) (c (a b)) D = {a,c} d a RSV q (d a ) = c 0 d 0 e 1 a 0 b

29 Processing Boolean Query (Method 1) D t1 t2 = {d a D m D (d a, t 1 ) m D (d a, t 2 ) = 1} D t1 t2 = {d a D m D (d a, t 1 ) m D (d a, t 2 ) = 1} D t = set of Documents containing term t T = {a,b,c,d,e} D a, D b, D c, D d, D e,

30 Output Processing Boolean Query (Method 1) D a D b Input a b a b

31 Processing Boolean Query (Method 1) Output Query D t t D e1 e1 D e2 e2 D e1 e2 e1 e2 D e1 e2 e1 e2 D\D e1 e1

32 Boolean Queries (Method 2) and-queries (t i t j ) Construct a merged list M for D ti and D tj. Transfer all duplicated records O d on merge list to output or-queries (t i t j ) Construct a merged list M for D ti and D tj. Transfer all unique records O u on merge list to output.

33 Boolean Queries (Method 2) not-queries (t i t j ) Construct a merged list M for D ti and D tj. Remove all items appearing only once on this list FIRST_List Create merge list composed of First_List and list composed of D ti SECOND_List Remove items appearing more than once from SECOND_List (O a ) Transfer remaining items to output.

34 Reminder - Inverted Index Term Term Term Term

35 Example Query, part 1 ((t 1 t 2 ) t 3 ) Let s do the first part (t 1 t 2 ) D t1 : {R 1, R 3 ) D t2 : {R 1, R 2 ) M(D t1, D t2 ) : {R 1, R 1, R 2, R 3 } O(t 1 t 2 ) : {R 1, R 2, R 3 } -unique M Merging Operation, O Output Selection

36 Example Query, part 2 Now, let s handle the second part ((t 1 t 2 ) t 3 ) O(t 1 t 2 ) : {R 1, R 2, R 3 } D t1 t2 D t3 : {R 2, R 3, R 4 ) M(D t1 t2, D t3 ) : {R 1, R 2, R 2, R 3, R 3, R 4 } O((t 1 t 2 ) t 3 ) : {R 2, R 3 }- duplicate M(D t1 t2, D (t1 t2) t3 ) : {R 1, R 2, R 2, R 3, R 3 } O((t 1 t 2 ) t 3 ) : {R 1 }- alone

37 Algorithms for Intersection

38 Algorithms Basic Intersection (aka Merging) Intersect(p1, p2) answer {} While (p1!= NIL) and (p2!= NIL) Do if docid(p1) = docid(p2) Then ADD(answer, docid(p1))» p1 next(p1)» p2 next(p2) Else if (docid(p1) < docid(p2))» Then p1 next(p1)» Else p2 next(p2) Return answer

39 Algorithms Intersection Complexity: O(x + y) For any given two posting lists List A has size x List B has size y Note, this is upper bound. Formally, Complexity: Q(N) N can be either Number of documents in collection Note, this is a tight bound.

40 Observation In many cases, Boolean queries Conjunctive in nature Allows for a possible improvement based on posting size (term frequency)

41 Algorithms Conjunctive Query Merging IntersectConjunct(t 1, t 2,, t z ) Terms SortByIncreasingFrequency((t 1, t 2,, t z )) Results postings(first(terms)) Terms rest(terms) while (Terms!= NIL) and (Results!= NIL) Do Results Intersect(result, postings(first(terms))) Terms rest(terms) Return Results

42 Why? By using least frequent term All results guaranteed to be no larger than least frequent term In practice The intermediate list always places upper bounds on the size.

43 Variations on Boolean Extended Boolean Has standard operations: AND, OR and NOT Plus Term Proximity Within X words, sentences, paragraphs Wildcard Matching Fuzzy Allow for range Function F no longer restricted to {0,1}

44 Thank-you Questions?

45 References Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze, Introduction to Information Retrieval, Chapter 1, Abraham Bookstein and William Cooper, A General Mathematical Model for Information Retrieval Systems, The Library Quarterly, Vol 26, no. 2, pp Vijay V. Raghavan s Notes/Lecture Material Model.pdf Material in Slides ued with permission

CSCE 561 Information Retrieval System Models

CSCE 561 Information Retrieval System Models Satya Katragadda 26 August 2015 Agenda Introduction to Information Retrieval Inverted Index IR System Models Boolean Retrieval Model 2 Introduction Information