Compact Indexes for Flexible Top-k Retrieval

Size: px

Start display at page:

Download "Compact Indexes for Flexible Top-k Retrieval"

Teresa Crawford
5 years ago
Views:

1 Compact Indexes for Flexible Top-k Retrieval Simon Gog Matthias Petri Institute of Theoretical Informatics, Karlsruhe Institute of Technology Computing and Information Systems, The University of Melbourne July th 5

2 Top-k document retrieval Given Collection D = {d,..., d N } Each d i is a string over alphabet Σ = [, σ] terminated by sentinel character (also #) D = D d N, with d N =. Bag of words query Q = {q, q,..., q m } (unordered set of size m) Problem Given a collection D, a query Q of length m, and a similarity measure S : D P =m (Σ ) R. Calculate the top-k documents of D with regard to Q and S. That is a sorted list of document identifiers T = {τ,..., τ k }, with S(d τi, Q) S(d τi+, Q) for i < k and S(d τk, Q) S(d j, Q) for j T.

3 Example Fix a concatenation C of D. i = C word = LA O LA # O LA LA LA # O O LA # C = d d d d S sfreq (d, q) := f d,q (i.e. single-term frequency ranking) S sfreq (d, LA) =, S sfreq (d, LA) =, S sfreq (d, LA) =, S sfreq (d, LA) =. Top-: T = {, }

4 Previous and related work Optimal time (O( q + k)) and space solution for single-term frequency ranking by Navarro & Nekreich (SODA ) Multi-term ranking for term frequency using linear space and time dependent on n m by Hon et al. (J. ACM 4) Larson et al. (CPM 4): Reduction of boolean matrix multiplication to problem of finding elements which contain both terms of a two-term query

5 Previous and related work Optimal time (O( q + k)) and space solution for single-term frequency ranking by Navarro & Nekreich (SODA ) Multi-term ranking for term frequency using linear space and time dependent on n m by Hon et al. (J. ACM 4) Larson et al. (CPM 4): Reduction of boolean matrix multiplication to problem of finding elements which contain both terms of a two-term query Our goal A practical solution for multi-term queries and a wide range of similarity measures, like...

6 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d

7 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d

8 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d

9 Okapi BM5 similarity measure Successful IR similarity measure: S BM5 Q,d = q Q ( ) (k + )f d,q N FD,q +.5 ) f Q,q ln k ( b + b n d n avg + f F D,q +.5 d,q }{{}}{{} =w Q,q =w d,q depends on document-dependent factors: f d,q term frequency (# of occurrences of term q in d) F D,q document frequency (# of distinct ds which contain q) n d length of document d

10 Previous practical solution The GREEDY framework of Culpepper et al. (ESA ) [] consists of a Compressed Suffix Array (CSA) of concatenation D Wavelet Tree of the Document Array of D CSA provides phrase search and snippet extraction This functionality is missing in Inverted Indexes (II)

11 The GREEDY framework T = b = ω ω ω ω # ω ω ω 4 ω # ω ω 4 ω ω # ω 5 ω 5 # ω CSA D = Interval of q = ω in D corresponds to the (multi)set of documents which contain q.

12 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω :

13 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) push state Top documents containing ω :

14 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop state Top documents containing ω :

15 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) expand (O() time) and push Top documents containing ω :

16 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop state Top documents containing ω :

17 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) expand and push Top documents containing ω :

18 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) pop and report Top documents containing ω : d ( times)

19 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times)

20 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times)

21 The GREEDY framework Represent document array D as wavelet tree WTD Example: Search top- documents (frequency based ranking) Top documents containing ω : d ( times), d ( times)

22 The GREEDY framework (conclusion) Elegant and simple alogrithm Size: CSA(T ) + WT (D) = nh k (T ) + n log D + o(n log D ) using a plain WT Worst case time depends on # distinct docs in lex. range Culpepper et al. (ESA ) showed that it is practical for single-term frequency ranking, however large index size (character-based indexing) only for small collections

23 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)

24 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)

25 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)

26 Generalize and improve GREEDY multi-term (state consists of multiple intervals) ranked-and or ranked-or version more complex similarity measures: TF IDF, BM5, LMDS Implemenation: 64-bit, word-alphabet Three tricks to achieve a better score estimation: overestimate max d Dv {f d,q }: interval size - unqiue docs in interval + Use repetition array to get unique docs in O() min d Dv {n d } can be determined in O() time (sort doc IDs according to length)

27 Document Frequency: Binary Suffix Tree (BST) v # H = v O v {,, } LA.. 8 v 4 v 5 v 6 # O LA.. O.. {} v 7 LA 6 #.. {} v 8 LA.. {} 5 v 9 {,, } O.. LA v v # O.. O v LA.. 4 {} 7 v v v v 4 v 5 v 6 v 7 v 8 v 9 v v v LA v O.. v 9

28 Document Frequency Sadakane s [] solution. H is at most n bits add o(n)-bit select structure For [l, r] CSA.search(q): document_frequency(h, [l, r]) s r l + y select(h, r, ) if l = then 4 return s (y r + ) 5 else 6 x select(h, l, ) 7 return s (y r + (x l + ))

29 Document Frequency for Subsets D v Solution for document subsets represented in WTD For each in H, record repeated doc ID (O(n log N) bits). Build WT over R. Map [l, r] to WTR (via rank/select). Traverse WTD and WTR simultaneously.

30 Document Frequency for Subsets: Repetition Array v # H = R = v O v {,, } LA.. 8 v v v v 4 v 5 v 6 # O LA.. O.. v 4 v 5 7 v 6 {} v 7 LA 6 v 7 {} v 8 #.. LA.. {} 5 v 8 v 9 {,, } O.. v 9 LA v v # O.. O v LA.. 4 v v v LA {} v O.. v 9

31 Space Reduction ˆR: omit entries in R, which belong to ST root v # H = R = v O v {,, } LA.. 8 v 4 v 5 v 6 O # LA.. O.. 7 v v v v 4 v 5 v 6 {} v 7 LA 6 v 7 #.. {} v 8 LA.. {} 5 v 8 v 9 {,, } O.. LA v v # O.. O v LA.. 4 v 9 v v v LA {} v O.. v 9

32 Space Reduction ˆR: omit entries in R, which belong to ST root ˆRl and ˆD l : If phrase length is restricted to l, sorting intervals with common prefix of length l does not change correctness of method. v # H = R = v O v {,, } LA.. 8 v 4 v 5 v 6 O # LA.. O.. 7 v v v v 4 v 5 v 6 {} v 7 LA 6 v 7 #.. {} v 8 LA.. {} 5 v 8 v 9 {,, } O.. LA v v # O.. O v LA.. 4 v 9 v v v LA {} v O.. v 9

33 Space Reduction ˆR: omit entries in R, which belong to ST root ˆRl and ˆD l : If phrase length is restricted to l, sorting intervals with common prefix of length l does not change correctness of method. Both, WT- ˆR l and WT-D l contain frequency information. Given WT- ˆR l, omit duplicates for intervals with common prefix of length l in WT-D l sort intervals in our example: D = {,,,,,,,,, } instead of D = {,,,,,,,,,,,,, }

34 Implementation Based on SDSL components. Available at Includes engineered II implementation (block-max WAND + document reordering for better compression).

35 Collection Statistics GOV CLUEWEB9 n,468,78,575 4,579,89,95 N 5,5,79 5,,4 n avg 9 88 n min n max 7,49 7,44 σ 9,77,9 9,4,65 C raw 46 GiB. TiB C word 7 GiB 8 GiB Notice Input is the word parsing C word (generated by Indri) Queries from the TREC 5/6 efficiency track

36 Index Sizes GOV CLUEWEB9 CSA DF WT-D l WT- ˆR l Size [GiB] 5 5 I-D n I-D n ˆR n I-D ˆR I-D n I-D n ˆR n I-D ˆR Details: Horizontal line: size of word parsing PII: typically 5%-6% (+original text), our II: 7 GiB for GOV

37 Evaluated States Explored search space % % % % k Estimation range size (f Dv,q) only range size (f Dv,q) and min. doc. length repeats (δ Dv,q) and min. doc length

38 Query Times () Time [ms] k k I-D n I-D n ˆR n I-D ˆR INVIDX-W k Mean Time per State [µs] 9 6 I-D n I-D n ˆR n I-D ˆR Query Length [Words] Left: BM5 Ranked-OR retrieval on GOV. Right: Time per state.

39 Query Times () k Ranked-AND MWE Ranked-AND I-D n I-D n ˆR n I-D ˆR INVIDX-E k k Time [ms] Time [ms] k k= k= k= TF IDF BM5 LMDS Ranked-AND BM5 runtime for unparsed and MWE-parsed queries (left) and Ranked-OR runtime for different similarity measures and indexes (right).

40 Conclusion Extended GREEDY approach to multi-term queries and complex scoring functions Conceptional very simple search engine Flexible: ranked-and/or, scoring function II still better for bag of word queries (for precomputed scores, i.e. fixed scoring function) But: self-index based solution provides more functionality phrase search (experiment with multi-word expressions) text extraction query completion (without query log) First self-index based system on scale Future work: WTs of higher arity, faster state processing

41 J. Shane Culpepper, Gonzalo Navarro, Simon J. Puglisi, and Andrew Turpin. Top-k ranked document search in general text databases. In Proc. ESA, pages 94 5,. Kunihiko Sadakane. Succinct data structures for flexible text retrieval systems. J. Discrete Alg., 5():, 7.

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text