Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Size: px

Start display at page:

Download "Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini"

Brandon Conley
6 years ago
Views:

1 Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini

2 Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a] 3: [it is a banana] 1, 2, 3 are document iden$fiers (docids) We want to find the documents that match a query, that is a set of terms

3 Inverted indexes 1: [it is what it is not] 2: [what is a] 3: [it is a banana] Query: [what] Results: 1, 2 Query: [it AND is] Results: 1, 3 Query: [banana OR not] Results: 1, 3 Query: [what AND is] Results: 1, 2 Query: [what AND not] Results: 1

4 Inverted index Pos$ng list: list of docids containing a term Inverted index: pos$ng lists for each term 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3 banana 3 is 3, 1, 2 it 1, 3 not 1 what 2, 1

5 Inverted indexes 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3 banana 3 is 3, 1, 2 it 1, 3 not 1 what 2, 1 Query: it AND is a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, 2

6 Docid- sorted inverted indexes Efficient intersec$on One- pass scan of the pos$ng lists Efficient compression Reduces to storage of monotonically increasing lists of integers

7 Index compression Inverted indexes can be very large Example: ClueWeb09 Track B Collec$on 50M documents 15B pos$ngs No compression (32 bits per integer): 60GB Spoiler: with compression, can go as low as 11GB

8 Sequences in pos$ng lists Generally, a pos$ng lists is composed of Sequence of docids: strictly monotonic Sequence of frequencies/scores: strictly posi$ve Sequence of posi$ons: concatena$on of strictly monotonic lists We focus on docids and frequencies

9 Elias Fano! Monotonic Sequences a i + i Docids Strictly monotonic Interpola$ve coding i a i i=0 a i a i - 1 Non- nega$ve Gap coding a i - 1 Scores Strictly non- nega$ve (Posi$ve)

10 Gap compression Replace list of integers with gaps between adjacent integers Since list is increasing, all gaps are posi$ve a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, 2 a 2, 1 banana 3 is 1, 1, 1 it 1, 2 not 1 what 1, 1

11 Elias- Fano encoding Data structure from the 70s, mostly in the succinct data structures niche Natural encoding of monotonically increasing sequences of integers Recently successfully applied to the representa$on of inverted indexes Used by Facebook Graph Search!

Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits 00010 00011 00101 00111 01011 01101 10011 l lower bits 0, 0, 1, 1, 2, 3, 4 Extract

12 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits 0, 0, 1, 1, 2, 3, 4 Extract the upper bits 0, 0, 1, 0, 1, 1, 1 Compute the gaps Encode gaps with unary code Elias- Fano representa$on of the sequence

13 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits Elias- Fano representa$on of the sequence Maximum upper value: [U / 2 l ] where U is largest sequence value Sequence length is n Example: [19 / 2 2 ] = 4 = 100 In upper bits there are [U / 2 l ] zeros and n ones Space [U / 2 l ] + n + nl bits

14 Elias- Fano representa$on Example: 2, 3, 5, 7, 11, 13, 19 w l upper bits l lower bits Can show that l = [log(u/n)] is op$mal Space [U / 2 l ] + n + nl bits (2 + log(u/n))n bits

15 Elias- Fano representa$on With addi$onal (small) data structures can support fast search and random access Almost- O(1)- $me skips for intersec$on queries (2 + log(u/n))n- bits space independent of values distribu1on! is this a good thing?

16 Term- document matrix Alterna$ve interpreta$on of inverted index a 2, 3 banana 3 is 1, 2, 3 it 1, 3 not 1 what 1, a X X banana X is X X X it X X not X what X X Gaps are distances between the Xs

17 Gaps are usually small Assume that documents from the same domain have similar docids unipi.it/a unipi.it/b unipi.it/c unipi.it/d livorno.it livorno.it/nemici pisa X X X X X Cluster of docids Pos$ng lists contain long runs of very close integers That is, long runs of very small gaps

18 Gap compression Gaps are mostly small integers Small integers can be compressed efficiently with universal codes Example: unary code 1: 1 2: 01 3: 001 4: 0001 Example: 2, 3, 5, 9 - > 2, 1, 2, 4 - >

19 Gap compression Gap compression historically the most common representa$on of inverted indexes Drawbacks Decoding is sequen$al, can be slow Bemer compression - > even slower decoding

20 Elias- Fano and clustering Consider the following two lists 1, 2, 3, 4,, n 1, U n random values between 1 and U Both have n elements and last value U Elias- Fano compresses both to the exact same number of bits! But first list is fare more compressible: it is sufficient to store n and U Elias- Fano doesn t exploit clustering

21 Par11oned Elias- Fano XXX XXXXX XXXXXXXXX X X X X XXXX X XXX Par$$on the sequence into chunks Add pointers to the beginning of each chunk Represent each chunk and the sequence of pointers with Elias- Fano If the chunks approximate the clusters, compression improves

22 Par$$on op$miza$on We want to find, among all the possible par$$ons, the one that takes up less space Exhaus$ve search is exponen1al Dynamic programming can be done quadra1c Our solu$on: (1 + ε)- approximate solu$on in linear $me O(n log(1/ε)/log(1 + ε)) Reduce to a shortest path in a sparsified DAG

23 Par$$on op$miza$on Nodes correspond to sequence elements Edges to poten$al chunks Paths = Sequence par$$ons

24 Par$$on op$miza$on Each edge weight is the cost of the chunk defined by the edge endpoints Shortest path = Minimum cost par$$on Edge costs can be computed in O(1) but number of edges is quadra$c!

25 Idea n i

26 Idea n.1 General DAG sparsifica1on technique Quan$ze edge costs in classes of cost between (1 + ε 1 ) i and (1 + ε 1 ) i + 1 For each node and each cost class, keep only one maximal edge O(log n / log (1 + ε 1 )) edges per node! Shortest path in sparsified DAG at most (1 + ε 1 ) $mes more expensive than in original DAG Sparsified DAG can be computed on the fly

27 Idea n.2 XXXXXXXXX X X X X If we split a chunk at an arbitrary posi$on New cost Old cost + cost of new pointer If chunk is big enough, loss is negligible We keep only edges with cost O(1 / ε 2 ) New DAG has O(n log (1 / ε 2 ) / log (1 + ε 1 )) edges! Approxima$on factor becomes (1 + ε 2 ) (1 + ε 1 )

Dependency on ε 1 No$ce the scale! 4.66 3500 4.65 3000 4.64 2500 4.63 2000 4.

28 Dependency on ε 1 No$ce the scale! Bits per posbng Time in seconds

Dependency on ε 2 Here we go from O(n log n) to O(n) 4.85 350 4.80 300 4.75 250 200 4.

29 Dependency on ε 2 Here we go from O(n log n) to O(n) Bits per posbng Time in seconds

30 Results on GOV2 and ClueWeb09 GOV2 ClueWeb09 Elias- Fano 5.6GB 14GB Par$$oned Elias- Fano 3.0GB 11GB

31 Results on GOV2 and ClueWeb09 Gov2 ClueWeb09 space doc freq space doc freq GB bpi bpi GB bpi bpi EF single 7.66 (+64.7%) 7.53 (+83.4%) 3.14 (+32.4%) (+23.1%) 7.46 (+27.7%) 2.44 (+11.0%) EF uniform 5.17 (+11.2%) 4.63 (+12.9%) 2.58 (+8.4%) (+11.5%) 6.58 (+12.6%) 2.39 (+8.8%) EF -optimal Interpolative 4.57 ( 1.8%) 4.03 ( 1.8%) 2.33 ( 1.8%) ( 8.3%) 5.33 ( 8.8%) 2.04 ( 7.1%) OptPFD 5.22 (+12.3%) 4.72 (+15.1%) 2.55 (+7.4%) (+11.6%) 6.42 (+9.8%) 2.56 (+16.4%) Varint-G8IU (+202.2%) (+158.2%) 8.98 (+278.3%) (+148.3%) (+88.1%) 8.98 (+308.8%) Gov2 ClueWeb09 TREC 05 TREC 06 TREC 05 TREC 06 Gov2 ClueWeb09 TREC 05 TREC 06 TREC 05 TREC 06 EF single 2.1 (+10%) 4.7 (+1%) 13.6 ( 5%) 15.8 ( 9%) EF uniform 2.1 (+9%) 5.1 (+10%) 15.5 (+8%) 18.9 (+9%) EF -optimal Interpolative 7.5 (+291%) 20.4 (+343%) 55.7 (+289%) 76.5 (+341%) OptPFD 2.2 (+14%) 5.7 (+24%) 16.6 (+16%) 21.9 (+26%) Varint-G8IU 1.5 ( 20%) 4.0 ( 13%) 11.1 ( 23%) 14.8 ( 15%) OR queries EF single 80.7 (+8%) (+10%) (+0%) ( 2%) EF uniform 72.1 ( 3%) ( 3%) ( 3%) ( 4%) EF -optimal Interpolative (+62%) (+62%) (+53%) (+51%) OptPFD 69.5 ( 7%) ( 7%) ( 10%) ( 12%) Varint-G8IU 67.4 ( 10%) ( 10%) ( 15%) ( 17%) AND queries

32 Thanks for your amen$on! Ques$ons?

Par$$oned Elias- Fano Indexes

Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3