Space and Time-Efficient Data Structures for Massive Datasets

Size: px

Start display at page:

Download "Space and Time-Efficient Data Structures for Massive Datasets"

Amberly Small
5 years ago
Views:

1 Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri Supervisor Rossano Venturini Computer Science Department University of Pisa 17/10/2016 1

3 Evidence The increase of information does not scale with technology. 3

4 Evidence The increase of information does not scale with technology. Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3

5 Evidence The increase of information does not scale with technology. Even more relevant today! Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3

6 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work Algorithms EFFICIENCY how much work is required by a program - less work 4

7 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work 4

8 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression 4

9 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression + space - time 4

10 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work? Data Compression + space - time 4

11 Dichotomy Problem Small VS Fast? 5

12 Dichotomy Problem Small VS Fast? Choose one. 5

13 Dichotomy Problem Small VS Fast? Choose one. NO 5

14 High Level Thesis Data Structures + Data Compression Faster Algorithms 6

15 High Level Thesis Data Structures + Data Compression Faster Algorithms Space optimization is closely related to time optimization in a disk memory. Donald E. Knuth, The Art of Computer Programming 6

16 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK 7

17 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size Speed 7

18 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7

19 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB 14x 200X Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7

20 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. 8

21 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8

22 Proposal Must exploit properties of the addressed problem. Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8

23 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. 9

24 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

25 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

26 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

27 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

28 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

29 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

30 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

31 Wait! Can t we use existing libraries? 10

32 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? 10

33 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10

34 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10

35 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map Prefer cache-friendly (non discontiguous) data structures. Always. Use std::vector. 10

36 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory 11

37 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory vector of offsets memory

38 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory

39 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

40 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

41 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

42 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

43 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

44 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

45 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

46 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs ~3GBs

47 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs X ~3GBs

48 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 12

49 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red the house is always hungry red is always good boy is red the boy is hungry 12

50 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} the house is always hungry boy is red boy is hungry 12

51 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 2 the house is red 1 red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 3 the house is always hungry boy is red 4 boy is hungry 5 12

52 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 5 12 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5]

53 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 13

54 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] 13

55 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13

56 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13

57 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} posting lists intersection 13

58 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? 14

59 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14

60 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14

61 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15

62 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15

63 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? 15

64 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. 15

65 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. Data Structures + Data Compression Faster Algorithms 15

66 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). 16

Elias-Fano - Genesis 1 Peter Elias [1923-2001] Robert Fano [1917-2016] Robert Fano.

Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias.

Journal of the ACM (JACM) 21, 2, 246 260 (1974). Sebastiano Vigna.

67 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). Sebastiano Vigna. Quasi-Succinct Indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), (2013). 40 years later! 16

68 Elias-Fano - Encoding example

69 Elias-Fano - Encoding example 2 u =

70 Elias-Fano - Encoding example u =

71 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =

72 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =

73 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

74 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

75 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

76 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

77 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

78 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = 43 8 L =

79 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =

80 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =

81 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L = H =

82 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits 18

83 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. 18

84 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n 18

85 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

86 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

87 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

88 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

89 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

90 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

91 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

92 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18

93 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18

94 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n but n ( n lg u+n n 18

95 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (

96 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (

97 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19

98 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19

99 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. Idea: encode clusters of posting lists. 19

100 Clustered Elias-Fano Indexes 2 cluster of posting lists 20

101 Clustered Elias-Fano Indexes 2 cluster of posting lists unbounded universe lg u bits 20

102 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list unbounded universe lg u bits 20

103 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20

104 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20

105 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits 20

106 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list 20

107 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list NP-hard problem already for a simplified formulation. 20

108 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21

109 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21

110 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21

111 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21

112 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average) 21

113 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits 22

114 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor 22

115 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. 22

116 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time 22

117 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time O(n lg u) bits (or even worse) 22

118 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

119 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

120 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

121 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23

122 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23

123 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23

124 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23

125 Dynamic Integer Sets in Succinct Space and Optimal Time Goals 24

126 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits 24

127 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits + o(n) bits 24

128 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 24

129 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 1. Extend the static Elias-Fano representation to support predecessor and successor queries in optimal worstcase O(lg lg n) time. 2. Maintain S in a fully dynamic fashion, supporting in optimal worst-case time all the operations defined in the Dynamic Dictionary and Dynamic List Representation problems. 24

130 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] 25

131 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] Idea: divide the sequence into blocks and use a y-fast trie to index the blocks. 25

132 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] 26

133 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] Idea: use a 2-level indexing data structure. First level indexes blocks using a y-fast trie and the dynamic prefix-sum data structure by Bille et al. Second level indexes mini blocks using the data structure of the Lemma. 26

134 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27

135 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27

136 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27

137 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27

138 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27

139 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27

140 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27

141 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27

142 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

143 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

144 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

145 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

146 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

147 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27

148 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27

149 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27

150 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27

151 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27

152 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27

153 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27

154 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27

155 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams

156 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams

157 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams

the books ever published devised to N number of grams 1 24,359,473 2 667,284,771 3

158 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 N number of grams More than 11 billion grams. 27

159 N-grams - Why? 2 Word prediction. 28

160 N-grams - Why? 2 Word prediction. space and time-efficient? 28

161 N-grams - Why? 2 Word prediction. space and time-efficient? context 28

162 N-grams - Why? 2 Word prediction. algorithms foo space and time-efficient? context data structures bar baz 28

163 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz

164 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz P ( data structures space and time-efficient ) f ( space and time-efficient data structures ) f ( space and time-efficient ) 28

165 N-grams - Who cares? 3 29

166 N-grams - Who cares? 3 29

167 N-grams - Who cares? 3 29

168 What can I help you with?

169 What can I help you with? Siri

170 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. 31

171 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. 31

172 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. Data Structures + Data Compression Faster Algorithms 31

173 N-grams - Data structures 5 Open-addressing VS Tries 32

174 N-grams - Data structures 5 Open-addressing VS Tries

175 N-grams - Data structures 5 Open-addressing VS Tries X h

176 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h

177 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h A B C D A B D B C A C D B A B D A A C A B D A

178 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

179 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

180 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

181 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

182 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

183 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

184 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time BBC X h A B C D A B D B C A C D B A B D A A C A B D A

185 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time X h Several software libraries 24 BBC KenLM [Heafield, 2011] BerkeleyLM [Pauls and Klein, 2011] A B 24 IRSTLM [Federico et al., 2008] C D RandLM [Talbot and Osborne, 2007] Get1T [Hawker et al., 2007] A B D B C A C D SRILM [Stolcke, 2002] B A B D A A C A B D A

186 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

187 tongrams - Tons of N-Grams 1 Hash-based Open addressing? VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

188 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

189 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

190 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

191 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

192 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

193 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

194 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A B C D A B D B C A C D B A B D A A C A B D A

195 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A0 B1 C2 D A0 B1 D3 B1 C2 A0 C2 D B1 A0 B1 D3 A0 A0 C2 A0 B1 D3 A

196 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A

197 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. 33

198 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. Random access. 33

199 tongrams - Preliminary results 2 34

200 tongrams - Preliminary results 2 34

201 tongrams - Preliminary results 2 X2.6 X3.8 34

202 (Some) Future Research Problems 1 Dynamic Inverted Indexes. 35

203 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35

204 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35

205 (Some) Future Research Problems 2 Compressed B-trees. 36

206 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. 36

207 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. 36

208 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. Michael Bender Stony Brook University Martin Farach-Colton Rutgers University Bradley Kuszmaul MIT Laboratory for Computer Science 36

209 (Some) Future Research Problems 3 Fast Successor for IP-lookup. 37

210 (Some) Future Research Problems 3 Fast Successor for IP-lookup. Successor search is what routers do for every incoming packet. Hence, the most run algorithm in the world. Time and space efficiency is crucial. 37

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from