Space and Time-Efficient Data Structures for Massive Datasets
|
|
- Amberly Small
- 5 years ago
- Views:
Transcription
1 Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri Supervisor Rossano Venturini Computer Science Department University of Pisa 17/10/2016 1
2
3 Evidence The increase of information does not scale with technology. 3
4 Evidence The increase of information does not scale with technology. Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3
5 Evidence The increase of information does not scale with technology. Even more relevant today! Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3
6 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work Algorithms EFFICIENCY how much work is required by a program - less work 4
7 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work 4
8 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression 4
9 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression + space - time 4
10 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work? Data Compression + space - time 4
11 Dichotomy Problem Small VS Fast? 5
12 Dichotomy Problem Small VS Fast? Choose one. 5
13 Dichotomy Problem Small VS Fast? Choose one. NO 5
14 High Level Thesis Data Structures + Data Compression Faster Algorithms 6
15 High Level Thesis Data Structures + Data Compression Faster Algorithms Space optimization is closely related to time optimization in a disk memory. Donald E. Knuth, The Art of Computer Programming 6
16 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK 7
17 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size Speed 7
18 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7
19 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB 14x 200X Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7
20 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. 8
21 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8
22 Proposal Must exploit properties of the addressed problem. Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8
23 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. 9
24 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
25 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
26 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
27 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
28 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
29 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
30 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9
31 Wait! Can t we use existing libraries? 10
32 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? 10
33 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10
34 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10
35 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map Prefer cache-friendly (non discontiguous) data structures. Always. Use std::vector. 10
36 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory 11
37 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory vector of offsets memory
38 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory
39 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout
40 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout
41 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout
42 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout
43 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs
44 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs
45 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs
46 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs ~3GBs
47 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs X ~3GBs
48 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 12
49 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red the house is always hungry red is always good boy is red the boy is hungry 12
50 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} the house is always hungry boy is red boy is hungry 12
51 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 2 the house is red 1 red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 3 the house is always hungry boy is red 4 boy is hungry 5 12
52 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 5 12 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5]
53 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 13
54 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] 13
55 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13
56 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13
57 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} posting lists intersection 13
58 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? 14
59 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14
60 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14
61 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15
62 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15
63 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? 15
64 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. 15
65 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. Data Structures + Data Compression Faster Algorithms 15
66 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). 16
67 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). Sebastiano Vigna. Quasi-Succinct Indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), (2013). 40 years later! 16
68 Elias-Fano - Encoding example
69 Elias-Fano - Encoding example 2 u =
70 Elias-Fano - Encoding example u =
71 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =
72 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =
73 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =
74 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =
75 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =
76 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =
77 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =
78 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = 43 8 L =
79 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =
80 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =
81 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L = H =
82 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits 18
83 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. 18
84 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n 18
85 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
86 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
87 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
88 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
89 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
90 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
91 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18
92 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18
93 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18
94 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n but n ( n lg u+n n 18
95 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (
96 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (
97 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19
98 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19
99 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. Idea: encode clusters of posting lists. 19
100 Clustered Elias-Fano Indexes 2 cluster of posting lists 20
101 Clustered Elias-Fano Indexes 2 cluster of posting lists unbounded universe lg u bits 20
102 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list unbounded universe lg u bits 20
103 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20
104 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20
105 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits 20
106 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list 20
107 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list NP-hard problem already for a simplified formulation. 20
108 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21
109 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21
110 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21
111 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21
112 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average) 21
113 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits 22
114 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor 22
115 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. 22
116 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time 22
117 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time O(n lg u) bits (or even worse) 22
118 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23
119 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23
120 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23
121 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23
122 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23
123 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23
124 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23
125 Dynamic Integer Sets in Succinct Space and Optimal Time Goals 24
126 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits 24
127 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits + o(n) bits 24
128 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 24
129 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 1. Extend the static Elias-Fano representation to support predecessor and successor queries in optimal worstcase O(lg lg n) time. 2. Maintain S in a fully dynamic fashion, supporting in optimal worst-case time all the operations defined in the Dynamic Dictionary and Dynamic List Representation problems. 24
130 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] 25
131 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] Idea: divide the sequence into blocks and use a y-fast trie to index the blocks. 25
132 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] 26
133 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] Idea: use a 2-level indexing data structure. First level indexes blocks using a y-fast trie and the dynamic prefix-sum data structure by Bille et al. Second level indexes mini blocks using the data structure of the Lemma. 26
134 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27
135 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27
136 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27
137 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27
138 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27
139 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27
140 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27
141 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27
142 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27
143 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27
144 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27
145 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27
146 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27
147 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27
148 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27
149 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27
150 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27
151 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27
152 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27
153 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27
154 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27
155 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams
156 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams
157 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams
158 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 N number of grams More than 11 billion grams. 27
159 N-grams - Why? 2 Word prediction. 28
160 N-grams - Why? 2 Word prediction. space and time-efficient? 28
161 N-grams - Why? 2 Word prediction. space and time-efficient? context 28
162 N-grams - Why? 2 Word prediction. algorithms foo space and time-efficient? context data structures bar baz 28
163 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz
164 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz P ( data structures space and time-efficient ) f ( space and time-efficient data structures ) f ( space and time-efficient ) 28
165 N-grams - Who cares? 3 29
166 N-grams - Who cares? 3 29
167 N-grams - Who cares? 3 29
168 What can I help you with?
169 What can I help you with? Siri
170 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. 31
171 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. 31
172 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. Data Structures + Data Compression Faster Algorithms 31
173 N-grams - Data structures 5 Open-addressing VS Tries 32
174 N-grams - Data structures 5 Open-addressing VS Tries
175 N-grams - Data structures 5 Open-addressing VS Tries X h
176 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h
177 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h A B C D A B D B C A C D B A B D A A C A B D A
178 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
179 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
180 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
181 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
182 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
183 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A
184 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time BBC X h A B C D A B D B C A C D B A B D A A C A B D A
185 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time X h Several software libraries 24 BBC KenLM [Heafield, 2011] BerkeleyLM [Pauls and Klein, 2011] A B 24 IRSTLM [Federico et al., 2008] C D RandLM [Talbot and Osborne, 2007] Get1T [Hawker et al., 2007] A B D B C A C D SRILM [Stolcke, 2002] B A B D A A C A B D A
186 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
187 tongrams - Tons of N-Grams 1 Hash-based Open addressing? VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
188 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
189 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
190 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
191 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
192 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
193 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A
194 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A B C D A B D B C A C D B A B D A A C A B D A
195 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A0 B1 C2 D A0 B1 D3 B1 C2 A0 C2 D B1 A0 B1 D3 A0 A0 C2 A0 B1 D3 A
196 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A
197 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. 33
198 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. Random access. 33
199 tongrams - Preliminary results 2 34
200 tongrams - Preliminary results 2 34
201 tongrams - Preliminary results 2 X2.6 X3.8 34
202 (Some) Future Research Problems 1 Dynamic Inverted Indexes. 35
203 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35
204 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35
205 (Some) Future Research Problems 2 Compressed B-trees. 36
206 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. 36
207 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. 36
208 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. Michael Bender Stony Brook University Martin Farach-Colton Rutgers University Bradley Kuszmaul MIT Laboratory for Computer Science 36
209 (Some) Future Research Problems 3 Fast Successor for IP-lookup. 37
210 (Some) Future Research Problems 3 Fast Successor for IP-lookup. Successor search is what routers do for every incoming packet. Hence, the most run algorithm in the world. Time and space efficiency is crucial. 37
Advanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from
More informationAdvanced Data Structures
Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support
More informationRank and Select Operations on Binary Strings (1974; Elias)
Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo
More informationFixed-Universe Successor : Van Emde Boas. Lecture 18
Fixed-Universe Successor : Van Emde Boas Lecture 18 Fixed-universe successor problem Goal: Maintain a dynamic subset S of size n of the universe U = {0, 1,, u 1} of size u subject to these operations:
More informationA Tight Lower Bound for Dynamic Membership in the External Memory Model
A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationLecture 2 September 4, 2014
CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the
More informationEfficient Accessing and Searching in a Sequence of Numbers
Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationPar$$oned Elias- Fano Indexes
Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3
More informationOptimal Color Range Reporting in One Dimension
Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting
More informationlsb(x) = Least Significant Bit?
lsb(x) = Least Significant Bit? w-1 lsb i 1 0 x 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 msb(x) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the information-theoretic bound with
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More informationDynamic Ordered Sets with Exponential Search Trees
Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/
More informationPar$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini
Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]
More informationTheoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts
Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationCache-Oblivious Algorithms
Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program
More informationarxiv: v1 [cs.ds] 15 Feb 2012
Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl
More informationLecture 4: Divide and Conquer: van Emde Boas Trees
Lecture 4: Divide and Conquer: van Emde Boas Trees Series of Improved Data Structures Insert, Successor Delete Space This lecture is based on personal communication with Michael Bender, 001. Goal We want
More informationOn Adaptive Integer Sorting
On Adaptive Integer Sorting Anna Pagh 1, Rasmus Pagh 1, and Mikkel Thorup 2 1 IT University of Copenhagen, Rued Langgaardsvej 7, 2300 København S, Denmark {annao, pagh}@itu.dk 2 AT&T Labs Research, Shannon
More informationLower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)
Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Mihai Pǎtraşcu, MIT, web.mit.edu/ mip/www/ Index terms: partial-sums problem, prefix sums, dynamic lower bounds Synonyms: dynamic trees 1
More informationNew Lower and Upper Bounds for Representing Sequences
New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,
More informationAlgorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address
More informationLecture 15 April 9, 2007
6.851: Advanced Data Structures Spring 2007 Mihai Pătraşcu Lecture 15 April 9, 2007 Scribe: Ivaylo Riskov 1 Overview In the last lecture we considered the problem of finding the predecessor in its static
More informationHashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille
Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
More informationHashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing
Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries
More informationContent. 1 Static dictionaries. 2 Integer data structures. 3 Graph data structures. 4 Dynamization and persistence. 5 Cache-oblivious structures
Data strutures II NTIN067 Jira Fin https://timl.mff.cuni.cz/ fin/ Katedra teoreticé informatiy a matematicé logiy Matematico-fyziální faulta Univerzita Karlova v Praze Summer semester 08/9 Last change
More informationOptimal Dynamic Sequence Representations
Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More informationInteger Sorting on the word-ram
Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words
More informationShannon-Fano-Elias coding
Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationOptimal Bounds for the Predecessor Problem and Related Problems
Optimal Bounds for the Predecessor Problem and Related Problems Paul Beame Computer Science and Engineering University of Washington Seattle, WA, USA 98195-2350 beame@cs.washington.edu Faith E. Fich Department
More informationOptimal Bounds for the Predecessor Problem and Related Problems
Journal of Computer and System Sciences 65, 38 72 (2002) doi:10.1006/jcss.2002.1822 Optimal Bounds for the Predecessor Problem and Related Problems Paul Beame 1 Computer Science and Engineering, University
More informationSpace-Efficient Re-Pair Compression
Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based
More informationLecture 4 Thursday Sep 11, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)
More informationUsing Hashing to Solve the Dictionary Problem (In External Memory)
Using Hashing to Solve the Dictionary Problem (In External Memory) John Iacono NYU Poly Mihai Pǎtraşcu AT&T Labs February 9, 2011 Abstract We consider the dictionary problem in external memory and improve
More informationCache-Oblivious Hashing
Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong
More informationHigher Cell Probe Lower Bounds for Evaluating Polynomials
Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we
More informationMinimal Indices for Successor Search
TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Minimal Indices for Successor Search This thesis was submitted in partial fulfillment of the requirements
More informationBRICS. Hashing, Randomness and Dictionaries. Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries
BRICS Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries Hashing, Randomness and Dictionaries Rasmus Pagh BRICS Dissertation Series DS-02-5 ISSN 1396-7002 October
More informationStatistical Machine Translation
Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram
More informationCompressed Index for Dynamic Text
Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationProblems. Mikkel Thorup. Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25.
Mihai Pǎtraşcu: Obituary and Open Problems Mikkel Thorup Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25. Mihai Pǎtraşcu, aged
More informationFinger Search in the Implicit Model
Finger Search in the Implicit Model Gerth Stølting Brodal, Jesper Sindahl Nielsen, Jakob Truelsen MADALGO, Department o Computer Science, Aarhus University, Denmark. {gerth,jasn,jakobt}@madalgo.au.dk Abstract.
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationThe Cell Probe Complexity of Dynamic Range Counting
The Cell Probe Complexity of Dynamic Range Counting Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark email: larsen@cs.au.dk August 27, 2012 Abstract In this
More informationOptimal lower bounds for rank and select indexes
Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:
More informationSmall-Space Dictionary Matching (Dissertation Proposal)
Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationMultiple-Site Distributed Spatial Query Optimization using Spatial Semijoins
11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,
More informationDatabases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)
Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,
More informationarxiv:cs/ v1 [cs.ds] 5 Feb 2005
On Dynamic Range Reporting in One Dimension Christian Worm Mortensen IT U. Copenhagen cworm@itu.dk Rasmus Pagh IT U. Copenhagen pagh@itu.dk Mihai Pǎtraşcu MIT mip@mit.edu arxiv:cs/0502032v1 [cs.ds] 5 Feb
More informationLecture Lecture 3 Tuesday Sep 09, 2014
CS 4: Advanced Algorithms Fall 04 Lecture Lecture 3 Tuesday Sep 09, 04 Prof. Jelani Nelson Scribe: Thibaut Horel Overview In the previous lecture we finished covering data structures for the predecessor
More informationLecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationarxiv: v1 [cs.ds] 19 Apr 2011
Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of
More information1 Hashing. 1.1 Perfect Hashing
1 Hashing Hashing is covered by undergraduate courses like Algo I. However, there is much more to say on this topic. Here, we focus on two selected topics: perfect hashing and cockoo hashing. In general,
More informationHash Functions for Priority Queues
INFORMATION AND CONTROL 63, 217-225 (1984) Hash Functions for Priority Queues M. AJTAI, M. FREDMAN, AND J. KOML6S* University of California San Diego, La Jolla, California 92093 The complexity of priority
More informationTASM: Top-k Approximate Subtree Matching
TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationIntroduction to Information Retrieval
Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08
More informationA Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple
A Lecture on Hashing Aram-Alexandre Pooladian, Alexander Iannantuono March 22, 217 This is the scribing of a lecture given by Luc Devroye on the 17th of March 217 for Honours Algorithms and Data Structures
More informationAdvanced Implementations of Tables: Balanced Search Trees and Hashing
Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the
More informationMotivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis
Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated
More informationScanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy
Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy Michael A. ender 1, Richard Cole 2, Erik D. Demaine 3, and Martin Farach-Colton 4 1 Department of Computer Science, State
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationAlphabet Friendly FM Index
Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM
More informationCommunication Complexity
Communication Complexity Jaikumar Radhakrishnan School of Technology and Computer Science Tata Institute of Fundamental Research Mumbai 31 May 2012 J 31 May 2012 1 / 13 Plan 1 Examples, the model, the
More informationLRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations
LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl
More informationPreview: Text Indexing
Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text
More informationDictionary: an abstract data type
2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees
More informationNearest Neighbor Search with Keywords in Spatial Databases
776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,
More informationSuccinct Data Structures for Approximating Convex Functions with Applications
Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca
More information8 Priority Queues. 8 Priority Queues. Prim s Minimum Spanning Tree Algorithm. Dijkstra s Shortest Path Algorithm
8 Priority Queues 8 Priority Queues A Priority Queue S is a dynamic set data structure that supports the following operations: S. build(x 1,..., x n ): Creates a data-structure that contains just the elements
More informationHashing. Martin Babka. January 12, 2011
Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice
More informationarxiv: v1 [cs.ds] 25 Nov 2009
Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,
More informationIn-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond
In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond Paolo Boldi, Sebastiano Vigna Laboratory for Web Algorithmics Università degli
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More informationMining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s
Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make
More informationChapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of
Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of strings 5.5 String searching algorithms 0 5.1 Arrays as
More informationSIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding
SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION
ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced
More informationCount-Min Tree Sketch: Approximate counting for NLP
Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26
More informationCorrelation Autoencoder Hashing for Supervised Cross-Modal Search
Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia
More informationCS483 Design and Analysis of Algorithms
CS483 Design and Analysis of Algorithms Lectures 2-3 Algorithms with Numbers Instructor: Fei Li lifei@cs.gmu.edu with subject: CS483 Office hours: STII, Room 443, Friday 4:00pm - 6:00pm or by appointments
More informationNearest Neighbor Search with Keywords: Compression
Nearest Neighbor Search with Keywords: Compression Yufei Tao KAIST June 3, 2013 In this lecture, we will continue our discussion on: Problem (Nearest Neighbor Search with Keywords) Let P be a set of points
More informationIndexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile
Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:
More informationFast Text Compression with Neural Networks
Fast Text Compression with Neural Networks Matthew V. Mahoney Florida Institute of Technology 150 W. University Blvd. Melbourne FL 32901 mmahoney@cs.fit.edu Abstract Neural networks have the potential
More information10 van Emde Boas Trees
Dynamic Set Data Structure S: S. insert(x) S. delete(x) S. search(x) S. min() S. max() S. succ(x) S. pred(x) Ernst Mayr, Harald Räcke 389 For this chapter we ignore the problem of storing satellite data:
More informationImproved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts
Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards
More information2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51
2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each
More informationmd5bloom: Forensic Filesystem Hashing Revisited
DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationProblem. Maintain the above set so as to support certain
Randomized Data Structures Roussos Ioannis Fundamental Data-structuring Problem Collection {S 1,S 2, } of sets of items. Each item i is an arbitrary record, indexed by a key k(i). Each k(i) is drawn from
More informationSpace-Efficient Construction Algorithm for Circular Suffix Tree
Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes
More information