Space and Time-Efficient Data Structures for Massive Datasets

Size: px
Start display at page:

Download "Space and Time-Efficient Data Structures for Massive Datasets"

Transcription

1 Space and Time-Efficient Data Structures for Massive Datasets Giulio Ermanno Pibiri Supervisor Rossano Venturini Computer Science Department University of Pisa 17/10/2016 1

2

3 Evidence The increase of information does not scale with technology. 3

4 Evidence The increase of information does not scale with technology. Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3

5 Evidence The increase of information does not scale with technology. Even more relevant today! Software is getting slower more rapidly than hardware becomes faster. Niklaus Wirth, A Plea for Lean Software 3

6 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work Algorithms EFFICIENCY how much work is required by a program - less work 4

7 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work 4

8 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression 4

9 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work Data Compression + space - time 4

10 Scenario Data Structures PERFORMANCE how quickly a program does its work - faster work + time - space Algorithms EFFICIENCY how much work is required by a program - less work? Data Compression + space - time 4

11 Dichotomy Problem Small VS Fast? 5

12 Dichotomy Problem Small VS Fast? Choose one. 5

13 Dichotomy Problem Small VS Fast? Choose one. NO 5

14 High Level Thesis Data Structures + Data Compression Faster Algorithms 6

15 High Level Thesis Data Structures + Data Compression Faster Algorithms Space optimization is closely related to time optimization in a disk memory. Donald E. Knuth, The Art of Computer Programming 6

16 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK 7

17 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size Speed 7

18 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7

19 Hierarchical Memory Organisation registers CPU L1 L2 RAM DISK Size bits 32 KB 256 KB 4-32 GB 1 TB 14x 200X Speed 1 ns 0.5 ns 7 ns 100 ns 1-10 ms Numbers are taken from: 7

20 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. 8

21 Proposal Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8

22 Proposal Must exploit properties of the addressed problem. Design space-efficient ad-hoc data structures, both from a theoretical and practical perspective, that support fast data extraction. Data compression & Fast Retrieval together. Mature algorithmic solutions now ready for technology transfer. 8

23 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. 9

24 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

25 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

26 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

27 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

28 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

29 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

30 Why? Who cares? Because engineered data structures are the ones the boost the availability and wealth of information around us. Inverted Indexes N-grams B-trees 9

31 Wait! Can t we use existing libraries? 10

32 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? 10

33 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10

34 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map 10

35 Wait! Can t we use existing libraries? C++ Standard Template Library (STL)? std::list std::stack std::queue std::map std::unordered_map Prefer cache-friendly (non discontiguous) data structures. Always. Use std::vector. 10

36 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory 11

37 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory vector of offsets memory

38 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory

39 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

40 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

41 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

42 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout

43 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

44 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

45 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs

46 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs ~3GBs

47 Example: vector of strings VS string pool Prefer cache-friendly data structures. vector of pointers memory Every access is a cache miss. Pointer chasing :-( vector of offsets memory Offsets instead of pointers Contiguous memory layout ~5.2GBs X ~3GBs

48 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 12

49 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red the house is always hungry red is always good boy is red the boy is hungry 12

50 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. the house is red red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} the house is always hungry boy is red boy is hungry 12

51 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 2 the house is red 1 red is always good the t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 3 the house is always hungry boy is red 4 boy is hungry 5 12

52 Inverted Indexes 1 Given a textual collection D, each document can be seen as a (multi-)set of terms. The set of terms occurring in D is the lexicon T. For each term t in T we store in a list Lt the identifiers of the documents in which t appears. The collection of all inverted lists {Lt 1,,Lt T } is the inverted index. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 T = {always, boy, good, house, hungry, is, red, the} 5 12 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5]

53 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 13

54 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] 13

55 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13

56 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} 13

57 Inverted Indexes 2 Inverted Indexes owe their popularity to the efficient resolution of queries, such as: return me all documents in which terms {t1,,tk} occur. 3 2 the house is red the house is always hungry 1 red is always good boy is red 4 the boy is hungry T = {always, boy, good, house, hungry, is, red, the} 5 t 1 t 2 t 3 t 4 t 5 t 6 t 7 t 8 Lt 1 =[1, 3] Lt 2 =[4, 5] Lt 3 =[1] Lt 4 =[2, 3] Lt 5 =[3, 5] Lt 6 =[1, 2, 3, 4, 5] Lt 7 =[1, 2, 4] Lt 8 =[2, 3, 5] q = {boy, is, the} posting lists intersection 13

58 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? 14

59 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14

60 General Problem Consider a sequence S[0,n) of n positive and monotonically increasing integers, i.e., S[i] S[i+1] for 0 i < n-1, possibly repeated. How to represent it as a bit vector in which each original integer is self-delimited, using as few as possible bits? Huge research corpora describing different space/time trade-offs. Elias gamma/delta [Elias, TIT 1975] Variable Byte [Salomon, Springer 2007] Binary Interpolative Coding [Moffat and Stuiver, IRJ 2000] Simple-9 [Anh and Moffat, IRJ 2005] PForDelta [Zukowski et al., ICDE 2006] OptPFD [Yan et al., WWW 2009] Simple-16 [Anh and Moffat, SPE 2010] Varint-G8IU [Stepanov et al., CIKM 2011] Elias-Fano [Vigna, WSDM 2013] Partitioned Elias-Fano [Ottaviano and Venturini, SIGIR 2014] 14

61 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15

62 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. 15

63 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? 15

64 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. 15

65 September 7, 2016 For Firefly, Dropbox full text-search engine, speed has always been a priority. They were unable to scale because of the dimension of their (distributed) inverted index. Consequence? Query time latencies deteriorate from 250ms to 1s. Solution? Compress the index to reduce I/O pressure. Data Structures + Data Compression Faster Algorithms 15

66 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). 16

67 Elias-Fano - Genesis 1 Peter Elias [ ] Robert Fano [ ] Robert Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT (1971). Peter Elias. Efficient Storage and Retrieval by Content and Address of Static Files. Journal of the ACM (JACM) 21, 2, (1974). Sebastiano Vigna. Quasi-Succinct Indices. In Proceedings of the 6-th ACM International Conference on Web Search and Data Mining (WSDM), (2013). 40 years later! 16

68 Elias-Fano - Encoding example

69 Elias-Fano - Encoding example 2 u =

70 Elias-Fano - Encoding example u =

71 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =

72 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u =

73 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

74 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

75 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

76 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

77 Elias-Fano - Encoding example 2 high lg n low lg(u/n) u = 43 8 L =

78 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = 43 8 L =

79 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =

80 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L =

81 Elias-Fano - Encoding example 2 high lg n low lg(u/n) missing high bits u = L = H =

82 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits 18

83 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. 18

84 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n 18

85 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

86 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

87 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

88 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits (less than half a bit away [Elias, JACM 1974]) X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

89 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

90 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

91 ( ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case X is the set of all monotone sequence of length n drawn from a universe u. X = ( u+n n lg ( u+n n n lg u+n n 18

92 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18

93 ( Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n n n lg u+n n 18

94 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case lg ( u+n but n ( n lg u+n n 18

95 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (

96 Elias-Fano - Properties 3 EF(S[0,n)) = n lg u n + 2n bits access to each S[i] in O(1) worst-case predecessor(x) = max{s[i] S[i] < x} X is the set of all monotone sequence of length n drawn from a universe u. successor(x) = min{s[i] S[i] x} queries in ( O lg u X = n ( ( u+n n ( worst-case ( u+n but u+n lg n lg n n need o(n) bits more to support fast rank/select primitives on bitvector H 18 (

97 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19

98 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. 19

99 Clustered Elias-Fano Indexes 1 Every encoder represents each sequence individually. No exploitation of redundancy. Idea: encode clusters of posting lists. 19

100 Clustered Elias-Fano Indexes 2 cluster of posting lists 20

101 Clustered Elias-Fano Indexes 2 cluster of posting lists unbounded universe lg u bits 20

102 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list unbounded universe lg u bits 20

103 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20

104 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits 20

105 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits 20

106 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list 20

107 Clustered Elias-Fano Indexes 2 cluster of posting lists reference list R unbounded universe lg u bits VS R << u lg R bits Problems 1. how to build clusters 2. how to synthesise the reference list NP-hard problem already for a simplified formulation. 20

108 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21

109 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size 21

110 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21

111 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) 21

112 Clustered Elias-Fano Indexes 3 Time VS Space tradeoffs by varying reference size Always better than PEF (by up to 11%) and better than BIC (by up to 6.25%) Much faster than BIC (103% on average) Slightly slower than PEF (20% on average) 21

113 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits 22

114 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor 22

115 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. 22

116 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time 22

117 Integer Data Structures 1 Elias-Fano matches the information theoretic minimum. n lg(u/n) + 2n + o(n) bits O(1) random access O(lg(u/n)) predecessor/successor Static succinct data structure. NO dynamic updates. veb Trees [van Emde Boas, FOCS 1975] x/y-fast Tries [Willard, IPL 1983] Fusion Trees [Fredman and Willard, JCSS 1993] Exponential Search Trees [Andersson and Thorup, JACM 2007] Dynamic Most of them take optimal time O(n lg u) bits (or even worse) 22

118 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

119 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

120 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} 23

121 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23

122 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. 23

123 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23

124 Integer Data Structures - Problems and Results 2 The (general) Dictionary problem The dynamic dictionary problem consists in representing a set S of n objects so that the following operations are supported. insert(x) inserts x in S delete(x) deletes x from S search(x) checks whether x belongs to S minimum() returns the minimum element of S maximum() returns the maximum element of S predecessor(x) returns max{y S : y < x} successor(x) returns min{y S : y x} The Dynamic List Representation problem [Fredman and Saks, STC 1989] Given a list S of n sorted integer, support the following operations access(i) return the i-th smallest element of S insert(x) inserts x in S delete(x) deletes x from S under the assumption that w lg γ n for some γ. Ω(lg n/lg lg n) amortized time per operation, in the cell-probe computational model. [Patrascu and Thorup, STC 2007] Optimal space/time trade-off for a static data structure taking m = n2 a w bits, where a is the number of bits necessary to represent the mean number of bits per integer, i.e., a = lg(m/n) lg w 23

125 Dynamic Integer Sets in Succinct Space and Optimal Time Goals 24

126 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits 24

127 Dynamic Integer Sets in Succinct Space and Optimal Time Goals n lg(u/n) + 2n + o(n) bits + o(n) bits 24

128 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 24

129 Dynamic Integer Sets in Succinct Space and Optimal Time Goals negligible redundancy! n lg(u/n) + 2n + o(n) bits + o(n) bits 1. Extend the static Elias-Fano representation to support predecessor and successor queries in optimal worstcase O(lg lg n) time. 2. Maintain S in a fully dynamic fashion, supporting in optimal worst-case time all the operations defined in the Dynamic Dictionary and Dynamic List Representation problems. 24

130 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] 25

131 Results - Static Elias-Fano Optimal Successor Queries 1 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] Idea: divide the sequence into blocks and use a y-fast trie to index the blocks. 25

132 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] 26

133 Results - Dynamic Elias-Fano 2 optimal time/space trade-off for successor search [Patrascu and Thorup, STC 2007] y-fast tries [Willard, IPL 1983] dynamic prefix-sum data structure [Bille et al., arxiv preprint 2015] Idea: use a 2-level indexing data structure. First level indexes blocks using a y-fast trie and the dynamic prefix-sum data structure by Bille et al. Second level indexes mini blocks using the data structure of the Lemma. 26

134 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27

135 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. 27

136 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27

137 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 27

138 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27

139 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different 27

140 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27

141 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms 27

142 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

143 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

144 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

145 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

146 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 different algorithms devised 27

147 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27

148 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised 27

149 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27

150 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms 27

151 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27

152 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised 27

153 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27

154 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to 27

155 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams

156 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 different algorithms devised different algorithms algorithms devised devised to N number of grams

157 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams

158 N-grams 1 Strings of at most N words. N typically ranges from 1 to 5. N = 1 N = 2 Books different different algorithms algorithms algorithms devised devised ~6% of the books ever published devised to N number of grams 1 24,359, ,284, ,397,041, ,644,807, ,415,355,596 N number of grams More than 11 billion grams. 27

159 N-grams - Why? 2 Word prediction. 28

160 N-grams - Why? 2 Word prediction. space and time-efficient? 28

161 N-grams - Why? 2 Word prediction. space and time-efficient? context 28

162 N-grams - Why? 2 Word prediction. algorithms foo space and time-efficient? context data structures bar baz 28

163 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz

164 N-grams - Why? 2 Word prediction. frequency count space and time-efficient? context algorithms foo data structures bar baz P ( data structures space and time-efficient ) f ( space and time-efficient data structures ) f ( space and time-efficient ) 28

165 N-grams - Who cares? 3 29

166 N-grams - Who cares? 3 29

167 N-grams - Who cares? 3 29

168 What can I help you with?

169 What can I help you with? Siri

170 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. 31

171 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. 31

172 N-grams - Challenge 4 Store massive N-grams datasets such that given a pattern, we can return its frequency count at light speed. Efficient map. Data Structures + Data Compression Faster Algorithms 31

173 N-grams - Data structures 5 Open-addressing VS Tries 32

174 N-grams - Data structures 5 Open-addressing VS Tries

175 N-grams - Data structures 5 Open-addressing VS Tries X h

176 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h

177 N-grams - Data structures 5 Open-addressing + time - space VS Tries X h A B C D A B D B C A C D B A B D A A C A B D A

178 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

179 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

180 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

181 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

182 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

183 N-grams - Data structures 5 Open-addressing + time - space VS Tries BBC X h A B C D A B D B C A C D B A B D A A C A B D A

184 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time BBC X h A B C D A B D B C A C D B A B D A A C A B D A

185 N-grams - Data structures 5 Open-addressing + time - space VS Tries + space - time X h Several software libraries 24 BBC KenLM [Heafield, 2011] BerkeleyLM [Pauls and Klein, 2011] A B 24 IRSTLM [Federico et al., 2008] C D RandLM [Talbot and Osborne, 2007] Get1T [Hawker et al., 2007] A B D B C A C D SRILM [Stolcke, 2002] B A B D A A C A B D A

186 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

187 tongrams - Tons of N-Grams 1 Hash-based Open addressing? VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

188 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

189 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

190 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

191 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

192 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

193 tongrams - Tons of N-Grams 1 Hash-based Open addressing? Data structure is static. Minimal Perfect Hashing. VS Trie-based A B C D A B D B C A C D B A B D A A C A B D A

194 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A B C D A B D B C A C D B A B D A A C A B D A

195 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A0 B1 C2 D A0 B1 D3 B1 C2 A0 C2 D B1 A0 B1 D3 A0 A0 C2 A0 B1 D3 A

196 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A

197 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. 33

198 tongrams - Tons of N-Grams 1 Hash-based VS Trie-based Open addressing? Data structure is static. Minimal Perfect Hashing. vocabulary A 0 B 1 C 2 D A 0 B 1 C2 3 D A 0 B 1 D 3 B1 4 C2 5 A0 5 C2 7 D B 1 A0 1 B1 2 D3 4 A0 4 A0 4 C2 6 A0 6 B1 7 D3 9 A Encode each level with Elias-Fano. Random access. 33

199 tongrams - Preliminary results 2 34

200 tongrams - Preliminary results 2 34

201 tongrams - Preliminary results 2 X2.6 X3.8 34

202 (Some) Future Research Problems 1 Dynamic Inverted Indexes. 35

203 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35

204 (Some) Future Research Problems 1 Dynamic Inverted Indexes. Classic solution: use two indexes. One is big and cold; the other is small and hot. Merge them periodically. 35

205 (Some) Future Research Problems 2 Compressed B-trees. 36

206 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. 36

207 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. 36

208 (Some) Future Research Problems 2 Compressed B-trees. Problem: maintain a dictionary on disk. Motivations: databases and file-systems. Fancy indexing structures may be a luxury now, but they will be essential by the decade s end. Michael Bender Stony Brook University Martin Farach-Colton Rutgers University Bradley Kuszmaul MIT Laboratory for Computer Science 36

209 (Some) Future Research Problems 3 Fast Successor for IP-lookup. 37

210 (Some) Future Research Problems 3 Fast Successor for IP-lookup. Successor search is what routers do for every incoming packet. Hence, the most run algorithm in the world. Time and space efficiency is crucial. 37

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT The Research University in the Helmholtz Association www.kit.edu Predecessor data structures We want to support the following operations on a set of integers from

More information

Advanced Data Structures

Advanced Data Structures Simon Gog gog@kit.edu - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Predecessor data structures We want to support

More information

Rank and Select Operations on Binary Strings (1974; Elias)

Rank and Select Operations on Binary Strings (1974; Elias) Rank and Select Operations on Binary Strings (1974; Elias) Naila Rahman, University of Leicester, www.cs.le.ac.uk/ nyr1 Rajeev Raman, University of Leicester, www.cs.le.ac.uk/ rraman entry editor: Paolo

More information

Fixed-Universe Successor : Van Emde Boas. Lecture 18

Fixed-Universe Successor : Van Emde Boas. Lecture 18 Fixed-Universe Successor : Van Emde Boas Lecture 18 Fixed-universe successor problem Goal: Maintain a dynamic subset S of size n of the universe U = {0, 1,, u 1} of size u subject to these operations:

More information

A Tight Lower Bound for Dynamic Membership in the External Memory Model

A Tight Lower Bound for Dynamic Membership in the External Memory Model A Tight Lower Bound for Dynamic Membership in the External Memory Model Elad Verbin ITCS, Tsinghua University Qin Zhang Hong Kong University of Science & Technology April 2010 1-1 The computational model

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Lecture 2 September 4, 2014

Lecture 2 September 4, 2014 CS 224: Advanced Algorithms Fall 2014 Prof. Jelani Nelson Lecture 2 September 4, 2014 Scribe: David Liu 1 Overview In the last lecture we introduced the word RAM model and covered veb trees to solve the

More information

Efficient Accessing and Searching in a Sequence of Numbers

Efficient Accessing and Searching in a Sequence of Numbers Regular Paper Journal of Computing Science and Engineering, Vol. 9, No. 1, March 2015, pp. 1-8 Efficient Accessing and Searching in a Sequence of Numbers Jungjoo Seo and Myoungji Han Department of Computer

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Par$$oned Elias- Fano Indexes

Par$$oned Elias- Fano Indexes Par$$oned Elias- Fano Indexes Giuseppe O)aviano ISTI- CNR, Pisa Rossano Venturini Università di Pisa Inverted indexes Docid Document 1: [it is what it is not] 2: [what is a] 3: [it is a banana] a 2, 3

More information

Optimal Color Range Reporting in One Dimension

Optimal Color Range Reporting in One Dimension Optimal Color Range Reporting in One Dimension Yakov Nekrich 1 and Jeffrey Scott Vitter 1 The University of Kansas. yakov.nekrich@googlemail.com, jsv@ku.edu Abstract. Color (or categorical) range reporting

More information

lsb(x) = Least Significant Bit?

lsb(x) = Least Significant Bit? lsb(x) = Least Significant Bit? w-1 lsb i 1 0 x 0 0 1 0 1 0 1 1 0 0 1 0 0 0 0 0 1 msb(x) in O(1) steps using 5 multiplications [M.L. Fredman, D.E. Willard, Surpassing the information-theoretic bound with

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

Dynamic Ordered Sets with Exponential Search Trees

Dynamic Ordered Sets with Exponential Search Trees Dynamic Ordered Sets with Exponential Search Trees Arne Andersson Computing Science Department Information Technology, Uppsala University Box 311, SE - 751 05 Uppsala, Sweden arnea@csd.uu.se http://www.csd.uu.se/

More information

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini

Par$$oned Elias- Fano indexes. Giuseppe O)aviano Rossano Venturini Par$$oned Elias- Fano indexes Giuseppe O)aviano Rossano Venturini Inverted indexes Core data structure of Informa$on Retrieval Documents are sequences of terms 1: [it is what it is not] 2: [what is a]

More information

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts

Theoretical Computer Science. Dynamic rank/select structures with applications to run-length encoded texts Theoretical Computer Science 410 (2009) 4402 4413 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Dynamic rank/select structures with

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

Cache-Oblivious Algorithms

Cache-Oblivious Algorithms Cache-Oblivious Algorithms 1 Cache-Oblivious Model 2 The Unknown Machine Algorithm C program gcc Object code linux Execution Can be executed on machines with a specific class of CPUs Algorithm Java program

More information

arxiv: v1 [cs.ds] 15 Feb 2012

arxiv: v1 [cs.ds] 15 Feb 2012 Linear-Space Substring Range Counting over Polylogarithmic Alphabets Travis Gagie 1 and Pawe l Gawrychowski 2 1 Aalto University, Finland travis.gagie@aalto.fi 2 Max Planck Institute, Germany gawry@cs.uni.wroc.pl

More information

Lecture 4: Divide and Conquer: van Emde Boas Trees

Lecture 4: Divide and Conquer: van Emde Boas Trees Lecture 4: Divide and Conquer: van Emde Boas Trees Series of Improved Data Structures Insert, Successor Delete Space This lecture is based on personal communication with Michael Bender, 001. Goal We want

More information

On Adaptive Integer Sorting

On Adaptive Integer Sorting On Adaptive Integer Sorting Anna Pagh 1, Rasmus Pagh 1, and Mikkel Thorup 2 1 IT University of Copenhagen, Rued Langgaardsvej 7, 2300 København S, Denmark {annao, pagh}@itu.dk 2 AT&T Labs Research, Shannon

More information

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine)

Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Lower Bounds for Dynamic Connectivity (2004; Pǎtraşcu, Demaine) Mihai Pǎtraşcu, MIT, web.mit.edu/ mip/www/ Index terms: partial-sums problem, prefix sums, dynamic lower bounds Synonyms: dynamic trees 1

More information

New Lower and Upper Bounds for Representing Sequences

New Lower and Upper Bounds for Representing Sequences New Lower and Upper Bounds for Representing Sequences Djamal Belazzougui 1 and Gonzalo Navarro 2 1 LIAFA, Univ. Paris Diderot - Paris 7, France. dbelaz@liafa.jussieu.fr 2 Department of Computer Science,

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

Lecture 15 April 9, 2007

Lecture 15 April 9, 2007 6.851: Advanced Data Structures Spring 2007 Mihai Pătraşcu Lecture 15 April 9, 2007 Scribe: Ivaylo Riskov 1 Overview In the last lecture we considered the problem of finding the predecessor in its static

More information

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Hashing Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

More information

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing Philip Bille Dictionaries Dictionary problem. Maintain a set S U = {,..., u-} supporting lookup(x): return true if x S and false otherwise. insert(x): set S = S {x} delete(x): set S = S - {x} Dictionaries

More information

Content. 1 Static dictionaries. 2 Integer data structures. 3 Graph data structures. 4 Dynamization and persistence. 5 Cache-oblivious structures

Content. 1 Static dictionaries. 2 Integer data structures. 3 Graph data structures. 4 Dynamization and persistence. 5 Cache-oblivious structures Data strutures II NTIN067 Jira Fin https://timl.mff.cuni.cz/ fin/ Katedra teoreticé informatiy a matematicé logiy Matematico-fyziální faulta Univerzita Karlova v Praze Summer semester 08/9 Last change

More information

Optimal Dynamic Sequence Representations

Optimal Dynamic Sequence Representations Optimal Dynamic Sequence Representations Gonzalo Navarro Yakov Nekrich Abstract We describe a data structure that supports access, rank and select queries, as well as symbol insertions and deletions, on

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Optimal Bounds for the Predecessor Problem and Related Problems

Optimal Bounds for the Predecessor Problem and Related Problems Optimal Bounds for the Predecessor Problem and Related Problems Paul Beame Computer Science and Engineering University of Washington Seattle, WA, USA 98195-2350 beame@cs.washington.edu Faith E. Fich Department

More information

Optimal Bounds for the Predecessor Problem and Related Problems

Optimal Bounds for the Predecessor Problem and Related Problems Journal of Computer and System Sciences 65, 38 72 (2002) doi:10.1006/jcss.2002.1822 Optimal Bounds for the Predecessor Problem and Related Problems Paul Beame 1 Computer Science and Engineering, University

More information

Space-Efficient Re-Pair Compression

Space-Efficient Re-Pair Compression Space-Efficient Re-Pair Compression Philip Bille, Inge Li Gørtz, and Nicola Prezza Technical University of Denmark, DTU Compute {phbi,inge,npre}@dtu.dk Abstract Re-Pair [5] is an effective grammar-based

More information

Lecture 4 Thursday Sep 11, 2014

Lecture 4 Thursday Sep 11, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)

More information

Using Hashing to Solve the Dictionary Problem (In External Memory)

Using Hashing to Solve the Dictionary Problem (In External Memory) Using Hashing to Solve the Dictionary Problem (In External Memory) John Iacono NYU Poly Mihai Pǎtraşcu AT&T Labs February 9, 2011 Abstract We consider the dictionary problem in external memory and improve

More information

Cache-Oblivious Hashing

Cache-Oblivious Hashing Cache-Oblivious Hashing Zhewei Wei Hong Kong University of Science & Technology Joint work with Rasmus Pagh, Ke Yi and Qin Zhang Dictionary Problem Store a subset S of the Universe U. Lookup: Does x belong

More information

Higher Cell Probe Lower Bounds for Evaluating Polynomials

Higher Cell Probe Lower Bounds for Evaluating Polynomials Higher Cell Probe Lower Bounds for Evaluating Polynomials Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark Email: larsen@cs.au.dk Abstract In this paper, we

More information

Minimal Indices for Successor Search

Minimal Indices for Successor Search TEL-AVIV UNIVERSITY RAYMOND AND BEVERLY SACKLER FACULTY OF EXACT SCIENCES SCHOOL OF COMPUTER SCIENCE Minimal Indices for Successor Search This thesis was submitted in partial fulfillment of the requirements

More information

BRICS. Hashing, Randomness and Dictionaries. Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries

BRICS. Hashing, Randomness and Dictionaries. Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries BRICS Basic Research in Computer Science BRICS DS-02-5 R. Pagh: Hashing, Randomness and Dictionaries Hashing, Randomness and Dictionaries Rasmus Pagh BRICS Dissertation Series DS-02-5 ISSN 1396-7002 October

More information

Statistical Machine Translation

Statistical Machine Translation Statistical Machine Translation Marcello Federico FBK-irst Trento, Italy Galileo Galilei PhD School University of Pisa Pisa, 7-19 May 2008 Part V: Language Modeling 1 Comparing ASR and statistical MT N-gram

More information

Compressed Index for Dynamic Text

Compressed Index for Dynamic Text Compressed Index for Dynamic Text Wing-Kai Hon Tak-Wah Lam Kunihiko Sadakane Wing-Kin Sung Siu-Ming Yiu Abstract This paper investigates how to index a text which is subject to updates. The best solution

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Problems. Mikkel Thorup. Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25.

Problems. Mikkel Thorup. Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25. Mihai Pǎtraşcu: Obituary and Open Problems Mikkel Thorup Figure 1: Mihai lived his short life in the fast lane. Here with his second wife Mirabela Bodic that he married at age 25. Mihai Pǎtraşcu, aged

More information

Finger Search in the Implicit Model

Finger Search in the Implicit Model Finger Search in the Implicit Model Gerth Stølting Brodal, Jesper Sindahl Nielsen, Jakob Truelsen MADALGO, Department o Computer Science, Aarhus University, Denmark. {gerth,jasn,jakobt}@madalgo.au.dk Abstract.

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

The Cell Probe Complexity of Dynamic Range Counting

The Cell Probe Complexity of Dynamic Range Counting The Cell Probe Complexity of Dynamic Range Counting Kasper Green Larsen MADALGO, Department of Computer Science Aarhus University Aarhus, Denmark email: larsen@cs.au.dk August 27, 2012 Abstract In this

More information

Optimal lower bounds for rank and select indexes

Optimal lower bounds for rank and select indexes Optimal lower bounds for rank and select indexes Alexander Golynski David R. Cheriton School of Computer Science, University of Waterloo agolynski@cs.uwaterloo.ca Technical report CS-2006-03, Version:

More information

Small-Space Dictionary Matching (Dissertation Proposal)

Small-Space Dictionary Matching (Dissertation Proposal) Small-Space Dictionary Matching (Dissertation Proposal) Graduate Center of CUNY 1/24/2012 Problem Definition Dictionary Matching Input: Dictionary D = P 1,P 2,...,P d containing d patterns. Text T of length

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins

Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins 11 Multiple-Site Distributed Spatial Query Optimization using Spatial Semijoins Wendy OSBORN a, 1 and Saad ZAAMOUT a a Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge,

More information

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,

More information

arxiv:cs/ v1 [cs.ds] 5 Feb 2005

arxiv:cs/ v1 [cs.ds] 5 Feb 2005 On Dynamic Range Reporting in One Dimension Christian Worm Mortensen IT U. Copenhagen cworm@itu.dk Rasmus Pagh IT U. Copenhagen pagh@itu.dk Mihai Pǎtraşcu MIT mip@mit.edu arxiv:cs/0502032v1 [cs.ds] 5 Feb

More information

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture Lecture 3 Tuesday Sep 09, 2014 CS 4: Advanced Algorithms Fall 04 Lecture Lecture 3 Tuesday Sep 09, 04 Prof. Jelani Nelson Scribe: Thibaut Horel Overview In the previous lecture we finished covering data structures for the predecessor

More information

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

arxiv: v1 [cs.ds] 19 Apr 2011

arxiv: v1 [cs.ds] 19 Apr 2011 Fixed Block Compression Boosting in FM-Indexes Juha Kärkkäinen 1 and Simon J. Puglisi 2 1 Department of Computer Science, University of Helsinki, Finland juha.karkkainen@cs.helsinki.fi 2 Department of

More information

1 Hashing. 1.1 Perfect Hashing

1 Hashing. 1.1 Perfect Hashing 1 Hashing Hashing is covered by undergraduate courses like Algo I. However, there is much more to say on this topic. Here, we focus on two selected topics: perfect hashing and cockoo hashing. In general,

More information

Hash Functions for Priority Queues

Hash Functions for Priority Queues INFORMATION AND CONTROL 63, 217-225 (1984) Hash Functions for Priority Queues M. AJTAI, M. FREDMAN, AND J. KOML6S* University of California San Diego, La Jolla, California 92093 The complexity of priority

More information

TASM: Top-k Approximate Subtree Matching

TASM: Top-k Approximate Subtree Matching TASM: Top-k Approximate Subtree Matching Nikolaus Augsten 1 Denilson Barbosa 2 Michael Böhlen 3 Themis Palpanas 4 1 Free University of Bozen-Bolzano, Italy augsten@inf.unibz.it 2 University of Alberta,

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 19: Size Estimation & Duplicate Detection Hinrich Schütze Institute for Natural Language Processing, Universität Stuttgart 2008.07.08

More information

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple A Lecture on Hashing Aram-Alexandre Pooladian, Alexander Iannantuono March 22, 217 This is the scribing of a lecture given by Luc Devroye on the 17th of March 217 for Honours Algorithms and Data Structures

More information

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy

Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy Scanning and Traversing: Maintaining Data for Traversals in a Memory Hierarchy Michael A. ender 1, Richard Cole 2, Erik D. Demaine 3, and Martin Farach-Colton 4 1 Department of Computer Science, State

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

Alphabet Friendly FM Index

Alphabet Friendly FM Index Alphabet Friendly FM Index Author: Rodrigo González Santiago, November 8 th, 2005 Departamento de Ciencias de la Computación Universidad de Chile Outline Motivations Basics Burrows Wheeler Transform FM

More information

Communication Complexity

Communication Complexity Communication Complexity Jaikumar Radhakrishnan School of Technology and Computer Science Tata Institute of Fundamental Research Mumbai 31 May 2012 J 31 May 2012 1 / 13 Plan 1 Examples, the model, the

More information

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations

LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations LRM-Trees: Compressed Indices, Adaptive Sorting, and Compressed Permutations Jérémy Barbay 1, Johannes Fischer 2, and Gonzalo Navarro 1 1 Department of Computer Science, University of Chile, {jbarbay gnavarro}@dcc.uchile.cl

More information

Preview: Text Indexing

Preview: Text Indexing Simon Gog gog@ira.uka.de - Simon Gog: KIT University of the State of Baden-Wuerttemberg and National Research Center of the Helmholtz Association www.kit.edu Text Indexing Motivation Problems Given a text

More information

Dictionary: an abstract data type

Dictionary: an abstract data type 2-3 Trees 1 Dictionary: an abstract data type A container that maps keys to values Dictionary operations Insert Search Delete Several possible implementations Balanced search trees Hash tables 2 2-3 trees

More information

Nearest Neighbor Search with Keywords in Spatial Databases

Nearest Neighbor Search with Keywords in Spatial Databases 776 Nearest Neighbor Search with Keywords in Spatial Databases 1 Sphurti S. Sao, 2 Dr. Rahila Sheikh 1 M. Tech Student IV Sem, Dept of CSE, RCERT Chandrapur, MH, India 2 Head of Department, Dept of CSE,

More information

Succinct Data Structures for Approximating Convex Functions with Applications

Succinct Data Structures for Approximating Convex Functions with Applications Succinct Data Structures for Approximating Convex Functions with Applications Prosenjit Bose, 1 Luc Devroye and Pat Morin 1 1 School of Computer Science, Carleton University, Ottawa, Canada, K1S 5B6, {jit,morin}@cs.carleton.ca

More information

8 Priority Queues. 8 Priority Queues. Prim s Minimum Spanning Tree Algorithm. Dijkstra s Shortest Path Algorithm

8 Priority Queues. 8 Priority Queues. Prim s Minimum Spanning Tree Algorithm. Dijkstra s Shortest Path Algorithm 8 Priority Queues 8 Priority Queues A Priority Queue S is a dynamic set data structure that supports the following operations: S. build(x 1,..., x n ): Creates a data-structure that contains just the elements

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

arxiv: v1 [cs.ds] 25 Nov 2009

arxiv: v1 [cs.ds] 25 Nov 2009 Alphabet Partitioning for Compressed Rank/Select with Applications Jérémy Barbay 1, Travis Gagie 1, Gonzalo Navarro 1 and Yakov Nekrich 2 1 Department of Computer Science University of Chile {jbarbay,

More information

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond

In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond In-core computation of distance distributions and geometric centralities with HyperBall: A hundred billion nodes and beyond Paolo Boldi, Sebastiano Vigna Laboratory for Web Algorithmics Università degli

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of

Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of Chapter 5 Arrays and Strings 5.1 Arrays as abstract data types 5.2 Contiguous representations of arrays 5.3 Sparse arrays 5.4 Representations of strings 5.5 String searching algorithms 0 5.1 Arrays as

More information

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding

SIGNAL COMPRESSION Lecture Shannon-Fano-Elias Codes and Arithmetic Coding SIGNAL COMPRESSION Lecture 3 4.9.2007 Shannon-Fano-Elias Codes and Arithmetic Coding 1 Shannon-Fano-Elias Coding We discuss how to encode the symbols {a 1, a 2,..., a m }, knowing their probabilities,

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION

ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION ON THE BIT-COMPLEXITY OF LEMPEL-ZIV COMPRESSION PAOLO FERRAGINA, IGOR NITTO, AND ROSSANO VENTURINI Abstract. One of the most famous and investigated lossless data-compression schemes is the one introduced

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

Correlation Autoencoder Hashing for Supervised Cross-Modal Search

Correlation Autoencoder Hashing for Supervised Cross-Modal Search Correlation Autoencoder Hashing for Supervised Cross-Modal Search Yue Cao, Mingsheng Long, Jianmin Wang, and Han Zhu School of Software Tsinghua University The Annual ACM International Conference on Multimedia

More information

CS483 Design and Analysis of Algorithms

CS483 Design and Analysis of Algorithms CS483 Design and Analysis of Algorithms Lectures 2-3 Algorithms with Numbers Instructor: Fei Li lifei@cs.gmu.edu with subject: CS483 Office hours: STII, Room 443, Friday 4:00pm - 6:00pm or by appointments

More information

Nearest Neighbor Search with Keywords: Compression

Nearest Neighbor Search with Keywords: Compression Nearest Neighbor Search with Keywords: Compression Yufei Tao KAIST June 3, 2013 In this lecture, we will continue our discussion on: Problem (Nearest Neighbor Search with Keywords) Let P be a set of points

More information

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile

Indexing LZ77: The Next Step in Self-Indexing. Gonzalo Navarro Department of Computer Science, University of Chile Indexing LZ77: The Next Step in Self-Indexing Gonzalo Navarro Department of Computer Science, University of Chile gnavarro@dcc.uchile.cl Part I: Why Jumping off the Cliff The Past Century Self-Indexing:

More information

Fast Text Compression with Neural Networks

Fast Text Compression with Neural Networks Fast Text Compression with Neural Networks Matthew V. Mahoney Florida Institute of Technology 150 W. University Blvd. Melbourne FL 32901 mmahoney@cs.fit.edu Abstract Neural networks have the potential

More information

10 van Emde Boas Trees

10 van Emde Boas Trees Dynamic Set Data Structure S: S. insert(x) S. delete(x) S. search(x) S. min() S. max() S. succ(x) S. pred(x) Ernst Mayr, Harald Räcke 389 For this chapter we ignore the problem of storing satellite data:

More information

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts

Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Improved Approximate String Matching and Regular Expression Matching on Ziv-Lempel Compressed Texts Philip Bille 1, Rolf Fagerberg 2, and Inge Li Gørtz 3 1 IT University of Copenhagen. Rued Langgaards

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

md5bloom: Forensic Filesystem Hashing Revisited

md5bloom: Forensic Filesystem Hashing Revisited DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Problem. Maintain the above set so as to support certain

Problem. Maintain the above set so as to support certain Randomized Data Structures Roussos Ioannis Fundamental Data-structuring Problem Collection {S 1,S 2, } of sets of items. Each item i is an arbitrary record, indexed by a key k(i). Each k(i) is drawn from

More information

Space-Efficient Construction Algorithm for Circular Suffix Tree

Space-Efficient Construction Algorithm for Circular Suffix Tree Space-Efficient Construction Algorithm for Circular Suffix Tree Wing-Kai Hon, Tsung-Han Ku, Rahul Shah and Sharma Thankachan CPM2013 1 Outline Preliminaries and Motivation Circular Suffix Tree Our Indexes

More information