Bloom Filters, general theory and variants
|
|
- Zoe Simon
- 5 years ago
- Views:
Transcription
1 Bloom Filters: general theory and variants G. Caravagna Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives.
2 Indice 1 The problem 2 Main idea Mathematics 3 4 Applications References
3 Index The problem 1 The problem 2 Main idea Mathematics 3 4 Applications References
4 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions
5 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions
6 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..
7 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..
8 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..
9 Reminder Main idea Mathematics Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives. Bloom[1970]
10 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References
11 Preconditions Main idea Mathematics given a set of objects S = {x 1,..., x n } no restrictions on objects a vector B of m bits were b i {0, 1} will discuss about m-value suppose we have k hash functions h 1,..., h k each h i is defined as h i : U S [1; m] h i indexes the B vector
12 Building vector B Main idea Mathematics how to build B i. b i = 1 (j, t). h j (x t ) = i x i x j B h i1 (x i ) h j1 (x j ) h ik (x i ) h jk (x j )
13 Building vector B Main idea Mathematics procedure build begin for each s in S do for each h in H do done done end B[h(s)] = 1; // is Θ( S ) // is O(1) // suppose is O(1) All the build is Θ( S ) = Θ(n) time and Θ( B ) = Θ(m) space
14 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built
15 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built
16 Searching into vector B Main idea Mathematics procedure search begin for each h in H do // is O(1) if B[h(y)] = 0 then return "not found" done return "found" end All the search is O(1) time and O(1) space
17 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?
18 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?
19 Main idea Mathematics The real problem of Bloom Filters Definition The main problem are false positives: may exist an x j y so that h... (x j ) = h... (y) y S x j B sentence 2 is not always true
20 What we compute Main idea Mathematics what may happen with a false positive? we say y S even if this is false we shall say y S, probably can we compute this probability?
21 Example Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}
22 Example: building B Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}
23 Example: building B Main idea Mathematics we may have introduced a false positive in B 10 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B h 1(T CT ) = h 2(T T A) = 9 S = {T T A, T CT, AT A}
24 Example: building B Main idea Mathematics we may have introduced two false positives in B 10 and B 11 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}
25 Example: searching into B Main idea Mathematics CGA ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B CGA? S NO S = {T T A, T CT, AT A} B h3(cga) = 0
26 Example: searching into B Main idea Mathematics AAA h 1(T T A) = h 2(AAA) = 1 h 2(AT A) = h 3(AAA) = 2 h 2(T CT ) = h 1(AAA) = 4 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B AAA? S Y ES false positive S = {T T A, T CT, AT A}
27 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References
28 Probability of a false positive Main idea Mathematics assumption that hash are perfectly random after build P(b i = 0) = ( 1 1 m ) kn e kn/m = p probability of a false positive is (1 e kn/m ) k = (1 p) k = ε other formulations are asymptotically equivalent
29 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n
30 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n
31 How big should be the B vector? Main idea Mathematics depends on the ε value we want given n we fix m m ε % n % 2n % 5n % 10n % m = O(n) is generally a good choice
32 Bloom Filters v.s. hash functions Main idea Mathematics hash functions Bloom Filters build time Θ(n + n log(n)) Θ(n) space needed Θ(n log(n)) Θ(m) search time O(log(n)) O(1) ε value 1/n (1 p) k Hash functions are Bloom Filters with k = 1
33 Bloom Filters tricks Main idea Mathematics union by OR 1 we have sets S 1, S 2 and Bloom Filters B 1, B 2 2 suppose m 1 = m 2 and same hashing functions 3 just OR the counters B 12 i = B 1 i B 2 i halved size 1 suppose m = 2 α 2 make union by OR of the two half 3 when hashing mask high-order bit
34 Summary Main idea Mathematics we have a tradeoff between space and false positives ε value is computable (and constant) we use abstraction provided by hash functions on x i S we approximate the characteristic function we have an easy to code data structure we started from the Membership Problem, we solve this one: Handle massive data sets to support membership queries using compact data structure what else shall we want?
35 Why variants extend Bloom Filters to multisets Spectral Bloom Filter, Matias and Cohen [2003] compute almost any function Bloomier Filter, Chazelle et al. [2004] something else? someone else.....are up to date results, let s try to give a brief overview...
36 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References
37 (SBF) We extend the Bloom Filters to multisets Definition M = S, f x is a multiset were S is a set f x is a function f x are the occurrences of x in M ex M = {A =2, B =1, C =2 }, f x M = 5, f A = f C = 2, f B = 1
38 Main features space usage is slightly larger performances are generally better insertions are always possible, deletion not can be built incrementally for streaming data we query values f x > T with T a threshold with T = 0 we guess for membership we have tricks for SBF
39 SBF B vector is replaced by a vector of counters C 1, C 2,..., C m C i is a sum of f x values for each x S mapping to i as always, approximations of f x are stored into C h1 (x), C h2 (x),..., C hk (x) thus, to compute f x, we have m x = min{c h1 (x), C h2 (x),..., C hk (x)} m x is the basic estimator or Minimum Selection (MS)
40 The Minimum Selection h...(y) = h...(x) = i 1 f y f z f x f x f x f z C i 1 C i C j C j+1 C i 1 is not a good approximation of f x (neither of f y ) C i is an exact approximation of f x C j+1 is an exact approximation of f z
41 Insertion and Deletion insertion is simple increase each counter by 1... for each h in H do // O(1) C[h(x)] = C[h(x)] + 1; done... deletion is simple decrease each counter by 1 search for an element x compute the MinimumSelection m x
42 On the error of SBF error is the same ε of Bloom Filters Theorem For all x f x m x. Furthermore f x m x with probability E SBF = ε (1 p) k Proof. With no collisions m x = f x. With collisions m x > f x. The m x < f x cannot happen with collision. The event f x m x is all counters have a collision, that is a false positive.
43 Implementing a SBF: challenges Mainly two challenges 1 vector of counters computational complexity of 2 performances random accesses insertion deletion allow insertion/deletion keeping low E SBF
44 Solving Problem 2 with Minimal Increase(MI) We minimize redundant insertions Minimal Increase principle When performing insertion of element x, increase only the counters that equals m x. Each lookup will return value m x. We get the inequality E SBF ε
45 Minimal Increase: example of increase insertion is always possible insert x mx = 1 mx = 2 insert x mx = x x x mx mx mx mx mx mx
46 Minimal Increase: example of decrease deletion may introduce false negatives mx = 1 my = 0 y x insert y y delete y mx = 1 = my mx = 0 = my x x y mx mx my mx mx mx my my mx my mx we lie saying x S MI doesn t allow deletion
47 Solving Problem 2 with Recurring Minimum(RM) Recurring Minimum : definition f z m x m z f x f x f x f z x has a Recurring Minimum (RM) z has a Single Minimum (SM) An element has a RM iff exist more than one counter with its MS value
48 Solving Problem 2 with Recurring Minimum(RM) We identify Bloom Errors and handle them Recurring Minimum : principle For item x with RM we use m x as estimator E SBF < ε For items with a single minimum we use a secondary SBF with SBF 2 SBF 1 Improvements are remarkable. E SBF2 ε
49 Recurrent Minimum: insertion insertion handles potential future errors 1 increase(sbf 1,x) 2 if x has a RM in SBF 1, stop 3 look for x in SBF 2 1 if x SBF 2 increase(sbf 2,x) 2 if x SBF 2 increase(sbf 2,search(SBF 1,x)) SM RM insert x x SM x insert x... x SM x
50 Recurrent Minimum: lookup and deletion lookup looks, if needed, in both SBF 1 if x has a RM in SBF 1, return it 2 say m x2 is value of x in SBF 2 1 if m x2 > 0, return it 2 return min value of x in SBF 1 deletion is reverse of insertion 1 decrease(sbf 1,x) 2 if x has a SM in SBF 1, decrease(sbf 2,x) As insertion is in both SBT, deletion can t create false positives
51 Methods Comparison: MS v.s. MI v.s. RM error rates space overhead complexity insertion/deletion MI RM MS = ε MI MS RM MS MI RM MS = RM MI
52 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?
53 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?
54 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion
55 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion
56 Solving Problem 1 with String Array Index use number of bits (per counter) strictly needed use some slack bits fix a value α > 0 add αm bits to vector, one every 1/α items each counter C i uses log 2 (C i ) each counter counts up to 2 log 2 (C i ) +.. ( m ) C vector is log 2 C i + αm = N bits i=1 log 2 C 1 log 2 C 2 log 2 C 1/α log 2 C 2/α log 2 Cm C 2/α C 1 C 2... C 1/α C m slack bit slack bit
57 The String Array Index: main idea first level of pointers to subsequences into SBF a Coarse Offset Vector to groups of (log N) size items these pointers are m/ log N second level may be other Coarse Offset Vector of pointers to subsequences a simple vector of offsets
58 The String Array Index: graphic Coarse Offset Vector... S... Offset Vectors C.O.V. Offset Vectors
59 The String Array Index: performances 2-level of pointers to sub-sequences if N = m ( log 2 (C i ) + α i ) Theorem i=1 The SAI of size o(n) + O(m) bits can be built in O(m) time, supporting access to sub-sequences in O(1) time Theorem An SBF of size N + o(n) + O(m) bits can be built in O(N) time, supporting lookup in O(1) time. Furthermore, each update takes O(1) amortized time.
60 SBF tricks merging by addiction 1 we have two sets S 1, S 2 and two SBF C 1, C 2 2 suppose m 1 = m 2 and same hashing functions 3 just sum the counters Ci 12 = Ci 1 + Ci 2
61 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References
62 , (BBF) we compute any function f using a BBF some constraints on f same tradeoff inherited by Bloom Filters we associate values with a subset of the domain elements
63 Which function f : D = {0,..., N 1} R = {, 0,..., 2 r 1} values computed are into S D R D R S f(s) f error free f error arbitrarily close to 1
64 Main features query is O(1) space requirement is O(nr) can be generalized to handle dynamic updates function can be updated space unchanged we query values of f we may change f (x) for x S but S is immutable
65 Main idea a false positive in a BBF returning a result when the key is not in the map we give a simple idea of a BBF the Bloom Filter cascade can be formerly generalized
66 A near-optimal and simple BBF possible values are {0, 1} A 0 is a BF with values mapping to 0 B 0 is a BF with values mapping to 1 we will build many (A i, B i ) (here is the cascade) we make a cross search we search as deep as we need what may happen when searching?
67 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0
68 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0
69 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0
70 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0
71 A BF cascade A 0 B 0 false positive false positive A 1 B 1 f.p. f.p A i+1 A i average search is O(1) first pairs are generally enough total space is independent of n first pair occupies most space
72 The general idea results are binary-coded v R is coded with β v {0, 1} q for each bit of β v we use the simple BBF what we get is space is slightly larger than the space for 2q BF lookup is Θ(q) build is O(n log n) the E BBF is proportional to 2 q
73 The general idea they use a table T of coded values T has m locations we have as always k hash functions we use a masking value M to reduce E BBF and if x S k β f (x) = M T [h i (x)] i=1 P[lookup(x, T ) = ] 1 k 2 q
74 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References
75 Other special Bloom Filters Counting Bloom Filters Broder et al. [1998] Compressed Bloom Filters M.Mitzenmacher [2002] Attenuated Bloom Filters Rhea, Kubiatowicz [2002] Compact Approximator of Lattice Functions Boldi, Vigna [2004]
76 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References
77 When are really used Applications References routing probabilistic location and routing shortest path distance information proxy Web proxy cache into SQUID distributed caching peer-to-peer summarize the contents spell checking original B.Bloom idea...
78 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References
79 References: foundations Applications References B.Bloom Space/time tradeoffs in hash coding with allowable errors. CACM,13(7): , 1970 Saar Cohen, Yossi Matias Spectral bloom filters. ACM SIGMOD 03, 2003 B. Chazelle, J. Kilian, R. Rubinfeld, A. Tal The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. Proceedings of 15th SODA (2004), 30-39, 2004 Bose, Guo, Kranakis, Maheshwari, Morin, Morrison, Smid, Tang On the false-positive rate of Bloom Filters. School of Computer Science, Carleton University, 2004
80 References: extras Applications References M. Mitzenmacher Compressed Bloom Filters. In Proceedings of 20th ACM SIGACT-SIGOPS, , 2002 P.Boldi, S.Vigna Compact Approximation of Lattice Functions with Applications to Large-Alphabet Text Search. Dipartimento di Scienze dell Informazione, Universita di Milano, 2004 A. Broder, M. Mitzenmacher Network Applications of Bloom Filters: A Survey. In Proceedings of 40th Allerton Conference (2004), , 2002
Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationIntroduction to Randomized Algorithms III
Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability
More informationBloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011
Bloom Filter Redux Matthias Vallentin Gene Pang CS 270 Combinatorial Algorithms and Data Structures UC Berkeley, Spring 2011 Inspiration Our background: network security, databases We deal with massive
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationLecture 4 Thursday Sep 11, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationPAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic
IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic Yoshihide MATSUMOTO a), Hiroaki HAZEYAMA b), and Youki KADOBAYASHI
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationWeighted Bloom Filter
Weighted Bloom Filter Jehoshua Bruck Jie Gao Anxiao (Andrew) Jiang Department of Electrical Engineering, California Institute of Technology. bruck@paradise.caltech.edu. Department of Computer Science,
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More informationA Model for Learned Bloom Filters, and Optimizing by Sandwiching
A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationIntroduction to Hash Tables
Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationCS 591, Lecture 7 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationCSE 190, Great ideas in algorithms: Pairwise independent hash functions
CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required
More informationTight Bounds for Sliding Bloom Filters
Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error
More informationBloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.
Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,
More informationAdvanced Implementations of Tables: Balanced Search Trees and Hashing
Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the
More informationModule 1: Analyzing the Efficiency of Algorithms
Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?
More informationLecture 18 April 26, 2012
6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationCSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30
CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2019 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationLecture: Analysis of Algorithms (CS )
Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect
More informationThe Bloom Paradox: When not to Use a Bloom Filter
1 The Bloom Paradox: When not to Use a Bloom Filter Ori Rottenstreich and Isaac Keslassy Abstract In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, the Bloom filter is harmful and
More informationData structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:
Data structures Exercise 1 solution Question 1 Let s start by writing all the functions in big O notation: f 1 (n) = 2017 = O(1), f 2 (n) = 2 log 2 n = O(n 2 ), f 3 (n) = 2 n = O(2 n ), f 4 (n) = 1 = O
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationmd5bloom: Forensic Filesystem Hashing Revisited
DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS
More informationCSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13
CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic
More informationMotivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis
Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationAlgorithm Design CS 515 Fall 2015 Sample Final Exam Solutions
Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Copyright c 2015 Andrew Klapper. All rights reserved. 1. For the functions satisfying the following three recurrences, determine which is the
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationCSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018
CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationInsert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park
1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution
More informationAs mentioned, we will relax the conditions of our dictionary data structure. The relaxations we
CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationHash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a
Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationBottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen
Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B
More informationComputational Complexity - Pseudocode and Recursions
Computational Complexity - Pseudocode and Recursions Nicholas Mainardi 1 Dipartimento di Elettronica e Informazione Politecnico di Milano nicholas.mainardi@polimi.it June 6, 2018 1 Partly Based on Alessandro
More informationHash tables. Hash tables
Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationShannon-Fano-Elias coding
Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The
More informationCS 61B Asymptotic Analysis Fall 2017
CS 61B Asymptotic Analysis Fall 2017 1 More Running Time Give the worst case and best case running time in Θ( ) notation in terms of M and N. (a) Assume that slam() Θ(1) and returns a boolean. 1 public
More informationCS 170 Algorithms Fall 2014 David Wagner MT2
CS 170 Algorithms Fall 2014 David Wagner MT2 PRINT your name:, (last) SIGN your name: (first) Your Student ID number: Your Unix account login: cs170- The room you are sitting in right now: Name of the
More informationDatabases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)
Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,
More informationProbabilistic Counting with Randomized Storage
Probabilistic Counting with Randomized Storage Benjamin Van Durme University of Rochester Rochester, NY 14627, USA Ashwin Lall Georgia Institute of Technology Atlanta, GA 30332, USA Abstract Previous work
More informationCSCI Honor seminar in algorithms Homework 2 Solution
CSCI 493.55 Honor seminar in algorithms Homework 2 Solution Saad Mneimneh Visiting Professor Hunter College of CUNY Problem 1: Rabin-Karp string matching Consider a binary string s of length n and another
More informationLecture 5: Hashing. David Woodruff Carnegie Mellon University
Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More informationi=1 i B[i] B[i] + A[i, j]; c n for j n downto i + 1 do c n i=1 (n i) C[i] C[i] + A[i, j]; c n
Fundamental Algorithms Homework #1 Set on June 25, 2009 Due on July 2, 2009 Problem 1. [15 pts] Analyze the worst-case time complexity of the following algorithms,and give tight bounds using the Theta
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationElectrical & Computer Engineering University of Waterloo Canada February 6, 2007
Lecture 9: Lecture 9: Electrical & Computer Engineering University of Waterloo Canada February 6, 2007 Hash tables Lecture 9: Recall that a hash table consists of m slots into which we are placing items;
More informationAlgorithms lecture notes 1. Hashing, and Universal Hash functions
Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.
More informationComputer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:
Computer Networks 55 (2) 84 89 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet A Generalized Bloom Filter to Secure Distributed Network Applications
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationIntroduction. An Introduction to Algorithms and Data Structures
Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationThe space complexity of approximating the frequency moments
The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationComplexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler
Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard
More informationLecture 8 HASHING!!!!!
Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)
More informationInteger Sorting on the word-ram
Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words
More informationOutline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.
Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität
More informationBloom Filters, Minhashes, and Other Random Stuff
Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?
More informationcompare to comparison and pointer based sorting, binary trees
Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:
More informationIntroduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1
Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology 6.006 Spring 2010 Professors Piotr Indyk and David Karger Quiz 1 Quiz 1 Do not open this quiz booklet until directed to do
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationAverage Case Analysis. October 11, 2011
Average Case Analysis October 11, 2011 Worst-case analysis Worst-case analysis gives an upper bound for the running time of a single execution of an algorithm with a worst-case input and worst-case random
More informationN/4 + N/2 + N = 2N 2.
CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationSearching, mainly via Hash tables
Data structures and algorithms Part 11 Searching, mainly via Hash tables Petr Felkel 26.1.2007 Topics Searching Hashing Hash function Resolving collisions Hashing with chaining Open addressing Linear Probing
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationQuiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)
Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We
More informationLecture 4 February 2nd, 2017
CS 224: Advanced Algorithms Spring 2017 Prof. Jelani Nelson Lecture 4 February 2nd, 2017 Scribe: Rohil Prasad 1 Overview In the last lecture we covered topics in hashing, including load balancing, k-wise
More informationSecure Indexes* Eu-Jin Goh Stanford University 15 March 2004
Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 * Generalizes an early version of my paper How to search on encrypted data on eprint Cryptology Archive on 7 October 2003 Secure Indexes Data
More informationAdapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts
Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /
More informationthe subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology
A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University
More informationStreaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.
Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering
More informationDeterministic Finite Automaton (DFA)
1 Lecture Overview Deterministic Finite Automata (DFA) o accepting a string o defining a language Nondeterministic Finite Automata (NFA) o converting to DFA (subset construction) o constructed from a regular
More information1 Substitution method
Recurrence Relations we have discussed asymptotic analysis of algorithms and various properties associated with asymptotic notation. As many algorithms are recursive in nature, it is natural to analyze
More informationAnalysis of Algorithms I: Perfect Hashing
Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the
More informationCS 580: Algorithm Design and Analysis
CS 580: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 2018 Reminder: Homework 1 due tonight at 11:59PM! Recap: Greedy Algorithms Interval Scheduling Goal: Maximize number of meeting
More informationChapter 2. Recurrence Relations. Divide and Conquer. Divide and Conquer Strategy. Another Example: Merge Sort. Merge Sort Example. Merge Sort Example
Recurrence Relations Chapter 2 Divide and Conquer Equation or an inequality that describes a function by its values on smaller inputs. Recurrence relations arise when we analyze the running time of iterative
More information