Bloom Filters, general theory and variants

Size: px

Start display at page:

Download "Bloom Filters, general theory and variants"

Zoe Simon
5 years ago
Views:

1 Bloom Filters: general theory and variants G. Caravagna Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives.

2 Indice 1 The problem 2 Main idea Mathematics 3 4 Applications References

3 Index The problem 1 The problem 2 Main idea Mathematics 3 4 Applications References

4 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions

5 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions

6 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

7 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

8 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

9 Reminder Main idea Mathematics Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives. Bloom[1970]

10 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References

11 Preconditions Main idea Mathematics given a set of objects S = {x 1,..., x n } no restrictions on objects a vector B of m bits were b i {0, 1} will discuss about m-value suppose we have k hash functions h 1,..., h k each h i is defined as h i : U S [1; m] h i indexes the B vector

12 Building vector B Main idea Mathematics how to build B i. b i = 1 (j, t). h j (x t ) = i x i x j B h i1 (x i ) h j1 (x j ) h ik (x i ) h jk (x j )

13 Building vector B Main idea Mathematics procedure build begin for each s in S do for each h in H do done done end B[h(s)] = 1; // is Θ( S ) // is O(1) // suppose is O(1) All the build is Θ( S ) = Θ(n) time and Θ( B ) = Θ(m) space

14 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built

15 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built

16 Searching into vector B Main idea Mathematics procedure search begin for each h in H do // is O(1) if B[h(y)] = 0 then return "not found" done return "found" end All the search is O(1) time and O(1) space

17 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?

18 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?

19 Main idea Mathematics The real problem of Bloom Filters Definition The main problem are false positives: may exist an x j y so that h... (x j ) = h... (y) y S x j B sentence 2 is not always true

20 What we compute Main idea Mathematics what may happen with a false positive? we say y S even if this is false we shall say y S, probably can we compute this probability?

21 Example Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

22 Example: building B Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

23 Example: building B Main idea Mathematics we may have introduced a false positive in B 10 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B h 1(T CT ) = h 2(T T A) = 9 S = {T T A, T CT, AT A}

24 Example: building B Main idea Mathematics we may have introduced two false positives in B 10 and B 11 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

25 Example: searching into B Main idea Mathematics CGA ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B CGA? S NO S = {T T A, T CT, AT A} B h3(cga) = 0

26 Example: searching into B Main idea Mathematics AAA h 1(T T A) = h 2(AAA) = 1 h 2(AT A) = h 3(AAA) = 2 h 2(T CT ) = h 1(AAA) = 4 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B AAA? S Y ES false positive S = {T T A, T CT, AT A}

27 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References

28 Probability of a false positive Main idea Mathematics assumption that hash are perfectly random after build P(b i = 0) = ( 1 1 m ) kn e kn/m = p probability of a false positive is (1 e kn/m ) k = (1 p) k = ε other formulations are asymptotically equivalent

29 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n

30 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n

31 How big should be the B vector? Main idea Mathematics depends on the ε value we want given n we fix m m ε % n % 2n % 5n % 10n % m = O(n) is generally a good choice

32 Bloom Filters v.s. hash functions Main idea Mathematics hash functions Bloom Filters build time Θ(n + n log(n)) Θ(n) space needed Θ(n log(n)) Θ(m) search time O(log(n)) O(1) ε value 1/n (1 p) k Hash functions are Bloom Filters with k = 1

33 Bloom Filters tricks Main idea Mathematics union by OR 1 we have sets S 1, S 2 and Bloom Filters B 1, B 2 2 suppose m 1 = m 2 and same hashing functions 3 just OR the counters B 12 i = B 1 i B 2 i halved size 1 suppose m = 2 α 2 make union by OR of the two half 3 when hashing mask high-order bit

34 Summary Main idea Mathematics we have a tradeoff between space and false positives ε value is computable (and constant) we use abstraction provided by hash functions on x i S we approximate the characteristic function we have an easy to code data structure we started from the Membership Problem, we solve this one: Handle massive data sets to support membership queries using compact data structure what else shall we want?

35 Why variants extend Bloom Filters to multisets Spectral Bloom Filter, Matias and Cohen [2003] compute almost any function Bloomier Filter, Chazelle et al. [2004] something else? someone else.....are up to date results, let s try to give a brief overview...

36 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

37 (SBF) We extend the Bloom Filters to multisets Definition M = S, f x is a multiset were S is a set f x is a function f x are the occurrences of x in M ex M = {A =2, B =1, C =2 }, f x M = 5, f A = f C = 2, f B = 1

38 Main features space usage is slightly larger performances are generally better insertions are always possible, deletion not can be built incrementally for streaming data we query values f x > T with T a threshold with T = 0 we guess for membership we have tricks for SBF

39 SBF B vector is replaced by a vector of counters C 1, C 2,..., C m C i is a sum of f x values for each x S mapping to i as always, approximations of f x are stored into C h1 (x), C h2 (x),..., C hk (x) thus, to compute f x, we have m x = min{c h1 (x), C h2 (x),..., C hk (x)} m x is the basic estimator or Minimum Selection (MS)

40 The Minimum Selection h...(y) = h...(x) = i 1 f y f z f x f x f x f z C i 1 C i C j C j+1 C i 1 is not a good approximation of f x (neither of f y ) C i is an exact approximation of f x C j+1 is an exact approximation of f z

41 Insertion and Deletion insertion is simple increase each counter by 1... for each h in H do // O(1) C[h(x)] = C[h(x)] + 1; done... deletion is simple decrease each counter by 1 search for an element x compute the MinimumSelection m x

42 On the error of SBF error is the same ε of Bloom Filters Theorem For all x f x m x. Furthermore f x m x with probability E SBF = ε (1 p) k Proof. With no collisions m x = f x. With collisions m x > f x. The m x < f x cannot happen with collision. The event f x m x is all counters have a collision, that is a false positive.

43 Implementing a SBF: challenges Mainly two challenges 1 vector of counters computational complexity of 2 performances random accesses insertion deletion allow insertion/deletion keeping low E SBF

44 Solving Problem 2 with Minimal Increase(MI) We minimize redundant insertions Minimal Increase principle When performing insertion of element x, increase only the counters that equals m x. Each lookup will return value m x. We get the inequality E SBF ε

45 Minimal Increase: example of increase insertion is always possible insert x mx = 1 mx = 2 insert x mx = x x x mx mx mx mx mx mx

46 Minimal Increase: example of decrease deletion may introduce false negatives mx = 1 my = 0 y x insert y y delete y mx = 1 = my mx = 0 = my x x y mx mx my mx mx mx my my mx my mx we lie saying x S MI doesn t allow deletion

47 Solving Problem 2 with Recurring Minimum(RM) Recurring Minimum : definition f z m x m z f x f x f x f z x has a Recurring Minimum (RM) z has a Single Minimum (SM) An element has a RM iff exist more than one counter with its MS value

48 Solving Problem 2 with Recurring Minimum(RM) We identify Bloom Errors and handle them Recurring Minimum : principle For item x with RM we use m x as estimator E SBF < ε For items with a single minimum we use a secondary SBF with SBF 2 SBF 1 Improvements are remarkable. E SBF2 ε

49 Recurrent Minimum: insertion insertion handles potential future errors 1 increase(sbf 1,x) 2 if x has a RM in SBF 1, stop 3 look for x in SBF 2 1 if x SBF 2 increase(sbf 2,x) 2 if x SBF 2 increase(sbf 2,search(SBF 1,x)) SM RM insert x x SM x insert x... x SM x

50 Recurrent Minimum: lookup and deletion lookup looks, if needed, in both SBF 1 if x has a RM in SBF 1, return it 2 say m x2 is value of x in SBF 2 1 if m x2 > 0, return it 2 return min value of x in SBF 1 deletion is reverse of insertion 1 decrease(sbf 1,x) 2 if x has a SM in SBF 1, decrease(sbf 2,x) As insertion is in both SBT, deletion can t create false positives

51 Methods Comparison: MS v.s. MI v.s. RM error rates space overhead complexity insertion/deletion MI RM MS = ε MI MS RM MS MI RM MS = RM MI

52 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?

53 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?

54 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion

55 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion

56 Solving Problem 1 with String Array Index use number of bits (per counter) strictly needed use some slack bits fix a value α > 0 add αm bits to vector, one every 1/α items each counter C i uses log 2 (C i ) each counter counts up to 2 log 2 (C i ) +.. ( m ) C vector is log 2 C i + αm = N bits i=1 log 2 C 1 log 2 C 2 log 2 C 1/α log 2 C 2/α log 2 Cm C 2/α C 1 C 2... C 1/α C m slack bit slack bit

57 The String Array Index: main idea first level of pointers to subsequences into SBF a Coarse Offset Vector to groups of (log N) size items these pointers are m/ log N second level may be other Coarse Offset Vector of pointers to subsequences a simple vector of offsets

58 The String Array Index: graphic Coarse Offset Vector... S... Offset Vectors C.O.V. Offset Vectors

59 The String Array Index: performances 2-level of pointers to sub-sequences if N = m ( log 2 (C i ) + α i ) Theorem i=1 The SAI of size o(n) + O(m) bits can be built in O(m) time, supporting access to sub-sequences in O(1) time Theorem An SBF of size N + o(n) + O(m) bits can be built in O(N) time, supporting lookup in O(1) time. Furthermore, each update takes O(1) amortized time.

60 SBF tricks merging by addiction 1 we have two sets S 1, S 2 and two SBF C 1, C 2 2 suppose m 1 = m 2 and same hashing functions 3 just sum the counters Ci 12 = Ci 1 + Ci 2

61 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

62 , (BBF) we compute any function f using a BBF some constraints on f same tradeoff inherited by Bloom Filters we associate values with a subset of the domain elements

63 Which function f : D = {0,..., N 1} R = {, 0,..., 2 r 1} values computed are into S D R D R S f(s) f error free f error arbitrarily close to 1

64 Main features query is O(1) space requirement is O(nr) can be generalized to handle dynamic updates function can be updated space unchanged we query values of f we may change f (x) for x S but S is immutable

65 Main idea a false positive in a BBF returning a result when the key is not in the map we give a simple idea of a BBF the Bloom Filter cascade can be formerly generalized

66 A near-optimal and simple BBF possible values are {0, 1} A 0 is a BF with values mapping to 0 B 0 is a BF with values mapping to 1 we will build many (A i, B i ) (here is the cascade) we make a cross search we search as deep as we need what may happen when searching?

67 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

68 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

69 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

70 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

71 A BF cascade A 0 B 0 false positive false positive A 1 B 1 f.p. f.p A i+1 A i average search is O(1) first pairs are generally enough total space is independent of n first pair occupies most space

72 The general idea results are binary-coded v R is coded with β v {0, 1} q for each bit of β v we use the simple BBF what we get is space is slightly larger than the space for 2q BF lookup is Θ(q) build is O(n log n) the E BBF is proportional to 2 q

73 The general idea they use a table T of coded values T has m locations we have as always k hash functions we use a masking value M to reduce E BBF and if x S k β f (x) = M T [h i (x)] i=1 P[lookup(x, T ) = ] 1 k 2 q

74 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

75 Other special Bloom Filters Counting Bloom Filters Broder et al. [1998] Compressed Bloom Filters M.Mitzenmacher [2002] Attenuated Bloom Filters Rhea, Kubiatowicz [2002] Compact Approximator of Lattice Functions Boldi, Vigna [2004]

76 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References

77 When are really used Applications References routing probabilistic location and routing shortest path distance information proxy Web proxy cache into SQUID distributed caching peer-to-peer summarize the contents spell checking original B.Bloom idea...

78 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References

79 References: foundations Applications References B.Bloom Space/time tradeoffs in hash coding with allowable errors. CACM,13(7): , 1970 Saar Cohen, Yossi Matias Spectral bloom filters. ACM SIGMOD 03, 2003 B. Chazelle, J. Kilian, R. Rubinfeld, A. Tal The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. Proceedings of 15th SODA (2004), 30-39, 2004 Bose, Guo, Kranakis, Maheshwari, Morin, Morrison, Smid, Tang On the false-positive rate of Bloom Filters. School of Computer Science, Carleton University, 2004

80 References: extras Applications References M. Mitzenmacher Compressed Bloom Filters. In Proceedings of 20th ACM SIGACT-SIGOPS, , 2002 P.Boldi, S.Vigna Compact Approximation of Lattice Functions with Applications to Large-Alphabet Text Search. Dipartimento di Scienze dell Informazione, Universita di Milano, 2004 A. Broder, M. Mitzenmacher Network Applications of Bloom Filters: A Survey. In Proceedings of 40th Allerton Conference (2004), , 2002

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of