Introduction to Randomized Algorithms III

Size: px

Start display at page:

Download "Introduction to Randomized Algorithms III"

Elvin Knight
5 years ago
Views:

1 Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November

2 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability 1 / 2 k Counting with decreasing probability Approximate set membership Bloom Filters Counting Bloom Filters U. Aveiro, November

3 Motivation Is it possible to use a small counter to keep approximate counts of large numbers? Use a large number of such counters to keep track of the number of occurrences of many different events E.g., 8-bit counters Morris, Approximate Count Algorithm, 1978 U. Aveiro, November

4 Motivation But, nowadays memory is no longer scarce Is such an approach still interesting? Yes!! Massive data volumes!! Need quick and memory-efficient processing U. Aveiro, November

5 Application areas Online social networks Large-scale scientific experiments Search engines Online content delivery Product and consumer tracking. Data too large to fit in memory must be analyzed!! U. Aveiro, November

6 Big-Data Scale up vs Downsize Scale up the computation Replicate cheap hardware / devices Build massive DBMSs and warehouses BUT, expensive equipment / energy!! Downsize the data Compact representations of large data sets Approximate answers Probabilistic methods U. Aveiro, November

7 Probabilistic Counters Goal Avoid using large counters when dealing with large data volumes!! A counter with n bits counts up to 2 n events Can we use less bits? What is the cost? U. Aveiro, November

8 1 st Method For each event, increment the counter with probability 1 / 2 Intuition: just incrementing for half of the events!! We can now count up to 2 n + 1 events Using just n bits!! Is that what happens? Draw the state diagram / triangular diagram U. Aveiro, November

9 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments several times! What can you conclude? How to evaluate the accuracy? Relative error or accuracy ratio When knowing the exact value U. Aveiro, November

10 Counting 100 events trials U. Aveiro, November

11 1 st Method Expected value (mean) Counter is a random variable Resulting from a succession of random events What is the expected value after k events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[ X i = 0 ] = P[ X i = 1 ] = 1 / 2 U. Aveiro, November

12 1 st Method Expected value (mean) E[ X i ] = 0 x P[ X i = 0 ] + 1 x P[ X i = 1 ] = 1 / 2 Counter value after k events is S = X i E[ S ] = E[ X i ] = E[ X i ] = k / 2 Number of events can be estimated by 2 x S U. Aveiro, November

13 1 st Method Variance σ 2 ( X i ) = E[ X i2 ] { E[ X i ] } 2 = E[ X i2 ] 1 / 4 E[ X i 2 ] = 0 2 x P[X i = 0] x P[X i = 1] = 1 / 2 σ 2 ( X i ) = 1 / 4 σ 2 ( S ) = σ 2 ( X i ) = σ 2 ( X i ) = k / 4 Standard deviation: σ ( S ) = k / 2 U. Aveiro, November

14 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments many times!! For each counter, compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November

15 Counting 100 events trials U. Aveiro, November

16 1 st Method Probability distribution After n events, what is the probability of the counter value being k? p ( n, k ) =? Example for n = 4 More probable / Less probable counter values? p ( 4, k ) =? Binary table / Binary tree / Pascal-like triangle U. Aveiro, November

17 1 st Method Probability distribution U. Aveiro, November

18 Probability Distribution p = 1 / 2 U. Aveiro, November

19 Generalization Can we approx. count the same number of events using less bits? Or approx. count more events using the same number of bits? Yes! Increment the counter with lesser probability Increment with probability 1 / 2 k U. Aveiro, November

20 Generalization Tasks Incrementing with probability 1 / 2 k Obtain an expression for the mean, the variance and the stdr. deviation after n events k = 2, 3,, 6, Analyze the corresponding probability distributions Pascal-like triangle U. Aveiro, November

21 Generalization Mean and Variance Probability of incrementing the counter: p q = ( 1 p ) It is not difficult to check that, after n events: E[ S ] = n p σ 2 ( S ) = n p q U. Aveiro, November

22 Generalization Tasks Set the counting probability to 1 / 32 Simulate such a counter for 10, 100, 1000 and events Compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November

23 Counting 100 events trials U. Aveiro, November

24 Counting events trials U. Aveiro, November

25 Probability Distribution p = 1 / 32 U. Aveiro, November

26 Probability Distribution p = 1 / 32 U. Aveiro, November

27 Fixed Probability Counters Recap For each event, increment the counter with probability 1 / 2 k, for k >= 1 On average, just incrementing for 1 / 2 k of the events!! Number of events estimated by 2 k x Counter We can now count up to 2 n + k events Using just n bits!! U. Aveiro, November

28 Issues What happens when counting a small number of events with probability 1 / 32? For much larger numbers of events, can we be more economical? U. Aveiro, November

29 Approximate Counting Binary Base Morris, 1978 For an arbitrary counting base As the counter value increases, it will be incremented with lesser probability If counter has value k Increment it with probability 1 / 2 k Do not increment it with probability ( 1 1 / 2 k ) Draw the state diagram! U. Aveiro, November

30 Approximate Counting Binary Base On average, how many events, n, are needed to reach a counter value of k? What does k represent? Events Counter value Number of events X 1 1 X Let s do it on the board! U. Aveiro, November

31 Approximate Counting Binary Base Counter is a random variable What is the expected value after n events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[X i = 0 ] = 1 1 / 2 i-1 P[X i = 1 ] = 1 / 2 i-1 U. Aveiro, November

32 Approximate Counting Binary Base E[ X i ] = 1 / 2 i-1 Counter value after n events is S = X i E[ S ] = E[ X i ] = E[ X i ] E[ S ] = / / / / 4 + U. Aveiro, November

33 Approximate Counting Binary Base BUT, we only store integer values!! Number of events E[S] Expected counter value / / x 1 / x 1 / x 1 / x 1 / x 1 / 8 4 How to estimate the number of events from the counter value? U. Aveiro, November

34 Approximate Counting Binary Base After n = 2 k 1 events the expected counter value is k k = log 2 ( n + 1 ) = floor( log 2 n ) + 1 Generalize! After n events the expected counter value is floor( log 2 ( n + 1 ) ) Logarithmic counter!! For larger values, it counts slower U. Aveiro, November

35 Approximate Counting Binary Base After n probabilistic updates, the counter contains an approximation of log n That value is stored in log log n bits!! U. Aveiro, November

36 Approximate Counting Binary Base How to estimate the number of events from the counter value k? Compute 2 k 1 How to evaluate the counter s accuracy? Compare with floor( log 2 ( n + 1 ) ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November

37 Tasks Simulate such a counter for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November

38 Counting events trials U. Aveiro, November

39 Approx. Counting Arbitrary Base For some applications the expected error of the previous method might be too large! How to improve the counter performance? If counter has value k Increment it with probability 1 / a k Do not increment it with probability ( 1 1 / a k ) a is now the counter base U. Aveiro, November

40 Approx. Counting Arbitrary Base Take a < 2 The counter value after m increments will be larger than with the binary base Giving a better accuracy!! Probabilities can be stored in a table No need to be recomputing!! U. Aveiro, November

41 Approx. Counting Arbitrary Base Possible values? a = 2 1/2, 2 1/4, How to estimate the number of events from the counter value k? Compute ( a k a + 1 ) / ( a 1 ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November

42 Tasks Simulate such a counter, with a = 2 1/2, for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November

43 Counting events trials U. Aveiro, November

44 One recent paper from 2016 U. Aveiro, November

45 References R. Morris, Counting Large Numbers of Events in Small Registers, Commun. ACM, Vol. 21, N. 10, October 1978 P. Flajolet, Approximate Counting: A Detailed Analysis, Bit, Vol. 25, 1985 M. Csurös. Approximate counting with a floatingpoint counter. In COCOON, LNCS vol. 6196, p , Springer, 2010 U. Aveiro, November

46 Set Membership Given an arbitrary sized string s and a set S Does s belong to S? Easy answer for small sets! Complexity? BUT difficult answer for huge sets! E.g., Big-Data applications U. Aveiro, November

47 Hash Tables Data structure for storing key-value pairs No ordering!! BUT, fast access!! No duplicate keys!! U. Aveiro, November

48 Approximate Membership Queries Given a set S = {x 1, x 2,, xn} Answer queries of the form: Is y in S? Data structure should be FAST and SMALL Faster than searching through S Smaller than explicit representation U. Aveiro, November

49 Approximate Membership Queries How to get speed and size improvements? Allow some probability of error!! False positives y S but reporting y S False negatives y S but reporting y S U. Aveiro, November

50 Bloom Filters B. H. Bloom, 1970 Use hash functions to determine approximate set membership Allow for fast set membership tests on very large data sets Applications Spell-Checking / Text Analysis Network monitoring U. Aveiro, November

51 Application Spell-Checkers Determine if candidate words are members of the set of words in a dictionary The Bloom filter should be large enough to allow the inclusion of additional words by the user U. Aveiro, November

52 Application Spam We know 1 billion good addresses If an comes from one of these, it is NOT spam How check for spam in a FAST way? U. Aveiro, November

53 Application Web-Caching Bloom filters are used in WWW caching proxy servers Proxy servers intercept requests from clients and either fulfill the requests themselves or re-issue them to servers U. Aveiro, November

54 Bloom Filters Is y in S? A Bloom filter Provides an answer in constant time Time to hash Uses a small amount of memory space BUT, with some small probability of being wrong! U. Aveiro, November

55 1 st Register the elements of set S [Mitzenmacher] U. Aveiro, November

56 2 nd Process the queries [Mitzenmacher] U. Aveiro, November

57 Basic operations Initialization Clear all cells Insertion Compute the values of k hash functions Set the corresponding cells, if needed It takes constant time, but proportional to k U. Aveiro, November

58 Basic operations Membership test Compute the values of k hash functions Check if the corresponding cells have been set If any such cell is not set, the searched element is not a member of the set Worst-case? Checking all k cells! Set elements and false positives U. Aveiro, November

59 Bloom Filter Simple Demos Bloom Filters by Example Bloom Filters U. Aveiro, November

60 Bloom Filters Behaviour Deterministic hash functions! No attempt to solve hashing collisions! Can we get false negatives? Probability of false positives? How to minimize? U. Aveiro, November

61 Bloom Filter Parameters The behaviour of a Bloom filter is determined by four parameters n set elements registered in B m = c n cells in B (i.e., bits) k independent, random hash functions f is the fraction of cells set to 1 U. Aveiro, November

62 Bloom Filter Parameters How to choose m, the size of the filter? How to choose k, the number of hash functions? How do we choose the best k value? U. Aveiro, November

63 Probabilities After 1 insertion Initially all bits are set to zero Inserting one element What is the probability of b i = 1, after using the first hash function? Equal probability for any cell P b i = 1 = 1 m P b i = 0 = 1 1 m U. Aveiro, November

64 Probabilities After 1 insertion After computing the k hash functions and setting k cells P b i = 0 = 1 1 m k U. Aveiro, November

65 Probabilities After n insertions After inserting all n set elements, by computing each time k hash values Assuming independence P b i = 0 = 1 1 m k n U. Aveiro, November

66 Probabilities After n insertions P b i = 0 = 1 1 m k n n P b i = 1 = 1 a k, a = 1 1 m U. Aveiro, November

67 Probability of a false positive Testing the membership of an item not in S entails a positive answer Corresponding k bits are set to 1 The probability of that happening is k k p = 1 a kn/m k p 1 e U. Aveiro, November

68 Example n = 1 billion items, m = 8 billion bits k = 1 : p 1 e 1/8 = k = 2 : p 1 e 2/8 2 = What happens as we keep increasing k? U. Aveiro, November

69 Optimal value of k U. Aveiro, November

70 Optimal value of k To determine the value of k that minimizes p we minimize log p, which is more tractable And get k opt m n ln m n Use the closest integer to k opt For the previous example : k opt 5,54 6 U. Aveiro, November

71 Which Hash Functions? No need to use cryptographic hash functions! You can simulate k hash functions by simply combining two hash functions Kirsch and Mitzenmacher (2006) Compute one base hash function on unsigned 64-bit numbers Take the upper half and the lower half of that value and return them as two 32 bit numbers U. Aveiro, November

72 Bloom Filters Wrap-up No false negatives and limited memory usage Great for pre-processing before more expensive checks Suitable for hardware implementation Hash computations can be parallelized Error rate can be decreased by increasing the number of hash functions and allocated memory space U. Aveiro, November

73 Bloom Filters Wrap-up Useful for applications where an imperfect set membership test can be helpfully applied to a large data set of unknown composition Advantage over hash tables is Bloom filter speed and error rate U. Aveiro, November

74 Bloom Filters Pending Issues Cannot represent multi-sets I.e., sets with repeated elements Cannot query the multiplicity of an item Deleting an item is not possible! U. Aveiro, November

75 Counting Bloom Filters Multi-set representation Now, each filter cell is a w-bit counter w = 4 seems to be enough for most applications U. Aveiro, November

76 Counting Bloom Filters To insert an element, increase the value of each corresponding cell Test membership checks if each of the required cells is non-zero U. Aveiro, November

77 Counting Bloom Filters To delete an element, decrease the value of each corresponding cell Deletions necessarily introduce false negative errors!! How? U. Aveiro, November

78 Counting Bloom Filters To retrieve the count of an element : Compute its set of counters And return the minimum value as a frequency estimate U. Aveiro, November

79 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November

80 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November

81 Counting Bloom Filters Issues Counter overflow No more increments after reaching 2 w 1 BUT, now we have undercounts!! Choice of counter width w A large w diminishes space savings and introduces unused space (many zeros) A small w quickly leads to maximum values Trade-off U. Aveiro, November

82 Counting Bloomm Filters in Practice If insertions/deletions are rare compared to look-ups Keep a CBF in off-chip memory Keep a BF in on-chip memory Update the BF when the CBF changes Keep space savings of a Bloom filter But can deal with deletions Popular design for network devices U. Aveiro, November

83 References J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, 2014 Chapter 4 B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, Commun. ACM, July 1970 J. Blustein and A. El-Maazaw, Bloom Filters A Tutorial, Analysis, and Survey, TR CS , Dalhousie University, Halifax, NS, Canada, December 2002 A. Broder and M. Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics, Vol. 1, N. 4, 2004 U. Aveiro, November

84 Acknowledgments An earlier version of some of these slides was developed by Professor Carlos Bastos Part of the slides adapted from original slides of J. Leskovec, A Rajaraman and J. Ullman Mining of Massive Datasets M. Mitzenmacher, Bloom Filters and Such 2014 Summer School on Hashing, Copenhagen, DK U. Aveiro, November

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of