Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1
Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability 1 / 2 k Counting with decreasing probability Approximate set membership Bloom Filters Counting Bloom Filters U. Aveiro, November 2017 2
Motivation Is it possible to use a small counter to keep approximate counts of large numbers? Use a large number of such counters to keep track of the number of occurrences of many different events E.g., 8-bit counters Morris, Approximate Count Algorithm, 1978 U. Aveiro, November 2017 3
Motivation But, nowadays memory is no longer scarce Is such an approach still interesting? Yes!! Massive data volumes!! Need quick and memory-efficient processing U. Aveiro, November 2017 4
Application areas Online social networks Large-scale scientific experiments Search engines Online content delivery Product and consumer tracking. Data too large to fit in memory must be analyzed!! U. Aveiro, November 2017 5
Big-Data Scale up vs Downsize Scale up the computation Replicate cheap hardware / devices Build massive DBMSs and warehouses BUT, expensive equipment / energy!! Downsize the data Compact representations of large data sets Approximate answers Probabilistic methods U. Aveiro, November 2017 6
Probabilistic Counters Goal Avoid using large counters when dealing with large data volumes!! A counter with n bits counts up to 2 n events Can we use less bits? What is the cost? U. Aveiro, November 2017 7
1 st Method For each event, increment the counter with probability 1 / 2 Intuition: just incrementing for half of the events!! We can now count up to 2 n + 1 events Using just n bits!! Is that what happens? Draw the state diagram / triangular diagram U. Aveiro, November 2017 8
1 st Method Tasks Simulate such a counter for 10, 100, 1000 and 10000 events Repeat the experiments several times! What can you conclude? How to evaluate the accuracy? Relative error or accuracy ratio When knowing the exact value U. Aveiro, November 2017 9
Counting 100 events 10000 trials U. Aveiro, November 2017 10
1 st Method Expected value (mean) Counter is a random variable Resulting from a succession of random events What is the expected value after k events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[ X i = 0 ] = P[ X i = 1 ] = 1 / 2 U. Aveiro, November 2017 11
1 st Method Expected value (mean) E[ X i ] = 0 x P[ X i = 0 ] + 1 x P[ X i = 1 ] = 1 / 2 Counter value after k events is S = X i E[ S ] = E[ X i ] = E[ X i ] = k / 2 Number of events can be estimated by 2 x S U. Aveiro, November 2017 12
1 st Method Variance σ 2 ( X i ) = E[ X i2 ] { E[ X i ] } 2 = E[ X i2 ] 1 / 4 E[ X i 2 ] = 0 2 x P[X i = 0] + 1 2 x P[X i = 1] = 1 / 2 σ 2 ( X i ) = 1 / 4 σ 2 ( S ) = σ 2 ( X i ) = σ 2 ( X i ) = k / 4 Standard deviation: σ ( S ) = k / 2 U. Aveiro, November 2017 13
1 st Method Tasks Simulate such a counter for 10, 100, 1000 and 10000 events Repeat the experiments many times!! For each counter, compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November 2017 14
Counting 100 events 10000 trials U. Aveiro, November 2017 15
1 st Method Probability distribution After n events, what is the probability of the counter value being k? p ( n, k ) =? Example for n = 4 More probable / Less probable counter values? p ( 4, k ) =? Binary table / Binary tree / Pascal-like triangle U. Aveiro, November 2017 16
1 st Method Probability distribution U. Aveiro, November 2017 17
Probability Distribution p = 1 / 2 U. Aveiro, November 2017 18
Generalization Can we approx. count the same number of events using less bits? Or approx. count more events using the same number of bits? Yes! Increment the counter with lesser probability Increment with probability 1 / 2 k U. Aveiro, November 2017 19
Generalization Tasks Incrementing with probability 1 / 2 k Obtain an expression for the mean, the variance and the stdr. deviation after n events k = 2, 3,, 6, Analyze the corresponding probability distributions Pascal-like triangle U. Aveiro, November 2017 20
Generalization Mean and Variance Probability of incrementing the counter: p q = ( 1 p ) It is not difficult to check that, after n events: E[ S ] = n p σ 2 ( S ) = n p q U. Aveiro, November 2017 21
Generalization Tasks Set the counting probability to 1 / 32 Simulate such a counter for 10, 100, 1000 and 10000 events Compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November 2017 22
Counting 100 events 10000 trials U. Aveiro, November 2017 23
Counting 10000 events 10000 trials U. Aveiro, November 2017 24
Probability Distribution p = 1 / 32 U. Aveiro, November 2017 25
Probability Distribution p = 1 / 32 U. Aveiro, November 2017 26
Fixed Probability Counters Recap For each event, increment the counter with probability 1 / 2 k, for k >= 1 On average, just incrementing for 1 / 2 k of the events!! Number of events estimated by 2 k x Counter We can now count up to 2 n + k events Using just n bits!! U. Aveiro, November 2017 27
Issues What happens when counting a small number of events with probability 1 / 32? For much larger numbers of events, can we be more economical? U. Aveiro, November 2017 28
Approximate Counting Binary Base Morris, 1978 For an arbitrary counting base As the counter value increases, it will be incremented with lesser probability If counter has value k Increment it with probability 1 / 2 k Do not increment it with probability ( 1 1 / 2 k ) Draw the state diagram! U. Aveiro, November 2017 29
Approximate Counting Binary Base On average, how many events, n, are needed to reach a counter value of k? What does k represent? Events Counter value Number of events X 1 1 X Let s do it on the board! U. Aveiro, November 2017 30
Approximate Counting Binary Base Counter is a random variable What is the expected value after n events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[X i = 0 ] = 1 1 / 2 i-1 P[X i = 1 ] = 1 / 2 i-1 U. Aveiro, November 2017 31
Approximate Counting Binary Base E[ X i ] = 1 / 2 i-1 Counter value after n events is S = X i E[ S ] = E[ X i ] = E[ X i ] E[ S ] = 1 + 1 / 2 + 1 / 2 + 1 / 4 + 1 / 4 + U. Aveiro, November 2017 32
Approximate Counting Binary Base BUT, we only store integer values!! Number of events E[S] Expected counter value 1 1 1 3 1 + 1 / 2 + 1 / 2 2 7 1 + 2 x 1 / 2 + 4 x 1 / 4 3 15 1 + 2 x 1 / 2 + 4 x 1 / 4 + 8 x 1 / 8 4 How to estimate the number of events from the counter value? U. Aveiro, November 2017 33
Approximate Counting Binary Base After n = 2 k 1 events the expected counter value is k k = log 2 ( n + 1 ) = floor( log 2 n ) + 1 Generalize! After n events the expected counter value is floor( log 2 ( n + 1 ) ) Logarithmic counter!! For larger values, it counts slower U. Aveiro, November 2017 34
Approximate Counting Binary Base After n probabilistic updates, the counter contains an approximation of log n That value is stored in log log n bits!! U. Aveiro, November 2017 35
Approximate Counting Binary Base How to estimate the number of events from the counter value k? Compute 2 k 1 How to evaluate the counter s accuracy? Compare with floor( log 2 ( n + 1 ) ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November 2017 36
Tasks Simulate such a counter for 10, 50, 100, 500, 1000, 10000 events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November 2017 37
Counting 10000 events 10000 trials U. Aveiro, November 2017 38
Approx. Counting Arbitrary Base For some applications the expected error of the previous method might be too large! How to improve the counter performance? If counter has value k Increment it with probability 1 / a k Do not increment it with probability ( 1 1 / a k ) a is now the counter base U. Aveiro, November 2017 39
Approx. Counting Arbitrary Base Take a < 2 The counter value after m increments will be larger than with the binary base Giving a better accuracy!! Probabilities can be stored in a table No need to be recomputing!! U. Aveiro, November 2017 40
Approx. Counting Arbitrary Base Possible values? a = 2 1/2, 2 1/4, How to estimate the number of events from the counter value k? Compute ( a k a + 1 ) / ( a 1 ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November 2017 41
Tasks Simulate such a counter, with a = 2 1/2, for 10, 50, 100, 500, 1000, 10000 events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November 2017 42
Counting 10000 events 10000 trials U. Aveiro, November 2017 43
One recent paper from 2016 U. Aveiro, November 2017 44
References R. Morris, Counting Large Numbers of Events in Small Registers, Commun. ACM, Vol. 21, N. 10, October 1978 P. Flajolet, Approximate Counting: A Detailed Analysis, Bit, Vol. 25, 1985 M. Csurös. Approximate counting with a floatingpoint counter. In COCOON, LNCS vol. 6196, p. 358-367, Springer, 2010 U. Aveiro, November 2017 45
Set Membership Given an arbitrary sized string s and a set S Does s belong to S? Easy answer for small sets! Complexity? BUT difficult answer for huge sets! E.g., Big-Data applications U. Aveiro, November 2017 46
Hash Tables Data structure for storing key-value pairs No ordering!! BUT, fast access!! No duplicate keys!! U. Aveiro, November 2017 47
Approximate Membership Queries Given a set S = {x 1, x 2,, xn} Answer queries of the form: Is y in S? Data structure should be FAST and SMALL Faster than searching through S Smaller than explicit representation U. Aveiro, November 2017 48
Approximate Membership Queries How to get speed and size improvements? Allow some probability of error!! False positives y S but reporting y S False negatives y S but reporting y S U. Aveiro, November 2017 49
Bloom Filters B. H. Bloom, 1970 Use hash functions to determine approximate set membership Allow for fast set membership tests on very large data sets Applications Spell-Checking / Text Analysis Network monitoring U. Aveiro, November 2017 50
Application Spell-Checkers Determine if candidate words are members of the set of words in a dictionary The Bloom filter should be large enough to allow the inclusion of additional words by the user U. Aveiro, November 2017 51
Application Email Spam We know 1 billion good email addresses If an email comes from one of these, it is NOT spam How check for spam in a FAST way? U. Aveiro, November 2017 52
Application Web-Caching Bloom filters are used in WWW caching proxy servers Proxy servers intercept requests from clients and either fulfill the requests themselves or re-issue them to servers U. Aveiro, November 2017 53
Bloom Filters Is y in S? A Bloom filter Provides an answer in constant time Time to hash Uses a small amount of memory space BUT, with some small probability of being wrong! U. Aveiro, November 2017 54
1 st Register the elements of set S [Mitzenmacher] U. Aveiro, November 2017 55
2 nd Process the queries [Mitzenmacher] U. Aveiro, November 2017 56
Basic operations Initialization Clear all cells Insertion Compute the values of k hash functions Set the corresponding cells, if needed It takes constant time, but proportional to k U. Aveiro, November 2017 57
Basic operations Membership test Compute the values of k hash functions Check if the corresponding cells have been set If any such cell is not set, the searched element is not a member of the set Worst-case? Checking all k cells! Set elements and false positives U. Aveiro, November 2017 58
Bloom Filter Simple Demos Bloom Filters by Example http://billmill.org/bloomfilter-tutorial/ Bloom Filters https://www.jasondavies.com/bloomfilter/ U. Aveiro, November 2017 59
Bloom Filters Behaviour Deterministic hash functions! No attempt to solve hashing collisions! Can we get false negatives? Probability of false positives? How to minimize? U. Aveiro, November 2017 60
Bloom Filter Parameters The behaviour of a Bloom filter is determined by four parameters n set elements registered in B m = c n cells in B (i.e., bits) k independent, random hash functions f is the fraction of cells set to 1 U. Aveiro, November 2017 61
Bloom Filter Parameters How to choose m, the size of the filter? How to choose k, the number of hash functions? How do we choose the best k value? U. Aveiro, November 2017 62
Probabilities After 1 insertion Initially all bits are set to zero Inserting one element What is the probability of b i = 1, after using the first hash function? Equal probability for any cell P b i = 1 = 1 m P b i = 0 = 1 1 m U. Aveiro, November 2017 63
Probabilities After 1 insertion After computing the k hash functions and setting k cells P b i = 0 = 1 1 m k U. Aveiro, November 2017 64
Probabilities After n insertions After inserting all n set elements, by computing each time k hash values Assuming independence P b i = 0 = 1 1 m k n U. Aveiro, November 2017 65
Probabilities After n insertions P b i = 0 = 1 1 m k n n P b i = 1 = 1 a k, a = 1 1 m U. Aveiro, November 2017 66
Probability of a false positive Testing the membership of an item not in S entails a positive answer Corresponding k bits are set to 1 The probability of that happening is k k p = 1 a kn/m k p 1 e U. Aveiro, November 2017 67
Example n = 1 billion items, m = 8 billion bits k = 1 : p 1 e 1/8 = 0.1175 k = 2 : p 1 e 2/8 2 = 0.0493 What happens as we keep increasing k? U. Aveiro, November 2017 68
Optimal value of k U. Aveiro, November 2017 69
Optimal value of k To determine the value of k that minimizes p we minimize log p, which is more tractable And get k opt m n ln 2 0.693 m n Use the closest integer to k opt For the previous example : k opt 5,54 6 U. Aveiro, November 2017 70
Which Hash Functions? No need to use cryptographic hash functions! You can simulate k hash functions by simply combining two hash functions Kirsch and Mitzenmacher (2006) Compute one base hash function on unsigned 64-bit numbers Take the upper half and the lower half of that value and return them as two 32 bit numbers U. Aveiro, November 2017 71
Bloom Filters Wrap-up No false negatives and limited memory usage Great for pre-processing before more expensive checks Suitable for hardware implementation Hash computations can be parallelized Error rate can be decreased by increasing the number of hash functions and allocated memory space U. Aveiro, November 2017 72
Bloom Filters Wrap-up Useful for applications where an imperfect set membership test can be helpfully applied to a large data set of unknown composition Advantage over hash tables is Bloom filter speed and error rate U. Aveiro, November 2017 73
Bloom Filters Pending Issues Cannot represent multi-sets I.e., sets with repeated elements Cannot query the multiplicity of an item Deleting an item is not possible! U. Aveiro, November 2017 74
Counting Bloom Filters Multi-set representation Now, each filter cell is a w-bit counter w = 4 seems to be enough for most applications U. Aveiro, November 2017 75
Counting Bloom Filters To insert an element, increase the value of each corresponding cell Test membership checks if each of the required cells is non-zero U. Aveiro, November 2017 76
Counting Bloom Filters To delete an element, decrease the value of each corresponding cell Deletions necessarily introduce false negative errors!! How? U. Aveiro, November 2017 77
Counting Bloom Filters To retrieve the count of an element : Compute its set of counters And return the minimum value as a frequency estimate U. Aveiro, November 2017 78
Counting Bloom Filters [Mitzenmacher] U. Aveiro, November 2017 79
Counting Bloom Filters [Mitzenmacher] U. Aveiro, November 2017 80
Counting Bloom Filters Issues Counter overflow No more increments after reaching 2 w 1 BUT, now we have undercounts!! Choice of counter width w A large w diminishes space savings and introduces unused space (many zeros) A small w quickly leads to maximum values Trade-off U. Aveiro, November 2017 81
Counting Bloomm Filters in Practice If insertions/deletions are rare compared to look-ups Keep a CBF in off-chip memory Keep a BF in on-chip memory Update the BF when the CBF changes Keep space savings of a Bloom filter But can deal with deletions Popular design for network devices U. Aveiro, November 2017 82
References J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, 2014 Chapter 4 B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, Commun. ACM, July 1970 J. Blustein and A. El-Maazaw, Bloom Filters A Tutorial, Analysis, and Survey, TR CS 2002-10, Dalhousie University, Halifax, NS, Canada, December 2002 A. Broder and M. Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics, Vol. 1, N. 4, 2004 U. Aveiro, November 2017 83
Acknowledgments An earlier version of some of these slides was developed by Professor Carlos Bastos Part of the slides adapted from original slides of J. Leskovec, A Rajaraman and J. Ullman Mining of Massive Datasets www.mmds.org M. Mitzenmacher, Bloom Filters and Such 2014 Summer School on Hashing, Copenhagen, DK U. Aveiro, November 2017 84