Introduction to Randomized Algorithms III

Similar documents
Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Algorithms for Data Science

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

Bloom Filters, general theory and variants

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Ad Placement Strategies

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS5112: Algorithms and Data Structures for Applications

Part 1: Hashing and Its Many Applications

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Application: Bucket Sort

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

Count-Min Tree Sketch: Approximate counting for NLP

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Algorithm Design Strategies V

The Bloom Paradox: When not to Use a Bloom Filter

Hash tables. Hash tables

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Hash tables. Hash tables

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:

arxiv: v1 [cs.ds] 3 Feb 2018

Foundations of Data Mining

Lecture 1 September 3, 2013

md5bloom: Forensic Filesystem Hashing Revisited

Data-Intensive Distributed Computing

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

CS246 Final Exam. March 16, :30AM - 11:30AM

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

Lecture 4 Thursday Sep 11, 2014

CS425: Algorithms for Web Scale Data

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Design of discrete-event simulations

Using the Power of Two Choices to Improve Bloom Filters

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Probabilistic Counting with Randomized Storage

Tight Bounds for Sliding Bloom Filters

Bloom Filters and Locality-Sensitive Hashing

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

CS5112: Algorithms and Data Structures for Applications

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives

Probabilistic Near-Duplicate. Detection Using Simhash

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

1 Approximate Quantiles and Summaries

University of Illinois at Urbana-Champaign. Midterm Examination

CS60021: Scalable Data Mining. Large Scale Machine Learning

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

Advanced topic: Space complexity

Rainbow Tables ENEE 457/CMSC 498E

Data Stream Methods. Graham Cormode S. Muthukrishnan

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Shannon-Fano-Elias coding

Hashing. Martin Babka. January 12, 2011

Expectation of geometric distribution

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

Lecture 01 August 31, 2017

Incremental Learning and Concept Drift: Overview

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Efficient Data Reduction and Summarization

Cuckoo Hashing and Cuckoo Filters

Learning from Time-Changing Data with Adaptive Windowing

6 Filtering and Streaming

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

B490 Mining the Big Data

1 Maintaining a Dictionary

Discrete-event simulations

2 How many distinct elements are in a stream?

An Early Traffic Sampling Algorithm

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Lecture 2. Frequency problems

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

How Philippe Flipped Coins to Count Data

Transcription:

Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1

Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability 1 / 2 k Counting with decreasing probability Approximate set membership Bloom Filters Counting Bloom Filters U. Aveiro, November 2017 2

Motivation Is it possible to use a small counter to keep approximate counts of large numbers? Use a large number of such counters to keep track of the number of occurrences of many different events E.g., 8-bit counters Morris, Approximate Count Algorithm, 1978 U. Aveiro, November 2017 3

Motivation But, nowadays memory is no longer scarce Is such an approach still interesting? Yes!! Massive data volumes!! Need quick and memory-efficient processing U. Aveiro, November 2017 4

Application areas Online social networks Large-scale scientific experiments Search engines Online content delivery Product and consumer tracking. Data too large to fit in memory must be analyzed!! U. Aveiro, November 2017 5

Big-Data Scale up vs Downsize Scale up the computation Replicate cheap hardware / devices Build massive DBMSs and warehouses BUT, expensive equipment / energy!! Downsize the data Compact representations of large data sets Approximate answers Probabilistic methods U. Aveiro, November 2017 6

Probabilistic Counters Goal Avoid using large counters when dealing with large data volumes!! A counter with n bits counts up to 2 n events Can we use less bits? What is the cost? U. Aveiro, November 2017 7

1 st Method For each event, increment the counter with probability 1 / 2 Intuition: just incrementing for half of the events!! We can now count up to 2 n + 1 events Using just n bits!! Is that what happens? Draw the state diagram / triangular diagram U. Aveiro, November 2017 8

1 st Method Tasks Simulate such a counter for 10, 100, 1000 and 10000 events Repeat the experiments several times! What can you conclude? How to evaluate the accuracy? Relative error or accuracy ratio When knowing the exact value U. Aveiro, November 2017 9

Counting 100 events 10000 trials U. Aveiro, November 2017 10

1 st Method Expected value (mean) Counter is a random variable Resulting from a succession of random events What is the expected value after k events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[ X i = 0 ] = P[ X i = 1 ] = 1 / 2 U. Aveiro, November 2017 11

1 st Method Expected value (mean) E[ X i ] = 0 x P[ X i = 0 ] + 1 x P[ X i = 1 ] = 1 / 2 Counter value after k events is S = X i E[ S ] = E[ X i ] = E[ X i ] = k / 2 Number of events can be estimated by 2 x S U. Aveiro, November 2017 12

1 st Method Variance σ 2 ( X i ) = E[ X i2 ] { E[ X i ] } 2 = E[ X i2 ] 1 / 4 E[ X i 2 ] = 0 2 x P[X i = 0] + 1 2 x P[X i = 1] = 1 / 2 σ 2 ( X i ) = 1 / 4 σ 2 ( S ) = σ 2 ( X i ) = σ 2 ( X i ) = k / 4 Standard deviation: σ ( S ) = k / 2 U. Aveiro, November 2017 13

1 st Method Tasks Simulate such a counter for 10, 100, 1000 and 10000 events Repeat the experiments many times!! For each counter, compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November 2017 14

Counting 100 events 10000 trials U. Aveiro, November 2017 15

1 st Method Probability distribution After n events, what is the probability of the counter value being k? p ( n, k ) =? Example for n = 4 More probable / Less probable counter values? p ( 4, k ) =? Binary table / Binary tree / Pascal-like triangle U. Aveiro, November 2017 16

1 st Method Probability distribution U. Aveiro, November 2017 17

Probability Distribution p = 1 / 2 U. Aveiro, November 2017 18

Generalization Can we approx. count the same number of events using less bits? Or approx. count more events using the same number of bits? Yes! Increment the counter with lesser probability Increment with probability 1 / 2 k U. Aveiro, November 2017 19

Generalization Tasks Incrementing with probability 1 / 2 k Obtain an expression for the mean, the variance and the stdr. deviation after n events k = 2, 3,, 6, Analyze the corresponding probability distributions Pascal-like triangle U. Aveiro, November 2017 20

Generalization Mean and Variance Probability of incrementing the counter: p q = ( 1 p ) It is not difficult to check that, after n events: E[ S ] = n p σ 2 ( S ) = n p q U. Aveiro, November 2017 21

Generalization Tasks Set the counting probability to 1 / 32 Simulate such a counter for 10, 100, 1000 and 10000 events Compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November 2017 22

Counting 100 events 10000 trials U. Aveiro, November 2017 23

Counting 10000 events 10000 trials U. Aveiro, November 2017 24

Probability Distribution p = 1 / 32 U. Aveiro, November 2017 25

Probability Distribution p = 1 / 32 U. Aveiro, November 2017 26

Fixed Probability Counters Recap For each event, increment the counter with probability 1 / 2 k, for k >= 1 On average, just incrementing for 1 / 2 k of the events!! Number of events estimated by 2 k x Counter We can now count up to 2 n + k events Using just n bits!! U. Aveiro, November 2017 27

Issues What happens when counting a small number of events with probability 1 / 32? For much larger numbers of events, can we be more economical? U. Aveiro, November 2017 28

Approximate Counting Binary Base Morris, 1978 For an arbitrary counting base As the counter value increases, it will be incremented with lesser probability If counter has value k Increment it with probability 1 / 2 k Do not increment it with probability ( 1 1 / 2 k ) Draw the state diagram! U. Aveiro, November 2017 29

Approximate Counting Binary Base On average, how many events, n, are needed to reach a counter value of k? What does k represent? Events Counter value Number of events X 1 1 X Let s do it on the board! U. Aveiro, November 2017 30

Approximate Counting Binary Base Counter is a random variable What is the expected value after n events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[X i = 0 ] = 1 1 / 2 i-1 P[X i = 1 ] = 1 / 2 i-1 U. Aveiro, November 2017 31

Approximate Counting Binary Base E[ X i ] = 1 / 2 i-1 Counter value after n events is S = X i E[ S ] = E[ X i ] = E[ X i ] E[ S ] = 1 + 1 / 2 + 1 / 2 + 1 / 4 + 1 / 4 + U. Aveiro, November 2017 32

Approximate Counting Binary Base BUT, we only store integer values!! Number of events E[S] Expected counter value 1 1 1 3 1 + 1 / 2 + 1 / 2 2 7 1 + 2 x 1 / 2 + 4 x 1 / 4 3 15 1 + 2 x 1 / 2 + 4 x 1 / 4 + 8 x 1 / 8 4 How to estimate the number of events from the counter value? U. Aveiro, November 2017 33

Approximate Counting Binary Base After n = 2 k 1 events the expected counter value is k k = log 2 ( n + 1 ) = floor( log 2 n ) + 1 Generalize! After n events the expected counter value is floor( log 2 ( n + 1 ) ) Logarithmic counter!! For larger values, it counts slower U. Aveiro, November 2017 34

Approximate Counting Binary Base After n probabilistic updates, the counter contains an approximation of log n That value is stored in log log n bits!! U. Aveiro, November 2017 35

Approximate Counting Binary Base How to estimate the number of events from the counter value k? Compute 2 k 1 How to evaluate the counter s accuracy? Compare with floor( log 2 ( n + 1 ) ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November 2017 36

Tasks Simulate such a counter for 10, 50, 100, 500, 1000, 10000 events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November 2017 37

Counting 10000 events 10000 trials U. Aveiro, November 2017 38

Approx. Counting Arbitrary Base For some applications the expected error of the previous method might be too large! How to improve the counter performance? If counter has value k Increment it with probability 1 / a k Do not increment it with probability ( 1 1 / a k ) a is now the counter base U. Aveiro, November 2017 39

Approx. Counting Arbitrary Base Take a < 2 The counter value after m increments will be larger than with the binary base Giving a better accuracy!! Probabilities can be stored in a table No need to be recomputing!! U. Aveiro, November 2017 40

Approx. Counting Arbitrary Base Possible values? a = 2 1/2, 2 1/4, How to estimate the number of events from the counter value k? Compute ( a k a + 1 ) / ( a 1 ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November 2017 41

Tasks Simulate such a counter, with a = 2 1/2, for 10, 50, 100, 500, 1000, 10000 events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November 2017 42

Counting 10000 events 10000 trials U. Aveiro, November 2017 43

One recent paper from 2016 U. Aveiro, November 2017 44

References R. Morris, Counting Large Numbers of Events in Small Registers, Commun. ACM, Vol. 21, N. 10, October 1978 P. Flajolet, Approximate Counting: A Detailed Analysis, Bit, Vol. 25, 1985 M. Csurös. Approximate counting with a floatingpoint counter. In COCOON, LNCS vol. 6196, p. 358-367, Springer, 2010 U. Aveiro, November 2017 45

Set Membership Given an arbitrary sized string s and a set S Does s belong to S? Easy answer for small sets! Complexity? BUT difficult answer for huge sets! E.g., Big-Data applications U. Aveiro, November 2017 46

Hash Tables Data structure for storing key-value pairs No ordering!! BUT, fast access!! No duplicate keys!! U. Aveiro, November 2017 47

Approximate Membership Queries Given a set S = {x 1, x 2,, xn} Answer queries of the form: Is y in S? Data structure should be FAST and SMALL Faster than searching through S Smaller than explicit representation U. Aveiro, November 2017 48

Approximate Membership Queries How to get speed and size improvements? Allow some probability of error!! False positives y S but reporting y S False negatives y S but reporting y S U. Aveiro, November 2017 49

Bloom Filters B. H. Bloom, 1970 Use hash functions to determine approximate set membership Allow for fast set membership tests on very large data sets Applications Spell-Checking / Text Analysis Network monitoring U. Aveiro, November 2017 50

Application Spell-Checkers Determine if candidate words are members of the set of words in a dictionary The Bloom filter should be large enough to allow the inclusion of additional words by the user U. Aveiro, November 2017 51

Application Email Spam We know 1 billion good email addresses If an email comes from one of these, it is NOT spam How check for spam in a FAST way? U. Aveiro, November 2017 52

Application Web-Caching Bloom filters are used in WWW caching proxy servers Proxy servers intercept requests from clients and either fulfill the requests themselves or re-issue them to servers U. Aveiro, November 2017 53

Bloom Filters Is y in S? A Bloom filter Provides an answer in constant time Time to hash Uses a small amount of memory space BUT, with some small probability of being wrong! U. Aveiro, November 2017 54

1 st Register the elements of set S [Mitzenmacher] U. Aveiro, November 2017 55

2 nd Process the queries [Mitzenmacher] U. Aveiro, November 2017 56

Basic operations Initialization Clear all cells Insertion Compute the values of k hash functions Set the corresponding cells, if needed It takes constant time, but proportional to k U. Aveiro, November 2017 57

Basic operations Membership test Compute the values of k hash functions Check if the corresponding cells have been set If any such cell is not set, the searched element is not a member of the set Worst-case? Checking all k cells! Set elements and false positives U. Aveiro, November 2017 58

Bloom Filter Simple Demos Bloom Filters by Example http://billmill.org/bloomfilter-tutorial/ Bloom Filters https://www.jasondavies.com/bloomfilter/ U. Aveiro, November 2017 59

Bloom Filters Behaviour Deterministic hash functions! No attempt to solve hashing collisions! Can we get false negatives? Probability of false positives? How to minimize? U. Aveiro, November 2017 60

Bloom Filter Parameters The behaviour of a Bloom filter is determined by four parameters n set elements registered in B m = c n cells in B (i.e., bits) k independent, random hash functions f is the fraction of cells set to 1 U. Aveiro, November 2017 61

Bloom Filter Parameters How to choose m, the size of the filter? How to choose k, the number of hash functions? How do we choose the best k value? U. Aveiro, November 2017 62

Probabilities After 1 insertion Initially all bits are set to zero Inserting one element What is the probability of b i = 1, after using the first hash function? Equal probability for any cell P b i = 1 = 1 m P b i = 0 = 1 1 m U. Aveiro, November 2017 63

Probabilities After 1 insertion After computing the k hash functions and setting k cells P b i = 0 = 1 1 m k U. Aveiro, November 2017 64

Probabilities After n insertions After inserting all n set elements, by computing each time k hash values Assuming independence P b i = 0 = 1 1 m k n U. Aveiro, November 2017 65

Probabilities After n insertions P b i = 0 = 1 1 m k n n P b i = 1 = 1 a k, a = 1 1 m U. Aveiro, November 2017 66

Probability of a false positive Testing the membership of an item not in S entails a positive answer Corresponding k bits are set to 1 The probability of that happening is k k p = 1 a kn/m k p 1 e U. Aveiro, November 2017 67

Example n = 1 billion items, m = 8 billion bits k = 1 : p 1 e 1/8 = 0.1175 k = 2 : p 1 e 2/8 2 = 0.0493 What happens as we keep increasing k? U. Aveiro, November 2017 68

Optimal value of k U. Aveiro, November 2017 69

Optimal value of k To determine the value of k that minimizes p we minimize log p, which is more tractable And get k opt m n ln 2 0.693 m n Use the closest integer to k opt For the previous example : k opt 5,54 6 U. Aveiro, November 2017 70

Which Hash Functions? No need to use cryptographic hash functions! You can simulate k hash functions by simply combining two hash functions Kirsch and Mitzenmacher (2006) Compute one base hash function on unsigned 64-bit numbers Take the upper half and the lower half of that value and return them as two 32 bit numbers U. Aveiro, November 2017 71

Bloom Filters Wrap-up No false negatives and limited memory usage Great for pre-processing before more expensive checks Suitable for hardware implementation Hash computations can be parallelized Error rate can be decreased by increasing the number of hash functions and allocated memory space U. Aveiro, November 2017 72

Bloom Filters Wrap-up Useful for applications where an imperfect set membership test can be helpfully applied to a large data set of unknown composition Advantage over hash tables is Bloom filter speed and error rate U. Aveiro, November 2017 73

Bloom Filters Pending Issues Cannot represent multi-sets I.e., sets with repeated elements Cannot query the multiplicity of an item Deleting an item is not possible! U. Aveiro, November 2017 74

Counting Bloom Filters Multi-set representation Now, each filter cell is a w-bit counter w = 4 seems to be enough for most applications U. Aveiro, November 2017 75

Counting Bloom Filters To insert an element, increase the value of each corresponding cell Test membership checks if each of the required cells is non-zero U. Aveiro, November 2017 76

Counting Bloom Filters To delete an element, decrease the value of each corresponding cell Deletions necessarily introduce false negative errors!! How? U. Aveiro, November 2017 77

Counting Bloom Filters To retrieve the count of an element : Compute its set of counters And return the minimum value as a frequency estimate U. Aveiro, November 2017 78

Counting Bloom Filters [Mitzenmacher] U. Aveiro, November 2017 79

Counting Bloom Filters [Mitzenmacher] U. Aveiro, November 2017 80

Counting Bloom Filters Issues Counter overflow No more increments after reaching 2 w 1 BUT, now we have undercounts!! Choice of counter width w A large w diminishes space savings and introduces unused space (many zeros) A small w quickly leads to maximum values Trade-off U. Aveiro, November 2017 81

Counting Bloomm Filters in Practice If insertions/deletions are rare compared to look-ups Keep a CBF in off-chip memory Keep a BF in on-chip memory Update the BF when the CBF changes Keep space savings of a Bloom filter But can deal with deletions Popular design for network devices U. Aveiro, November 2017 82

References J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, 2014 Chapter 4 B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, Commun. ACM, July 1970 J. Blustein and A. El-Maazaw, Bloom Filters A Tutorial, Analysis, and Survey, TR CS 2002-10, Dalhousie University, Halifax, NS, Canada, December 2002 A. Broder and M. Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics, Vol. 1, N. 4, 2004 U. Aveiro, November 2017 83

Acknowledgments An earlier version of some of these slides was developed by Professor Carlos Bastos Part of the slides adapted from original slides of J. Leskovec, A Rajaraman and J. Ullman Mining of Massive Datasets www.mmds.org M. Mitzenmacher, Bloom Filters and Such 2014 Summer School on Hashing, Copenhagen, DK U. Aveiro, November 2017 84