Introduction to Randomized Algorithms III
|
|
- Elvin Knight
- 5 years ago
- Views:
Transcription
1 Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November
2 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability 1 / 2 k Counting with decreasing probability Approximate set membership Bloom Filters Counting Bloom Filters U. Aveiro, November
3 Motivation Is it possible to use a small counter to keep approximate counts of large numbers? Use a large number of such counters to keep track of the number of occurrences of many different events E.g., 8-bit counters Morris, Approximate Count Algorithm, 1978 U. Aveiro, November
4 Motivation But, nowadays memory is no longer scarce Is such an approach still interesting? Yes!! Massive data volumes!! Need quick and memory-efficient processing U. Aveiro, November
5 Application areas Online social networks Large-scale scientific experiments Search engines Online content delivery Product and consumer tracking. Data too large to fit in memory must be analyzed!! U. Aveiro, November
6 Big-Data Scale up vs Downsize Scale up the computation Replicate cheap hardware / devices Build massive DBMSs and warehouses BUT, expensive equipment / energy!! Downsize the data Compact representations of large data sets Approximate answers Probabilistic methods U. Aveiro, November
7 Probabilistic Counters Goal Avoid using large counters when dealing with large data volumes!! A counter with n bits counts up to 2 n events Can we use less bits? What is the cost? U. Aveiro, November
8 1 st Method For each event, increment the counter with probability 1 / 2 Intuition: just incrementing for half of the events!! We can now count up to 2 n + 1 events Using just n bits!! Is that what happens? Draw the state diagram / triangular diagram U. Aveiro, November
9 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments several times! What can you conclude? How to evaluate the accuracy? Relative error or accuracy ratio When knowing the exact value U. Aveiro, November
10 Counting 100 events trials U. Aveiro, November
11 1 st Method Expected value (mean) Counter is a random variable Resulting from a succession of random events What is the expected value after k events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[ X i = 0 ] = P[ X i = 1 ] = 1 / 2 U. Aveiro, November
12 1 st Method Expected value (mean) E[ X i ] = 0 x P[ X i = 0 ] + 1 x P[ X i = 1 ] = 1 / 2 Counter value after k events is S = X i E[ S ] = E[ X i ] = E[ X i ] = k / 2 Number of events can be estimated by 2 x S U. Aveiro, November
13 1 st Method Variance σ 2 ( X i ) = E[ X i2 ] { E[ X i ] } 2 = E[ X i2 ] 1 / 4 E[ X i 2 ] = 0 2 x P[X i = 0] x P[X i = 1] = 1 / 2 σ 2 ( X i ) = 1 / 4 σ 2 ( S ) = σ 2 ( X i ) = σ 2 ( X i ) = k / 4 Standard deviation: σ ( S ) = k / 2 U. Aveiro, November
14 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments many times!! For each counter, compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November
15 Counting 100 events trials U. Aveiro, November
16 1 st Method Probability distribution After n events, what is the probability of the counter value being k? p ( n, k ) =? Example for n = 4 More probable / Less probable counter values? p ( 4, k ) =? Binary table / Binary tree / Pascal-like triangle U. Aveiro, November
17 1 st Method Probability distribution U. Aveiro, November
18 Probability Distribution p = 1 / 2 U. Aveiro, November
19 Generalization Can we approx. count the same number of events using less bits? Or approx. count more events using the same number of bits? Yes! Increment the counter with lesser probability Increment with probability 1 / 2 k U. Aveiro, November
20 Generalization Tasks Incrementing with probability 1 / 2 k Obtain an expression for the mean, the variance and the stdr. deviation after n events k = 2, 3,, 6, Analyze the corresponding probability distributions Pascal-like triangle U. Aveiro, November
21 Generalization Mean and Variance Probability of incrementing the counter: p q = ( 1 p ) It is not difficult to check that, after n events: E[ S ] = n p σ 2 ( S ) = n p q U. Aveiro, November
22 Generalization Tasks Set the counting probability to 1 / 32 Simulate such a counter for 10, 100, 1000 and events Compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November
23 Counting 100 events trials U. Aveiro, November
24 Counting events trials U. Aveiro, November
25 Probability Distribution p = 1 / 32 U. Aveiro, November
26 Probability Distribution p = 1 / 32 U. Aveiro, November
27 Fixed Probability Counters Recap For each event, increment the counter with probability 1 / 2 k, for k >= 1 On average, just incrementing for 1 / 2 k of the events!! Number of events estimated by 2 k x Counter We can now count up to 2 n + k events Using just n bits!! U. Aveiro, November
28 Issues What happens when counting a small number of events with probability 1 / 32? For much larger numbers of events, can we be more economical? U. Aveiro, November
29 Approximate Counting Binary Base Morris, 1978 For an arbitrary counting base As the counter value increases, it will be incremented with lesser probability If counter has value k Increment it with probability 1 / 2 k Do not increment it with probability ( 1 1 / 2 k ) Draw the state diagram! U. Aveiro, November
30 Approximate Counting Binary Base On average, how many events, n, are needed to reach a counter value of k? What does k represent? Events Counter value Number of events X 1 1 X Let s do it on the board! U. Aveiro, November
31 Approximate Counting Binary Base Counter is a random variable What is the expected value after n events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[X i = 0 ] = 1 1 / 2 i-1 P[X i = 1 ] = 1 / 2 i-1 U. Aveiro, November
32 Approximate Counting Binary Base E[ X i ] = 1 / 2 i-1 Counter value after n events is S = X i E[ S ] = E[ X i ] = E[ X i ] E[ S ] = / / / / 4 + U. Aveiro, November
33 Approximate Counting Binary Base BUT, we only store integer values!! Number of events E[S] Expected counter value / / x 1 / x 1 / x 1 / x 1 / x 1 / 8 4 How to estimate the number of events from the counter value? U. Aveiro, November
34 Approximate Counting Binary Base After n = 2 k 1 events the expected counter value is k k = log 2 ( n + 1 ) = floor( log 2 n ) + 1 Generalize! After n events the expected counter value is floor( log 2 ( n + 1 ) ) Logarithmic counter!! For larger values, it counts slower U. Aveiro, November
35 Approximate Counting Binary Base After n probabilistic updates, the counter contains an approximation of log n That value is stored in log log n bits!! U. Aveiro, November
36 Approximate Counting Binary Base How to estimate the number of events from the counter value k? Compute 2 k 1 How to evaluate the counter s accuracy? Compare with floor( log 2 ( n + 1 ) ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November
37 Tasks Simulate such a counter for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November
38 Counting events trials U. Aveiro, November
39 Approx. Counting Arbitrary Base For some applications the expected error of the previous method might be too large! How to improve the counter performance? If counter has value k Increment it with probability 1 / a k Do not increment it with probability ( 1 1 / a k ) a is now the counter base U. Aveiro, November
40 Approx. Counting Arbitrary Base Take a < 2 The counter value after m increments will be larger than with the binary base Giving a better accuracy!! Probabilities can be stored in a table No need to be recomputing!! U. Aveiro, November
41 Approx. Counting Arbitrary Base Possible values? a = 2 1/2, 2 1/4, How to estimate the number of events from the counter value k? Compute ( a k a + 1 ) / ( a 1 ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November
42 Tasks Simulate such a counter, with a = 2 1/2, for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November
43 Counting events trials U. Aveiro, November
44 One recent paper from 2016 U. Aveiro, November
45 References R. Morris, Counting Large Numbers of Events in Small Registers, Commun. ACM, Vol. 21, N. 10, October 1978 P. Flajolet, Approximate Counting: A Detailed Analysis, Bit, Vol. 25, 1985 M. Csurös. Approximate counting with a floatingpoint counter. In COCOON, LNCS vol. 6196, p , Springer, 2010 U. Aveiro, November
46 Set Membership Given an arbitrary sized string s and a set S Does s belong to S? Easy answer for small sets! Complexity? BUT difficult answer for huge sets! E.g., Big-Data applications U. Aveiro, November
47 Hash Tables Data structure for storing key-value pairs No ordering!! BUT, fast access!! No duplicate keys!! U. Aveiro, November
48 Approximate Membership Queries Given a set S = {x 1, x 2,, xn} Answer queries of the form: Is y in S? Data structure should be FAST and SMALL Faster than searching through S Smaller than explicit representation U. Aveiro, November
49 Approximate Membership Queries How to get speed and size improvements? Allow some probability of error!! False positives y S but reporting y S False negatives y S but reporting y S U. Aveiro, November
50 Bloom Filters B. H. Bloom, 1970 Use hash functions to determine approximate set membership Allow for fast set membership tests on very large data sets Applications Spell-Checking / Text Analysis Network monitoring U. Aveiro, November
51 Application Spell-Checkers Determine if candidate words are members of the set of words in a dictionary The Bloom filter should be large enough to allow the inclusion of additional words by the user U. Aveiro, November
52 Application Spam We know 1 billion good addresses If an comes from one of these, it is NOT spam How check for spam in a FAST way? U. Aveiro, November
53 Application Web-Caching Bloom filters are used in WWW caching proxy servers Proxy servers intercept requests from clients and either fulfill the requests themselves or re-issue them to servers U. Aveiro, November
54 Bloom Filters Is y in S? A Bloom filter Provides an answer in constant time Time to hash Uses a small amount of memory space BUT, with some small probability of being wrong! U. Aveiro, November
55 1 st Register the elements of set S [Mitzenmacher] U. Aveiro, November
56 2 nd Process the queries [Mitzenmacher] U. Aveiro, November
57 Basic operations Initialization Clear all cells Insertion Compute the values of k hash functions Set the corresponding cells, if needed It takes constant time, but proportional to k U. Aveiro, November
58 Basic operations Membership test Compute the values of k hash functions Check if the corresponding cells have been set If any such cell is not set, the searched element is not a member of the set Worst-case? Checking all k cells! Set elements and false positives U. Aveiro, November
59 Bloom Filter Simple Demos Bloom Filters by Example Bloom Filters U. Aveiro, November
60 Bloom Filters Behaviour Deterministic hash functions! No attempt to solve hashing collisions! Can we get false negatives? Probability of false positives? How to minimize? U. Aveiro, November
61 Bloom Filter Parameters The behaviour of a Bloom filter is determined by four parameters n set elements registered in B m = c n cells in B (i.e., bits) k independent, random hash functions f is the fraction of cells set to 1 U. Aveiro, November
62 Bloom Filter Parameters How to choose m, the size of the filter? How to choose k, the number of hash functions? How do we choose the best k value? U. Aveiro, November
63 Probabilities After 1 insertion Initially all bits are set to zero Inserting one element What is the probability of b i = 1, after using the first hash function? Equal probability for any cell P b i = 1 = 1 m P b i = 0 = 1 1 m U. Aveiro, November
64 Probabilities After 1 insertion After computing the k hash functions and setting k cells P b i = 0 = 1 1 m k U. Aveiro, November
65 Probabilities After n insertions After inserting all n set elements, by computing each time k hash values Assuming independence P b i = 0 = 1 1 m k n U. Aveiro, November
66 Probabilities After n insertions P b i = 0 = 1 1 m k n n P b i = 1 = 1 a k, a = 1 1 m U. Aveiro, November
67 Probability of a false positive Testing the membership of an item not in S entails a positive answer Corresponding k bits are set to 1 The probability of that happening is k k p = 1 a kn/m k p 1 e U. Aveiro, November
68 Example n = 1 billion items, m = 8 billion bits k = 1 : p 1 e 1/8 = k = 2 : p 1 e 2/8 2 = What happens as we keep increasing k? U. Aveiro, November
69 Optimal value of k U. Aveiro, November
70 Optimal value of k To determine the value of k that minimizes p we minimize log p, which is more tractable And get k opt m n ln m n Use the closest integer to k opt For the previous example : k opt 5,54 6 U. Aveiro, November
71 Which Hash Functions? No need to use cryptographic hash functions! You can simulate k hash functions by simply combining two hash functions Kirsch and Mitzenmacher (2006) Compute one base hash function on unsigned 64-bit numbers Take the upper half and the lower half of that value and return them as two 32 bit numbers U. Aveiro, November
72 Bloom Filters Wrap-up No false negatives and limited memory usage Great for pre-processing before more expensive checks Suitable for hardware implementation Hash computations can be parallelized Error rate can be decreased by increasing the number of hash functions and allocated memory space U. Aveiro, November
73 Bloom Filters Wrap-up Useful for applications where an imperfect set membership test can be helpfully applied to a large data set of unknown composition Advantage over hash tables is Bloom filter speed and error rate U. Aveiro, November
74 Bloom Filters Pending Issues Cannot represent multi-sets I.e., sets with repeated elements Cannot query the multiplicity of an item Deleting an item is not possible! U. Aveiro, November
75 Counting Bloom Filters Multi-set representation Now, each filter cell is a w-bit counter w = 4 seems to be enough for most applications U. Aveiro, November
76 Counting Bloom Filters To insert an element, increase the value of each corresponding cell Test membership checks if each of the required cells is non-zero U. Aveiro, November
77 Counting Bloom Filters To delete an element, decrease the value of each corresponding cell Deletions necessarily introduce false negative errors!! How? U. Aveiro, November
78 Counting Bloom Filters To retrieve the count of an element : Compute its set of counters And return the minimum value as a frequency estimate U. Aveiro, November
79 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November
80 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November
81 Counting Bloom Filters Issues Counter overflow No more increments after reaching 2 w 1 BUT, now we have undercounts!! Choice of counter width w A large w diminishes space savings and introduces unused space (many zeros) A small w quickly leads to maximum values Trade-off U. Aveiro, November
82 Counting Bloomm Filters in Practice If insertions/deletions are rare compared to look-ups Keep a CBF in off-chip memory Keep a BF in on-chip memory Update the BF when the CBF changes Keep space savings of a Bloom filter But can deal with deletions Popular design for network devices U. Aveiro, November
83 References J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, 2014 Chapter 4 B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, Commun. ACM, July 1970 J. Blustein and A. El-Maazaw, Bloom Filters A Tutorial, Analysis, and Survey, TR CS , Dalhousie University, Halifax, NS, Canada, December 2002 A. Broder and M. Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics, Vol. 1, N. 4, 2004 U. Aveiro, November
84 Acknowledgments An earlier version of some of these slides was developed by Professor Carlos Bastos Part of the slides adapted from original slides of J. Leskovec, A Rajaraman and J. Ullman Mining of Massive Datasets M. Mitzenmacher, Bloom Filters and Such 2014 Summer School on Hashing, Copenhagen, DK U. Aveiro, November
Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing
Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of
More informationCS 591, Lecture 7 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationLecture 24: Bloom Filters. Wednesday, June 2, 2010
Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More informationCSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13
CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic
More informationBloom Filters, general theory and variants
Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More informationAd Placement Strategies
Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More informationCS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)
CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose
More informationHigh Dimensional Search Min- Hashing Locality Sensi6ve Hashing
High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of
More informationPAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic
IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic Yoshihide MATSUMOTO a), Hiroaki HAZEYAMA b), and Youki KADOBAYASHI
More information1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:
CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at
More informationApplication: Bucket Sort
5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from
More informationA General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY
A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)
More informationBloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011
Bloom Filter Redux Matthias Vallentin Gene Pang CS 270 Combinatorial Algorithms and Data Structures UC Berkeley, Spring 2011 Inspiration Our background: network security, databases We deal with massive
More informationCount-Min Tree Sketch: Approximate counting for NLP
Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26
More informationCS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32
CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most
More informationAlgorithm Design Strategies V
Algorithm Design Strategies V Joaquim Madeira Version 0.0 October 2016 U. Aveiro, October 2016 1 Overview The 0-1 Knapsack Problem Revisited The Fractional Knapsack Problem Greedy Algorithms Example Coin
More informationThe Bloom Paradox: When not to Use a Bloom Filter
1 The Bloom Paradox: When not to Use a Bloom Filter Ori Rottenstreich and Isaac Keslassy Abstract In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, the Bloom filter is harmful and
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationDATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing
DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining
More informationCS 591, Lecture 9 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationHash tables. Hash tables
Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationStreaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.
Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationFoundations of Data Mining
Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)
More informationLecture 1 September 3, 2013
CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html
More informationmd5bloom: Forensic Filesystem Hashing Revisited
DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More information:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)
BALLS, BINS, AND RANDOM GRAPHS We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this probability, writing the bound as Pr(X 2: x) :s ex-ill-x In(x/m). For x = m + J2m In m,
More informationCS246 Final Exam. March 16, :30AM - 11:30AM
CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions
More informationCS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d
CS 5321: Advanced Algorithms Amortized Analysis of Data Structures Ali Ebnenasir Department of Computer Science Michigan Technological University Motivations Why amortized analysis and when? Suppose you
More informationProblem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)
Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.
More informationAs mentioned, we will relax the conditions of our dictionary data structure. The relaxations we
CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data
More informationCS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2
CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property
More informationRAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response
RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto RAPPOR, What is is good for? (Absolutely something!)
More informationLecture 4 Thursday Sep 11, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)
More informationCS425: Algorithms for Web Scale Data
CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges
More informationBloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.
Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,
More informationDesign of discrete-event simulations
Design of discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/tlt-2707/ OUTLINE: Discrete event simulation; Event advance design; Unit-time
More informationUsing the Power of Two Choices to Improve Bloom Filters
Internet Mathematics Vol. 4, No. 1: 17-33 Using the Power of Two Choices to Improve Bloom Filters Steve Lumetta and Michael Mitzenmacher Abstract. We consider the combination of two ideas from the hashing
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationSecure Indexes* Eu-Jin Goh Stanford University 15 March 2004
Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 * Generalizes an early version of my paper How to search on encrypted data on eprint Cryptology Archive on 7 October 2003 Secure Indexes Data
More informationProbabilistic Counting with Randomized Storage
Probabilistic Counting with Randomized Storage Benjamin Van Durme University of Rochester Rochester, NY 14627, USA Ashwin Lall Georgia Institute of Technology Atlanta, GA 30332, USA Abstract Previous work
More informationTight Bounds for Sliding Bloom Filters
Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error
More informationBloom Filters and Locality-Sensitive Hashing
Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationMining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationCS5112: Algorithms and Data Structures for Applications
CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining
More informationA REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,
More informationRetouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives
Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives Benoit Donnet Univ. Catholique de Louvain Bruno Baynat Univ. Pierre et Marie Curie
More informationProbabilistic Near-Duplicate. Detection Using Simhash
Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October
More informationFrequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:
Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationUniversity of Illinois at Urbana-Champaign. Midterm Examination
University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,
More informationCS60021: Scalable Data Mining. Large Scale Machine Learning
J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance
More informationA Model for Learned Bloom Filters, and Optimizing by Sandwiching
A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested
More informationAdvanced topic: Space complexity
Advanced topic: Space complexity CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Chinese University of Hong Kong Fall 2016 1/28 Review: time complexity We have looked at how long it takes to
More informationRainbow Tables ENEE 457/CMSC 498E
Rainbow Tables ENEE 457/CMSC 498E How are Passwords Stored? Option 1: Store all passwords in a table in the clear. Problem: If Server is compromised then all passwords are leaked. Option 2: Store only
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationAMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM
AMORTIZED ANALYSIS binary counter multipop stack dynamic table Lecture slides by Kevin Wayne http://www.cs.princeton.edu/~wayne/kleinberg-tardos Last updated on 1/24/17 11:31 AM Amortized analysis Worst-case
More informationAlgorithms lecture notes 1. Hashing, and Universal Hash functions
Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.
More informationCounting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109
1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of
More informationCOMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from
COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard
More informationCS6931 Database Seminar. Lecture 6: Set Operations on Massive Data
CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1
More information27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling
10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel
More informationA Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window
A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY
More informationSlides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationShannon-Fano-Elias coding
Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The
More informationHashing. Martin Babka. January 12, 2011
Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice
More informationExpectation of geometric distribution
Expectation of geometric distribution What is the probability that X is finite? Can now compute E(X): Σ k=1f X (k) = Σ k=1(1 p) k 1 p = pσ j=0(1 p) j = p 1 1 (1 p) = 1 E(X) = Σ k=1k (1 p) k 1 p = p [ Σ
More informationComputer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:
Computer Networks 55 (2) 84 89 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet A Generalized Bloom Filter to Secure Distributed Network Applications
More informationLecture 01 August 31, 2017
Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed
More informationIncremental Learning and Concept Drift: Overview
Incremental Learning and Concept Drift: Overview Incremental learning The algorithm ID5R Taxonomy of incremental learning Concept Drift Teil 5: Incremental Learning and Concept Drift (V. 1.0) 1 c G. Grieser
More informationComputer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms
Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds
More informationEfficient Data Reduction and Summarization
Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information
More informationCuckoo Hashing and Cuckoo Filters
Cuckoo Hashing and Cuckoo Filters Noah Fleming May 7, 208 Preliminaries A dictionary is an abstract data type that stores a collection of elements, located by their key. It supports operations: insert,
More informationLearning from Time-Changing Data with Adaptive Windowing
Learning from Time-Changing Data with Adaptive Windowing Albert Bifet Ricard Gavaldà Universitat Politècnica de Catalunya {abifet,gavalda}@lsi.upc.edu Abstract We present a new approach for dealing with
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationDatabases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)
Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,
More informationCuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions
1 / 29 Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions Martin Aumüller, Martin Dietzfelbinger Technische Universität Ilmenau 2 / 29 Cuckoo Hashing Maintain a dynamic dictionary
More informationB490 Mining the Big Data
B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations
More information1 Maintaining a Dictionary
15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition
More informationDiscrete-event simulations
Discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/elt-53606/ OUTLINE: Why do we need simulations? Step-by-step simulations; Classifications;
More information2 How many distinct elements are in a stream?
Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition
More informationAn Early Traffic Sampling Algorithm
An Early Traffic Sampling Algorithm Hou Ying ( ), Huang Hai, Chen Dan, Wang ShengNan, and Li Peng National Digital Switching System Engineering & Techological R&D center, ZhengZhou, 450002, China ndschy@139.com,
More informationINF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)
INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder
More informationCMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms
CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the
More informationInsert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park
1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More informationMeelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05
Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population
More informationDATA MINING LECTURE 3. Frequent Itemsets Association Rules
DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.
More informationHow Philippe Flipped Coins to Count Data
1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain
More information