Introduction to Randomized Algorithms III

Size: px
Start display at page:

Download "Introduction to Randomized Algorithms III"

Transcription

1 Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November

2 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability 1 / 2 k Counting with decreasing probability Approximate set membership Bloom Filters Counting Bloom Filters U. Aveiro, November

3 Motivation Is it possible to use a small counter to keep approximate counts of large numbers? Use a large number of such counters to keep track of the number of occurrences of many different events E.g., 8-bit counters Morris, Approximate Count Algorithm, 1978 U. Aveiro, November

4 Motivation But, nowadays memory is no longer scarce Is such an approach still interesting? Yes!! Massive data volumes!! Need quick and memory-efficient processing U. Aveiro, November

5 Application areas Online social networks Large-scale scientific experiments Search engines Online content delivery Product and consumer tracking. Data too large to fit in memory must be analyzed!! U. Aveiro, November

6 Big-Data Scale up vs Downsize Scale up the computation Replicate cheap hardware / devices Build massive DBMSs and warehouses BUT, expensive equipment / energy!! Downsize the data Compact representations of large data sets Approximate answers Probabilistic methods U. Aveiro, November

7 Probabilistic Counters Goal Avoid using large counters when dealing with large data volumes!! A counter with n bits counts up to 2 n events Can we use less bits? What is the cost? U. Aveiro, November

8 1 st Method For each event, increment the counter with probability 1 / 2 Intuition: just incrementing for half of the events!! We can now count up to 2 n + 1 events Using just n bits!! Is that what happens? Draw the state diagram / triangular diagram U. Aveiro, November

9 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments several times! What can you conclude? How to evaluate the accuracy? Relative error or accuracy ratio When knowing the exact value U. Aveiro, November

10 Counting 100 events trials U. Aveiro, November

11 1 st Method Expected value (mean) Counter is a random variable Resulting from a succession of random events What is the expected value after k events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[ X i = 0 ] = P[ X i = 1 ] = 1 / 2 U. Aveiro, November

12 1 st Method Expected value (mean) E[ X i ] = 0 x P[ X i = 0 ] + 1 x P[ X i = 1 ] = 1 / 2 Counter value after k events is S = X i E[ S ] = E[ X i ] = E[ X i ] = k / 2 Number of events can be estimated by 2 x S U. Aveiro, November

13 1 st Method Variance σ 2 ( X i ) = E[ X i2 ] { E[ X i ] } 2 = E[ X i2 ] 1 / 4 E[ X i 2 ] = 0 2 x P[X i = 0] x P[X i = 1] = 1 / 2 σ 2 ( X i ) = 1 / 4 σ 2 ( S ) = σ 2 ( X i ) = σ 2 ( X i ) = k / 4 Standard deviation: σ ( S ) = k / 2 U. Aveiro, November

14 1 st Method Tasks Simulate such a counter for 10, 100, 1000 and events Repeat the experiments many times!! For each counter, compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November

15 Counting 100 events trials U. Aveiro, November

16 1 st Method Probability distribution After n events, what is the probability of the counter value being k? p ( n, k ) =? Example for n = 4 More probable / Less probable counter values? p ( 4, k ) =? Binary table / Binary tree / Pascal-like triangle U. Aveiro, November

17 1 st Method Probability distribution U. Aveiro, November

18 Probability Distribution p = 1 / 2 U. Aveiro, November

19 Generalization Can we approx. count the same number of events using less bits? Or approx. count more events using the same number of bits? Yes! Increment the counter with lesser probability Increment with probability 1 / 2 k U. Aveiro, November

20 Generalization Tasks Incrementing with probability 1 / 2 k Obtain an expression for the mean, the variance and the stdr. deviation after n events k = 2, 3,, 6, Analyze the corresponding probability distributions Pascal-like triangle U. Aveiro, November

21 Generalization Mean and Variance Probability of incrementing the counter: p q = ( 1 p ) It is not difficult to check that, after n events: E[ S ] = n p σ 2 ( S ) = n p q U. Aveiro, November

22 Generalization Tasks Set the counting probability to 1 / 32 Simulate such a counter for 10, 100, 1000 and events Compute the mean, variance and standard deviation of the experimental results Compare with the theoretical results! U. Aveiro, November

23 Counting 100 events trials U. Aveiro, November

24 Counting events trials U. Aveiro, November

25 Probability Distribution p = 1 / 32 U. Aveiro, November

26 Probability Distribution p = 1 / 32 U. Aveiro, November

27 Fixed Probability Counters Recap For each event, increment the counter with probability 1 / 2 k, for k >= 1 On average, just incrementing for 1 / 2 k of the events!! Number of events estimated by 2 k x Counter We can now count up to 2 n + k events Using just n bits!! U. Aveiro, November

28 Issues What happens when counting a small number of events with probability 1 / 32? For much larger numbers of events, can we be more economical? U. Aveiro, November

29 Approximate Counting Binary Base Morris, 1978 For an arbitrary counting base As the counter value increases, it will be incremented with lesser probability If counter has value k Increment it with probability 1 / 2 k Do not increment it with probability ( 1 1 / 2 k ) Draw the state diagram! U. Aveiro, November

30 Approximate Counting Binary Base On average, how many events, n, are needed to reach a counter value of k? What does k represent? Events Counter value Number of events X 1 1 X Let s do it on the board! U. Aveiro, November

31 Approximate Counting Binary Base Counter is a random variable What is the expected value after n events? X i represents the i th increment X i = 1 : counter is incremented X i = 0 : counter is not incremented P[X i = 0 ] = 1 1 / 2 i-1 P[X i = 1 ] = 1 / 2 i-1 U. Aveiro, November

32 Approximate Counting Binary Base E[ X i ] = 1 / 2 i-1 Counter value after n events is S = X i E[ S ] = E[ X i ] = E[ X i ] E[ S ] = / / / / 4 + U. Aveiro, November

33 Approximate Counting Binary Base BUT, we only store integer values!! Number of events E[S] Expected counter value / / x 1 / x 1 / x 1 / x 1 / x 1 / 8 4 How to estimate the number of events from the counter value? U. Aveiro, November

34 Approximate Counting Binary Base After n = 2 k 1 events the expected counter value is k k = log 2 ( n + 1 ) = floor( log 2 n ) + 1 Generalize! After n events the expected counter value is floor( log 2 ( n + 1 ) ) Logarithmic counter!! For larger values, it counts slower U. Aveiro, November

35 Approximate Counting Binary Base After n probabilistic updates, the counter contains an approximation of log n That value is stored in log log n bits!! U. Aveiro, November

36 Approximate Counting Binary Base How to estimate the number of events from the counter value k? Compute 2 k 1 How to evaluate the counter s accuracy? Compare with floor( log 2 ( n + 1 ) ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November

37 Tasks Simulate such a counter for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November

38 Counting events trials U. Aveiro, November

39 Approx. Counting Arbitrary Base For some applications the expected error of the previous method might be too large! How to improve the counter performance? If counter has value k Increment it with probability 1 / a k Do not increment it with probability ( 1 1 / a k ) a is now the counter base U. Aveiro, November

40 Approx. Counting Arbitrary Base Take a < 2 The counter value after m increments will be larger than with the binary base Giving a better accuracy!! Probabilities can be stored in a table No need to be recomputing!! U. Aveiro, November

41 Approx. Counting Arbitrary Base Possible values? a = 2 1/2, 2 1/4, How to estimate the number of events from the counter value k? Compute ( a k a + 1 ) / ( a 1 ) What is the largest value that we can count with a 4-bit or 8-bit or 16-bit counter? U. Aveiro, November

42 Tasks Simulate such a counter, with a = 2 1/2, for 10, 50, 100, 500, 1000, events Repeat the experiments many times! For each counter, compute the mean, variance and standard deviation of the experimental results What can you conclude? U. Aveiro, November

43 Counting events trials U. Aveiro, November

44 One recent paper from 2016 U. Aveiro, November

45 References R. Morris, Counting Large Numbers of Events in Small Registers, Commun. ACM, Vol. 21, N. 10, October 1978 P. Flajolet, Approximate Counting: A Detailed Analysis, Bit, Vol. 25, 1985 M. Csurös. Approximate counting with a floatingpoint counter. In COCOON, LNCS vol. 6196, p , Springer, 2010 U. Aveiro, November

46 Set Membership Given an arbitrary sized string s and a set S Does s belong to S? Easy answer for small sets! Complexity? BUT difficult answer for huge sets! E.g., Big-Data applications U. Aveiro, November

47 Hash Tables Data structure for storing key-value pairs No ordering!! BUT, fast access!! No duplicate keys!! U. Aveiro, November

48 Approximate Membership Queries Given a set S = {x 1, x 2,, xn} Answer queries of the form: Is y in S? Data structure should be FAST and SMALL Faster than searching through S Smaller than explicit representation U. Aveiro, November

49 Approximate Membership Queries How to get speed and size improvements? Allow some probability of error!! False positives y S but reporting y S False negatives y S but reporting y S U. Aveiro, November

50 Bloom Filters B. H. Bloom, 1970 Use hash functions to determine approximate set membership Allow for fast set membership tests on very large data sets Applications Spell-Checking / Text Analysis Network monitoring U. Aveiro, November

51 Application Spell-Checkers Determine if candidate words are members of the set of words in a dictionary The Bloom filter should be large enough to allow the inclusion of additional words by the user U. Aveiro, November

52 Application Spam We know 1 billion good addresses If an comes from one of these, it is NOT spam How check for spam in a FAST way? U. Aveiro, November

53 Application Web-Caching Bloom filters are used in WWW caching proxy servers Proxy servers intercept requests from clients and either fulfill the requests themselves or re-issue them to servers U. Aveiro, November

54 Bloom Filters Is y in S? A Bloom filter Provides an answer in constant time Time to hash Uses a small amount of memory space BUT, with some small probability of being wrong! U. Aveiro, November

55 1 st Register the elements of set S [Mitzenmacher] U. Aveiro, November

56 2 nd Process the queries [Mitzenmacher] U. Aveiro, November

57 Basic operations Initialization Clear all cells Insertion Compute the values of k hash functions Set the corresponding cells, if needed It takes constant time, but proportional to k U. Aveiro, November

58 Basic operations Membership test Compute the values of k hash functions Check if the corresponding cells have been set If any such cell is not set, the searched element is not a member of the set Worst-case? Checking all k cells! Set elements and false positives U. Aveiro, November

59 Bloom Filter Simple Demos Bloom Filters by Example Bloom Filters U. Aveiro, November

60 Bloom Filters Behaviour Deterministic hash functions! No attempt to solve hashing collisions! Can we get false negatives? Probability of false positives? How to minimize? U. Aveiro, November

61 Bloom Filter Parameters The behaviour of a Bloom filter is determined by four parameters n set elements registered in B m = c n cells in B (i.e., bits) k independent, random hash functions f is the fraction of cells set to 1 U. Aveiro, November

62 Bloom Filter Parameters How to choose m, the size of the filter? How to choose k, the number of hash functions? How do we choose the best k value? U. Aveiro, November

63 Probabilities After 1 insertion Initially all bits are set to zero Inserting one element What is the probability of b i = 1, after using the first hash function? Equal probability for any cell P b i = 1 = 1 m P b i = 0 = 1 1 m U. Aveiro, November

64 Probabilities After 1 insertion After computing the k hash functions and setting k cells P b i = 0 = 1 1 m k U. Aveiro, November

65 Probabilities After n insertions After inserting all n set elements, by computing each time k hash values Assuming independence P b i = 0 = 1 1 m k n U. Aveiro, November

66 Probabilities After n insertions P b i = 0 = 1 1 m k n n P b i = 1 = 1 a k, a = 1 1 m U. Aveiro, November

67 Probability of a false positive Testing the membership of an item not in S entails a positive answer Corresponding k bits are set to 1 The probability of that happening is k k p = 1 a kn/m k p 1 e U. Aveiro, November

68 Example n = 1 billion items, m = 8 billion bits k = 1 : p 1 e 1/8 = k = 2 : p 1 e 2/8 2 = What happens as we keep increasing k? U. Aveiro, November

69 Optimal value of k U. Aveiro, November

70 Optimal value of k To determine the value of k that minimizes p we minimize log p, which is more tractable And get k opt m n ln m n Use the closest integer to k opt For the previous example : k opt 5,54 6 U. Aveiro, November

71 Which Hash Functions? No need to use cryptographic hash functions! You can simulate k hash functions by simply combining two hash functions Kirsch and Mitzenmacher (2006) Compute one base hash function on unsigned 64-bit numbers Take the upper half and the lower half of that value and return them as two 32 bit numbers U. Aveiro, November

72 Bloom Filters Wrap-up No false negatives and limited memory usage Great for pre-processing before more expensive checks Suitable for hardware implementation Hash computations can be parallelized Error rate can be decreased by increasing the number of hash functions and allocated memory space U. Aveiro, November

73 Bloom Filters Wrap-up Useful for applications where an imperfect set membership test can be helpfully applied to a large data set of unknown composition Advantage over hash tables is Bloom filter speed and error rate U. Aveiro, November

74 Bloom Filters Pending Issues Cannot represent multi-sets I.e., sets with repeated elements Cannot query the multiplicity of an item Deleting an item is not possible! U. Aveiro, November

75 Counting Bloom Filters Multi-set representation Now, each filter cell is a w-bit counter w = 4 seems to be enough for most applications U. Aveiro, November

76 Counting Bloom Filters To insert an element, increase the value of each corresponding cell Test membership checks if each of the required cells is non-zero U. Aveiro, November

77 Counting Bloom Filters To delete an element, decrease the value of each corresponding cell Deletions necessarily introduce false negative errors!! How? U. Aveiro, November

78 Counting Bloom Filters To retrieve the count of an element : Compute its set of counters And return the minimum value as a frequency estimate U. Aveiro, November

79 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November

80 Counting Bloom Filters [Mitzenmacher] U. Aveiro, November

81 Counting Bloom Filters Issues Counter overflow No more increments after reaching 2 w 1 BUT, now we have undercounts!! Choice of counter width w A large w diminishes space savings and introduces unused space (many zeros) A small w quickly leads to maximum values Trade-off U. Aveiro, November

82 Counting Bloomm Filters in Practice If insertions/deletions are rare compared to look-ups Keep a CBF in off-chip memory Keep a BF in on-chip memory Update the BF when the CBF changes Keep space savings of a Bloom filter But can deal with deletions Popular design for network devices U. Aveiro, November

83 References J. Leskovec, A. Rajaraman and J. D. Ullman, Mining of Massive Datasets, 2014 Chapter 4 B. H. Bloom, Space/Time Trade-offs in Hash Coding with Allowable Errors, Commun. ACM, July 1970 J. Blustein and A. El-Maazaw, Bloom Filters A Tutorial, Analysis, and Survey, TR CS , Dalhousie University, Halifax, NS, Canada, December 2002 A. Broder and M. Mitzenmacher, Network Applications of Bloom Filters: A Survey, Internet Mathematics, Vol. 1, N. 4, 2004 U. Aveiro, November

84 Acknowledgments An earlier version of some of these slides was developed by Professor Carlos Bastos Part of the slides adapted from original slides of J. Leskovec, A Rajaraman and J. Ullman Mining of Massive Datasets M. Mitzenmacher, Bloom Filters and Such 2014 Summer School on Hashing, Copenhagen, DK U. Aveiro, November

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13 CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic

More information

Bloom Filters, general theory and variants

Bloom Filters, general theory and variants Bloom Filters: general theory and variants G. Caravagna caravagn@cli.di.unipi.it Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered.

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Ad Placement Strategies

Ad Placement Strategies Case Study 1: Estimating Click Probabilities Tackling an Unknown Number of Features with Sketching Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox 2014 Emily Fox January

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 14: Exponential decay; convolution Ramin Zabih Some content from: Piotr Indyk; Wikipedia/Google image search; J. Leskovec, A. Rajaraman,

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing

High Dimensional Search Min- Hashing Locality Sensi6ve Hashing High Dimensional Search Min- Hashing Locality Sensi6ve Hashing Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata September 8 and 11, 2014 High Support Rules vs Correla6on of

More information

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic Yoshihide MATSUMOTO a), Hiroaki HAZEYAMA b), and Youki KADOBAYASHI

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011 Bloom Filter Redux Matthias Vallentin Gene Pang CS 270 Combinatorial Algorithms and Data Structures UC Berkeley, Spring 2011 Inspiration Our background: network security, databases We deal with massive

More information

Count-Min Tree Sketch: Approximate counting for NLP

Count-Min Tree Sketch: Approximate counting for NLP Count-Min Tree Sketch: Approximate counting for NLP Guillaume Pitel, Geoffroy Fouquier, Emmanuel Marchand and Abdul Mouhamadsultane exensa firstname.lastname@exensa.com arxiv:64.5492v [cs.ir] 9 Apr 26

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

Algorithm Design Strategies V

Algorithm Design Strategies V Algorithm Design Strategies V Joaquim Madeira Version 0.0 October 2016 U. Aveiro, October 2016 1 Overview The 0-1 Knapsack Problem Revisited The Fractional Knapsack Problem Greedy Algorithms Example Coin

More information

The Bloom Paradox: When not to Use a Bloom Filter

The Bloom Paradox: When not to Use a Bloom Filter 1 The Bloom Paradox: When not to Use a Bloom Filter Ori Rottenstreich and Isaac Keslassy Abstract In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, the Bloom filter is harmful and

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 12: Real-Time Data Analytics (2/2) March 31, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org. Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Foundations of Data Mining

Foundations of Data Mining Foundations of Data Mining http://www.cohenwang.com/edith/dataminingclass2017 Instructors: Haim Kaplan Amos Fiat Lecture 1 1 Course logistics Tuesdays 16:00-19:00, Sherman 002 Slides for (most or all)

More information

Lecture 1 September 3, 2013

Lecture 1 September 3, 2013 CS 229r: Algorithms for Big Data Fall 2013 Prof. Jelani Nelson Lecture 1 September 3, 2013 Scribes: Andrew Wang and Andrew Liu 1 Course Logistics The problem sets can be found on the course website: http://people.seas.harvard.edu/~minilek/cs229r/index.html

More information

md5bloom: Forensic Filesystem Hashing Revisited

md5bloom: Forensic Filesystem Hashing Revisited DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 9: Real-Time Data Analytics (2/2) March 29, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m) BALLS, BINS, AND RANDOM GRAPHS We use the Chernoff bound for the Poisson distribution (Theorem 5.4) to bound this probability, writing the bound as Pr(X 2: x) :s ex-ill-x In(x/m). For x = m + J2m In m,

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d

CS 5321: Advanced Algorithms Amortized Analysis of Data Structures. Motivations. Motivation cont d CS 5321: Advanced Algorithms Amortized Analysis of Data Structures Ali Ebnenasir Department of Computer Science Michigan Technological University Motivations Why amortized analysis and when? Suppose you

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #15: Mining Streams 2 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #15: Mining Streams 2 Today s Lecture More algorithms for streams: (1) Filtering a data stream: Bloom filters Select elements with property

More information

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response

RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response RAPPOR: Randomized Aggregatable Privacy- Preserving Ordinal Response Úlfar Erlingsson, Vasyl Pihur, Aleksandra Korolova Google & USC Presented By: Pat Pannuto RAPPOR, What is is good for? (Absolutely something!)

More information

Lecture 4 Thursday Sep 11, 2014

Lecture 4 Thursday Sep 11, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)

More information

CS425: Algorithms for Web Scale Data

CS425: Algorithms for Web Scale Data CS425: Algorithms for Web Scale Data Most of the slides are from the Mining of Massive Datasets book. These slides have been modified for CS425. The original slides can be accessed at: www.mmds.org Challenges

More information

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004. Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,

More information

Design of discrete-event simulations

Design of discrete-event simulations Design of discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/tlt-2707/ OUTLINE: Discrete event simulation; Event advance design; Unit-time

More information

Using the Power of Two Choices to Improve Bloom Filters

Using the Power of Two Choices to Improve Bloom Filters Internet Mathematics Vol. 4, No. 1: 17-33 Using the Power of Two Choices to Improve Bloom Filters Steve Lumetta and Michael Mitzenmacher Abstract. We consider the combination of two ideas from the hashing

More information

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017 CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code

More information

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 * Generalizes an early version of my paper How to search on encrypted data on eprint Cryptology Archive on 7 October 2003 Secure Indexes Data

More information

Probabilistic Counting with Randomized Storage

Probabilistic Counting with Randomized Storage Probabilistic Counting with Randomized Storage Benjamin Van Durme University of Rochester Rochester, NY 14627, USA Ashwin Lall Georgia Institute of Technology Atlanta, GA 30332, USA Abstract Previous work

More information

Tight Bounds for Sliding Bloom Filters

Tight Bounds for Sliding Bloom Filters Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University

Mining of Massive Datasets Jure Leskovec, AnandRajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

CS5112: Algorithms and Data Structures for Applications

CS5112: Algorithms and Data Structures for Applications CS5112: Algorithms and Data Structures for Applications Lecture 19: Association rules Ramin Zabih Some content from: Wikipedia/Google image search; Harrington; J. Leskovec, A. Rajaraman, J. Ullman: Mining

More information

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES

A REVIEW ARTICLE ON NAIVE BAYES CLASSIFIER WITH VARIOUS SMOOTHING TECHNIQUES Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,

More information

Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives

Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives Retouched Bloom Filters: Allowing Networked Applications to Trade Off Selected False Positives Against False Negatives Benoit Donnet Univ. Catholique de Louvain Bruno Baynat Univ. Pierre et Marie Curie

More information

Probabilistic Near-Duplicate. Detection Using Simhash

Probabilistic Near-Duplicate. Detection Using Simhash Probabilistic Near-Duplicate Detection Using Simhash Sadhan Sood, Dmitri Loguinov Presented by Matt Smith Internet Research Lab Department of Computer Science and Engineering Texas A&M University 27 October

More information

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit:

Frequent Itemsets and Association Rule Mining. Vinay Setty Slides credit: Frequent Itemsets and Association Rule Mining Vinay Setty vinay.j.setty@uis.no Slides credit: http://www.mmds.org/ Association Rule Discovery Supermarket shelf management Market-basket model: Goal: Identify

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

University of Illinois at Urbana-Champaign. Midterm Examination

University of Illinois at Urbana-Champaign. Midterm Examination University of Illinois at Urbana-Champaign Midterm Examination CS410 Introduction to Text Information Systems Professor ChengXiang Zhai TA: Azadeh Shakery Time: 2:00 3:15pm, Mar. 14, 2007 Place: Room 1105,

More information

CS60021: Scalable Data Mining. Large Scale Machine Learning

CS60021: Scalable Data Mining. Large Scale Machine Learning J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Large Scale Machine Learning Sourangshu Bhattacharya Example: Spam filtering Instance

More information

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

A Model for Learned Bloom Filters, and Optimizing by Sandwiching A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested

More information

Advanced topic: Space complexity

Advanced topic: Space complexity Advanced topic: Space complexity CSCI 3130 Formal Languages and Automata Theory Siu On CHAN Chinese University of Hong Kong Fall 2016 1/28 Review: time complexity We have looked at how long it takes to

More information

Rainbow Tables ENEE 457/CMSC 498E

Rainbow Tables ENEE 457/CMSC 498E Rainbow Tables ENEE 457/CMSC 498E How are Passwords Stored? Option 1: Store all passwords in a table in the clear. Problem: If Server is compromised then all passwords are leaked. Option 2: Store only

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM

AMORTIZED ANALYSIS. binary counter multipop stack dynamic table. Lecture slides by Kevin Wayne. Last updated on 1/24/17 11:31 AM AMORTIZED ANALYSIS binary counter multipop stack dynamic table Lecture slides by Kevin Wayne http://www.cs.princeton.edu/~wayne/kleinberg-tardos Last updated on 1/24/17 11:31 AM Amortized analysis Worst-case

More information

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Algorithms lecture notes 1. Hashing, and Universal Hash functions Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.

More information

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109

Counting. 1 Sum Rule. Example 1. Lecture Notes #1 Sept 24, Chris Piech CS 109 1 Chris Piech CS 109 Counting Lecture Notes #1 Sept 24, 2018 Based on a handout by Mehran Sahami with examples by Peter Norvig Although you may have thought you had a pretty good grasp on the notion of

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data

CS6931 Database Seminar. Lecture 6: Set Operations on Massive Data CS6931 Database Seminar Lecture 6: Set Operations on Massive Data Set Resemblance and MinWise Hashing/Independent Permutation Basics Consider two sets S1, S2 U = {0, 1, 2,...,D 1} (e.g., D = 2 64 ) f1

More information

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling

27 : Distributed Monte Carlo Markov Chain. 1 Recap of MCMC and Naive Parallel Gibbs Sampling 10-708: Probabilistic Graphical Models 10-708, Spring 2014 27 : Distributed Monte Carlo Markov Chain Lecturer: Eric P. Xing Scribes: Pengtao Xie, Khoa Luu In this scribe, we are going to review the Parallel

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

Slides credits: Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

Expectation of geometric distribution

Expectation of geometric distribution Expectation of geometric distribution What is the probability that X is finite? Can now compute E(X): Σ k=1f X (k) = Σ k=1(1 p) k 1 p = pσ j=0(1 p) j = p 1 1 (1 p) = 1 E(X) = Σ k=1k (1 p) k 1 p = p [ Σ

More information

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage: Computer Networks 55 (2) 84 89 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet A Generalized Bloom Filter to Secure Distributed Network Applications

More information

Lecture 01 August 31, 2017

Lecture 01 August 31, 2017 Sketching Algorithms for Big Data Fall 2017 Prof. Jelani Nelson Lecture 01 August 31, 2017 Scribe: Vinh-Kha Le 1 Overview In this lecture, we overviewed the six main topics covered in the course, reviewed

More information

Incremental Learning and Concept Drift: Overview

Incremental Learning and Concept Drift: Overview Incremental Learning and Concept Drift: Overview Incremental learning The algorithm ID5R Taxonomy of incremental learning Concept Drift Teil 5: Incremental Learning and Concept Drift (V. 1.0) 1 c G. Grieser

More information

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms Computer Science 385 Analysis of Algorithms Siena College Spring 2011 Topic Notes: Limitations of Algorithms We conclude with a discussion of the limitations of the power of algorithms. That is, what kinds

More information

Efficient Data Reduction and Summarization

Efficient Data Reduction and Summarization Ping Li Efficient Data Reduction and Summarization December 8, 2011 FODAVA Review 1 Efficient Data Reduction and Summarization Ping Li Department of Statistical Science Faculty of omputing and Information

More information

Cuckoo Hashing and Cuckoo Filters

Cuckoo Hashing and Cuckoo Filters Cuckoo Hashing and Cuckoo Filters Noah Fleming May 7, 208 Preliminaries A dictionary is an abstract data type that stores a collection of elements, located by their key. It supports operations: insert,

More information

Learning from Time-Changing Data with Adaptive Windowing

Learning from Time-Changing Data with Adaptive Windowing Learning from Time-Changing Data with Adaptive Windowing Albert Bifet Ricard Gavaldà Universitat Politècnica de Catalunya {abifet,gavalda}@lsi.upc.edu Abstract We present a new approach for dealing with

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,

More information

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions

Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions 1 / 29 Cuckoo Hashing with a Stash: Alternative Analysis, Simple Hash Functions Martin Aumüller, Martin Dietzfelbinger Technische Universität Ilmenau 2 / 29 Cuckoo Hashing Maintain a dynamic dictionary

More information

B490 Mining the Big Data

B490 Mining the Big Data B490 Mining the Big Data 1 Finding Similar Items Qin Zhang 1-1 Motivations Finding similar documents/webpages/images (Approximate) mirror sites. Application: Don t want to show both when Google. 2-1 Motivations

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

Discrete-event simulations

Discrete-event simulations Discrete-event simulations Lecturer: Dmitri A. Moltchanov E-mail: moltchan@cs.tut.fi http://www.cs.tut.fi/kurssit/elt-53606/ OUTLINE: Why do we need simulations? Step-by-step simulations; Classifications;

More information

2 How many distinct elements are in a stream?

2 How many distinct elements are in a stream? Dealing with Massive Data January 31, 2011 Lecture 2: Distinct Element Counting Lecturer: Sergei Vassilvitskii Scribe:Ido Rosen & Yoonji Shin 1 Introduction We begin by defining the stream formally. Definition

More information

An Early Traffic Sampling Algorithm

An Early Traffic Sampling Algorithm An Early Traffic Sampling Algorithm Hou Ying ( ), Huang Hai, Chen Dan, Wang ShengNan, and Li Peng National Digital Switching System Engineering & Techological R&D center, ZhengZhou, 450002, China ndschy@139.com,

More information

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2) INF2270 Spring 2010 Philipp Häfliger Summary/Repetition (1/2) content From Scalar to Superscalar Lecture Summary and Brief Repetition Binary numbers Boolean Algebra Combinational Logic Circuits Encoder/Decoder

More information

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms

CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms CMSC 858F: Algorithmic Lower Bounds: Fun with Hardness Proofs Fall 2014 Introduction to Streaming Algorithms Instructor: Mohammad T. Hajiaghayi Scribe: Huijing Gong November 11, 2014 1 Overview In the

More information

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05

Meelis Kull Autumn Meelis Kull - Autumn MTAT Data Mining - Lecture 05 Meelis Kull meelis.kull@ut.ee Autumn 2017 1 Sample vs population Example task with red and black cards Statistical terminology Permutation test and hypergeometric test Histogram on a sample vs population

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

How Philippe Flipped Coins to Count Data

How Philippe Flipped Coins to Count Data 1/18 How Philippe Flipped Coins to Count Data Jérémie Lumbroso LIP6 / INRIA Rocquencourt December 16th, 2011 0. DATA STREAMING ALGORITHMS Stream: a (very large) sequence S over (also very large) domain

More information