Bloom Filters, general theory and variants

Size: px
Start display at page:

Download "Bloom Filters, general theory and variants"

Transcription

1 Bloom Filters: general theory and variants G. Caravagna Information Retrieval Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives.

2 Indice 1 The problem 2 Main idea Mathematics 3 4 Applications References

3 Index The problem 1 The problem 2 Main idea Mathematics 3 4 Applications References

4 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions

5 Membership query The problem Definition The Membership Problem Given a set S and an element y: y? S Given a set S compute its characteristic function χ s { 1, if y S χ s (y) = 0, if y S well-known solutions Linear Scan Deterministic Arrays Hash Functions

6 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

7 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

8 Well-known Solutions The problem Linear Scan not in our world Deterministic Arrays (exactly compute χ s ) elements of S belong to a finite universe a boolean array big as the universe map elements of universe on it Hash Functions (approximate χ s ) use α bits for each elements of S (usually α = log(n)) sort hashed values (sort α-tuples) what may happen with collisions? Bloom Filters use these as starting point..

9 Reminder Main idea Mathematics Wherever a list or set is used, and space is a consideration, a Bloom Filter should be considered. When using a Bloom Filter, consider the effects of false positives. Bloom[1970]

10 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References

11 Preconditions Main idea Mathematics given a set of objects S = {x 1,..., x n } no restrictions on objects a vector B of m bits were b i {0, 1} will discuss about m-value suppose we have k hash functions h 1,..., h k each h i is defined as h i : U S [1; m] h i indexes the B vector

12 Building vector B Main idea Mathematics how to build B i. b i = 1 (j, t). h j (x t ) = i x i x j B h i1 (x i ) h j1 (x j ) h ik (x i ) h jk (x j )

13 Building vector B Main idea Mathematics procedure build begin for each s in S do for each h in H do done done end B[h(s)] = 1; // is Θ( S ) // is O(1) // suppose is O(1) All the build is Θ( S ) = Θ(n) time and Θ( B ) = Θ(m) space

14 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built

15 Searching into vector B Main idea Mathematics how to search for y y S i = 1,..., k. b hi (y) = 1 y y S... B S is useless when B is built

16 Searching into vector B Main idea Mathematics procedure search begin for each h in H do // is O(1) if B[h(y)] = 0 then return "not found" done return "found" end All the search is O(1) time and O(1) space

17 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?

18 Searching into vector B Main idea Mathematics it s easy to notice that computing χ s (y) 1 if i.b hi (y) = 0 χ s (y) = 0 thus y S 2 if i.b hi (y) = 1 χ s (y) = 1 thus y S of course sentence 1 is true what about 2?

19 Main idea Mathematics The real problem of Bloom Filters Definition The main problem are false positives: may exist an x j y so that h... (x j ) = h... (y) y S x j B sentence 2 is not always true

20 What we compute Main idea Mathematics what may happen with a false positive? we say y S even if this is false we shall say y S, probably can we compute this probability?

21 Example Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

22 Example: building B Main idea Mathematics ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

23 Example: building B Main idea Mathematics we may have introduced a false positive in B 10 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B h 1(T CT ) = h 2(T T A) = 9 S = {T T A, T CT, AT A}

24 Example: building B Main idea Mathematics we may have introduced two false positives in B 10 and B 11 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B S = {T T A, T CT, AT A}

25 Example: searching into B Main idea Mathematics CGA ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B CGA? S NO S = {T T A, T CT, AT A} B h3(cga) = 0

26 Example: searching into B Main idea Mathematics AAA h 1(T T A) = h 2(AAA) = 1 h 2(AT A) = h 3(AAA) = 2 h 2(T CT ) = h 1(AAA) = 4 ACG ATA CGA TTA ATA CGC AAA TCT h 1 h 2 h B AAA? S Y ES false positive S = {T T A, T CT, AT A}

27 Index Main idea Mathematics 1 The problem 2 Main idea Mathematics 3 4 Applications References

28 Probability of a false positive Main idea Mathematics assumption that hash are perfectly random after build P(b i = 0) = ( 1 1 m ) kn e kn/m = p probability of a false positive is (1 e kn/m ) k = (1 p) k = ε other formulations are asymptotically equivalent

29 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n

30 Main idea Mathematics Optimizing number of hash functions higher k-value more chances to find a 0-bit for y S lower k-value increase fraction of 0-bits in B minimize the ε function k = ln 2 (m/n) if p = 0.5 then ε is a constant ε = (0.5) k = (0.6185) m/n

31 How big should be the B vector? Main idea Mathematics depends on the ε value we want given n we fix m m ε % n % 2n % 5n % 10n % m = O(n) is generally a good choice

32 Bloom Filters v.s. hash functions Main idea Mathematics hash functions Bloom Filters build time Θ(n + n log(n)) Θ(n) space needed Θ(n log(n)) Θ(m) search time O(log(n)) O(1) ε value 1/n (1 p) k Hash functions are Bloom Filters with k = 1

33 Bloom Filters tricks Main idea Mathematics union by OR 1 we have sets S 1, S 2 and Bloom Filters B 1, B 2 2 suppose m 1 = m 2 and same hashing functions 3 just OR the counters B 12 i = B 1 i B 2 i halved size 1 suppose m = 2 α 2 make union by OR of the two half 3 when hashing mask high-order bit

34 Summary Main idea Mathematics we have a tradeoff between space and false positives ε value is computable (and constant) we use abstraction provided by hash functions on x i S we approximate the characteristic function we have an easy to code data structure we started from the Membership Problem, we solve this one: Handle massive data sets to support membership queries using compact data structure what else shall we want?

35 Why variants extend Bloom Filters to multisets Spectral Bloom Filter, Matias and Cohen [2003] compute almost any function Bloomier Filter, Chazelle et al. [2004] something else? someone else.....are up to date results, let s try to give a brief overview...

36 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

37 (SBF) We extend the Bloom Filters to multisets Definition M = S, f x is a multiset were S is a set f x is a function f x are the occurrences of x in M ex M = {A =2, B =1, C =2 }, f x M = 5, f A = f C = 2, f B = 1

38 Main features space usage is slightly larger performances are generally better insertions are always possible, deletion not can be built incrementally for streaming data we query values f x > T with T a threshold with T = 0 we guess for membership we have tricks for SBF

39 SBF B vector is replaced by a vector of counters C 1, C 2,..., C m C i is a sum of f x values for each x S mapping to i as always, approximations of f x are stored into C h1 (x), C h2 (x),..., C hk (x) thus, to compute f x, we have m x = min{c h1 (x), C h2 (x),..., C hk (x)} m x is the basic estimator or Minimum Selection (MS)

40 The Minimum Selection h...(y) = h...(x) = i 1 f y f z f x f x f x f z C i 1 C i C j C j+1 C i 1 is not a good approximation of f x (neither of f y ) C i is an exact approximation of f x C j+1 is an exact approximation of f z

41 Insertion and Deletion insertion is simple increase each counter by 1... for each h in H do // O(1) C[h(x)] = C[h(x)] + 1; done... deletion is simple decrease each counter by 1 search for an element x compute the MinimumSelection m x

42 On the error of SBF error is the same ε of Bloom Filters Theorem For all x f x m x. Furthermore f x m x with probability E SBF = ε (1 p) k Proof. With no collisions m x = f x. With collisions m x > f x. The m x < f x cannot happen with collision. The event f x m x is all counters have a collision, that is a false positive.

43 Implementing a SBF: challenges Mainly two challenges 1 vector of counters computational complexity of 2 performances random accesses insertion deletion allow insertion/deletion keeping low E SBF

44 Solving Problem 2 with Minimal Increase(MI) We minimize redundant insertions Minimal Increase principle When performing insertion of element x, increase only the counters that equals m x. Each lookup will return value m x. We get the inequality E SBF ε

45 Minimal Increase: example of increase insertion is always possible insert x mx = 1 mx = 2 insert x mx = x x x mx mx mx mx mx mx

46 Minimal Increase: example of decrease deletion may introduce false negatives mx = 1 my = 0 y x insert y y delete y mx = 1 = my mx = 0 = my x x y mx mx my mx mx mx my my mx my mx we lie saying x S MI doesn t allow deletion

47 Solving Problem 2 with Recurring Minimum(RM) Recurring Minimum : definition f z m x m z f x f x f x f z x has a Recurring Minimum (RM) z has a Single Minimum (SM) An element has a RM iff exist more than one counter with its MS value

48 Solving Problem 2 with Recurring Minimum(RM) We identify Bloom Errors and handle them Recurring Minimum : principle For item x with RM we use m x as estimator E SBF < ε For items with a single minimum we use a secondary SBF with SBF 2 SBF 1 Improvements are remarkable. E SBF2 ε

49 Recurrent Minimum: insertion insertion handles potential future errors 1 increase(sbf 1,x) 2 if x has a RM in SBF 1, stop 3 look for x in SBF 2 1 if x SBF 2 increase(sbf 2,x) 2 if x SBF 2 increase(sbf 2,search(SBF 1,x)) SM RM insert x x SM x insert x... x SM x

50 Recurrent Minimum: lookup and deletion lookup looks, if needed, in both SBF 1 if x has a RM in SBF 1, return it 2 say m x2 is value of x in SBF 2 1 if m x2 > 0, return it 2 return min value of x in SBF 1 deletion is reverse of insertion 1 decrease(sbf 1,x) 2 if x has a SM in SBF 1, decrease(sbf 2,x) As insertion is in both SBT, deletion can t create false positives

51 Methods Comparison: MS v.s. MI v.s. RM error rates space overhead complexity insertion/deletion MI RM MS = ε MI MS RM MS MI RM MS = RM MI

52 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?

53 Solving Problem 1 with an integer vector Each counter fits in one word, for example a 4 bytes word. All the m-counter are (4m) bytes. To get ε < 0.01% we have m = 10n. So m-counter are (40n) bytes. With n = 2 20 (few more than 10 6 objects) we have that counters need 40MB! vector of integer it s too big do we need to count up to ?

54 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion

55 Solving Problem 1 with a static bit vector Suppose i.f i < 10 thus C j < 10 + α with α depending from collisions. Use for each counter log 2 10 = 4 bits (= 0.5 bytes). To get ε < 0.01% we have m = 10n. So m-counter are (5n) bytes. With n = 2 20 we have a 5MB static vector! this static vector doesn t allow insertion or deletion

56 Solving Problem 1 with String Array Index use number of bits (per counter) strictly needed use some slack bits fix a value α > 0 add αm bits to vector, one every 1/α items each counter C i uses log 2 (C i ) each counter counts up to 2 log 2 (C i ) +.. ( m ) C vector is log 2 C i + αm = N bits i=1 log 2 C 1 log 2 C 2 log 2 C 1/α log 2 C 2/α log 2 Cm C 2/α C 1 C 2... C 1/α C m slack bit slack bit

57 The String Array Index: main idea first level of pointers to subsequences into SBF a Coarse Offset Vector to groups of (log N) size items these pointers are m/ log N second level may be other Coarse Offset Vector of pointers to subsequences a simple vector of offsets

58 The String Array Index: graphic Coarse Offset Vector... S... Offset Vectors C.O.V. Offset Vectors

59 The String Array Index: performances 2-level of pointers to sub-sequences if N = m ( log 2 (C i ) + α i ) Theorem i=1 The SAI of size o(n) + O(m) bits can be built in O(m) time, supporting access to sub-sequences in O(1) time Theorem An SBF of size N + o(n) + O(m) bits can be built in O(N) time, supporting lookup in O(1) time. Furthermore, each update takes O(1) amortized time.

60 SBF tricks merging by addiction 1 we have two sets S 1, S 2 and two SBF C 1, C 2 2 suppose m 1 = m 2 and same hashing functions 3 just sum the counters Ci 12 = Ci 1 + Ci 2

61 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

62 , (BBF) we compute any function f using a BBF some constraints on f same tradeoff inherited by Bloom Filters we associate values with a subset of the domain elements

63 Which function f : D = {0,..., N 1} R = {, 0,..., 2 r 1} values computed are into S D R D R S f(s) f error free f error arbitrarily close to 1

64 Main features query is O(1) space requirement is O(nr) can be generalized to handle dynamic updates function can be updated space unchanged we query values of f we may change f (x) for x S but S is immutable

65 Main idea a false positive in a BBF returning a result when the key is not in the map we give a simple idea of a BBF the Bloom Filter cascade can be formerly generalized

66 A near-optimal and simple BBF possible values are {0, 1} A 0 is a BF with values mapping to 0 B 0 is a BF with values mapping to 1 we will build many (A i, B i ) (here is the cascade) we make a cross search we search as deep as we need what may happen when searching?

67 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

68 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

69 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

70 A near-optimal and simple BBF we start looking in (A 0, B 0 ) if it is in neither it is not in the map (surely) if it is in A 0 but not in B 0 it does not map to 1 (surely) it does map to 0 (probably) if it is in A 0 and in B 0 which one lies? (false positive) we have to go recursively into (A 1, B 1 ) A 1 are values mapping to 0 that are false positives in B 0

71 A BF cascade A 0 B 0 false positive false positive A 1 B 1 f.p. f.p A i+1 A i average search is O(1) first pairs are generally enough total space is independent of n first pair occupies most space

72 The general idea results are binary-coded v R is coded with β v {0, 1} q for each bit of β v we use the simple BBF what we get is space is slightly larger than the space for 2q BF lookup is Θ(q) build is O(n log n) the E BBF is proportional to 2 q

73 The general idea they use a table T of coded values T has m locations we have as always k hash functions we use a masking value M to reduce E BBF and if x S k β f (x) = M T [h i (x)] i=1 P[lookup(x, T ) = ] 1 k 2 q

74 Index 1 The problem 2 Main idea Mathematics 3 4 Applications References

75 Other special Bloom Filters Counting Bloom Filters Broder et al. [1998] Compressed Bloom Filters M.Mitzenmacher [2002] Attenuated Bloom Filters Rhea, Kubiatowicz [2002] Compact Approximator of Lattice Functions Boldi, Vigna [2004]

76 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References

77 When are really used Applications References routing probabilistic location and routing shortest path distance information proxy Web proxy cache into SQUID distributed caching peer-to-peer summarize the contents spell checking original B.Bloom idea...

78 Index Applications References 1 The problem 2 Main idea Mathematics 3 4 Applications References

79 References: foundations Applications References B.Bloom Space/time tradeoffs in hash coding with allowable errors. CACM,13(7): , 1970 Saar Cohen, Yossi Matias Spectral bloom filters. ACM SIGMOD 03, 2003 B. Chazelle, J. Kilian, R. Rubinfeld, A. Tal The Bloomier Filter: An Efficient Data Structure for Static Support Lookup Tables. Proceedings of 15th SODA (2004), 30-39, 2004 Bose, Guo, Kranakis, Maheshwari, Morin, Morrison, Smid, Tang On the false-positive rate of Bloom Filters. School of Computer Science, Carleton University, 2004

80 References: extras Applications References M. Mitzenmacher Compressed Bloom Filters. In Proceedings of 20th ACM SIGACT-SIGOPS, , 2002 P.Boldi, S.Vigna Compact Approximation of Lattice Functions with Applications to Large-Alphabet Text Search. Dipartimento di Scienze dell Informazione, Universita di Milano, 2004 A. Broder, M. Mitzenmacher Network Applications of Bloom Filters: A Survey. In Proceedings of 40th Allerton Conference (2004), , 2002

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing Bloom filters and Hashing 1 Introduction The Bloom filter, conceived by Burton H. Bloom in 1970, is a space-efficient probabilistic data structure that is used to test whether an element is a member of

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Introduction to Randomized Algorithms III

Introduction to Randomized Algorithms III Introduction to Randomized Algorithms III Joaquim Madeira Version 0.1 November 2017 U. Aveiro, November 2017 1 Overview Probabilistic counters Counting with probability 1 / 2 Counting with probability

More information

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011

Bloom Filter Redux. CS 270 Combinatorial Algorithms and Data Structures. UC Berkeley, Spring 2011 Bloom Filter Redux Matthias Vallentin Gene Pang CS 270 Combinatorial Algorithms and Data Structures UC Berkeley, Spring 2011 Inspiration Our background: network security, databases We deal with massive

More information

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Lecture 24: Bloom Filters. Wednesday, June 2, 2010 Lecture 24: Bloom Filters Wednesday, June 2, 2010 1 Topics for the Final SQL Conceptual Design (BCNF) Transactions Indexes Query execution and optimization Cardinality Estimation Parallel Databases 2 Lecture

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Algorithms for Data Science

Algorithms for Data Science Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based

More information

Lecture 4 Thursday Sep 11, 2014

Lecture 4 Thursday Sep 11, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture 4 Thursday Sep 11, 2014 Prof. Jelani Nelson Scribe: Marco Gentili 1 Overview Today we re going to talk about: 1. linear probing (show with 5-wise independence)

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic

PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic IEICE TRANS.??, VOL.Exx??, NO.xx XXXX x PAPER Adaptive Bloom Filter : A Space-Efficient Counting Algorithm for Unpredictable Network Traffic Yoshihide MATSUMOTO a), Hiroaki HAZEYAMA b), and Youki KADOBAYASHI

More information

Hash tables. Hash tables

Hash tables. Hash tables Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key

More information

Weighted Bloom Filter

Weighted Bloom Filter Weighted Bloom Filter Jehoshua Bruck Jie Gao Anxiao (Andrew) Jiang Department of Electrical Engineering, California Institute of Technology. bruck@paradise.caltech.edu. Department of Computer Science,

More information

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY A General-Purpose Counting Filter: Making Every Bit Count Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY Approximate Membership Query (AMQ) insert(x) ismember(x)

More information

A Model for Learned Bloom Filters, and Optimizing by Sandwiching

A Model for Learned Bloom Filters, and Optimizing by Sandwiching A Model for Learned Bloom Filters, and Optimizing by Sandwiching Michael Mitzenmacher School of Engineering and Applied Sciences Harvard University michaelm@eecs.harvard.edu Abstract Recent work has suggested

More information

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is: CS 24 Section #8 Hashing, Skip Lists 3/20/7 Probability Review Expectation (weighted average): the expectation of a random quantity X is: x= x P (X = x) For each value x that X can take on, we look at

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Introduction to Hash Tables

Introduction to Hash Tables Introduction to Hash Tables Hash Functions A hash table represents a simple but efficient way of storing, finding, and removing elements. In general, a hash table is represented by an array of cells. In

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32 CS 473: Algorithms Ruta Mehta University of Illinois, Urbana-Champaign Spring 2018 Ruta (UIUC) CS473 1 Spring 2018 1 / 32 CS 473: Algorithms, Spring 2018 Universal Hashing Lecture 10 Feb 15, 2018 Most

More information

1 Maintaining a Dictionary

1 Maintaining a Dictionary 15-451/651: Design & Analysis of Algorithms February 1, 2016 Lecture #7: Hashing last changed: January 29, 2016 Hashing is a great practical tool, with an interesting and subtle theory too. In addition

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 13th, 2017 Bloom Filter Approximate membership problem Highly space-efficient randomized data structure

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Tight Bounds for Sliding Bloom Filters

Tight Bounds for Sliding Bloom Filters Tight Bounds for Sliding Bloom Filters Moni Naor Eylon Yogev November 7, 2013 Abstract A Bloom filter is a method for reducing the space (memory) required for representing a set by allowing a small error

More information

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004.

Bloom Filters. filters: A survey, Internet Mathematics, vol. 1 no. 4, pp , 2004. Bloo Filters References A. Broder and M. Mitzenacher, Network applications of Bloo filters: A survey, Internet Matheatics, vol. 1 no. 4, pp. 485-509, 2004. Li Fan, Pei Cao, Jussara Aleida, Andrei Broder,

More information

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Advanced Implementations of Tables: Balanced Search Trees and Hashing Advanced Implementations of Tables: Balanced Search Trees and Hashing Balanced Search Trees Binary search tree operations such as insert, delete, retrieve, etc. depend on the length of the path to the

More information

Module 1: Analyzing the Efficiency of Algorithms

Module 1: Analyzing the Efficiency of Algorithms Module 1: Analyzing the Efficiency of Algorithms Dr. Natarajan Meghanathan Professor of Computer Science Jackson State University Jackson, MS 39217 E-mail: natarajan.meghanathan@jsums.edu What is an Algorithm?

More information

Lecture 18 April 26, 2012

Lecture 18 April 26, 2012 6.851: Advanced Data Structures Spring 2012 Prof. Erik Demaine Lecture 18 April 26, 2012 1 Overview In the last lecture we introduced the concept of implicit, succinct, and compact data structures, and

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30 CSCB63 Winter 2019 Week10 - Lecture 2 - Hashing Anna Bretscher March 21, 2019 1 / 30 Today Hashing Open Addressing Hash functions Universal Hashing 2 / 30 Open Addressing Open Addressing. Each entry in

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2019 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2019 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from

COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from COMPUTING SIMILARITY BETWEEN DOCUMENTS (OR ITEMS) This part is to a large extent based on slides obtained from http://www.mmds.org Distance Measures For finding similar documents, we consider the Jaccard

More information

Lecture: Analysis of Algorithms (CS )

Lecture: Analysis of Algorithms (CS ) Lecture: Analysis of Algorithms (CS483-001) Amarda Shehu Spring 2017 1 Outline of Today s Class 2 Choosing Hash Functions Universal Universality Theorem Constructing a Set of Universal Hash Functions Perfect

More information

The Bloom Paradox: When not to Use a Bloom Filter

The Bloom Paradox: When not to Use a Bloom Filter 1 The Bloom Paradox: When not to Use a Bloom Filter Ori Rottenstreich and Isaac Keslassy Abstract In this paper, we uncover the Bloom paradox in Bloom filters: sometimes, the Bloom filter is harmful and

More information

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation:

Data structures Exercise 1 solution. Question 1. Let s start by writing all the functions in big O notation: Data structures Exercise 1 solution Question 1 Let s start by writing all the functions in big O notation: f 1 (n) = 2017 = O(1), f 2 (n) = 2 log 2 n = O(n 2 ), f 3 (n) = 2 n = O(2 n ), f 4 (n) = 1 = O

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

md5bloom: Forensic Filesystem Hashing Revisited

md5bloom: Forensic Filesystem Hashing Revisited DIGITAL FORENSIC RESEARCH CONFERENCE md5bloom: Forensic Filesystem Hashing Revisited By Vassil Roussev, Timothy Bourg, Yixin Chen, Golden Richard Presented At The Digital Forensic Research Conference DFRWS

More information

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13 CSCB63 Winter 2019 Week 11 Bloom Filters Anna Bretscher March 30, 2019 1 / 13 Today Bloom Filters Definition Expected Complexity Applications 2 / 13 Bloom Filters (Specification) A bloom filter is a probabilistic

More information

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis Motivation Introduction to Algorithms Hash Tables CSE 680 Prof. Roger Crawfis Arrays provide an indirect way to access a set. Many times we need an association between two sets, or a set of keys and associated

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions

Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Algorithm Design CS 515 Fall 2015 Sample Final Exam Solutions Copyright c 2015 Andrew Klapper. All rights reserved. 1. For the functions satisfying the following three recurrences, determine which is the

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018

CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis. Ruth Anderson Winter 2018 CSE332: Data Structures & Parallelism Lecture 2: Algorithm Analysis Ruth Anderson Winter 2018 Today Algorithm Analysis What do we care about? How to compare two algorithms Analyzing Code Asymptotic Analysis

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park 1617 Preview Data Structure Review COSC COSC Data Structure Review Linked Lists Stacks Queues Linked Lists Singly Linked List Doubly Linked List Typical Functions s Hash Functions Collision Resolution

More information

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we CSE 203A: Advanced Algorithms Prof. Daniel Kane Lecture : Dictionary Data Structures and Load Balancing Lecture Date: 10/27 P Chitimireddi Recap This lecture continues the discussion of dictionary data

More information

Bloom Filters and Locality-Sensitive Hashing

Bloom Filters and Locality-Sensitive Hashing Randomized Algorithms, Summer 2016 Bloom Filters and Locality-Sensitive Hashing Instructor: Thomas Kesselheim and Kurt Mehlhorn 1 Notation Lecture 4 (6 pages) When e talk about the probability of an event,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a Hash Tables Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a mapping from U to M = {1,..., m}. A collision occurs when two hashed elements have h(x) =h(y).

More information

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing) CS5314 Randomized Algorithms Lecture 15: Balls, Bins, Random Graphs (Hashing) 1 Objectives Study various hashing schemes Apply balls-and-bins model to analyze their performances 2 Chain Hashing Suppose

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen

Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence. Mikkel Thorup University of Copenhagen Bottom-k and Priority Sampling, Set Similarity and Subset Sums with Minimal Independence Mikkel Thorup University of Copenhagen Min-wise hashing [Broder, 98, Alta Vita] Jaccard similary of sets A and B

More information

Computational Complexity - Pseudocode and Recursions

Computational Complexity - Pseudocode and Recursions Computational Complexity - Pseudocode and Recursions Nicholas Mainardi 1 Dipartimento di Elettronica e Informazione Politecnico di Milano nicholas.mainardi@polimi.it June 6, 2018 1 Partly Based on Alessandro

More information

Hash tables. Hash tables

Hash tables. Hash tables Basic Probability Theory Two events A, B are independent if Conditional probability: Pr[A B] = Pr[A] Pr[B] Pr[A B] = Pr[A B] Pr[B] The expectation of a (discrete) random variable X is E[X ] = k k Pr[X

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Shannon-Fano-Elias coding

Shannon-Fano-Elias coding Shannon-Fano-Elias coding Suppose that we have a memoryless source X t taking values in the alphabet {1, 2,..., L}. Suppose that the probabilities for all symbols are strictly positive: p(i) > 0, i. The

More information

CS 61B Asymptotic Analysis Fall 2017

CS 61B Asymptotic Analysis Fall 2017 CS 61B Asymptotic Analysis Fall 2017 1 More Running Time Give the worst case and best case running time in Θ( ) notation in terms of M and N. (a) Assume that slam() Θ(1) and returns a boolean. 1 public

More information

CS 170 Algorithms Fall 2014 David Wagner MT2

CS 170 Algorithms Fall 2014 David Wagner MT2 CS 170 Algorithms Fall 2014 David Wagner MT2 PRINT your name:, (last) SIGN your name: (first) Your Student ID number: Your Unix account login: cs170- The room you are sitting in right now: Name of the

More information

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR)

Databases. DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) Databases DBMS Architecture: Hashing Techniques (RDBMS) and Inverted Indexes (IR) References Hashing Techniques: Elmasri, 7th Ed. Chapter 16, section 8. Cormen, 3rd Ed. Chapter 11. Inverted indexing: Elmasri,

More information

Probabilistic Counting with Randomized Storage

Probabilistic Counting with Randomized Storage Probabilistic Counting with Randomized Storage Benjamin Van Durme University of Rochester Rochester, NY 14627, USA Ashwin Lall Georgia Institute of Technology Atlanta, GA 30332, USA Abstract Previous work

More information

CSCI Honor seminar in algorithms Homework 2 Solution

CSCI Honor seminar in algorithms Homework 2 Solution CSCI 493.55 Honor seminar in algorithms Homework 2 Solution Saad Mneimneh Visiting Professor Hunter College of CUNY Problem 1: Rabin-Karp string matching Consider a binary string s of length n and another

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

i=1 i B[i] B[i] + A[i, j]; c n for j n downto i + 1 do c n i=1 (n i) C[i] C[i] + A[i, j]; c n

i=1 i B[i] B[i] + A[i, j]; c n for j n downto i + 1 do c n i=1 (n i) C[i] C[i] + A[i, j]; c n Fundamental Algorithms Homework #1 Set on June 25, 2009 Due on July 2, 2009 Problem 1. [15 pts] Analyze the worst-case time complexity of the following algorithms,and give tight bounds using the Theta

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007 Lecture 9: Lecture 9: Electrical & Computer Engineering University of Waterloo Canada February 6, 2007 Hash tables Lecture 9: Recall that a hash table consists of m slots into which we are placing items;

More information

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Algorithms lecture notes 1. Hashing, and Universal Hash functions Algorithms lecture notes 1 Hashing, and Universal Hash functions Algorithms lecture notes 2 Can we maintain a dictionary with O(1) per operation? Not in the deterministic sense. But in expectation, yes.

More information

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage:

Computer Networks 55 (2011) Contents lists available at ScienceDirect. Computer Networks. journal homepage: Computer Networks 55 (2) 84 89 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet A Generalized Bloom Filter to Secure Distributed Network Applications

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Introduction. An Introduction to Algorithms and Data Structures

Introduction. An Introduction to Algorithms and Data Structures Introduction An Introduction to Algorithms and Data Structures Overview Aims This course is an introduction to the design, analysis and wide variety of algorithms (a topic often called Algorithmics ).

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

DATA MINING LECTURE 3. Frequent Itemsets Association Rules

DATA MINING LECTURE 3. Frequent Itemsets Association Rules DATA MINING LECTURE 3 Frequent Itemsets Association Rules This is how it all started Rakesh Agrawal, Tomasz Imielinski, Arun N. Swami: Mining Association Rules between Sets of Items in Large Databases.

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

Lecture 8 HASHING!!!!!

Lecture 8 HASHING!!!!! Lecture 8 HASHING!!!!! Announcements HW3 due Friday! HW4 posted Friday! Q: Where can I see examples of proofs? Lecture Notes CLRS HW Solutions Office hours: lines are long L Solutions: We will be (more)

More information

Integer Sorting on the word-ram

Integer Sorting on the word-ram Integer Sorting on the word-rm Uri Zwick Tel viv University May 2015 Last updated: June 30, 2015 Integer sorting Memory is composed of w-bit words. rithmetical, logical and shift operations on w-bit words

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Bloom Filters, Minhashes, and Other Random Stuff

Bloom Filters, Minhashes, and Other Random Stuff Bloom Filters, Minhashes, and Other Random Stuff Brian Brubach University of Maryland, College Park StringBio 2018, University of Central Florida What? Probabilistic Space-efficient Fast Not exact Why?

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1

Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology Spring 2010 Professors Piotr Indyk and David Karger Quiz 1 Introduction to Algorithms March 10, 2010 Massachusetts Institute of Technology 6.006 Spring 2010 Professors Piotr Indyk and David Karger Quiz 1 Quiz 1 Do not open this quiz booklet until directed to do

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Average Case Analysis. October 11, 2011

Average Case Analysis. October 11, 2011 Average Case Analysis October 11, 2011 Worst-case analysis Worst-case analysis gives an upper bound for the running time of a single execution of an algorithm with a worst-case input and worst-case random

More information

N/4 + N/2 + N = 2N 2.

N/4 + N/2 + N = 2N 2. CS61B Summer 2006 Instructor: Erin Korber Lecture 24, 7 Aug. 1 Amortized Analysis For some of the data structures we ve discussed (namely hash tables and splay trees), it was claimed that the average time

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Searching, mainly via Hash tables

Searching, mainly via Hash tables Data structures and algorithms Part 11 Searching, mainly via Hash tables Petr Felkel 26.1.2007 Topics Searching Hashing Hash function Resolving collisions Hashing with chaining Open addressing Linear Probing

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures. Modified from Jeff Ullman Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Distance Measures Modified from Jeff Ullman Goals Many Web-mining problems can be expressed as finding similar sets:. Pages

More information

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing

DATA MINING LECTURE 6. Similarity and Distance Sketching, Locality Sensitive Hashing DATA MINING LECTURE 6 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY AND DISTANCE Thanks to: Tan, Steinbach, and Kumar, Introduction to Data Mining Rajaraman and Ullman, Mining

More information

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts) Introduction to Algorithms October 13, 2010 Massachusetts Institute of Technology 6.006 Fall 2010 Professors Konstantinos Daskalakis and Patrick Jaillet Quiz 1 Solutions Quiz 1 Solutions Problem 1. We

More information

Lecture 4 February 2nd, 2017

Lecture 4 February 2nd, 2017 CS 224: Advanced Algorithms Spring 2017 Prof. Jelani Nelson Lecture 4 February 2nd, 2017 Scribe: Rohil Prasad 1 Overview In the last lecture we covered topics in hashing, including load balancing, k-wise

More information

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004

Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 Secure Indexes* Eu-Jin Goh Stanford University 15 March 2004 * Generalizes an early version of my paper How to search on encrypted data on eprint Cryptology Archive on 7 October 2003 Secure Indexes Data

More information

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts

Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Adapting Boyer-Moore-Like Algorithms for Searching Huffman Encoded Texts Domenico Cantone Simone Faro Emanuele Giaquinta Department of Mathematics and Computer Science, University of Catania, Italy 1 /

More information

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology

the subset partial order Paul Pritchard Technical Report CIT School of Computing and Information Technology A simple sub-quadratic algorithm for computing the subset partial order Paul Pritchard P.Pritchard@cit.gu.edu.au Technical Report CIT-95-04 School of Computing and Information Technology Grith University

More information

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org.

Streaming - 2. Bloom Filters, Distinct Item counting, Computing moments. credits:www.mmds.org. Streaming - 2 Bloom Filters, Distinct Item counting, Computing moments credits:www.mmds.org http://www.mmds.org Outline More algorithms for streams: 2 Outline More algorithms for streams: (1) Filtering

More information

Deterministic Finite Automaton (DFA)

Deterministic Finite Automaton (DFA) 1 Lecture Overview Deterministic Finite Automata (DFA) o accepting a string o defining a language Nondeterministic Finite Automata (NFA) o converting to DFA (subset construction) o constructed from a regular

More information

1 Substitution method

1 Substitution method Recurrence Relations we have discussed asymptotic analysis of algorithms and various properties associated with asymptotic notation. As many algorithms are recursive in nature, it is natural to analyze

More information

Analysis of Algorithms I: Perfect Hashing

Analysis of Algorithms I: Perfect Hashing Analysis of Algorithms I: Perfect Hashing Xi Chen Columbia University Goal: Let U = {0, 1,..., p 1} be a huge universe set. Given a static subset V U of n keys (here static means we will never change the

More information

CS 580: Algorithm Design and Analysis

CS 580: Algorithm Design and Analysis CS 580: Algorithm Design and Analysis Jeremiah Blocki Purdue University Spring 2018 Reminder: Homework 1 due tonight at 11:59PM! Recap: Greedy Algorithms Interval Scheduling Goal: Maximize number of meeting

More information

Chapter 2. Recurrence Relations. Divide and Conquer. Divide and Conquer Strategy. Another Example: Merge Sort. Merge Sort Example. Merge Sort Example

Chapter 2. Recurrence Relations. Divide and Conquer. Divide and Conquer Strategy. Another Example: Merge Sort. Merge Sort Example. Merge Sort Example Recurrence Relations Chapter 2 Divide and Conquer Equation or an inequality that describes a function by its values on smaller inputs. Recurrence relations arise when we analyze the running time of iterative

More information