Hash tables. Hash tables

Similar documents
Hash tables. Hash tables

Hash tables. Hash tables

Hashing. Dictionaries Hashing with chaining Hash functions Linear Probing

Data Structures and Algorithm. Xiaoqing Zheng

Lecture: Analysis of Algorithms (CS )

Hash Tables. Direct-Address Tables Hash Functions Universal Hashing Chaining Open Addressing. CS 5633 Analysis of Algorithms Chapter 11: Slide 1

Hashing, Hash Functions. Lecture 7

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Motivation. Dictionaries. Direct Addressing. CSE 680 Prof. Roger Crawfis

1 Maintaining a Dictionary

Insert Sorted List Insert as the Last element (the First element?) Delete Chaining. 2 Slide courtesy of Dr. Sang-Eon Park

Introduction to Hashtables

Collision. Kuan-Yu Chen ( 陳冠宇 ) TR-212, NTUST

Algorithms for Data Science

Introduction to Hash Tables

So far we have implemented the search for a key by carefully choosing split-elements.

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

6.1 Occupancy Problem

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Advanced Implementations of Tables: Balanced Search Trees and Hashing

Symbol-table problem. Hashing. Direct-access table. Hash functions. CS Spring Symbol table T holding n records: record.

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Searching. Constant time access. Hash function. Use an array? Better hash function? Hash function 4/18/2013. Chapter 9

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Searching, mainly via Hash tables

Analysis of Algorithms I: Perfect Hashing

Fundamental Algorithms

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

CSE 502 Class 11 Part 2

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Module 1: Analyzing the Efficiency of Algorithms

COMP251: Hashing. Jérôme Waldispühl School of Computer Science McGill University. Based on (Cormen et al., 2002)

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Hashing. Martin Babka. January 12, 2011

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

CSE 548: (Design and) Analysis of Algorithms

A Lecture on Hashing. Aram-Alexandre Pooladian, Alexander Iannantuono March 22, Hashing. Direct Addressing. Operations - Simple

Cache-Oblivious Hashing

? 11.5 Perfect hashing. Exercises

Hashing. Data organization in main memory or disk

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Lecture 4 Thursday Sep 11, 2014

A General-Purpose Counting Filter: Making Every Bit Count. Prashant Pandey, Michael A. Bender, Rob Johnson, Rob Patro Stony Brook University, NY

Round 5: Hashing. Tommi Junttila. Aalto University School of Science Department of Computer Science

Abstract Data Type (ADT) maintains a set of items, each with a key, subject to

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

CS483 Design and Analysis of Algorithms

Hashing Data Structures. Ananda Gunawardena

Lecture 24: Bloom Filters. Wednesday, June 2, 2010

Hashing. Why Hashing? Applications of Hashing

Randomized Load Balancing:The Power of 2 Choices

Bloom Filters, general theory and variants

Independence, Variance, Bayes Theorem

Data Structures and Algorithm. Xiaoqing Zheng

1 Hashing. 1.1 Perfect Hashing

Lecture 4: Two-point Sampling, Coupon Collector s problem

Amortized analysis. Amortized analysis

CS 580: Algorithm Design and Analysis

compare to comparison and pointer based sorting, binary trees

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

CSE548, AMS542: Analysis of Algorithms, Fall 2017 Date: Oct 26. Homework #2. ( Due: Nov 8 )

Cuckoo Hashing and Cuckoo Filters

2. This exam consists of 15 questions. The rst nine questions are multiple choice Q10 requires two

CS 591, Lecture 7 Data Analytics: Theory and Applications Boston University

Application: Bucket Sort

6.842 Randomness and Computation Lecture 5

s 1 if xπy and f(x) = f(y)

Problem Set 4 Solutions

Tight Bounds for Sliding Bloom Filters

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

Lecture 12: Hash Tables [Sp 15]

INTRODUCTION TO HASHING Dr. Thomas Hicks Trinity University. Data Set - SSN's from UTSA Class

Quiz 1 Solutions. Problem 2. Asymptotics & Recurrences [20 points] (3 parts)

Outline. CS38 Introduction to Algorithms. Max-3-SAT approximation algorithm 6/3/2014. randomness in algorithms. Lecture 19 June 3, 2014

Advanced Algorithm Design: Hashing and Applications to Compact Data Representation

String Matching. Thanks to Piotr Indyk. String Matching. Simple Algorithm. for s 0 to n-m. Match 0. for j 1 to m if T[s+j] P[j] then

Hashing. Hashing DESIGN & ANALYSIS OF ALGORITHM

Introduction to Randomized Algorithms III

CS 125 Section #12 (More) Probability and Randomized Algorithms 11/24/14. For random numbers X which only take on nonnegative integer values, E(X) =

CSCI Honor seminar in algorithms Homework 2 Solution

Basics of hashing: k-independence and applications

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

Bloom Filters and Locality-Sensitive Hashing

Advanced Data Structures

In a five-minute period, you get a certain number m of requests. Each needs to be served from one of your n servers.

Lecture Lecture 3 Tuesday Sep 09, 2014

Cryptographic Hash Functions

Rainbow Tables ENEE 457/CMSC 498E

14.1 Finding frequent elements in stream

Ad Placement Strategies

Bloom Filters, Adaptivity, and the Dictionary Problem

Introduction to Algorithms

2. ALGORITHM ANALYSIS

Electrical & Computer Engineering University of Waterloo Canada February 6, 2007

Module 9: Tries and String Matching

Transcription:

Dictionary Definition A dictionary is a data-structure that stores a set of elements where each element has a unique key, and supports the following operations: Search(S, k) Return the element whose key is k. Insert(S, x) Add x to S. Delete(S, x) Remove x from S Assume that the keys are from U = {0,..., u 1}.

Direct addressing S is stored in an array T [0..u 1]. The entry T [k] contains a pointer to the element with key k, if such element exists, and NULL otherwise. Search(T, k): return T [k]. Insert(T, x): T [x.key] x. Delete(T, x): T [x.key] NULL. What is the problem with this structure? 1 S U 0 9 7 4 6 3 2 5 8 0 1 2 3 4 5 6 7 8 9 key 2 3 5 8 satellite data

Hashing In order to reduce the space complexity of direct addressing, map the keys to a smaller range {0,... m 1} using a hash function. There is a problem of collisions (two or more keys that are mapped to the same value). There are several methods to handle collisions. Chaining Open addressing

Hash table with chaining Let h : U {0,..., m 1} be a hash function (m < u). S is stored in a table T [0..m 1] of linked lists. The element x S is stored in the list T [h(x.key)]. Search(T, k): Search the list T [h(k)]. Insert(T, x): Insert x at the head of T [h(x.key)]. Delete(T, x): Delete x from the list containing x. S={6,9,19,26,30} m=5, h(x)=x mod 5 30 6 26 19 9

Analysis Worst case The worst case running times of the operations are: Search: Θ(n). Insert: Θ(1). Delete: Θ(n) (can be reduced to Θ(1) using doubly linked list).

Analysis Assumption of simple uniform hashing: any element is equally likely to hash into any of the m slots, independently of where other elements have hashed into. The above assumption is true when the keys are chosen uniformly and independently at random (with repetitions), and the hash function satisfies {k U : h(k) = i} = u/m for every i {0,..., m 1}. We want to analyze the performance of hashing under the assumption of simple uniform hashing. This is the balls into bins problem. Suppose we randomly place n balls into m bins. Let X be the number of balls in bin 1. The time complexity of a random search in a hash table is Θ(1 + X ).

The expectation of X Each ball has probability 1/m to be in bin 1. The random variable X has binomial distribution with parameters n and 1/m. Therefore, E[X ] = n/m.

The distribution of X Claim Pr[X = r] e α α r r!, where α = n/m (i.e., X has approximately Poisson distribution). Example If α = 1, Pr[X = 0] 0.368 Pr[X = 4] 0.015 Pr[X = 1] 0.368 Pr[X = 5] 0.003 Pr[X = 2] 0.184 Pr[X = 6] 0.0005 Pr[X = 3] 0.061 Pr[X = 7] 0.00007

The distribution of X Claim Pr[X = r] e α α r r!, where α = n/m (i.e., X has approximately Poisson distribution). Proof. Pr[X = r] = = ( ) ( ) r ( n 1 1 1 ) n r r m m n(n 1) (n r + 1) r! 1 m r ( 1 1 ) n r m If m and n are large, n(n 1) (n r + 1) n r and ( ) 1 1 n r m e n/m. Thus, Pr[X = r] e n/m (n/m) r. r!

The maximum number of balls in a bin Suppose that n = m = 10 6. Most bins contain 0 3 balls. The probability that a specific bin contains at least 8 balls is 0.000009. However, it is very likely that some bin will contain 8 balls. Theorem If m = Θ(n) then the size of the largest bin is Θ(log n/ log log n) with probability at least 1 1/n Θ(1).

Rehashing We wish to maintain n = O(m) in order to have Θ(1) search time. This can be achieved by rehashing. Suppose we want n m. When the table has m elements and a new element is inserted, create a new table of size 2m and copy all elements into the new table. 30 6 21 30 21 h'(x)=x mod 10 19 34 h(x)=x mod 5 34 6 19

Rehashing We wish to maintain n = O(m) in order to have Θ(1) search time. This can be achieved by rehashing. Suppose we want n m. When the table has m elements and a new element is inserted, create a new table of size 2m and copy all elements into the new table. The cost of rehashing is Θ(n).

Rehashing In order to keep the space usage Θ(n), we need n = Ω(m). We can keep α [ 1, 1]. If an insert would make α > 1, 4 create a new table of size 2m. If a delete would make α < 1, create a new table of size m/2. 4

Universal hash functions Random keys + fixed hash function (e.g. mod) The hashed keys are random numbers. A fixed set of keys + random hash function (selected from a universal set) The hashed keys are semi-random numbers.

Universal hash functions Definition A collection H of hash functions is a universal if for every pair of distinct keys x, y U, Pr h H [h(x) = h(y)] 1 m. Example Let p be a prime number larger than u. f a,b (x) = ((ax + b) mod p) mod m H p,m = {f a,b a {1, 2,..., p 1}, b {0, 1,..., p 1}}

Universal hash functions Theorem Suppose that H is a universal collection of hash functions. If a hash table for S is built using a randomly chosen h H, then for every k U, the expected time of Search(S, k) is Θ(1 + n/m). Proof. Let X = length of T [h(k)]. X = y S I y where I y = 1 if h(y.key) = h(k) and I y = 0 otherwise. [ ] E[X ] = E I y = E[I y ] = y S y S y S Pr [h(y.key) = h(k)] h H If k / S, E[X ] n 1 m. Otherwise, E[X ] 1 + (n 1) 1 m.

Time complexity of chaining Under the assumption of simple uniform hashing, the expected time of a search is Θ(1 + α) time. If α = Θ(1), and under the assumption of simple uniform hashing, the worst case time of a search is Θ(log n/ log log n), with probability at least 1 1/n Θ(1). If the hash function is chosen from a universal collection at random, the expected time of a search is Θ(1 + α). The worst case time of insert is Θ(1) if there is no rehashing.

Interpreting keys as natural numbers How can we convert floats or ASCII strings to natural numbers? An ASCII string can be interpreted as a number in base 256. Example For the string CLRS, the ASCII values of the characters are C = 67, L = 76, R = 82, S = 83. So CLRS is (67 256 3 )+(76 256 2 )+(82 256 1 )+(83 256 0 ) = 1129075283.

Horner s rule Horner s rule: a d x d + a d 1 x d 1 + + a 1 x 1 + a 0 = Example: ( ((a d x + a d 1 )x + a d 2 )x + )x + a 0. 67 256 3 + 76 256 2 + 82 256 1 + 83 256 0 = ((67 256 + 76) 256 + 82) 256 + 83 If d is large the value of y is too big. Solution: evaluate the polynomial modulo p. (1) y a d (2) for i = d 1 to 0 (3) y a i + xy mod p

Rabin-Karp pattern matching algorithm Suppose we are given strings P and T, and we want to find all occurences of P in T. The Rabin-Karp algorithm is as follows: Compute h(p) for every substring T of T of length P if h(t ) = h(p) check whether T = P. The values h(t ) for all T can be computed in Θ( T ) time using rolling hash. Example Let T = BGUABC, P = GUAB, T 1 = BGUA, T 2 = GUAB. h(t 1 ) = (66 256 3 + 71 256 2 + 85 256 + 65) mod p h(t 2 ) = (71 256 3 + 85 256 2 + 65 256 + 66) mod p = (h(t 1 ) 66 256 3 ) 256 + 66 mod p

Applications Data deduplication: Suppose that we have many files, and some files have duplicates. In order to save storage space, we want to store only one instance of each distinct file. Distributed storage: Suppose we have many files, and we want to store them on several servers.

Distributed storage Let m denote the number of servers. The simple solution is to use a hash function h : U {1,..., m}, and assign file x to server h(x). The problem with this solution is that if we add a server, we need to do rehashing which will move most files between servers.

Distributed storage: Consistent hashing Suppose that the server have identifiers s 1,..., s m. Let h : U [0, 1] be a hash function. For each server i associate a point h(s i ) on the unit circle. For each file f, assign f to the server whose point is the first point encountered when traversing the unit cycle anti-clockwise starting from h(f ). server 1 server 2 0 server 3

Distributed storage: Consistent hashing Suppose that the server have identifiers s 1,..., s m. Let h : U [0, 1] be a hash function. For each server i associate a point h(s i ) on the unit circle. For each file f, assign f to the server whose point is the first point encountered when traversing the unit cycle anti-clockwise starting from h(f ). server 1 server 2 0 server 3

Distributed storage: Consistent hashing When a new server m + 1 is added, let i be the server whose point is the first server point after h(s m+1 ). We only need to reassign some of the files that were assigned to server i. The expected number of files reassignments is n/(m + 1). server 1 server 2 0 server 4 server 3

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 h(k)=k mod 10

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 14 insert(t,14)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 20 14 insert(t,20)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 20 14 34 insert(t,34)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 20 14 34 19 insert(t,19)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 20 14 34 55 19 insert(t,55)

Linear probing In the following, we assume that the elements in the hash table are keys with no satellite information. To insert element k, try inserting to T [h(k)]. If T [h(k)] is not empty, try T [h(k) + 1 mod m], then try T [h(k) + 2 mod m] etc. Insert(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = NULL (4) T [j] k (5) return (6) j j + 1 mod m (7) error hash table overflow 0 1 2 3 4 5 6 7 8 9 20 49 14 34 55 19 insert(t,49)

Searching Search(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = k (4) return j (5) if T [j] = NULL (6) return NULL (7) j j + 1 mod m (8) return NULL 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 search(t,55)

Searching Search(T, k) (1) j h(k) (2) for i = 0 to m 1 (3) if T [j] = k (4) return j (5) if T [j] = NULL (6) return NULL (7) j j + 1 mod m (8) return NULL 0 20 1 49 2 3 4 14 5 34 6 55 7 8 9 19 search(t,25)

Deletion Delete method 1: To delete element k, store in T [h(k)] a special value DELETED. Change line (3) in Insert to: if T [j] = NULL OR T [j] = DELETED 0 20 1 49 2 3 4 14 5 D 6 55 7 45 8 9 19 delete 34

Deletion Delete method 2: Erase k from the table (replace it by NULL) and also erase the consecutive block of elements following k. Then, reinsert the latter elements to the table. 0 20 1 49 2 3 4 14 5 34 6 55 7 45 8 9 19 delete 34

Deletion Delete method 2: Erase k from the table (replace it by NULL) and also erase the consecutive block of elements following k. Then, reinsert the latter elements to the table. 0 1 2 3 4 5 6 7 8 9 20 49 14 55 19

Deletion Delete method 2: Erase k from the table (replace it by NULL) and also erase the consecutive block of elements following k. Then, reinsert the latter elements to the table. 0 20 1 49 2 3 4 14 5 55 6 45 7 8 9 19

Open addressing Open addressing is a generalization of linear probing. Let h : U {0,... m 1} {0,... m 1} be a hash function such that {h(k, 0), h(k, 1),... h(k, m 1)} is a permutation of {0,... m 1} for every k U. The slots examined during search/insert are h(k, 0), then h(k, 1), h(k, 2) etc. In the example on the right, h(k, i) = (h 1 (k) + ih 2 (k)) mod 13 where h 1 (k) = k mod 13 h 2 (k) = 1 + (k mod 11) 0 1 79 h(14,0) 2 3 4 69 5 98 h(14,1) 6 7 72 8 9 14 h(14,2) 10 11 50 12 insert(t,14)

Open addressing Insertion is the same as in linear probing: Insert(T, k) (1) for i = 0 to m 1 (2) j h(k, i) (3) if T [j] = NULL OR T [j] = DELETED (4) T [j] k (5) return (6) error hash table overflow Deletion is done using delete method 1 defined above (using special value DELETED).

Double hashing In the double hashing method, h(k, i) = (h 1 (k) + ih 2 (k)) mod m for some hash functions h 1 and h 2. The value h 2 (k) must be relatively prime to m for the entire hash table to be searched. This can be ensured by either Taking m to be a power of 2, and the image of h 2 contains only odd numbers. Taking m to be a prime number, and the image of h 2 contains integers from {1,..., m 1}. For example, h 1 (k) = k mod m h 2 (k) = 1 + (k mod m ) where m is prime and m < m.

Time complexity of open addressing Assume uniform hashing: the probe sequence of each key is equally likely to be any of the m! permutations of {0,... m 1}. Assuming uniform hashing and no deletions, the expected number of probes in a search is 1 At most 1 α for unsuccessful search. At most 1 α ln 1 1 α for successful search.

Comparison of hash table methods Search 300 250 Cuckoo Two Way Chaining Chained Hashing Linear Probing Successful Lookup Clock Cycles 200 150 100 50 0 8 10 12 14 16 18 20 22 log n

Comparison of hash table methods Search 300 250 Cuckoo Two Way Chaining Chained Hashing Linear Probing Unsuccessful Lookup Clock Cycles 200 150 100 50 0 8 10 12 14 16 18 20 22 log n

Comparison of hash table methods Insert Clock Cycles 450 400 350 300 250 200 150 100 50 Cuckoo Two Way Chaining Chained Hashing Linear Probing Insert 0 8 10 12 14 16 18 20 22 log n

Comparison of hash table methods Delete Clock Cycles 350 300 250 200 150 100 Cuckoo Two Way Chaining Chained Hashing Linear Probing Delete 50 0 8 10 12 14 16 18 20 22 log n

Bloom Filter Bloom Filter

Blocking malicious web pages Suppose we want to implement blocking of malicious web pages in a web browser. Assume we have a list of 10,000,000 malicious pages. We can store all pages in a hash table, but this requires a large amount of memory. Bloom filter is an implementation of static dictionary that uses small amount of memory. For a query key that is in the set, the Bloom filter always returns a positive answer. For a query key that is not in the set, the Bloom filter can return either a negative answer, or a positive answer (false positive). Bloom Filter

Blocking malicious web pages Suppose that the list of malicious pages is stored in a Bloom filter. For a query URL, if the answer is negative, we know that the URL is not a malicious page. If the answer is positive, we do not know the correct answer. In this case, we get the answer from a server. We want a small probability for a false positive. Bloom Filter

Bloom filter simple version Suppose we want to store a set S. We use an array A of size m initialized to 0. We then choose a hash function h, and set A[h(x)] 1 for every x S. For a query y, if A[h(y)] = 0, we know that y / S. If A[h(y)] = 1, either y is in S or not. 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Bloom filter simple version Suppose we want to store a set S. We use an array A of size m initialized to 0. We then choose a hash function h, and set A[h(x)] 1 for every x S. For a query y, if A[h(y)] = 0, we know that y / S. If A[h(y)] = 1, either y is in S or not. 0 0 0 0 0 01 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Bloom filter simple version Suppose we want to store a set S. We use an array A of size m initialized to 0. We then choose a hash function h, and set A[h(x)] 1 for every x S. For a query y, if A[h(y)] = 0, we know that y / S. If A[h(y)] = 1, either y is in S or not. 0 0 01 0 0 01 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Bloom filter simple version Suppose we want to store a set S. We use an array A of size m initialized to 0. We then choose a hash function h, and set A[h(x)] 1 for every x S. For a query y, if A[h(y)] = 0, we know that y / S. If A[h(y)] = 1, either y is in S or not. 0 0 01 0 0 01 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Bloom filter simple version Suppose we want to store a set S. We use an array A of size m initialized to 0. We then choose a hash function h, and set A[h(x)] 1 for every x S. For a query y, if A[h(y)] = 0, we know that y / S. If A[h(y)] = 1, either y is in S or not. 0 0 01 0 0 01 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Probability of false positive Assume simple uniform hashing: any element is equally likely to hash into any of the m slots, independently of where other elements have hashed into. For fixed i, the probability that A[i] = 0 is p = (1 1/m) n. For y / S, the probability that A[h(y)] = 1 is 1 p. 0 0 01 0 0 01 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Reducing the false positive probability We choose k hash functions h 1, h 2,..., h k, and set A[h j (x)] 1 for j = 1, 2,... k, for every x S. For a query y, if A[h j (y)] = 0 for some j, we know that y / S. If A[h j (y)] = 1 for all j, either y is in S or not. 0 0 0 0 0 0 0 0 0 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Reducing the false positive probability We choose k hash functions h 1, h 2,..., h k, and set A[h j (x)] 1 for j = 1, 2,... k, for every x S. For a query y, if A[h j (y)] = 0 for some j, we know that y / S. If A[h j (y)] = 1 for all j, either y is in S or not. 0 0 0 01 0 01 0 0 10 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Reducing the false positive probability We choose k hash functions h 1, h 2,..., h k, and set A[h j (x)] 1 for j = 1, 2,... k, for every x S. For a query y, if A[h j (y)] = 0 for some j, we know that y / S. If A[h j (y)] = 1 for all j, either y is in S or not. 01 0 01 01 0 01 0 0 10 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Reducing the false positive probability We choose k hash functions h 1, h 2,..., h k, and set A[h j (x)] 1 for j = 1, 2,... k, for every x S. For a query y, if A[h j (y)] = 0 for some j, we know that y / S. If A[h j (y)] = 1 for all j, either y is in S or not. 01 0 01 01 0 01 0 0 10 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Reducing the false positive probability We choose k hash functions h 1, h 2,..., h k, and set A[h j (x)] 1 for j = 1, 2,... k, for every x S. For a query y, if A[h j (y)] = 0 for some j, we know that y / S. If A[h j (y)] = 1 for all j, either y is in S or not. 01 0 01 01 0 01 0 0 10 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Probability of false positive For fixed i, the probability that A[i] = 0 is p = (1 1/m) kn. For y / S, the probability that A[h(y)] = 1 is (1 p) k. 01 0 01 01 0 01 0 0 10 0 0 1 2 3 4 5 6 7 8 9 Bloom Filter

Optimizing the probability of false positive For the previous example (n = 2, m = 10): For k = 1, p = 0.81, (1 p) 1 = 0.19. For k = 2, p 0.656, (1 p) 2 0.118. For k = 3, p 0.531, (1 p) 3 0.103. For k = 4, p 0.430, (1 p) 4 0.105. For k = 5, p 0.349, (1 p) 5 0.117. Bloom Filter

Optimizing the probability of false positive Let f (k) be the probability of false positive. p = (1 1/m) kn e kn/m f (k) = (1 p) k (1 e kn/m ) k = e k ln(1 e kn/m ) Minimizing f (k) is equivalent to minimizing g(k) = k ln(1 e kn/m ). dg = ln(1 dk e kn/m ) + kn e kn/m. m 1 e kn/m The minimum of g(k) is when dg/dk = 0 = k = m n The minimum of f (k) is 0.6185 m/n. Example: For m = 8n, m ln 2 5.55. f (5) 0.0217, n f (6) 0.0216. ln 2. Bloom Filter