CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Similar documents
Lecture 5: Hashing. David Woodruff Carnegie Mellon University

1 Maintaining a Dictionary

Hash tables. Hash tables

Hash tables. Hash tables

Some notes on streaming algorithms continued

6.1 Occupancy Problem

Lecture 4: Hashing and Streaming Algorithms

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Lecture 4 Thursday Sep 11, 2014

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

Algorithms for Data Science

Lecture: Analysis of Algorithms (CS )

Analysis of Algorithms I: Perfect Hashing

:s ej2mttlm-(iii+j2mlnm )(J21nm/m-lnm/m)

Lecture 23: Alternation vs. Counting

Ad Placement Strategies

14.1 Finding frequent elements in stream

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Hash tables. Hash tables

1 Difference between grad and undergrad algorithms

6.842 Randomness and Computation Lecture 5

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture and notes by: Alessio Guerrieri and Wei Jin Bloom filters and Hashing

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

Randomness and Computation March 13, Lecture 3

CS5314 Randomized Algorithms. Lecture 15: Balls, Bins, Random Graphs (Hashing)

Lecture 4: Two-point Sampling, Coupon Collector s problem

Topics in Probabilistic Combinatorics and Algorithms Winter, Basic Derandomization Techniques

Algorithms lecture notes 1. Hashing, and Universal Hash functions

Application: Bucket Sort

Lecture 5: Two-point Sampling

Lecture 3 Sept. 4, 2014

Bloom Filters and Locality-Sensitive Hashing

1 Estimating Frequency Moments in Streams

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 4 February 2nd, 2017

Big Data. Big data arises in many forms: Common themes:

Introduction to Hash Tables

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

So far we have implemented the search for a key by carefully choosing split-elements.

6 Filtering and Streaming

1 Cryptographic hash functions

A New Paradigm for Collision-free Hashing: Incrementality at Reduced Cost

PRAMs. M 1 M 2 M p. globaler Speicher

18.5 Crossings and incidences

1 Randomized Computation

As mentioned, we will relax the conditions of our dictionary data structure. The relaxations we

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

The Generalized Randomized Iterate and its Application to New Efficient Constructions of UOWHFs from Regular One-Way Functions

Lecture 1 : Data Compression and Entropy

Notes for Lecture Notes 2

Notes on MapReduce Algorithms

Hash Tables. Given a set of possible keys U, such that U = u and a table of m entries, a Hash function h is a

Lecture 8: Channel and source-channel coding theorems; BEC & linear codes. 1 Intuitive justification for upper bound on channel capacity

ACO Comprehensive Exam October 14 and 15, 2013

Lecture Lecture 3 Tuesday Sep 09, 2014

CS 580: Algorithm Design and Analysis

Lecture 2. Frequency problems

Notes on Complexity Theory Last updated: December, Lecture 27

Lecture 7: More Arithmetic and Fun With Primes

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Linear Congruences. The equation ax = b for a, b R is uniquely solvable if a 0: x = b/a. Want to extend to the linear congruence:

CS 151 Complexity Theory Spring Solution Set 5

CSCB63 Winter Week 11 Bloom Filters. Anna Bretscher. March 30, / 13

Course 2BA1: Trinity 2006 Section 9: Introduction to Number Theory and Cryptography

12 Hash Tables Introduction Chaining. Lecture 12: Hash Tables [Fa 10]

Authentication. Chapter Message Authentication

Inaccessible Entropy and its Applications. 1 Review: Psedorandom Generators from One-Way Functions

Approximating MAX-E3LIN is NP-Hard

compare to comparison and pointer based sorting, binary trees

Tight Bounds for Sliding Bloom Filters

Lecture 18: PCP Theorem and Hardness of Approximation I

PRGs for space-bounded computation: INW, Nisan

COMS E F15. Lecture 22: Linearity Testing Sparse Fourier Transform

2 How many distinct elements are in a stream?

arxiv: v3 [cs.ds] 11 May 2017

cse 311: foundations of computing Fall 2015 Lecture 12: Primes, GCD, applications

Balls and Bins: Smaller Hash Families and Faster Evaluation

Randomized Algorithms, Spring 2014: Project 2

Solution Set for Homework #1

Homework 6: Solutions Sid Banerjee Problem 1: (The Flajolet-Martin Counter) ORIE 4520: Stochastics at Scale Fall 2015

cse 311: foundations of computing Fall 2015 Lecture 11: Modular arithmetic and applications

Lecture 3: Error Correcting Codes

Lecture 8 HASHING!!!!!

Data Structures and Algorithm. Xiaoqing Zheng

Unit 2, Section 3: Linear Combinations, Spanning, and Linear Independence Linear Combinations, Spanning, and Linear Independence

Introduction to Randomized Algorithms: Quick Sort and Quick Selection

Lecture Notes CS:5360 Randomized Algorithms Lecture 20 and 21: Nov 6th and 8th, 2018 Scribe: Qianhang Sun

The diameter of a random Cayley graph of Z q

(K2) For sum of the potential differences around any cycle is equal to zero. r uv resistance of {u, v}. [In other words, V ir.]

Expectation of geometric distribution

On the Distribution of Multiplicative Translates of Sets of Residues (mod p)

Lecture 5: January 30

Transcription:

CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required guarantees, we would need not just one function, but a family of functions, where we would use randomness to sample a hash function from this family. Let H = {h : U R} be a family of functions, mapping elements from a (large) universe to a (small) range. Ideally, we would like to obtain certatin random-like properties of this family, while keeping its size small. This will be important for applications in data structures and streaming algorithms, where we would need to keep a seed which tells us which hash function we chose, as well as in de-randomization, where we can replace true random with pseudo-randomness, obtained via enumerating all the hash functions in a family. We start with describing one of the more basic but very useful properties we can require from hash functions, that of being pairwise-independent. Definition 1.1 (Pairwise independent hash functions). A family H = {h : U R} is said to be pairwise independent, if for any two distinct elements x 1 x U, and any two (possibly equal) values y 1, y R, Pr [h(x 1) = y 1 and h(x ) = y ] = 1 h H R. Pairwise independent bits We will construct a family of hash functions H = {h : U R} where R = {0, 1}. We will assume that U = k, by possible increasing the universe size, and identify U = {0, 1} k. Define the following family of hash functions: H = {h a,b (x) = a, x + b (mod ) : a {0, 1} k, b {0, 1}}. It is easy to see that H = k+1 = U. In order to show that H is pairwise independent, we would need the following simple claim. 1

Claim.1. Let x {0, 1} k, x 0. Then Proof. Assume x i = 1 for some i [k]. Then Pr [ a, x a {0,1} k Pr [ a, x (mod ) = 0] = 1 a {0,1} k. (mod ) = 0] = Pr [ a i = j i a j x j (mod ) ] = 1. Lemma.. H as described above is pairwise independent. Proof. Fix x 1 x {0, 1} k and y 1, y {0, 1}. All the calculations below of a, x + b are modulo. We need to prove Pr [ a, x 1 + b = y 1 and a, x + b = y ] = 1 a {0,1} k,b {0,1} 4. If we just randomized over a then by the claim, then for any y {0, 1} by the claim, Pr a [ a, x 1 a, x = y] = Pr a [ a, x 1 x = y] = 1. Randomizing also over b gives us the desired result. Pr [ a, x 1 + b = y 1 and a, x + b = y ] = a,b Pr [ a, x 1 a, x = y 1 y and b = a, x 1 + y 1 ] = a,b 1 1 = 1 4. An equivalent way to view this is as follows. We can generate a joint distribution over n = U bits, such that any pair of bits is uniform. Let U = {u 1,..., u n }. To generate a random binary string x 1,..., x n, we sample h H uniformly and set x i = h(u i ). Our distribution has support of size at most H = O(n). That is, only O(n) binary strings are possible. This is much fewer than the uniform distribution over n bits, which is supported on all n binary strings. In particular, we can represent a string by specifying the hash function which generated it, which only takes log H = log n + O(1) bits. Example.3. For n = 4, we get that sampling a uniform string from the following set of 8 = 3 strings is pairwise independent: {0000, 0011, 0101, 0110, 1001, 1010, 1100, 1111}.

.1 Application: de-randomized MAXCUT Let G = (V, E) be a simple undirected graph. For S V let E(S, S c ) = {(u, v) E : u S, v S c } be the number of edges which cross the cut S. The MAXCUT problem asks to find the maximal number of edges in a cut. MAXCUT(G) = max S V E(S, Sc ) Computing the MAXCUT of a graph is known to be NP-hard. Still, there is a simple randomized algorithm which approximates it within factor. Let n = V and V = {v 1,..., v n }. Lemma.4. Let x 1,..., x n {0, 1} be uniformly and independently chosen. Set S = {v i : x i = 1}. Then E S [ E(S, S c ) ] E(G) Proof. For any choice of S we have E(S, S c ) = (v i,v j ) E MAXCUT(G). 1 vi S1 vj S c Note that every undirected edge {u, v} in G is actually counted twice in the calculation above, once as (u, v) and once as (v, u). However, clearly at most one of these is in E(S, S c ). By linearity of expectation, the expected size of the cut is E S [ E(S, S c ) ] = E[1 vi S1 vj / S] = E[1 xi =11 xj =0] (v i,v j ) E = (v i,v j ) E (v i,v j ) E Pr[x i = 1 and x j = 0] = E(G) 1 4 = E(G). This implies that a random choice of S has a non-negligible probability of giving a - approximation. [ ] Corollary.5. Pr S E(S, S c ) E(G) 1 1. E(G) n Proof. Let X = E(S, S c ) be a random variable counting the number of edges in a random cut. Let µ = E(G) /, where we know that E[X] µ. Note that whenever X < µ, we in fact have that X µ 1/, since X is an integer and µ a half-integer. Also, note that always X E(G) µ. Let p = Pr[X µ]. Then E[X] = E[X X µ] Pr[X µ] + E[X X µ 1/] Pr[X µ 1/] µ p + (µ 1/) (1 p) µ 1/ + µp. So we must have µp 1/, which means that p 1/(4µ) 1/(E(G)). 3

So in particular, we can sample O(n ) sets S, compute for each one its cut size, and we are guaranteed that with high probability, the maximum will be at least E(G) /. We can derandomize randomized algorithm this using pairwise independent bits. As a side benefit, it will reduce the computation time from testing O(n ) sets to testing only O(n) sets. Lemma.6. Let x 1,..., x n {0, 1} be pairwise independent bits. Set Then S = {v i : x i = 1}. E S [ E(S, S c ) ] E(G). Proof. The only place where we used the fact that the bits were uniform, where in the calculation that Pr[x i = 1 and x j = 0] = 1 4 for all distinct i, j. However, this is also true for pairwise independent bits. In particular, for one of the O(n) sets S that we generate in the algorithm, we must have that E(S, S c ) exceeds the average, and hence E(S, S c ) E(G) /.. Optimal sample size for pairwise independent bits The previous application showed the usefulness of having small sample spaces for pairwise independent bits. We saw that we can generate O(n) binary strings of length n, such that choosing one of them uniformly gives us pairwise independent bits. We next show that this is optimal. Lemma.7. Let X {0, 1} n. Assume that for any distinct i, j [n] and any b, b {0, 1} we have Pr x X [x i = b and x j = b ] = 1 4. Then X n. Proof. Let m = X and suppose X = {x 1,..., x m }. For any i [n], construct a real vector v i { 1, 1} m as follows: (v i ) l = ( 1) (xl ) i. We will show that the set of vectors {v 1,..., v m } are linearly independent over the reals, and hence span a subspace of dimension n. Hence, m n. To do that, we first show that v i, v j = 0 for all i j. To see that, note that v i, v j = x X ( 1) x i ( 1) x j = x X( 1) x i+x j = {x X : x i = x j } {x X : x i x j }. 4

By our assumption however, {x X : x i = x j } = {x X : x i x j } = 1/ and hence v i, v j = 0. Now, assume towards contradiction that v 1,..., v n are linearly independent. Then there exist coefficients α 1,..., α n R, not all zero, such that αi v i = 0. However, for any j [n], we have 0 = αi v i, v j = α i v i, v j = α j v j = X α j. So we must have α j = 0 for all j, a contradiction. 3 Hash functions with large ranges We now consider the problem of constructing a family of hash functions H = {h : U R} for large R. For simplicity, we will assume that R is prime, although this requirement can be somewhat removed. So, lets identify R = F p for a prime p. We may assume that U = p k, by possibly increase the size of the universe by p. So, we can identify U = F k p. Define the following family of hash functions: H = {h a,b (x) = a, x + b : a F k p, b F p }. Note that H = p k+1 = U R. In order to show that H is pairwise independent, we need the following generalized claim. Claim 3.1. Let x F k p, x 0. Then for any y F p, Pr [ a, x = y] = 1 a F k p p. Proof. Assume x i 0 for some i [k]. Then [ Pr [ a, x = y] = Pr a F k p a i x i = y j i a j x j ] Now, for every fixing of {a j : j i}, we have that a i x i is uniformly distributed in F p, hence the probability that it equals any value is exactly 1/p. Lemma 3.. H as described above is pairwise independent. Proof of lemma. Fix x 1 x F k p and y 1, y F p. All the calculations below of a, x + b are in F p. We need to prove Pr [ a, x 1 + b = y 1 and a, x + b = y ] = 1 a F k p p.,b Fp 5.

If we just randomized over a then by the claim, then for any y F p by the claim, Pr a [ a, x 1 a, x = y] = Pr a [ a, x 1 x = y] = 1 p. Randomizing also over b gives us the desired result. Pr [ a, x 1 + b = y 1 and a, x + b = y ] = a,b Pr [ a, x 1 a, x = y 1 y and b = a, x 1 + y 1 ] = a,b 1 p 1 p = 1 p. 3.1 Application: collision free hashing Let S U be a set of objects. A hash function h : U R is said to be collision free for S if it is injective on S. That is, h(x) h(y) for all distinct x, y S. We will show that if R is large enough, then any pairwise independent hash family contains many collision free hash functions for any small set S. This is extremely useful: it allows to give lossless compression of elements from a large universe to a small range. Lemma 3.3. Let H : {h : U R} be a pairwise independent hash family. Let S U be a set of size S R. Then Pr [h is collision free for S] 1/. h H Proof. Let h H be uniformly chosen, and let X be a random variable that counts the number of collisions in S. That is, X = 1 h(x)=h(y). The expected value of X is E[X] = h H x y S x y S Pr [h(x) = h(y)] = ( ) S 1 R S R 1. Let p = Pr[X 1] be the probability that there is at least one collision. Then E[X] = E[X X = 0] Pr[X = 0] + E[X X 1] Pr[X 1] p. So, p 1/, and hence at least half the functions h H are collision free for S. 6

4 Efficient dictionaries: storing sets efficiently We now show how to use pairwise independent hash functions, in order to design efficient dictionaries. Fix a universe U. For simplicity, we will assume that for any R we have a family of pairwise independent hash functions H = {h : U R}, and note that while our previous constructions required R to be prime (or in fact, a prime power), this will at most double the size of the range, which at the end will only change our space requirements by a constant factor. Given a set S U of size S = n, we would like to design a data structure which supports queries of the form is x S? Our goal will be to do so, while minimizing both the space requirements and the time it takes to answer a query. If we simply store the set as a list of n elements, this takes space O(n log U ), and queries take time O(n log U ). We will see that this can be improved via hashing. First, consider the following simple hashing scheme. Fix a range R = {1,..., n }. Let H = {h : U [n ]} be a pairwise independent hash function. We showed that a randomly chosen h H will be collisions free on S with probability at least 1/. So, we can sample h H until we find such an h, and we will find one on average after two samples. Let A be an array of length n. It will be mostly empty, except that we set A[h(x)] = x for all x S. Now, to check whether x S, we compute h(x) and check whether A[h(x)] = x or not. Thus, the query time is only O(log U ). However, the space requirements are big: to store n element, we maintain an array of size n, which requires at least n bits (and maybe even O(n log U ), depends on how clever is your implementation for storing the empty cells). We now describe a two-step hashing scheme due to Ajtai, Komlos and Szemerédi which avoids this large waste of space. It will use only O(n log n + log U ) space, but would still allow for query time of O(log U ). As a preliminary step, we apply the collision free hash scheme we just described. So, will assume from now on that U = O(n ) and that S U of size S = n. Step 1. We first find a hash function h : U [n] which has only n collisions. Define Coll(h, S) = {x, y} S : x y, h(x) = h(y)} ]. If H = {h : U [n]} is a family of pairwise independent hash functions, then E h H [Coll(h, S)] = ( ) S 1 Pr[h(x) = h(y)] = n S n n. By Markov s inequality, we have {x,y} S Pr [Coll(h, S) n] 1/. h H So, after on average two iterations of randomly choosing h H, we find such a function h : U [n] such that Coll(h, S) n. We fix it from now on. Note that it is represented using only O(log n) bits. 7

Step. Next, for any i [n] let S i = {x S : h(x) = i}. Observe that S i = n and n ( ) Si = Coll(h, S) n. i=1 Let n i = S i. Note that n i = Coll(h, S) + S i 3n. We will find hash functions h i : U [n i ] which are collision free on S i. Choosing a uniform hash function from a pairwise independent set of hash functions H i = {h : U [n i ]} succeeds on average after two samples. So, we only need O(n) time to find these functions. As each h i requires O(log n) bits to be represented, we need in total O(n log n) to represent all of them. Let A be an array of size 3n. Let offset i = j<i n j. The sub-array A[offset i : offset i + n i ] will be used to store the elements of S i. Initially A is empty. We set A[offset i + h i (x)] = x x S i. Note that there are no collisions in A, as we are guaranteed that h i are collision free on S i. We will also keep all of {offset i : i [n]} in a separate array. Query. To check whether x S, we do the following: Compute i = h(x). Read offset i. Check if A[offset i + h i (x)] = x or not. This can be computed using O(log n) bit operations. Space requirements. The hash functions h requires O(log n) bits. The hash functions {h i : i [n]} require O(n log n) bits. The array A requires O(n log n) bits. Setup time. The setup algorithm is randomized, as it needs to find good hash functions. It has expected running time is O(n log n) bit operations. To find h takes O(n log n) time, as this is how long it takes to verify that it is collision free. To find each h i takes O( S i log n) time, and in total it is O(n log n) time. To set up the arrays of {offset i : i [n]} and A takes O(n log n) time. RAM model vs bit model. Up until now, we counted bit operations. However, computers can operate on words efficiently. A model for that is the RAM model, where we can perform basic operations on log n-bit words. In this model, it can be verified that the query time is O(1) word operations, space requirements are O(n) words and setup time is O(n) word operations. 8

5 Bloom filters Bloom filters allow for even more efficient data structures for set membership, if some errors are allowed. Let U ba a universe, S U a subset of size S = n. Let h : U [m] be a uniform hash function, for m to be determined later. The data structure maintains a bit array A of length m, initially set to zero. Then, for every x S, we set A[h(x)] = 1. To check if x S we answer yes if A[h(x)] = 1. This has the following guarantees: No false negative: if x S we will always say yes. Few false positives: if x / S, we will say yes with probability {i:a[i]=1}, assuming h m is a uniform hash function. So, if we set for example m = n, then the probability for x / S that we say no it at least 1/. It is in fact more, since when hashing n elements to n values there will be collisions, so the number of 1 s in the array will be less than n. In fact, to get probability 1/ we only need m 1.44n. This is since the probability that A[i] = 1, over the choice of h, is Pr h [A[i] = 0] = Pr[h(x) i, x S] = ( 1 1 m) n exp( n/m). So, for m = n/ln() 1.44n, the expected number of 0 s in A is m/. Note that a bloom filter uses O(n) bits, which is much less than the O(n log U ) bits we needed for no errors. It can be shown that we don t need h to be uniform, a O(log n)-wise independent hash function gives the same guarantees, and it can be stored very efficiently (concretely, using only O(log n) bits). 9