Lecture 4: Hashing and Streaming Algorithms

Similar documents
Lecture 2: Streaming Algorithms

CS 591, Lecture 6 Data Analytics: Theory and Applications Boston University

Lecture 3 Sept. 4, 2014

1 Estimating Frequency Moments in Streams

Lecture 14: Random Walks, Local Graph Clustering, Linear Programming

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

Lecture 7: Schwartz-Zippel Lemma, Perfect Matching. 1.1 Polynomial Identity Testing and Schwartz-Zippel Lemma

1 Maintaining a Dictionary

Lecture 13: Spectral Graph Theory

14.1 Finding frequent elements in stream

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 04: Balls and Bins: Birthday Paradox. Birthday Paradox

6.842 Randomness and Computation Lecture 5

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 5: Two-point Sampling

Lecture 6 September 13, 2016

Lecture 4: Two-point Sampling, Coupon Collector s problem

Lecture Lecture 3 Tuesday Sep 09, 2014

Lecture 1: Contraction Algorithm

Some notes on streaming algorithms continued

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Derandomization, witnesses for Boolean matrix multiplication and construction of perfect hash functions

Lecture 2. Frequency problems

Lecture 12: Introduction to Spectral Graph Theory, Cheeger s inequality

Lecture 6. Today we shall use graph entropy to improve the obvious lower bound on good hash functions.

Lecture 3 Frequency Moments, Heavy Hitters

2 Completing the Hardness of approximation of Set Cover

1 Probability Review. CS 124 Section #8 Hashing, Skip Lists 3/20/17. Expectation (weighted average): the expectation of a random quantity X is:

CPSC 467: Cryptography and Computer Security

Lecture 9: SVD, Low Rank Approximation

Lecture 5. 1 Review (Pairwise Independence and Derandomization)

Randomness and Computation March 13, Lecture 3

Lecture 8: Linear Algebra Background

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

Lecture 4: Universal Hash Functions/Streaming Cont d

25.2 Last Time: Matrix Multiplication in Streaming Model

Hashing. Martin Babka. January 12, 2011

Lecture 9: Low Rank Approximation

6.1 Occupancy Problem

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

Kousha Etessami. U. of Edinburgh, UK. Kousha Etessami (U. of Edinburgh, UK) Discrete Mathematics (Chapter 7) 1 / 13

Lecture 10 - MAC s continued, hash & MAC

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Lecture 2 Sept. 8, 2015

Lecture 1: Introduction to Sublinear Algorithms

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

1 Approximate Counting by Random Sampling

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

Lecture 6: September 22

2 How many distinct elements are in a stream?

Hashing. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing. Philip Bille

1 Closest Pair of Points on the Plane

Hashing. Hashing. Dictionaries. Dictionaries. Dictionaries Chained Hashing Universal Hashing Static Dictionaries and Perfect Hashing

So far we have implemented the search for a key by carefully choosing split-elements.

Analysis of Algorithms I: Perfect Hashing

CS261: A Second Course in Algorithms Lecture #18: Five Essential Tools for the Analysis of Randomized Algorithms

Lecture 5: Derandomization (Part II)

1 Difference between grad and undergrad algorithms

Lecture 14: Cryptographic Hash Functions

Expectation of geometric distribution

Lecture 9: Counting Matchings

Lecture 8: Path Technology

Expectation of geometric distribution. Variance and Standard Deviation. Variance: Examples

CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

6.854 Advanced Algorithms

Norms of Random Matrices & Low-Rank via Sampling

Notes for Lecture Decision Diffie Hellman and Quadratic Residues

Lecture 11: Hash Functions, Merkle-Damgaard, Random Oracle

CSCB63 Winter Week10 - Lecture 2 - Hashing. Anna Bretscher. March 21, / 30

Frequency Estimators

6.842 Randomness and Computation April 2, Lecture 14

Big Data. Big data arises in many forms: Common themes:

Lecture Lecture 9 October 1, 2015

Discrete Math, Fourteenth Problem Set (July 18)

Basics of hashing: k-independence and applications

Lecture: Analysis of Algorithms (CS )

IITM-CS6845: Theory Toolkit February 3, 2012

Outline. MSRI-UP 2009 Coding Theory Seminar, Week 2. The definition. Link to polynomials

Lecture Space Bounded and Random Turing Machines and RL

Lecture Note 2. 1 Bonferroni Principle. 1.1 Idea. 1.2 Want. Material covered today is from Chapter 1 and chapter 4

8.1 Concentration inequality for Gaussian random matrix (cont d)

Randomized Algorithms. Lecture 4. Lecturer: Moni Naor Scribe by: Tamar Zondiner & Omer Tamuz Updated: November 25, 2010

Lecture 16: Roots of the Matching Polynomial

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 32

Partitions and Covers

CS168: The Modern Algorithmic Toolbox Lecture #19: Expander Codes

Lecture 1 September 3, 2013

Lecture 4 Thursday Sep 11, 2014

Lecture 17: D-Stable Polynomials and Lee-Yang Theorems

Intro to Public Key Cryptography Diffie & Hellman Key Exchange

compare to comparison and pointer based sorting, binary trees

Ad Placement Strategies

Lecture 2: Concentration Bounds

New Characterizations in Turnstile Streams with Applications

Lecture 23: Alternation vs. Counting

Explicit and Efficient Hash Functions Suffice for Cuckoo Hashing with a Stash

Lecture 6: Introducing Complexity

1 Approximate Quantiles and Summaries

COS597D: Information Theory in Computer Science October 5, Lecture 6

Transcription:

CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected to the usual scrutiny reserved for formal publications. At the end of the previous lecture, we introduced the basic notions of hashing and saw some of its applications. In this lecture, we are going to study hashing in more detail. 4.1 The Problem of Hashing Let U = {1, 2,..., U } be a large universe of numbers, let X 1,..., X n {1, 2,..., U } be n input numbers in such a universe where n U. Recall from the last lecture, our goal is to construct a family of hash functions H, where every function in this family maps from U to {1, 2,..., N}. We want the probability of collision P [ i, j {1,..., n}, i j, h(x i ) = h(x j )] to be small. In the mean time, we would like to use as h H little space as possible. Recall that in the example of birthday paradox, where we have n people and N days in a year, the probability that two people were born in the same day is small when N n 2. Therefore, if we map the n input numbers uniformly to {1, 2,..., cn 2 } for some large enough constant c, then by an analysis similar to what we did in birthday paradox, we can show that the probability of collision is small. However, the problem here is that to record the uniform hash function, we need U log N bits, which is too big. Therefore, instead of choosing uniformly from all the possible mappings, we choose uniformly from a smaller set of functions. This motivates us to use hash functions with limited independence. 4.2 Limited Independence Definition 4.1 (One-way Independence). Let H be a family of hash functions, we say H is one-way independent if for all x 1 U and for all a 1 [N], we have P h H [h(x 1) = a 1 ] = 1 N. Note that the above definition of one-way independence is not enough for a good family of hash functions. The family of constant functions {h 1, h 2,..., h N } where h i (x) = i for every x U is one-way independent. Constant functions give us the largest amount of collisions we can imagine. However, when we take one step further and use a pairwise independent family of functions, we are able to achieve small collision probability. Definition 4.2 (Pairwise Independence). Let H be a family of hash functions, we say H is pairwise independent if for all distinct x 1, x 2 U and for all a 1, a 2 [N], we have P [h(x 1) = a 1, h(x 2 ) = a 2 ] = 1 h H N 2. 4-1

4-2 Lecture 4: Hashing and Streaming Algorithms In fact, in the case of our problem, hash function families with the below definition of approximate pairwise independence property is sufficient. Definition 4.3 (Approximate Pairwise Independence). Let H be a family of hash functions, we say H is α-approximate pairwise independent if for all distinct x 1, x 2 U and for all a 1, a 2 [N], we have P [h(x 1) = a 1, h(x 2 ) = a 2 ] α h H N 2. Before proving that pairwise independence is sufficient for a good family of hash functions, we remark that we can extend the definitions 4.1 and 4.2 to k-wise independence. Although pairwise independence is already sufficient for our application today, k-wise independent hash functions are very important objects in computer science, and thus have found a lot of applications elsewhere. Definition 4.4 (k-wise Independence). Let H be a family of hash functions, we say H is k-wise independent if for all distinct x 1, x 2,...x k U and for all a 1, a 2,..., a k [N], we have 4.3 Birthday Paradox Revisit P [h(x 1) = a 1, h(x 2 ) = a 2,..., h(x k ) = a k ] = 1 h H N k. Now, let us revisit the analysis of the birthday paradox. We will see that the actual property that we need is approximate pairwise independence instead of mutual independence. Similar to the analysis of the birthday paradox, we define Y ij to be the indicator random variables that h(x i ) = h(x j ), and let Y = n 1 n i=1 j=i+1 Y ij. Suppose that H is an α-approximate pairwise independent hash function family, then for every distinct i, j [n] Therefore E[Y ij ] = P [h(x i ) = h(x j )] = h H E[Y ] = n 1 a [N] i=1 j=i+1 So if n 2 < N/α, then by Markov s inequality, we have P [h(x i) = a, h(x j ) = a] α h H N 2 N = α N. n E[Y ij ] ( ) n α/n. 2 P h H [ distinct i, j {1,..., n}, h(x i) h(x j )] = P[Y = 0] 1 2 The above analysis shows that a family of hash functions with the property of α-approximate pair independence for some constant α would be suffice for our purpose. 4.4 Double Hashing The material of this section follows from the work of Fredman et al. []. As we will see in the next section, to store the the (approximate) pairwise independence hash functions, we will need O(log n) space. One plausible hashing scheme would be storing a linked list for each of the possible images of our hash function. If we use this hashing scheme directly, we will also need O(n 2 ) space to store the starting nodes of the linked lists of the O(n 2 ) possible images. Therefore, it requires O(n 2 ) space overall to implement such a hashing

Lecture 4: Hashing and Streaming Algorithms 4-3 scheme, which is not affordable when n is too large. In fact, we can save more memory by using the following two layers hashing scheme. Instead of choosing N = Θ(n 2 ), we choose N = Θ(n) and use our hashing algorithm to map the inputs to N buckets. We still maintain a link list for each of the buckets. For this choice of value of N, we should not expect no collision happen. However, we can bound the number of pairs of inputs elements that map to the same bucket. Then, by hashing elements in the same bucket in a second tier we avoid collision with a constant probability. To analyze the above scheme, we use Z i to denote the number of elements that map to the i-th location. Then, when we do the second layer of hashing, according to Section 4.3, we should map the elements in the i-th bucket to [αzi 2 ] location. On the other hand, by the proof of Section 4.3, E N i=1 n 1 Zi 2 = 2 n i=1 j=i+1 Therefore, the expected amount of memory we will use is O(n). EY ij + n = 2α( ) n 2 N + n = O(αn). 4.5 Construction of Pairwise Independent Hash Function Let p be a prime so that U p 2 U. Let variables a and b both be uniformly chosen from {0,..., p 1} such that a 0. We show that the family of functions f a,b (x) = ax + b mod p is pairwise independent. Note that these functions map [p] [p]. Later we see that using a mod operation we can construct pairwise independent hash functions that map [p] [N]. Claim 4.5. For all x, y {0,..., p 1} such that x y, and for all s, t {0,..., p 1} such that s t, we have P [f 1 a,b(x) = s, f a,b (y) = t] = a,b p(p 1) Proof. When f a,b (x) = s and f a,b (y) = t, we have ax + b s mod p, and ay + b t mod p. Subtracting one from the other, we get a(x y) s t mod p. Since p is a prime number, for every number k {1,..., p 1}, there is a unique (modular) inverse k 1 of k so that kk 1 1 mod p. We do not discuss the algorithm for finding modular inverse, we refer students to https://en.wikipedia.org/wiki/modular multiplicative inverse for details. Since x y, x y has a modular inverse. So, we can write solve the above system of modular equations for a and b; in particular, we have a (s t)(x y) 1 mod p. Note that since s t, we have a 0 in above as desired. Furthermore, b s ax mod p The above analysis shows that for x, y, s, t given in the statement of the above claim, there exists a unique solution (a, b) so that f a,b (x) = s and f a,b (y) = t. Since there are p(p 1) possible values for (a, b) to take and (a, b) is chosen uniformly from them, the claim follows.

4-4 Lecture 4: Hashing and Streaming Algorithms Now, we choose the family of hash functions to be H = {h a,b } where h a,b (x) = f a,b (x) mod N. Note that to store this function in memory, we only have to store a and b, which takes only O(log p) = O(log U ) many bits. For the particular application to Hashing universe U, we can use another idea to reduce the memory size to O(log n). Please refere to [, Lem 2] for details. Now, we show that H is the family of hash functions with 2-approximate pairwise independence property. Claim 4.6. For all x, y U so that x y, we have P [h a,b(x) = h a,b (y)] 1 a,b N. Proof. For all x, y U so that x y, h a,b (x) = h a,b (y) if and only if f a,b (x) f a,b (y) mod N. Thus, by Claim 4.5 P [h a,b(x) = h a,b (y)] = P [f a,b (x) f a,b (y) mod N] a,b a,b = P [f a,b (x) = s, f a,b (y) = t] I [s t mod N] = 0 s,t<p:s t 0 s,t<p:s t 1 p(p 1) p N = 1 N I [s t mod N] p(p 1) The last inequality follows by the fact that for any x [p] there are at most p/n numbers t such that s t and s t mod N. Note that the family of functions that we construct above is approximately pairwise independent but that is enough for all interesting applications. We remark that we can extend the above construction and obtain a family of hash functions that is k-wise independent. For some prime number p, consider the family of hash functions f a0,...,a k 1 (x) = a k 1 x k 1 +... + a 1 x + a 0, where a 0,..., a k 1 are uniformly chosen in {0, 1,..., p 1}. Similar to the invertible argument we used above, the proof that this construction is k-wise independent follows from the fact that the Vandermonde matrix 1 1 1 x 1 x 2 x k x 2 1 x 2 2 x 2 k...... x k 1 1 x k 1 2 x k 1 k is invertible for distinct x 1,..., x k. So, in general we can store a k-wise independent hash function with only O(k log U ) amount of memory.

REFERENCES 4-5 4.6 Introduction to Streaming Algorithms As an application of hashing, we are going to discuss streaming algorithms in the next a couple of lectures. Streaming algorithms has become a hot topic in theoretical computer science nowadays because of the massive amount of data we have to process. Typically, we do not have enough space to store the entire data. Instead, we process the data in a streaming fashion, and sketch the information we want from the data by a few passes. We will talk about algorithms for F 0 and F 2 estimation. Those are classic results appeared in the first paper of streaming algorithms []. The problem is as follows. Let U = {1,..., U } be a large universe of numbers, and let X 1,..., X n be a sequence of numbers in U. Let f i = n j=1 I[X j = i] be the number of times i appears in the sequence. For 0 k, we let F k = U i=1 f i k, where we define 0 0 = 0. The interesting values of k for us are When k = 0, F 0 counts the number of distinct elements in the sequence. When k = 2, F 2 is the second moment of the vector (f 1,..., f U ). When k =, F corresponds to the number of times the most frequent number shows up in the sequence. In the next lecture, we are going to prove the following theorem. log U +log n Theorem 4.7. With (1 δ) probability and using O( ɛ log 1 2 δ ) space, we can give a (1 ɛ) approximation of F 0 and F 2 (in the worst case). We remark that allowing randomness and approximated solution is crucial to us. There is no hope to use a deterministic or exact algorithm to achieve logarithmic amount of space. Please see Tim Roughgarden s Lecture notes for more details. References [] M. L. Fredman, J. Komlós, and E. Szemerédi. Storing a Sparse Table with 0(1) Worst Case Access Time. In: J. ACM 31.3 (June 1984), pp. 538 544 (cit. on pp. 4-2, 4-4). [] N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. In: STOC. ACM. 1996, pp. 20 29 (cit. on p. 4-5).