Lecture 2: Streaming Algorithms

Size: px

Start display at page:

Download "Lecture 2: Streaming Algorithms"

Roxanne McLaughlin
5 years ago
Views:

1 CS369G: Algorithmic Techniques for Big Data Spring Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality for an algorithm for counting distinct element in a stream using pairwise independent hash functions. Then, we present a proof technique to prove lower bounds for streaming algorithms. Finally, we define the concept of a frequency moment and propose an algorithm to estimate F 2. 2 Review Last lecture we examined the problem of estimating the number of distinct elements in a stream. We found a solution that performed better than the brute force approach of eep an enormous hash table. The solution that was presented was a probabilistic algorithm that gave an (1 + ɛ) approximation with probability 1 δ. Further, the algorithm required space O( 1 ɛ 2 log(1 δ )). More precisely, we defined Y as the minimum hash value of the stream. For a fully independent hash, we found that E[Y ] = (1) where is the number of distinct elements. Recall that we combined many copies of Y using independent hashes to provide an estimate. In particular, we created O(log( 1 )) groups of hashes δ where each group had O( 1 ) hashes. For the estimate, we computed the mean of each group, then ɛ2 calculated the median of these means. 3 Setches Informally, a data setch is a smaller description of a stream of data that enables the calculation or estimate of a property of the data. An important attribute of setches is that they are composable. Suppose we have data streams S 1 and S 2 with corresponding setches s(s 1 ) and s(s 2 ). We wish there to be an efficiently computable function f where s(s 1 S 2 ) = f(s(s 1 ), s(s 2 )) (2) 1

2 4 Bounds for Pairwise Independent Hashes In the analysis of the distinct element setch from last time, we relied on a fully independent family of hash functions. Unfortunately, such hash functions are not practical. Here, we examine pairwise independent hashes. Recall that a pairwise independent family of hash functions satisfies P h [h(x 1 ) = y 1, h(x 2 ) = y 2 ] = P h [h(x 1 ) = y 1 ]P h [h(x 2 ) = y 2 ] (3) 4.1 Example Choose p to be a large prime number. Let a, b [p]. Let us define the following hash function h a,b : [p] [p] as This family of hash functions is pairwise independent. h a,b (x) = ax + b(mod p) (4) In particular, for this family of hash functions, we have the following bounds P[Y < 1 3 ] < 2 5 P[Y > 3 ] < 1 3 (5) (6) We can then mae O(log( 1 )) copies of the hash and tae the median to be an estimate that is δ within a factor of 3 of the true answer with probability 1 δ. The first bound has an easy proof in the continuous case since we can do a union bound on the interval [0, 1 3 ] among elements to get a probabilistic bound of General Pairwise Independent Analysis To get a general bound for pairwise independent hash families, we need to change the algorithm. Instead of taing the mean of the min within a group of hashes, we eep trac of the smallest t hash elements. Let y i be the i th smallest element. For this setup, our estimator is t/y t. Theorem 1. Fix t = c/ɛ 2. With probability 2/3, (1 ɛ)t y t Proof. Let us first prove the second inequality first. Let I = [0, (1 + ɛ) t ]. Let X i be an indicator variable for the event h(x i ) I. Let X = i X i. Thus, X is the number of hash values in the interval I. 2

3 Note that E[X] = i E[X i ] = (1 + ɛ) t =. P[y t > ] = P[X < t] = P[X E[X] < ɛt] P[ X E[X] > ɛt] (7) By Chebyshev s inequality, P[ X E[X] > ɛt] V ar[x] ɛ 2 t 2 (8) Let p be the probability that X i = 1. Then, E[X i ] = p and V ar(x i ) = p(1 p). By linearity of expectation, E[X] = p and by pairwise independence, V ar(x) = p(1 p) E[X] =. Thus, P[ X E[X] > ɛt] (1 + ɛ) ɛ 2 t 2 = c (9) We can choose the value of c so that (1 + ɛ) c 1. Putting this together, we get, 6 P[y t < ] 1 6 (10) the proof of the other direction is the same except that the Chebyshev bound is in the other direction. For this scheme, we need O(log( 1 δ )) different hashes and for each hash we need to store t = O( 1 ɛ 2 ) values. Thus, the memory of the algorithm will be O( 1 ɛ 2 log(1 δ ). 4.3 Proving Lower Bounds on Streaming Algorithms In this section we describe a common technique for deriving bounds for streaming problems. For this technique, we specify I 1,..., I N different streams. If we can show that the algorithm must have a unique state after processing each stream, Ω(log(N)) is lower bound on the space since the algorithm must distinguish the N different streams. In order to show that the algorithm must have different states after processing two streams I i and I j, we can construct an additional stream I such that the algorithm must give a different result for I i I than for I j I. Using this technique, we can prove a lower bound for a deterministic, approximate algorithm for the distinct element problem. Let {S i } N i=1 be subsets of [n] that satisfy i : S i = n 10 i j : S i S j n 20 (11) (12) 3

4 With a probabilistic construction, we can find 2 cn subsets that meet these requirements. Note that the number of distinct elements in S i S i is (n/10) while the number of distinct elements in S i S j (for i j) is at least (3/2)(n/10). Thus, any deterministic, approximate algorithm must distinguish the N streams corresponding to the sets S i. Thus, Ω(log(2 cn )) = Ω(n) is a lower bound on the required space. 4.4 Algorithms for Frequency Moments Define f i as the number of times that element i appears in a stream. The t th frequency moment is the quantity, F t = i f t i (13) Note that F 0 is the number of distinct elements (assuming 0 0 = 0). F 1 is simply the number of elements in the stream, and thus is trivial to compute. F 2 is a measure of sewness and is the problem we will next examine, which has a solution given in [1]. Suppose we have a hash function h : U {±1}. Let Y = i h(x i ) be a random variable. This variable is a setch because we can simply add the sums of two different sets. Further, Y 2 is an unbiased estimate of F 2. To see this, define the random variable X i = h(x i ) and note that Y = i f i X i E[Y 2 ] = E[( i f i X i ) 2 ] (14) = E[ i,j f i f j X i X j ] (15) = i,j f i f j E[X i X j ] (16) = i f 2 i (17) The last equation follows from the fact that E[X i X j ] = 0 if i j. We need a bound on the variance in order to obtain a concentration inequality. E[Y 4 ] = E[( i f i X i ) 4 ] = i f 4 i + 6 i,j f 2 i f 2 j (18) Thus, V ar[y 2 ] = E[Y 4 ] E[Y 2 ] 2 (19) 4

5 = [ i f 2 i + 6 i,j f 2 i f 2 j ] [ i f 2 i + 2 i,j f 2 i f 2 j ] (20) = 4 i,j f 2 i f 2 j 2F 2 2 = 2E[Y 2 ] 2 (21) Thus, we can use Chebyshev s inequality to concentrate the mean and then use the median of means technique. References [1] N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, Proceedings of the twenty-eighth annual ACM symposium on Theory of computing,

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],