Lecture 3 Frequency Moments, Heavy Hitters

Size: px

Start display at page:

Download "Lecture 3 Frequency Moments, Heavy Hitters"

Fay Black
6 years ago
Views:

1 COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is about the second frequency moment and heavy hitters. First, e present the Tug-of-War algorithm by Alon-Matias-Szegedy to obtain a 1 + ɛ approximation using O( 1 ɛ log n) space. Second, e define heavy hitters and present the CountMin algorithm hich can be used to obtain heavy hitters. 2 Second Frequency Moment Assume as in previous lectures that e are given a stream of length m from hich e ant to obtain a frequency vector f = (f 1, f 2,..., f n ) (n distinct elements) here f i is the frequency of the ith distinct element in the stream, then the kth frequency moment of the stream is F k = fi k (1) In Lectures 1 and 2, e estimated F 0 and F 1 corresponding to the number of distinct elements and the length of the stream respectively. No e ish to obtain the second moment: F 2 = fi 2 (2) The idea is to use i.i.d random variables r i (= r(i)) here P (r i = 1) = P (r i = 1) = 1 2 (Rademacher random variables) to get an estimator. Thus, E[r i ] = 0 and Var[r i ] = 1 (since r 2 i = 1 and Var[r i] = E[r 2 i ] (E[r i]) 2 ). We can think of r i as a random variable defined for every distinct element so that r : [n] { 1, +1} (3) As an example, let s consider hen the entries of the frequency vector is all 1s ( m = n). Let z = r i (4) Then E[z] = E[r i ] = 0 by linearity of expectation and using the distribution of r i. Similarly, Var[z] = Var[ r i ] = m. We can apply Chebyshev to obtain that z 0 = z O( m) ith constant probability. Turns out that this bound is tight. 1

2 No, let s consider the more general case and define z r i f i (5) Claim 1. E[z 2 ] = f 2 i = F 2 Proof. E[z 2 ] = E[( r i f i ) 2 ] (6) = E[ri 2 ]fi 2 + E[r i ]E[r j ]f i f j (7) i i j f 2 i E[r 2 i ] (8) f 2 i = F 2 (9) (7) holds because for i j, r i and r j are independent, hile (8) holds because E[r i ] = 0 for all i. Finally, (9) holds because r 2 i = 1( E[r2 i ] = 1). Claim 2. Var[z 2 ] O(1) F 2 2 Proof. First, let s compute E[z 4 ] E[z 4 ] = E[( r i f i ) 4 ] (10) i = E[ r i f i r r k f k r l f l ] (11) i,j,k,l = f i f j f k f l E[r i r j r k r l ] (12) i,j,k,l f 4 i + 6 i<j f 2 i f 2 j (13) O(1) ( i f 2 i ) 2 (14) (13) holds because: Of independence of the random variables The terms ith odd poers of r i evaluate to zero There are ( 4 2) = 6 terms of the form r 2 i r 2 j Finally, e obtain that Var[z 2 ] E[z 4 ] O(1) F 2 2 Having bounded E[z 2 ] and Var[z 2 ], e present the algorithm belo 2

3 2.1 Tug-of-War (Alon-Matias-Szegedy 1996) We maintain a counter z. 1. Initialize z = 0 2. When e see element i: z = z + r(i) 3. Return the estimator z 2 Earlier, e defined r i in the form r : [n] { 1, +1} here r i acts like a hash function. The hash function for the r i s should be 4-ise independent. Next, e apply the average trick using k = O( 1 ɛ 2 ) parallel runs of the algorithm to obtain a 1 + ɛ approximation in O( 1 ɛ 2 log n) space. 2.2 Linearity So far e have only considered simple estimators, and e next consider something more complex. Suppose e have to parts of a stream seen by to different estimators, and e ant to estimate the union of these to parts (i.e., ho e can combine them). Claim 3. Linearity: given estimates z for f and z for f, e can combine them as z = z + z for f = f + f. Proof. Since z = r i f i and z = r i f i, e have z + z = r i (f i + f i ). Note that e need to use the same randomness for to estimates. Similarly, e can use (z z ) 2 to estimate (f i f i )2. Hoever, e cannot use linearity in a similar ay for f i f i, and ill discuss this pointer later in class. 2.3 General Streaming Model We no consider a more generalized model: at each moment, e have an update (i, δ i ) to increase the i-th entry by δ i. (δ i may be negative) A linear algorithm S handles this easily, S(f + e i δ i ) = S(f) + S(e i δ i ). We call S a sketch. According to [Nguyen-Li-Woodruff 14], any algorithm for general streaming might as ell be linear. 3 Heavy Hitters No that e are able to compute many types of frequency counts, e onder if e can also compute the max frequency in a stream. It turns out that e cannot: it is impossible to approximate the max frequency in sublinear space. Therefore, e ill solve a more modest problem here e ant to detect the max-frequency element if it is very heavy. We ill sho that e can find these heavy hitters in space O(1/φ). Definition 4. i is φ-heavy if f i φ. 3

4 The basic idea is still to use hash functions. A first-attempt method uses a single hash function h mapping from [n] to [] randomly, here = O(1/φ). Then each element i goes to bucket i, and e sum up the frequencies in each of the buckets. We denote the sum of each bucket as S. So the estimator for f i is ˆf = S(h(i)). For example, consider a stream of 2, 5, 7, 5, 5. If the hash function h 1 orks as h 1 (2) = 2, h 1 (5) = 1, h 1 (7) = 2, then e ill obtain the folloing estimates: ˆf2 = 2, ˆf 5 = 3, ˆf 7 = 2. Hoever, for an element that never appears in the stream, e.g. 11, this method also estimates its frequency as f ˆ 11 = 2, assuming h 1 (11) = 2. Claim 5. S(h(i)) is a biased estimator. Proof. Analyzing this estimator, e have ˆf i = S(h(i)) = f i + {j:h(j)=h(i)} f j. Let C = {j:h(j)=h(i)} f j. Thus, E[C] = P r[h(j) = h(i)] f j = f j 0 j j i Hoever, it is easy to see that E[C] f i /. By Markov Inequality, e have Cle 10 ( P r[ ˆf i f i < O. So the bias is at most /, hich is small for ith at least 90% probability, i.e. ) ] 0.9 ( ) For = O 1 and f i φ, e have C ɛf i. That is, ˆf i is a (1 + ɛ) approximation (ith 90% probability). Still, there are to issues ith this estimator: (1) only constant probability; (2) overestimate for many indices (10%). Fundamentally, there is a conflict beteen avoiding many collisions and reducing space used by the hash table. 3.1 CountMin This motivates us to use the median trick. We can use L = O(log n) hash tables ith hash functions h j. The CountMin algorithm orks as follos: Initialize (r, L): array S[L][] L hash functions h 1,..., h L, into {1,..., } Process ( int i): for (j = 0; j < L; ++j) S[j][h j (i)] += 1 Estimator : foreach i in PossibleIP : ˆf i = median j (S[j][h j (i)] 4

5 Claim 6. The median is a ± estimator ith (1 1/n 2 ) probability. Proof. For an index i, each ro (out of the L ros) gives ˆf i = f i ± ith 90% probability. By Median Trick (see Lecture 2), the median gives an estimator of ± ith (1 1/n 2 ) probability. Alternatively, e can take the min, instead of median, since all counts are overestimated. 3.2 Output Heavy Hitters We can no identify the heavy hitters by iterating over all i s and outputing those ith particular, for true frequencies f i s, if if f i f i if φ(1 ɛ) φ(1 ɛ), then i is not in output; φ(1 + ɛ), then i is reported as a heavy hitter; f i φ(1 + ɛ), then i may or may not be reported as a heavy hitter. ˆf i φ. In If e really care about those elements in beteen, then e could take more space to further narro the gap. ( ) The space used is O log 2 n bits. Since e iterate over all i s, the time complexity is Ω(n). We can improve this time complexity, at the cost of increasing space to O ( log 3 n ) bits. The idea is to use dyadic intervals. For each level, e maintain its on sketch, and find the heavy hitters by folloing don the subtrees of heavy hitters in intermediary. 5

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]