CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ], < ɛ <, and µ E i X i, P ( i X i µ > ɛµ) < e ɛ µ/3 Today Distinct elements Norm estimation (if there s time) 3 Distinct elements (F ) Problem: Given a stream of integers i,..., i m [n] where [n] : {,,..., n}, we want to output the number of distinct elements seen. 3. Straightforward algorithms. Keep a bit array of length n. Flip bit if a number is seen.. Store the whole stream. Takes m lg n bits. We can solve with O(min(n, m lg n)) bits. 3. Randomized approximation We can settle for outputting t s.t. P ( t t > ɛt) < δ. The original solution was by Flajolet and Martin [].
3.3 Idealized algorithm. Pick random function h : [n] [, ] (idealized, since we can t actually nicely store this). Maintain counter X min i stream h(i) 3. Output /X Intuition. X is a random variable that s the minimum of t i.i.d Unif(, ) r.v.s. Claim. EX t+. Proof. EX t + P (X > λ)dλ P ( i str, h(i) > λ)dλ t P (h(i r ) > λ)dλ r ( λ) t dλ Claim. EX (t+)(t+) Proof. EX P (X > λ)dλ P (X > λ)dλ ( λ) t dλ u λ u t ( u)du (t + )(t + ) This gives V ar[x] EX (EX) t, and furthermore V ar[x] < (EX). (t+) (t+) (t+)
4 FM+ We average together multiple estimates from the idealized algorithm FM.. Instantiate q /ɛ η FMs independently. Let X i come from FM i. 3. Output /Z, where Z q i X i. We have that E(Z) t+, and V ar(z) t q (t+) (t+) <. q(t+) Claim 3. P ( Z t+ > ɛ t+ ) < η Proof. Chebyshev. P ( Z t + > ɛ (t + ) ) < t + ɛ q(t + ) η Claim 4. P ( ( Z ) t > O(ɛ)t) < η Proof. By the previous claim, with probability η we have ( ± ɛ) t+ ( ± O(ɛ))(t + ) ( ± O(ɛ))t ± O(ɛ) 5 FM++ We take the median of multiple estimates from FM+.. Instantiate s 36 ln(/δ) independent copies of FM+ with η /3.. Output the median t of {/Z j } s j where Z j is from the jth copy of FM+. Claim 5. P ( t t > ɛt) < δ Proof. Let Y j { if (/Z j ) t > ɛt else We have EY j P (Y j ) < /3 from the choice of η. The probability we seek to bound is equivalent to the probability that the median fails, i.e. at least half of the FM+ estimates have Y j. In other words, s Y j > s/ j 3
We then get that P ( Y j > s/) P ( Y j s/3 > s/6) () Make the simplifying assumption that EY j /3 (this turns out to be stronger than EY j < /3. Then equation becomes using Chernoff, as desired. P ( Y j E Y j > E Y j ) < e ( ) s/3 3 < δ The final space required, ignoring h, is O( lg(/δ) ɛ ) for O(lg(/δ)) copies of FM+ that require O(/ɛ ) space each. 6 k-wise independent functions Definition 6. A family H of functions mapping [a] to [b] is k-wise independent if j,..., j k [b] and distinct i,..., i k [a], P h H (h(i ) j... h(i k ) j k ) /b k Example. The set H of all functions [a] [b] is k-wise independent for every k. H b a so h is representable in a lg b bits. Example. Let a b q for q p r a prime power, then H poly, the set of degree k polynomials with coefficients in F q, the finite field of order q. H poly q k so h is representable in k lg p bits. Claim 7. H poly is k-wise independent. Proof. Interpolation. 7 Non-idealized FM First, we get an O()-approximation in O(lg n) bits, i.e. our estimate t satisfies t/c t Ct for some constant C.. Pick h from -wise family [n] [n], for n a power of (round up if necessary). Maintain X max i str lsb(h(i)) where lsb is the least significant bit of a number 3. Output X 4
For fixed j, let Z j be the number of i in stream with lsb(h(i)) j. Let Z >j be the number of i with lsb(h(i)) > j. Let Y i { lsb(h(i)) j else Then Z j i str Y i. We can compute EZ j t/ j+ and similarly EZ >j t( j+ + +...) < t/j+ j+3 and also V ar[z j ] V ar[ Y i ] E( Y i ) (E Y i ) i,i E(Y i Y i ) Since h is from a -wise family, Y i are pairwise independent, so E(Y i Y i ) E(Y i )E(Y i ). We can then show V ar[z j ] < t/ j+ Now for j lg t 5, we have 6 EZ j 3 P (Z j ) P ( Z j EZ j 6) < /5 by Chebyshev. For j lg t + 5 EZ >j /6 P (Z >j ) < /6 by Markov. This means with good probability the max lsb will be above j but below j, in a constant range. This gives us a 3-approximation, i.e. constant approximation. 8 Refine to + ɛ Trivial solution. Algorithm TS stores first C/ɛ distinct elements. This is correct if t C/ɛ. Algorithm.. Instantiate TS,..., TS lg n. Pick g : [n] [n] from -wise family 3. Feed i to TS lsb(g(i)) 4. Output j+ out where t/ j+ /ɛ. 5
Let B j be the number of distinct elements hashed by g to TS j. Then EB j t/ j+ Q j. By Chebyshev B j Q j ± O( Q j ) with good probability. This equals ( ± O(ɛ))Q j if Q j /ɛ. Final space: C ɛ (lg n) O( ɛ lg n) bits. It is known that space O(/ɛ + log n) is achievable [4], and furthermore this is optimal [, 5] (also see [3]). References [] Noga Alon, Yossi Matias, Mario Szegedy The Space Complexity of Approximating the Frequency Moments. J. Comput. Syst. Sci. 58(): 37 47, 999. [] Philippe Flajolet, G. Nigel Martin Probabilistic counting algorithms for data base applications. J. Comput. Syst. Sci., 3():8 9, 985. [3] T. S. Jayram, Ravi Kumar, D. Sivakumar: The One-Way Communication Complexity of Hamming Distance. Theory of Computing 4(): 9 35, 8. [4] Daniel M. Kane, Jelani Nelson, David P. Woodruff An optimal algorithm for the distinct elements problem. In Proceedings of the Twenty-Ninth ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS), pages 4 5,. [5] David P. Woodruff. Optimal space lower bounds for all frequency moments. In SODA, pages 67 75, 4. 6