CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2

Size: px

Start display at page:

Download "CSCI8980 Algorithmic Techniques for Big Data September 12, Lecture 2"

Myles Blankenship
6 years ago
Views:

1 CSCI8980 Algorithmic Techniques for Big Data September, 03 Dr. Barna Saha Lecture Scribe: Matt Nohelty Overview We continue our discussion on data streaming models where streams of elements are coming in and main memory space is not sufficient to hold all the data. We begin by discussing the Chernoff Bound and demonstrating it s proof. We then look at the Universal Hash Family and discuss pairwise, k-wise, and fully independent hash functions. Next, we dive deeper into algorithms used to count distinct items in a stream and discuss two algorithms and analyze them. Chernoff Bound The Chernoff Bound is commonly used to show randomization algorithms produce results of acceptable quality or to determine the number of runs needed to acheive a result of a certain probability. Many data streaming algorithms have components of randomization so the Chernoff Bound is frequently used with these algorithms. The Chernoff Bound produces tighter bounds than the Markov Inequality or Chebyshev Inequality but it requires assumptions that those two do not. The Chernoff Bound requires it s input to be independent Bernoulli random variables which the other two inequalities do not. Theorem (The Chernoff Bound). Let X, X...X n be n independent Bernoulli random variables with Pr(X i ) p i. Let X X i. Hence, [ ] E[X] E Xi E [X i ] Pr(X i ) p i µ(say). Then the Chernoff Bound says for any ɛ > 0 Pr(X > ( + ɛ)µ) Pr(X < ( ɛ)µ) ( e ɛ ( + ɛ) ɛ ( e ɛ ( ɛ) ɛ ) µ and ) µ When 0 < ɛ < the above expression can be further simplified to Pr(X > ( + ɛ)µ) e µɛ 3 and Pr(X < ( ɛ)µ) e µɛ Hence Pr( X µ > ɛµ) e µɛ 3

2 Proof of the Chernoff Upper Bound The upper bound of the Chernoff Bound states: Pr(X > ( + ɛ)µ) e µɛ 3 Proof. P r(e tx e t(+ɛ)µ ) for any t > 0 P r(e tx e t(+ɛ)µ ) E[e tx ] by Markov Inequality et(+ɛ)µ Expand x in the numerator: E[e tx ] E[e t xi ] E[e tx e tx...e txn ] all are independent by base assumption in Chernoff Bound n E[e tx i ] i n [p i e t + ( p i )] i n [ + p i (e t )] i n [e p i (e t )] because e x > + x i e n i p i(e t ) e (et )µ Using the simplified numerator in the Chernoff Bound yields E[eet ] e t(+ɛ)µ Differentiating to find t where the above is minimized results in t ln( + ɛ) Returning to the upper bound with t. Expand x in the numerator: P r(x ( + ɛ)µ) e(e(ln(+ɛ) )µ e +ɛ)ln(+ɛ)µ ) µ ( e ɛ ( + ɛ) (+ɛ) e µ[(+ɛ)ln(+ɛ) ɛ] [ ] ] e µ (+ɛ) [ɛ ɛ + ɛ ɛ e µ [ ɛ ] ɛ [ e µ ɛ ɛ3 6 e µ ɛ ( ɛ ) ] e µ ɛ 3 which is the upper bound of the Chernoff Bound

3 The proof of the lower bound of the Chernoff Bound can be found using similar logic as the proof of the upper bound. Universal Hash Family The Univeral Hash Family is a family of hash functions H {h h : [N] [M]} is called a pairwise independent family of hash functions if for all i j [N] and any k, l [M] P r h H [h(i) k h(j) l] is a strongly universal hash family () M A hash function is pairwise independent if property holds. This definition can be extend to form k-wise hash functions as well. K-wise hash functions are important because they allow for efficient construction of hash families. Fully independent hash functions generally require large space requirements. Hash functions are uniform over [M] P r h H [h(i) k] M () P r h H [h(i) h(j)] M is a weakly universal hash family (3) To Construct a pairwise independent hash family: Let p be a prime. For any a, b Z p {0,,,...p }, define h a,b : Z p Z p by h a,b (x) ax + bmodp. The resulting collection of functions H {h a,b a, b Z p } is a pairwise independent hash family. 3 Counting Distinct Items Given a stream of data a, find the total number of distinct items in the stream. For the purpose of this discussion, we assume the stream to too large to be stored in main memory. a a a...a m a i (j, µ) where j [, n] and µ m represents the number of elements in the stream n represents the maximum number of distinct elements that could be in the stream. The goal is to find the actual number of distinct elements, DE. However, because we cannot store a in main memory, we must approximate DE. This approximation will be denoted DE. We want to find DE such that the following constraint holds with probablilty ( δ). 3

4 ( ɛ)de DE DE( + ɛ) for ɛ > 0 (4) 4 Algorithm - Count Distinct Items The following algorithm attempts to guess the actual value of DE by looping through exponentially growing values of t. For each guess, the algorithm calls EST IMAT E which returns YES if there are at least t distinct values, otherwise it returns NO. EST IMAT E returns the correct answer with probability ( δ) as we will see later. Following the for loop, we have a list of YES/NO values corresponding to each t. The algorithm returns the largest value of t which has a value YES. Algorithm COUNT DISTINCT ITEMS[a, ɛ, δ] ɛ ɛ/ for t, ( + ɛ ), ( + ɛ ),... ( + ɛ ) log n +ɛ do δ ɛ δ logn {Run in parallel} b t EST IMAT E(a, t, ɛ, δ ) {b t is a boolean variable YES/NO} end for return the smallest value of t such that b t YES and b t NO if no such t exists, return n Below is an example of the output produced by the for loop in Algorithm. This is the likely output produced in the case where ( + ɛ ) DE ( + ɛ ). t YES t ( + ɛ ) YES t ( + ɛ ) NO t ( + ɛ ) 3 NO... t n NO As the example illustrates, the resulting DE satisfies the constraint: ( ɛ)de DE DE(+ɛ) Proof. For each t, we get the correct result with probability δ ɛ δ logn and there are log +ɛ n different values for t. P r(error for any t) δ P r(error in at least one t) t P r(error for any t) log +ɛ nδ ɛ lognδ δ P r(no error in any t) < δ 4

5 5 Algorithm - ESTIMATE EST IMAT E randomly selects c ɛ log δ hash functions from a fully-independent hash family. The hash function h is of the form h : [...n] [...t]. We then compute the hash value for every value of in the stream for each hash function. If the hash function ever returns, use YES for this t, otherwise use NO. Finally, count the number of NO values and if it s greater than or equal to c log ɛ δ, return NO, otherwise return YES. EST IMAT E returns the correct answer with probabily ( δ) because there are c ɛ log δ hash functions used and the most common answer wins. This minimizes the impact of the randomization in the hash functions. Algorithm [ESTIMATE(a, t, ɛ, δ )] count 0 for t, c log ɛ δ do Select a hash function h i uniformly and randomly from a fully-independent hash family H {run in parallel} b i t NO repeat Consider the current element in the stream a, say a i (j, µ) if h i (j) then b i t YES, BREAK end if until a is exhasted if b i t NO then count count + end if end for if count e c ɛ return NO else return YES end if log δ then Proof. The goal is to return YES when DE > ( + ɛ)t and to return NO when DE < ( ɛ)t. Let h i be the i th run through the for loop. There are k runs where k c ɛ log δ P r(h i (j) ) t by definition of h P r(return NO for the i th run) P r(none of the distinct elements are mapped to by h i ) ( t )DE 5

6 Lemma. Consider the i th round of EST IMAT E(a, t, ɛ, δ ) for any i [ c ɛ log δ ] If DE > ( + ɛ)t and ɛ < then P r[b i t NO] e ɛ e P r(i th run returns NO) ( t )(+ɛ)t e (+ɛ) when t is large e ( ɛ + ɛ...) e ɛ e + ɛ e e ɛ e If DE < ( ɛ)t and ɛ < then P r[b i t NO] e + ɛ e P r(i th run returns NO) ( e )( ɛ)t e + ɛ e by the same logic as above Lemma 3. Demostrates the bounds of the error in Algorithm. If DE > ( + ɛ )t then P r[b t NO] δ If DE < ( ɛ )t then P r[b t Y ES] δ P r(algorithm returns NO) P r(x > k e ) because we return NO if more than k e e ɛ ck runs return NO Define a random variable x i if algorithm returns NO, otherwise x i 0. x E[x] xi E[x] P r(xi ) P r(i th run returns NO) k( e + ɛ ) by Lemma e 6

7 Re-write P r(x > k e ) in the form of the Chernoff Bound P r(x > ( + ɛ )E[x]) ( + ɛ )k( e ɛ e ) k by using the value of E[x] from above e ( + ɛ )( ɛ ) + ɛ ɛ P r(x > k e ) e ɛ µ 3 e ɛ ck δ using k c log ɛ δ The lower bound can be demonstrated with similar logic to what was done to prove the upper c bound above. This shows that when run enough times, log ɛ δ, we can minimize the probability for error to a sufficient level. Lemma 4. If DE t > ɛ t then P r[error] δ Using the Union Bound, we know the total P r[error] cannot exceed the sum of the P r[error] of the lower bound and the P r[error] of the upper bound. δ + δ δ Lemma 5. For all t such that DE t > ɛ t then P r[error] δ Theorem 6. Algorithm returns an estimate of DE within ( ± ɛ) with probability ( δ). Theorem 6 shows that this algorithm to count distinct items has achieved our goal of finding an algorithm that computes DE under the following accuracy constraint: ( ɛ)de DE DE( + ɛ) for ɛ > 0 and does so with probability ( δ). 6 Space and Time Complexity of Count Distinct Items Space Complexity: O( ɛ 3 log n(log δ +log logn+log ɛ )) Time Complexity: O( ɛ 3 log n(log δ +log logn+log ɛ )) Ignoring constants, there are ɛ logn copies that need to be stored and each requires bit. The space complexity of EST IMAT E is log logn ɛ ɛδ Expanding this space complexity yields: (loglogn + log ɛ ɛ + log δ ) Combining the space complexity and number of copies yields the total space complexity: 7

8 O( ɛ 3 logn(log δ + log logn + log )) (5) ɛ The time complexity can be computed in the same way as the space complexity. In practice, the space and time dependency on ɛ 3 is generally problematic. The optimal lower bound on space complexity for counting distinct items in a stream was shown to be Ω( + log n). ɛ References [] Daniel M. Kane, Jelani Nelson and David P. Woodruff. An Optimal Algorithm for the Distinct Elements Problem. PODS 00:

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],