Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Size: px

Start display at page:

Download "Hierarchical Sampling from Sketches: Estimating Functions over Data Streams"

Avis Pope
6 years ago
Views:

1 Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We present a randomized procedure named Hierarchical Sampling from Sketches HSS) that can be used for estimating a class of functions over the frequency vector f of update streams of the form ΨS) = n i=1 ψ fi ). We illustrate this by applying the HSS technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p 2 and for estimating the entropy of a data stream. 3 1 Introduction A variety of applications in diverse areas, such as, networking, database systems, sensor networks, web-applications, share some common characteristics, namely, that data is generated rapidly and continuously, and must be analyzed in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, etc.. Furthermore, it is frequently sufficient to continuously track the big picture, or, an aggregate view of the data. In this context, efficient and approximate computation with bounded error probability is often acceptable. The data stream model presents a computational model for such applications, where, incoming data is processed in an online fashion using sub-linear space. 1.1 The data stream model A data stream S is viewed as a sequence of records of the form pos, i, v), where, pos is the index of the record in the sequence, i is the identity of an item in [1, n] = {1,..., n}, and v is the change to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corresponding deletion. The frequency of an item i, denoted by f i, is the sum of the changes to the frequency of i since the inception of the stream, that 3 Preliminary version of this paper appeared as the following conference publications. Simpler algorithm for estimating frequency moments of data streams, Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh and Chandan Saha, Proceedings of the ACM Symposium on Discrete Algorithms, 2006, pp and Estimating Entropy over Data Streams, Lakshminath Bhuvanagiri and Sumit Ganguly, Proceedings of the European Symposium on Algorithms, Springer LNCS Volume 4168, pp , 2006.

2 is, f i = pos,i,v) appears in S v. The following variations of the data stream model have been considered in the research literature. 1. The insert-only model, where, data streams to not have deletions, that is, v > 0 for all records. The unit insert-only model is a special case of the insert-only model, where, v = 1 for all records. 2. The strict update model, where, insertions and deletions are allowed, subject to the constraint that f i 0, for all i [1, n]. 3. The general update model where no constraints are placed on insertions and deletions. 4. The sliding window model, where, a window size parameter W is given and only the portion of the stream that has arrived within the last W time units is considered to be relevant. Records that are not part of the the current window are deemed to have expired. In the data stream model, an algorithm must perform its computations in an online manner, that is, the algorithm gets to view each stream record exactly once. Further, the computation is constrained to use sub-linear space, that is, on log ) bits, where, = n i=1 f i. This implies that the stream cannot be stored in its entirety and summary structures must be devised to solve the problem. We say that an algorithm estimates a quantity C with ɛ-accuracy and probability 1 δ if it returns an estimate that satisfies Ĉ C ɛc with probability 1 δ. The probability is assumed to hold for every instance of input data and is taken over the random coin tosses used by the algorithm. More simply, we say that a randomized algorithm estimates C with ɛ-accuracy if it returns an estimate satisfying Ĉ C ɛc with constant probability greater than 1 2 for e.g., 7 8 ). Such an estimator can be used to obtain another estimator for C that is ɛ-accurate and is correct with probability 1 δ by following the standard procedure of returning the the median of s = Olog 1 δ ) independent such estimates Ĉ1,..., Ĉs. In this paper, we consider two basic problems in the data streaming model, namely, estimating the moment of the frequency vector of a data stream and estimating the entropy of a data stream. We first define the problems and review the research literature. 1.2 Previous work on estimating F p for p 2 For any real p 0, the p th moment of the frequency vector of the stream S is defined as F p S) = n f i p. i=1 The problem of estimating F p has led to a number of advancements in the design of algorithms and lower bound techniques for data stream computation. It was first introduced in

3 [1] that also presented the first sub-linear space randomized algorithm for estimating F p, for p > 1 with ɛ-accuracy and using space O 1 ɛ 2 n 1 1/p log ) bits. For the special case of F 2, the seminal sketch technique was presented in [1], that uses space O 1 ɛ 2 log ) bits for estimating F 2 with ɛ-accuracy. The work in [9, 13] reduced the space requirement for ɛ-accurate estimation of F p, for p > 2, to O 1 ɛ 2 n 1 1/p 1) log ) ). 4 The space requirement was reduced in [12] to On 1 2/p+1) log )), for p > 2 and p integral. A space lower bound of Ωn 1 2 p ) for estimating F p, for p > 2, was shown in a series of contributions [1, 2, 7] see also [20]). Finally, Indyk and Woodruff [18] presented the first algorithm for estimating F p, for real p > 2, that matched the above space lower bound up to poly-logarithmic factors. The 2-pass algorithm of Indyk and Woodruff requires space O 1 ɛ 12 n 1 2/p log 2 n)log 6 )) bits. The 1-pass data streaming algorithm derived from the 2-pass algorithm further increases the constant and poly-logarithmic factors. The work in [24] presents an Ω 1 ɛ 2 ) space lower bound for ɛ-accurate estimation of F p, for any real p 1 and ɛ 1 n. 1.3 Previous work on estimating entropy The entropy H of a data stream is defined as H = :f i 0 f i log f i. It is a measure of the information theoretic randomness or the incompressibility of the data stream. A value of entropy close to log n, is indicative that the frequencies in the stream are randomly distributed, whereas, low values are indicative of patterns in the data. Monitoring changes in the entropy of a network traffic stream has been used to detect anomalies [15, 22, 25]. The work in [16] presents an algorithm for estimating H over unit insert-only streams to within α-approximation i.e., H a Ĥ bh and a b α) over insert-only streams using space O ) n 1/α H. The work in [5] presents an algorithm for estimating H to within α- ) approximation over insert-only streams using space O 1 F 2/α ɛ 2 1 log ) 2 log + log n). Subsequent to the publication of the conference version [3] of our algorithm, a different algorithm for estimating the entropy of insert-only streams was presented in [6]. The algorithm of [6] requires space O 1 log n)log F ɛ 2 1 )log + log1/ɛ)) ) respectively. The work in [6] ) also shows a space lower bound of Ω for estimating entropy over a data stream. 1 ɛ 2 log1/ɛ) 4 The algorithm of [13] assumes p to be integral.

4 1.4 Contributions In this paper we present a technique called hierarchical sampling from sketches or HSS, that is a general randomized technique for estimating a variety of functions over update data streams. We illustrate the HSS technique by applying it to derive near-optimal space algorithms for estimating frequency moments F p, for real p > 2. The HSS algorithm estimates F p, for any real p 2 to within ɛ-accuracy using space ) p 2 O ɛ 2+4/p n1 2/p log ) 2 log n + log log ) 2 Thus, the space upper bound of these algorithms match the lower bounds up to factors that are poly-logarithmic in, n and polynomial in 1 ɛ. The expected time required to process each stream update is Olog n + log log ) operations. The algorithm essentially uses the idea of Indyk and Woodruff [18] to classify items into groups, based on frequency. However, Indyk and Woodruff define groups whose boundaries are randomized; in our algorithm, the group boundaries are deterministic. The HSS technique is used to design an algorithm that estimates the entropy of general update data streams to within ɛ-accuracy using space log F1 ) 2 ) O ɛ 3 log1/ɛ) log n + log log ) 2. Organization. The remainder of the paper is organized as follows. In Section 2, we review relevant algorithms from the research literature. In Section 3, we present the HSS technique for estimating a class of data stream metrics. Sections 4 and 5 use the HSS technique to estimate frequency moments and entropy, respectively, of a data stream. Finally, we conclude in Section Preliminaries In this section, we review the COUNTSKETCH and the COUNT-MIN algorithms for finding frequent items in a data stream. We also review algorithms to estimate the residual second moment of a data stream [14]. For completeness, we also present an algorithm for estimating the residual first moment of a data stream. Given a data stream, rankr) is an item with the r th largest absolute value of the frequency, where, ties are broken arbitrarily. We say that an item i has rank r if rankr) = i. For a given value of k, 1 k n, the set topk) is the set of items with rank k. The residual second moment [8] of a data stream, denoted by F res 2 k), is defined as the second moment of the stream after the top-k frequencies have been removed, that is, F res 2 k) = r>k f 2 rankr).

5 The residual first moment [10] of a data stream, denoted by F res 1, is analogously defined as the norm of the data stream after the top-k frequencies have been removed, that is, F res 1 = r>k f rankr). Sketches. A linear sketch [1] is a random integer X = i f i x i, where, x i { 1, +1}, for i [1, n] and the family of variables {x i } is either pair-wise or 4-wise independent, depending on the use of the sketches. The family of random variables {x i } i D is referred to as the linear sketch basis. For any d 2, a d-wise independent linear sketch basis can be constructed in a pseudo-random manner from a truly random seed of size Od log n) bits as follows. Let F be field of characteristic 2 and of size at least n + 1. Choose a degree d 1 polynomial g : F F with coefficients that are randomly chosen from F [4, 23]. Define x i to be 1 if the first bit of gi) is 1, and define x i to be 1 otherwise. The d-wise independence of the x i s follows from Wegman and Carter s universal hash functions [23]. Linear sketches were pioneered by [1] to present an elegant and efficient algorithm for returning an estimate of the second moment F 2 of a stream to within a factor of 1 ± ɛ) with probability 7 8. Their procedure keeps s = O 1 ɛ 2 ) independent linear sketches i.e., using independent sketch bases) X 1, X 2,..., X s, where, the sketch basis used for each X i is fourwise independent. The algorithm returns ˆF 2 as the average of the square of the sketches, that is, ˆF 2 = 1 s s j=1 X2 j. A crucial property observed in [1] is that E[ X 2 j ] = F2 and Var [ X 2 j ] 5F2 2. In the remainder of the paper, we abbreviate linear sketches as simply, sketches. COUNTSKETCH algorithm for estimating frequency. Pair-wise independent sketches are used in [8] to design the COUNTSKETCH algorithm for estimating the frequency of any given items i [1, n] of a stream. The data structure consists of a collection of s = Olog 1 δ ) independent hash tables U 1, U 2,..., U s, each consisting of 8k buckets. A pair-wise independent hash function h j : [1, n] {1, 2,..., 8k} is associated with each hash table that maps items randomly to one of the 8k buckets. Additionally, for each table index j = 1, 2,..., s, the algorithm keeps a pair-wise independent family of random variables {x ij }, where, each x ij { 1, +1} with equal probability. Each bucket keeps a sketch of the sub-stream that maps to it, that is, U j [r] = i:h j i)=r f ix ij, i {1, 2,..., s} and j {1, 2,..., s}. An estimate ˆf i is returned as follows: ˆf i = median s j=1 U j[h j i)]x ij. The accuracy of estimation is stated as a function of the residual second moment defined as [8] s, A) def F res 2 = 8 s) ) 1/2. A

6 The space versus accuracy guarantees of the COUNTSKETCH algorithm is presented in Theorem 1. Theorem 1 [8]). Let = k, 8k). Then, for i [1, n], Pr { ˆf } i f i 1 δ. The space used is Oklog 1 δ )log )) bits and the time taken to process a stream update is Olog 1 δ ). COUNT-MIN Sketch for estimating frequency. The COUNT-MIN algorithm [10] for estimating frequencies keeps a collection of s = Olog 1 δ ) independent hash tables T 1, T 2,..., T s, where, each hash table T j is of size b = 2k buckets and uses a pair-wise independent hash function h j : [1, n] {1,..., 2k}, for j = 1, 2,..., s. The bucket T j [r] is an integer counter that maintains the following sum: T j [r] = i:h j i)=r f i. The estimated frequency ˆf i is obtained as ˆf i = median s r=1t j [h j i)]. The space versus accuracy guarantees for the COUNT-MIN algorithm is given in terms of the quantity F1 res k) = r>k f rankr). Theorem 2 [10]). Pr { ˆf } i f i res k) k 1 δ. The space used is O k log 1 ) δ log F1 ) ) bits and time O log 1 ) δ to process each stream update. Estimating F res 2. The work in [14] presents an algorithm to estimate Fres 2 s) to within an accuracy of 1 ± ɛ) with confidence 1 δ using space O s logf ɛ 2 1 ) log n δ )) bits. The data structure used is identical to the COUNTSKETCH structure. The algorithm basically removes the top-s estimated frequencies from the COUNTSKETCH structure and then estimates F 2. The COUNTSKETCH structure is used to find the top-k items with respect to the absolute values of their estimated frequencies. Let ˆf τ1... ˆf τk denote the top-k estimated frequencies. Next, the contributions of these estimates are removed from the structure, that is, U j [r]:=u j [r] t:h j τ f t)=r τ t x jτt. Subsequently, the FASTAMS algorithm [21], a variant of the original sketch algorithm [1], is used to estimate the second moment as follows: ˆF res 2 = median s 8k j=1 r=1 U j[r]) 2. If k = Oɛ 2 s), then, ˆF res 2 F res res 2 ɛˆf 2 [14]. Lemma 1 [14]). For a given integer k 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF res 2 k) satisfying ˆF res 2 k) F2 res k) ɛf2 res k) with probability 1 δ using space O k log F ɛ 2 1 δ )log )) bits. Estimating F1 res. An argument similar to the one used to estimate F res 2 k) can be applied to estimate F1 res k). We will prove the following property in this subsection.

7 Lemma 2. For 0 < ɛ { 1, there exists an algorithm for} update streams that returns an estimate ˆF res 1 satisfying Pr ˆF res 1 F1 res k) ɛf1 res k) 3 4. The algorithm uses O klog F1 ) ɛ ) + log )log n+log1/ɛ)) bits. ɛ 2 The algorithm is the following. Keep a COUNT-MIN sketch structure with height b, where, b is a parameter to be fixed, and width s = Olog n δ ). In parallel, we keep a set of s = O 1 ɛ 2 ) sketches based on a 1-stable distribution [17]. A one-stable sketch is a linear sketch Y j = i f iz j,i, where, the z j,i s are drawn from a 1-stable distribution, j = 1, 2,..., s. As shown by Indyk [17], the random variable } Ŷ = median s j=1 Y j satisfies Pr { Ŷ ɛ 7 8. The COUNT-MIN structure is used to obtain the top-k elements with respect to the absolute value of their estimated frequencies. Suppose these items are i 1, i 2,..., i k and ˆf i1 ˆf i2... ˆf ik. Let I = {i 1, i 2,..., i k }. Each one-stable sketch Y j is updated to remove the contribution of the estimated frequencies. That is, Finally, the following value is returned. Y j = Y j k r=1 ˆf ir z j,ir. ˆF res 1 k) = median k j=1 Y j. We now analyze the algorithm. Let T = T k = {t 1, t 2,..., t k } denote the set of indices of the top-k true) frequencies, such that f t1 f t2... f tk. Lemma 3. F res 1 k), i I f i F1 res k) 1 + k ) b. Proof. Since both T and I are sets of k elements each, therefore, T I = I T. Let i i be an arbitrary 1-1 map from i T I to an element i in I T. Since, i is a top-k frequency and i is not, therefore, f i f i. Further, ˆfi ˆf i, otherwise, i would be among the top-k items with respect to estimated frequencies, that is i would be in I, contradicting that i T I. The condition ˆf i ˆf i implies the following. f i ˆf i ˆf i f i + with probability 1 δ 4n, or, that f i f i + 2, with probability 1 δ 4n. Therefore, f i f i f i +.

8 Thus, using union bound for all items in [1, n], we have with probability 1 δ 2, f i f i f i + = f i + k. 1) i I T i T I i I T i I T Let G = f i = f i + I i T I i T I) f i By 1), it follows that f i + f i G i I T i T I) i I T f i + k + i T I) f i. Since, i I T f i + i T I) f i = F res 1 k), we have, By Theorem 2, we have, F res F res 1 k) G F res 1 k) + k. 1 k) b F res 1 k) G F res 1 k). Therefore, 1 + k b ), with prob. 1 δ 2. Proof Of Lemma 2.). Let f denote the frequency vector after the estimated frequencies are removed. Then, f r f r if r [1, n] I = f r ˆf r if r I. Let denote f i. Then, by Lemma 3, it follows that = f r + f i ˆf i ) G + k F1 res k) 1 + 2k b I i I since, F res 1 b) b F 1 = res k) b. Further, I f r + i I f i ˆf i ) G k F1 res k) 1 k ) b ) Combining, F1 res k) 1 k ) F1 res k) 1 + 2k b b )

9 with probability 1 2 Ωw).{ By construction, the} 1-stable sketches technique returns median s j=1 Y j satisfying Pr ˆF res 1 ɛ. Therefore, using union bound 7 8 res ˆ F1 res k) 2k b res k) + ɛf1 res k) 1 + 2k b If b 2k ɛ, then, res ˆ F1 res k) 3ɛF1 res k) with prob Replacing ɛ by ɛ/3 yields the first statement of the lemma. ) with prob. 7 8 n2 Ωw) ˆF res 1 = The space requirement is calculated as follows. We use the idea of a pseudo-random generator of Indyk [17] for streaming computations. The state of the data structure i.e., stable sketches) is the same as if the items arrived in sorted order. Therefore, as observed by Indyk [17], Nisan s pseudo-random generator [19] can be used to simulate the randomized computation using OS log R) random bits, where, S is the space used by the algorithm not counting the random bits used) and R is the running time of the algorithm. We apply) Indyk s observation to the portion of the algorithm that estimates. Thus, S = O log F1 ) and the ɛ 2 running time R = O ) F 0 ɛ, assuming the input is presented in sorted order of items. Here, 2 F0 is the number of distinct items in the stream with non-zero frequency. Then the total random bits required is OS log R) = O log F1 )log n+log1/ɛ) ). ɛ 2 3 The HSS algorithm In this section, we present a procedure for obtaining a representative sample over the input stream, which we refer to as Hierarchical Sampling over Sketches HSS) and use it for estimating a class of metrics over data-streams of the following form ΨS) = i:f i 0 ψ f i ). 2) Sampling sub-streams. The HSS algorithm uses a sampling scheme as follows. From the input stream S, we create sub-streams S 0,..., S L such that S 0 = S and for 1 l L, S l is obtained from S l 1 by sub-sampling each distinct item appearing in S l 1 independently with probability 1 2. At level 0, S 0 = S. The stream S 1, corresponding to level 1, is obtained by sampling choosing each distinct value of i with independently, with probability 1 2. Since, the sampling is based on the identity of an item i, either all records in S with identity i are present in S 1, or, none are each of these cases holds with probability 1 2. The stream S 2,

10 corresponding to level 2 is obtained by sampling each distinct value of i appearing in the sub-stream S 1, with probability 1 2 and independently of the other items in S 1. In this manner, S l is a randomly sampled sub-stream of S l 1, for l 1, based on the identity of the items. The sub-sampling scheme is implemented as follows. We assume that n is a power of 2. Let h : [1, n] [0, maxn 2, W )] be a random hash function drawn from a pair-wise independent hash family and W 2. Let L max = logmaxn 2, W )). Define the random function level : [1, n] [1, L max ] as follows. 1 if hi) = 0 leveli) = lsbhi)) 2 leveli) L max. where, lsbx) is the position of the least significant 1 in the binary representation of x. The probability distribution of the random level function is as follows. 1 2 Pr {leveli) = l} = + 1 n if l = 1 1 otherwise. 2 l All records pertaining to i are included in the sub-streams S 0 through S leveli). The sampling technique is based on the original idea of Flajolet and Martin [11] for estimating the number of distinct items in a data stream. The Flajolet-Martin scheme maps the original stream into disjoint sub-streams S 1, S 2,..., S log n, where, S l is the sequence of records of the form pos, i, v) such that leveli) = l. The HSS technique creates a monotonic decreasing sequence of random sub-streams in the sense that S 0 S 1 S 2... S Lmax and S l is the sequence of records for item i such that leveli) l. At each level l {0, 1,..., L max }, the HSS algorithm keeps a frequency estimation datastructure denoted by D l, that takes as input the sub-stream S l, and returns an approximation to the frequencies of items that map to S l. The D l structure can be any standard data structure such as the COUNT-MIN sketch structure or the COUNTSKETCH structure. We use the COUNT-MIN structure for estimating entropy and the COUNTSKETCH structure for estimating F p. Each stream update pos, i, v) belonging to S l is propagated to the frequent items data structures D l for 0 l leveli). Let kl) denote a space parameter for the data structure D l, for example, kl) is the size of the hash tables in the COUNT-MIN or COUNTSKETCH structures. The values of kl) are the same for levels l = 1, 2,..., L max and is twice the value for k0). That is, if k = k0), then, k1) =... = kl max ) = 2k. This non-uniformity is a technicality required by Lemma 4 and Corollary 1. We refer to k = k0) as the space parameter of the HSS structure.

11 Approximating f i. Let l k) denote the additive error of the frequency estimation by the data structure D l ) at level l and using space parameter k. That is, we assume that ˆf i,l f i l k) with probability 1 2 t where, t is a parameter and ˆf i,l is the estimate for the frequency of f i obtained using the frequent items structure D l k). By Theorem 2, if D l is instantiated by the COUNT-MIN sketch structure with height k and width log t, then, ˆf i,l f i res k,l) k with probability 1 2 t. If D l is instantiated using the COUNTSKETCH structure with height 8k and width Olog t), then, by Theorem 1, it follows that ˆf ) F res 1/2 i,l f i 8 with probability 1 2 t. 2 k,l) k We first relate the random values F1 res k, l) and F res 2 k, l) to their corresponding non-random values F1 res k) and F res 2 k), respectively. { Lemma 4. For l 1 and k 2, Pr F res 1 2 l 1 k) 1 k, l) F res 2 l 1 } 1 2e k/6. Proof. For item i [1, n], define an indicator variable x i to be 1 if records corresponding to i are included in the stream at level l, namely, S l, and let x i be 0 otherwise. Then, Pr {x i = 1} = 1 2 l. Define the random variable T l,k as the number of items in [1, n] with rank at most 2 l 1 k in the original stream S and that are included in S l. That is, T l,k = 1 ranki) 2 l 1 k By linearity of expectation, E [ ] T l,k = 2 l 1 k = k 2 l 2. Applying Chernoff s bounds to the sum of indicator variables T l,k, we obtain x i. Pr {T l,k > k} < e k/6. Say that the event SPARSEl) occurs if the element with rank k + 1 in S l has rank larger than 2 l 1 k in the original stream S. The argument above shows that Pr {SPARSEl)} 1 e k/6. Thus, F res 1 k, l) By linearity of expectation ranki)>2 l 1 k f i x i, assuming SPARSEl) holds. 3) E [ F res 1 k, l) SPARSEl) ] F res 1 2 l 1 k) 2 l.

12 Suppose u l is the frequency of the item with rank k + 1 in S l. Applying Hoeffding s bound to the sum of non-negative random variables in 3), each upper bounded by u l, we have, Pr { F1 res k, l) > 2E [ F1 res k, l) ] SPARSEl) } [ ] < e E F1 res k,l) /3u l ) or, { Pr F1 res k, l) < res 2 l 1 } k 2 l 1 SPARSEl) F res < e 1 2 l 1 k)/3 2 l u l ). 4) Assuming the event SPARSEl), it follows that u l f rank2 l 1 k+1). or, u l 1 2 l 2 k 2 l 2 k+1 ranki) 2 l 1 k) f ranki) res 2 l 2 k) 2 l 2 k F res 1 2 l 1 k) u l 2 l 2 k. Substituting in 4) and taking the probability of the complement event, we have, { Pr F1 res k, l) < res 2 l 1 } k) 2 l 1 SPARSEl) > 1 e k/6. Since, Pr {SPARSEl)} > 1 e k/6, { Pr F1 res k, l) < res 2 l 1 } k) > 1 e k/6 ) Pr {SPARSEl)} Corollary 1. For l 1, F res 2 2 l 1 = 1 e k/6 ) 2 > 1 2e k/6. Fres k, l) with probability 1 2 k 2 l l 1 k) Proof. Apply Lemma 4 to the frequency vector obtained by replacing f i by fi 2, for i [1, n]. 3.1 Group definitions Recall that at each level l, the sampled stream S l is provided as input to a data structure D l, that when queried, returns an estimate ˆf i,l for any i [1, n] satisfying ˆf i,l f i l, with prob. 1 2 t. Here, t is a parameter that will be fixed in the analysis and the additive error l is a function of the algorithm used by D l e.g., l = F res 1 k)/2 l 1 k) for COUNT-MIN sketches and

13 l = F res 2 k)/2l 1 k) for COUNTSKETCH). Fix a parameter ɛ which will be closely related to the given accuracy parameter ɛ, and is chosen depending on the problem. For example, in order to estimate F p, ɛ is set to ɛ 4p. Therefore, ˆf i,l 1 ± ɛ)f i, Define the following event provided, f i > l ɛ, and i S l, with prob. 1 2 t. By union bound, GOODEST ˆf i,l f i < l, for each i S l and l {0, 1,..., L}. Pr {GOODEST} 1 nl + 1)2 t. 5) Our analysis will be conditioned on the event GOODEST. Define a sequence of geometrically decreasing thresholds T 0, T 1,..., T L as follows. T l = T 0 2 l, l = 1, 2,..., L and 1 2 < T L 1. 6) In other words, L = log T 0. Note that L and L max are distinct parameters. L max is a data structure parameter and is decided prior to the run of the algorithm. L is a dynamic parameter that is dependent on T 0 and is instantiated at the time of inference. In the next paragraph, we discuss how T 0 is chosen. The threshold values T l s are used to partition the elements of the stream into groups G 0,..., G L as follows. G 0 = {i S : f i T 0 } and G l = {i S : T l < f i T l 1 }, l = 1, 2,..., L. An item i is said to be discovered as frequent at level l, provided, i maps to S l and ˆf i,l Q l, where, Q l, l = 0, 1, 2..., L, is a parameter family. The values of Q l are chosen as follows. The space parameter kl) is chosen at level l as follows. Q l = T l 1 ɛ) 7) 0 = 0 k) ɛq 0, 0 = l 2k) ɛq l, l = 1, 2,..., L. 8) The choice of T 0. The value of T 0 is a critical parameter for the HSS parameter and its precis choice depends on the problem that is being solved. For example, for estimating F p, ) 1 T 0 is chosen as ˆF2 1/2. ɛ1 ɛ) k For estimating the entropy H, it is sufficient to choose T0 as

14 1 ɛ1 ɛ) ˆ res k ) k, where, k and k are parameters of the estimation algorithm. T 0 must be chosen as small as possible subject to the following property: l ɛ1 ɛ) T 0. Lemma 4 and Corollary 1 show that for the COUNT-MIN structure and the COUNTSKETCH structure, T 0 can 2 l be F res ) 1/2, respectively. Since, neither F res 1 k) nor chosen to be as small as res k ) 2 k k and k k) can be exactly computed in sub-linear space, therefore, the algorithms of Lemmas 1 F res 2 and Lemma 2 are used to obtain 1 2 -approximations5 to the corresponding quantities. By replacing k by 2k at each level, it suffices to define T 0 as ) ˆF res 1 k) 2k or as ˆF res 2 k 1/2, 2k respectively. 3.2 Hierarchical samples Items are sampled and placed into sampled groups Ḡ0, Ḡ1,..., ḠL as follows. The estimated frequency of an item i is defined as ˆf i = ˆf i,r, where, r is the lowest level such that ˆf i,r > Q r. The sampled groups are defined as follows. Ḡ 0 = {i : ˆf i T 0 } and Ḡl = {i : T l 1 < ˆf i T l and i S l }, 1 l L. The choices of the parameter settings satisfy the following properties. We use the following standard notation. For a, b R and a < b, a, b) denotes the open interval defined by the set of points between a and b end points not included), [a, b] represents the closed interval of points between a and b both included) and finally, and [a, b) and a, b] respectively, represent the two half-open intervals. Partition a frequency group G l, for 1 l L 1, into three adjacent sub-regions: lmarging l ) = [T l, T l + ɛq l ], l = 0, 1,..., L 1 and is undefined for l = L. rmarging l ) = [Q l 1 ɛq l 1, T l 1 ), l = 1, 2,..., L and is undefined for l = 0. midg l ) = T l + ɛq l, Q l 1 ɛq l ), 1 l L 1 These regions respectively denote the lmargin left-margin), rmargin right-margin) and middleregion of the group G l. An item i is said to belong to one of these regions if its true frequency lies in that region. The middle-region of groups G 0 and G l is extended to include the right and left margins, respectively. That is, lmarging 0 ) = [T 0, T 0 + ɛq 0 ) and midg 0 ) = [T 0 + ɛq 0, ] rmarging L ) = Q L 1 ɛq L 1, T L 1 ) and midg 0 ) = 0, Q L 1 ɛq L 1 ]. 5 More accurate estimates of F res 2 and F res 1 can be obtained using Lemmas 1 and Lemma 2, but in our applications, a constant factor accuracy suffices.

15 Important Convention. For clarity of presentation, from now on, the description of the algorithm and the analysis throughout uses the frequencies f i instead of f i. However, the analysis remains unchanged if the frequencies are negative and f i is used in terms of f i. The only reason for making this notational convenience is to avoid writing in many places. An equivalent way of viewing this is to assume that the actual frequencies are given by an n-dimensional vector g. The vector f is defined as the absolute value of g, taken coordinate wise, i.e., f i = g i for all i). It is important to note that the HSS technique is only designed to work with functions of the form n i=1 ψ g i ). All results in this paper and their analysis, hold for general update data streams, where, item frequencies could be positive, negative or zero. We would now like to show that the following properties hold, with probability 1 2 t each. 1. Items belonging to the middle region of any G l may be discovered as frequent, that is, ˆf i,r Q r, only at a level r l. Further, ˆfi = ˆf i,l, that is, the estimate of its frequency is obtained from level l. These items are never misclassified, that is, if i belongs to some sampled group Ḡr, then, r = l. 2. Items belonging to the right region of G l may be discovered as frequent at level r l 1, but not at levels less than l 1, for l 1. Such items may be misclassified, but only to the extent that i may be placed in either Ḡl 1 or Ḡl. 3. Similarly, items belonging to the left-region of G l may be discovered as frequent only at levels l or higher. Such items may be misclassified, but only to the extent that i is placed either in Ḡl or in Ḡl+1. Lemma 5 states the properties formally. Lemma 5. Let ɛ 1 6. The following properties hold conditional on the event GOODEST. 1. Suppose i midg l ). Then, i is classified into Ḡl iff i S l and ˆf i = ˆf i,l. If i S l, then, ˆf i is undefined and i is unclassified. 2. Suppose i lmarging l ), for some l {0, 1,..., L 1}. If i S l, then, i is not classified into any group. Suppose i S l. Then, 1) i is classified into Ḡl iff i S l and ˆf i,l T l, and, 2) i is classified into Ḡl+1 iff i S l+1 ˆfi,l < T l. In both cases, ˆf i = ˆf i,l. 3. Suppose i rmarging l ) for some some l {1, 2,..., L}. If i S l 1, then, ˆfi is undefined and i is unclassified. Suppose i S l 1. Then, a) i is classified into Ḡl 1 iff 1) ˆf i,l 1 T l 1, or, 2) ˆf i,l 1 < Q l and i S l and ˆf i,l T l 1. In case 1), ˆf i = ˆf i,l 1 and in case 2), ˆf i = ˆf i,l. b) i is classified into Ḡl iff i S l and either 1) ˆf i,l 1 Q l 1 and ˆf i < T l 1, or, 2) ˆf i,l 1 < Q l 1 and ˆf i = ˆf i,l. In case 1), ˆf i = ˆf i,l 1 and in case 2) ˆf i = ˆf i,l.

16 Proof. We prove the statements in sequence. Assume that the event GOODEST holds. Let i midg l ). If l = 0 then the statement is obviously true, so we consider l 1. Suppose i S r, for some r < l, and i is discovered as frequent at level r, that is, ˆf i,r Q r. Since, GOODEST holds, therefore, f i Q r r. Since, i midg l ), f i < Q l 1 ɛq l 1. Combining, we have Q l 1 ɛq l 1 > f i Q r r = Q r ɛq r which is a contradiction for r l 1. Therefore, i is not discovered as frequent in any level r < l. Hence, if i S l, i remains unclassified. Now suppose that i S l. Since, GOODEST holds, ˆf i,l f i + l. Since, i midg l ), f i < Q l 1 ɛq l 1. Therefore, ˆf i,l < Q l 1 ɛq l 1 + l < Q l 1 9) since, l = ɛq l = ɛq l 1 /2. Further, i midg l ) implies that f i > T l + ɛq l. Since, GOODEST holds, ˆf i,l f i l > T l + ɛq l l = T l 10) since, l = ɛq l. Combining 9) and 10), we have, T l < ˆf i,l < Q l 1 < T l 1. Thus, i is classified into Ḡl. We now consider statement 2) of the lemma. Assume that GOODEST holds. Suppose i lmarging l ), for l {0, 1,..., L 1}. Then, T l f i T l + ɛq l = T l + l. Suppose r < l. We first show that if i S r, then, i cannot be discovered as frequent at level r, that is, ˆfi,r < Q r. Assume to the contrary that ˆfi,r Q r. Since, GOODEST holds, we have, ˆf i,r f i + r. Further, r = ɛq r and Q r = 1 ɛ)t r. Therefore, T r 1 ɛ) 2 = Q r r f i T l + l = T ɛ1 ɛ)). Since, T r = T l 2 l r, 2 l r 1 + ɛ1 ɛ) 1 ɛ) 2 < 2, if ɛ 1 6. This is a contradiction, if l > r. We conclude that i is not discovered as frequent at level r < l. Therefore, if i S l, then, i is not classified into any of the Ḡr s. Now suppose that

17 i S l. We first show that i is discovered as frequent at level l. Since, i lmarging l ), therefore, f i T l and hence, ˆf i,l > T l l = Q l 1 ɛ ɛq l > Q l. 11) Thus, i is discovered as frequent at level l. There are two cases, namely, either ˆf i,l T l or ˆf i,l < T l. In the former case, i is classified into Ḡl and ˆf i = ˆf i. In the latter case, ˆf i,l < T l, the decision regarding the classification is made at the next level l + 1. If i S l+1, then, i remains unclassified. Otherwise, suppose i S l+1. The estimate ˆf i,l+1 is ignored in favor of a lower level estimate ˆf i,l, which is deemed to be accurate, since it is at least Q l. By 11), ˆf i,l > Q l > T l+1. By assumption, ˆf i,l < T l. Therefore, i is classified into Ḡl+1. This proves statement 2) of the lemma. Statement 3) is proved in a similar fashion. Estimator. The sample is used to compute the estimate ˆΨ. We also define an idealized estimator Ψ that assumes that the frequent items structure is an oracle that does not make errors. ˆΨ = L ψ ˆf i ) 2 l Ψ L = ψf i ) 2 l 12) l=0 i Ḡl l=0 i Ḡl 3.3 Analysis For i [1, n] and r [0, L], define the indicator variable x i,r as follows. 1 if i S r and i x i,r = Ḡr 0 otherwise. In this notation, equation 12) can be written as follows. ˆΨ = ψ ˆf L i ) x i,r 2 r r=0 L Ψ = ψf i ) x i,r 2 r. 13) r=0 Note that for a fixed i, the family of variables x i,r s is not independent, since, each item belongs to at most one sampled group Ḡr i.e., L r=0 x i,r is either 0 or 1). We now prove a basic property of the sampling procedure. Lemma 6. Let i G l.

18 1. If i midg l ), then, a) Pr {x i,l = 1 GOODEST} = 1 2 l, and, b) Pr {x i,r = 1 GOODEST} = 0 for l r. 2. If 0 l L 1 and i lmarging l ), then, a) Pr {x i,l = 1 GOODEST} 2 l + Pr {x l+1 = 1 GOODEST} 2 l+1 = 1, and, b) Pr {x i,r = 1} = 0, for r {0,..., L} {l, l + 1}. 3. If 1 l L and i rmarging l ), then, a) Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l 1 = 1 GOODEST} 2 l 1 = 1, and, b) Pr {x i,r = 1 GOODEST} = 0, for r {0,..., L} {l 1, l}. Proof. We first note that the part b) of each of the three statements of the lemma is a restatement of parts of Lemma 5. For example, suppose i lmarging l ), l < L and assume that the event GOODEST holds. By Lemma 5, part 2), either i Ḡl or i Ḡl+1. Thus, x i,r = 0, if r is neither l nor l + 1. In a similar manner, parts b) of the other statements of the Lemma can be seen as a restatement of parts of Lemma 5. We now consider part a) of the statements. Assume that the event GOODEST holds. Suppose i midg l ). The probability that i is sampled into S l = 1 2 l, by construction of the sampling technique. By Lemma 5, part 1), if i S l, then, i Ḡl. Therefore, Pr {x i,l = 1 GOODEST} = 1 2 l. Suppose i lmarging l ) and l < L. Then, Pr {i S l } = 1 and Pr {i S 2 l l+1 i S l } = 1 2. By Lemma 5, part 2), a) i Ḡl, or, x i,l = 1 iff i S l and ˆf i,l T l and, b) i Ḡl+1, or, x i,l+1 = 1 iff i S l+1 and ˆf i,l < T l. Therefore, Pr {x i,l+1 = 1 i S l and GOODEST} { = Pr i S l+1 and ˆf } i,l < T l i S l and GOODEST { } = Pr ˆfi,l < T l i S l and GOODEST Pr {i S l+1 i S l and GOODEST and ˆf } i,l < T l 14) { } We know that Pr ˆfi,l < T l i S l and GOODEST = 1 Pr {x i,l = 1 i S l and GOODEST}. The specific value of the estimate ˆf i,l is a function solely of the random bits employed by D l and the sub-stream S l. By full-independence of the hash function mapping items to the levels, we have that Pr {i S l+1 i S l and GOODEST and ˆf } i,l < T l Substituting in 14), we have, Pr {x i,l+1 = 1 i S l and GOODEST} = 1 2 = Pr { S l+1 i S l } = 1 2. ) 1 Pr {x i,l = 1 i S l and GOODEST}.

19 By definition of conditional probabilities and multiplying by 2), 2Pr {x i,l+1 = 1 GOODEST} Pr {i S l } Since, Pr {i S l } = 1 2 l, we obtain, = 1 Pr {x i,l = 1 GOODEST} Pr {i S l }. 2 l+1 Pr {x i,l+1 = 1 GOODEST} = 1 2 l Pr {x i,l = 1 GOODEST} or, Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l+1 = 1 GOODEST} 2 l+1 = 1. This proves the statement 2) of the lemma. Statement 3) regarding the right-margin of G l can be proved analogously. A useful corollary of Lemma 6 is the following. Lemma 7. For i [1, n], L r=0 E[ x i,r GOODEST] 2 r = 1. Proof. If i midg l ), then, by Lemma 6, Pr {x i,l = 1 GOODEST} = 1 2 l and Pr {x i,r = 1 GOODEST} = 0, if r l. Therefore, L E [ ] L x i,r GOODEST = Pr {x i,r = 1 GOODEST} 2 l = Pr {x i,l = 1} 2 l = 1. r=0 r=0 Suppose i lmarging l ), for 0 l < L. By Lemma 6, Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l+1 = 1 GOODEST} 2 l+1 = 1 and Pr {x i,r = 1 GOODEST} = 0, for r {l, l+1}. Therefore, L E [ x i,r GOODEST] 2 r = r=0 L Pr {x i,r = 1 GOODEST} 2 r = 1. r=0 The case for i rmarging l ) is proved analogously. Lemma 8 shows that the expected value of Ψ is Ψ, assuming the event GOODEST holds. Lemma 8. E [ Ψ GOODEST ] = Ψ. Proof. By 13), Ψ = ψf i) L r=0 x i,r 2 r. Taking expectation and using linearity of expectation, E [ Ψ GOODEST ] = = ψf i ) ψf i ), L r=0 E [ x i,r 2 r ] GOODEST since, L E [ x i,r 2 r GOODEST] = 1, by Lemma 7 r=0 = Ψ.

20 The following lemma is useful in the calculation of the variance of Ψ. Notation. Let li) denote the index of the group G l such that i G l. Lemma 9. For i [1, n] and i G 0 lmarging 0 ), L r=0 E[ x i,r 2 2r GOODEST] 2 li)+1. If i G 0 lmarging 0 ), then, L r=0 E[ x i,r 2 2r GOODEST] = 1. Proof. We assume that all probabilities and expectations in this proof are conditioned on the event GOODEST. For brevity, we do not write the conditioning event. Let i midg l ) and assume that GOODEST holds. Then, x i,l = 1 iff i S l, by Lemma 6. Thus, Pr {x i,l = 1} 2 2l = 1 2 l 22l = 2 l. If i lmarging l ), then, by the argument in Lemma 7, Pr {x i,l+1 = 1} 2 l+1 + Pr {x i,l = 1} 2 l = 1. Multiplying by 2 l+1, Pr {x i,l+1 = 1} 2 2l+1) + Pr {x i,l = 1} 2 2l 2 l+1. Similarly, if i rmarging l ), then, Pr {x i,l = 1} 2 l + Pr {x i,l 1 = 1} 2 l 1 = 1. Therefore, Pr {x i,l = 1} 2 2l + Pr {x i,l 1 = 1} 2 2l 1) 2 2l Since, li) denotes the index of the group G l to which i belongs, therefore, L E [ x i,r 2 2r] < 2 li)+1. r=0 In particular, if i G 0 lmarging 0 ) or if i midg l ), then, the above sum is 2 li). Lemma 10. Var [ Ψ GOODEST ] i/ G 0 lmarging 0 )) ψ 2 f i ) 2 li)+1.

21 Proof. We assume that all expressions for probability and expectations in this proof are conditioned on the event GOODEST. For brevity, it is not written explicitly. E [ Ψ 2 ] = E [ = E [ ψf i ) L x i,r 2 r) 2] r=0 ψ 2 f i ) L x i,r 2 r) 2] [ + E ψf i ) ψf j ) r=0 i j L x i,r1 2 r 1 r 1 =0 L x j,r2 2 r ] 2 r 2 =0 = E [ L ψ 2 f i ) x 2 i,r 2 2r] + E [ ψ 2 f i ) ] x i,r1 x i,r2 2 r1+r2 r=0 r 1 r 2 + E [ L L ψf i ) ψf j ) x i,r1 2 r 1 x j,r2 2 r ] 2 i j r 1 =0 We note that, a) x 2 i,r = x i,r, b) an item i is classified into a unique group G r, and therefore, x i,r1 x i,r2 = 0, for r 1 r 2, and, c) for i j, x i,r1 and x j,r2 are assumed to be independent r 2 =0 of each other, regardless of the values of r 1 and r 2. Thus, E [ Ψ 2 ] = E [ ψ 2 f i ) L r=0 x i,r 2 2r] + i j r=0 E [ ψf i ) L x i,r1 2 r ] 1 E [ ψf j ) r 1 =0 = ψ 2 f i )E [ L x i,r 2 2r] + Ψ 2 ψ 2 f i ) L x j,r2 2 r ] 2 since, by Lemma 8, E [ Ψ] = Ψ = ψf i) L r=0 E[ x i,r 2 r]. As a result, the expression for Var [ Ψ] simplifies to Var [ Ψ] = E [ Ψ 2 ] E [ Ψ] ) 2 = i G 0 lmarging 0 ) i G 0 lmarging 0 ) ψ 2 f i )2 li)+1 + E [ ψ 2 f i ) L x i,r 2 2r] ψ 2 f i ) r=0 i G 0 lmarging 0 ) ψ 2 f i ) r 2 =0 ψ 2 f i ) ψ 2 f i )2 li)+1, by Lemma 9. For any subset S [1, n], denote by ψs) the expression i S ψf i). Let Ψ 2 = Ψ 2 S) denote n i=1 ψ2 f i ).

22 Corollary 2. If the function ψ ) is non-decreasing in the interval [0... T ], then, Var [ ] Ψ L GOODEST = ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 )) 15) l=1 Proof. If the monotonicity condition is satisfied, then ψt l 1 ) ψf i ) for all i G l, l 1 and ψf i ) ψt ) for i lmarging 0 ). Therefore, ψ 2 f i ) ψt l 1 ) ψf i ), in the first case and ψ 2 f i ) ψt ) in the second case. By Lemma 10, Var [ ] Ψ L GOODEST ψt l 1 )ψf i )2 li)+1 + ψt )ψf i ) l=1 i G l i lmarging 0 ) L = ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 )). l=1 3.4 Error in the estimate The error incurred by our estimate ˆΨ is ˆΨ Ψ, and can be written as the sum of two error components using triangle inequality. ˆΨ Ψ Ψ Ψ + ˆΨ Ψ = E 1 + E 2 Here, E 1 = Ψ Ψ is the error due to sampling and E 2 = ˆΨ Ψ is the error due to the estimation of the frequencies. By Chebychev s inequality { Pr E 1 3Var [ } Ψ] ) 1/2 GOODEST 8 9. Substituting the expression for Var [ Ψ] from 15), { L 1/2 } Pr E 1 3 ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 ))) GOODEST l= ) We now present an upper bound on E 2. Define a real valued function π : [1, n] R as follows. l ψ ξ i f i, l )) if i G 0 lmarging 0 ) or i midg l ) π i = l ψ ξ i f i, l )) if i lmarging l ), for some l > 1 l 1 ψ ξ i f i, l 1 )) if i rmarging l ) where, the notation ξ i f i, l ) returns the value of t that maximizes ψ t) in the interval [f i l, f i + l ].

23 Lemma 11. Assume that GOODEST holds. Then, E 2 π i L r=0 x i,r2 r. Proof. Assume that the event GOODEST holds. By triangle inequality, E 2 L ψ ˆf i ) ψf i ) x i,l 2 l = ψ ˆf L i ) ψf i ) x i,r 2 r. l=0 i Ḡl r=0 Case 1: i midg l ) or i G 0 lmarging 0 ). Then, i is classified only in group Ḡl with probability 1 2 l, or remains unclassified), and ˆf i,l = ˆf i, by Lemma 6. By Taylor s series ψ ˆf i ) ψf i ) l ψ ξ i ) where, ξ i = ξ i f i, l ) maximizes ψ t) for t [f i l, f i + l ]. Case 2: i lmarging l ) and l < L. Then, ˆf i = ˆf i,l or ˆf i = ˆf i,l+1. Therefore, ˆf i f i l and by Taylor s series, ψ ˆf i ) ψf i ) l ψ ξ i ). Finally, Case 3: i rmarging l ) and l > 0. Then, i Ḡl 1 or i Ḡl. Similarly, it can be shown that ψ ˆf i ) ψf i ) l 1 ψ ξ i ). Adding, E 2 i G 0 lmarging 0 ) or i midg l ) l ψ ξ i f i, l )) + Using the notation of π i s, we have, E 2 π i L L r=0 l=1 i rmarging l ) L 1 x i,r 2 r + l=0 i lmarging l ) l 1 ψ ξ i f i, l 1 )) L x i,r 2 r. r=0 l ψ ξ i f i, l )) L x i,r 2 r 2 l ) To abbreviate the statement of the next few lemmas, we introduce the following notation. Π 1 = Π 2 = 3 L Λ = 3 π i 17) i G 0 lmarging 0 ) l=1 r=0 L x i,r 2 r r=0 π 2 i 2 li)+1 ) 1/2, and 18) ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 ))) 1/2 19)

24 Lemma 12. E [ E 2 GOODEST] Π1, and Var [ ] Π 2 E 2 GOODEST 2 9. Therefore, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Proof. Assume that GOODEST holds and define E 2 = π i L l=0 x i,l2 l. By Lemma 11, E 2 E 2. Applying Lemmas 8 and 10 to E 2 gives E [ E 2 GOODEST] Π1, and Var [ E 2 ] Π 2 GOODEST 2 9. By Chebychev s inequality, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Thus, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Lemma 13 presents the overall expression of error and its probability. Lemma 13. Let ɛ 1 3. Suppose ψ) is a monotonic function in the interval [0, T ]. Pr { ˆΨ Ψ } Λ + Π 1 + Π 2 > nl + 1))2 t ). Proof. Combining Lemma 12 and equation 16), and using the notation of equations 17), 18) and 19), we have, { Pr ˆΨ Ψ } Λ + Π 1 + Π 2 GOODEST = 7 9 Since, Pr {GOODEST} > 1 nl + 1))2 t, therefore, Pr { ˆΨ Ψ } Λ + Π 1 + Π 2 { = Pr ˆΨ Ψ } Λ + Π 1 + Π 2 GOODEST Pr {GOODEST} nl + 1))2 t ). Reducing Randomness by using Pseudo-random Generator. The analysis has assumed that the hash function mapping items to levels is completely independent. A pseudo-random generator can be constructed along the lines of Indyk in [17] and Indyk and Woodruff in [18], to reduce the required randomness. This is illustrated for each of the two estimations that we consider in the following sections, namely, estimating F p and estimating H.

25 4 Estimating F p for p 2 In this section, we apply the HSS technique to estimate F p for p 2. Let ɛ = ɛ 4p. For estimating F p for p 2, we use the HSS technique instantiated with the COUNTSKETCH data structure that uses space parameter k. The value of k will be fixed during the analysis. We use a standard estimator such as sketches [1] or FASTAMS [21] for estimating ˆF 2 to within accuracy of 1± and with confidence 27 using space O log ). We extend the event ɛ 2 GOODEST to include this event, that is, GOODEST = ˆF 2 F 2 < F 2 4 and ˆf i,l f i l, i [1, n] and l {0, 1,..., L}. The width of the COUNTSKETCH structure at each level is chosen to be s = Olog n + log log ) so that Pr {GOODEST} Let F 0 denote the number of distinct items in the data stream, that is, the number of items with non-zero frequency. ) Lemma 14. Let ɛ = min ɛ 4p, 1 6. Then, Π 1 F p ) ɛf p. Proof. Let i G l and l > 0. Then, π i li) 1 ψ ξf i, l 1 )). Since, F p is monotonic and increasing, we have, ξf i, t) = f i + t, for t > 0. For i G l and l > 0, l 1 < 2 ɛf i. Further, ψ ξf i, l 1 ) pf i + l 1 ) p 1 pf p 1 i ɛ) p 1 pf p 1 i e 2p 1)/4p 2pf p i. Thus, π i l 1 ψ ξf i, l 1 )) 4 ɛf i pf p 1 i < ɛf p ɛ i, since, ɛ = 4p. 20) For i G 0, 0 < ɛt l ɛf i. Therefore, Combining this with 20), we have, π i 0 ψ ξf i, 0 )) 2 ɛf i pf i + 0 ) p 1 ɛf p i. 21) Π 1 F p ) ) Lemma 15. Let ɛ min ɛ 4p, 1 6. Then, ɛf p i = ɛf p. Π 2 ɛf p, provided, k 32 72)2/p p 2 ɛ 2 n 1 2/p.

26 Proof. Π 2 is defined as Π 2 ) 2 = 9,i G 0 lmarging 0 ) πi 2 2 li)+1 Suppose i G l and l 1, or, i lmarging 0 ). Then, by 20) and 21), it follows that, π i ɛf p i. Therefore, Π 2 ) 2 9,i G 0 lmarging 0 ) ɛ 2 f 2p i For i G l and l 1, f i < T l 1 and therefore, f 2p i T p l 1 f p i. For i lmarging 0 ), f i T < T ɛ) and therefore, Combining, we have, f 2p i T ) p f p i T p ɛ)p, for i lmarging 0 ). 2 li)+1 22) By construction, Π 2 ) 2 9 ɛ 2 T p ɛ)p f p i + L i lmarging 0 ) l=1 T p l 1 ) 1/2 2F2 0 = ɛq 0 = ɛ1 ɛ)t 0. k i G l f p i 2l+1. 23) and T l = T 0 2 l. Substituting in 23), we have, Π 2 ) 2 9 ɛ 2 1 ɛ) p ) p/2 2F2 ɛ 2 k i lmarging 0 ) L 1 + ɛ) p f p i + l=1 i G l f p i 2 l+1 2 l 1)p 1) 24) Since, p 2, 2 l+1 2 l 1)p 1) 4. Further, since, ɛ 1 4p, 1 + ɛ)p 2 and 1 ɛ) p > 1 2. Therefore, 24) can be simplified as Π 2 ) 2 9 ) p/2 8ɛ 2 2F2 ɛ 2 k i lmarging 0 ) L f p i + f p i l=1 i G l ) p 1)/2 8ɛ 2 2F2 ɛ 2 F p 25) k

27 since, i lmarging0) f p i + L ) l=1 i G l f p i = F p i G 0 lmarging 0 ) f p i. We now use the identity F2 F 0 ) 1/2 Fr F 0 ) 1/r or, F r/2 2 F r F r/2 1 0, for any r 2. 26) Letting r = p, we have F p 2 < F pf p/ Substituting in 25), we have, Π 2 ) p/2 ɛ 2 Fp 2 F p/2 1 0 ɛ 2 k) p/2 F p 72ɛ 2 2 p/2 F 2 p Letting k = 2 72)2/p n 1 2/p = 32 72)2/p p2 n 1 2/p, we have, ɛ 2 ɛ 2 Lemma 16. If ɛ min Π 2 ) 2 ɛ 2 F 2 p, or, Π 2 ɛf p. ɛ 4p, 1 ) 6 ) p/2 n 1 2/p ɛ 2 k and k 32 72)2/p p 2 ɛ 2+4/p n 1 2/p, then, Λ < ɛf p. Proof. The expression 19) for Λ can be written as follows. Λ 2 9 = i lmarging 0 ) L T ) p f p i + T l 1 ) p f p i 2l+1 27) i G l Except for the factor of ɛ 2, the expression on the RHS is identical to the expression in the RHS 23), for which an upper bound was derived in Lemma 15. Following the same proof and modifying the constants, we obtain that l=1 Λ 2 ɛ 2 F 2 p if k 32 72)2/p p 2 ɛ 2+4/p n 1 2/p. Recall that s is the width of the COUNTSKETCH structures kept at each level l = 0,..., L. ) Lemma 17. Suppose p 2. Let k 32 72)2/p p2 n 1 2/p, ɛ = min ɛ ɛ 2+4/p 4p, 1 6 and s = Olog n + log log ), then, Pr { ˆF } p F p < 3ɛF p with probability 6 9. Proof. By Lemma 13, { Pr ˆF } p F p Π 1 + Π 2 + Λ 1 7 )Pr {GOODEST}). 9

28 By Lemma 14, we have, Π 1 ɛf p. By Lemma 15, we have, Π 2 ɛf p. Finally, by Lemma 16, Λ ɛf p. Therefore, Π 1 + Π 2 + Λ 3ɛF p. By choosing the number s of independent tables to be Olog n+log log ) in the COUNTSKETCH structure at each level, we have, Pr {GOODEST} This includes the error probability of estimating ˆF 2 to within factor of 1 ± 1 2 of F 2. Thus, { Pr ˆF } p F p Π 1 + Π 2 + Λ 1 2 ) Pr ) 6 {GOODEST} > 9 9. The space requirement for the estimation procedure whose guarantees are presented in Lemma 17 can be calculated as follows. There are L = Olog ) levels, where, each level uses a COUNTSKETCH structure with hash table size size k = O p 2 n 1 2/p 1 and number ɛ 1+2/p ) of independent hash tables s = Olog n + log log ). Since, each table entry requires space Olog ) bits, the space required to store the tables is p 2 S = O ɛ 2 n1 2+4/p ) p log 2 )log n + log log ) The number of random bits can be reduced from Onlog n + log )) to OS log R) using the pseudo-random generator of [17, 18]. Since the final state of the data structure is the same if the input is presented in the order of item identities, therefore, R = On L s). Therefore, the number of random bits required is OSlog n+log log ). The expected time required to update the structure corresponding to each stream update of the form pos, i, v) is Os) = Olog n + log log ). This is summarized in the following theorem. Theorem 3. For every p 2 and 0 < ɛ 1, there exists a randomized algorithm that returns an estimate ˆF p satisfying Pr { ˆF } p F p ɛf p 2 3 using space ) O p 2 n 1 2/p log F ɛ 2+4/p 1 ) 2 log n + log log ) 2. The expected time to process each stream update is Olog n + log log ).. 5 Estimating Entropy In this section, we apply the HSS technique to estimate the entropy H = i:f i 0 f i log f i

On Estimating Frequency Moments of Data Streams

On Estimating Frequency Moments of Data Streams Sumit Ganguly and 1 Graham Cormode 1 Indian Institute of Technology, Kanpur, sganguly@iit.ac.in AT&T Labs Research, graham@earch.att.com Abstract. Space-economical