On Estimating Frequency Moments of Data Streams

Size: px

Start display at page:

Download "On Estimating Frequency Moments of Data Streams"

Reginald Nathan Jennings
6 years ago
Views:

1 On Estimating Frequency Moments of Data Streams Sumit Ganguly and 1 Graham Cormode 1 Indian Institute of Technology, Kanpur, sganguly@iit.ac.in AT&T Labs Research, graham@earch.att.com Abstract. Space-economical estimation of the pth frequency moments, defined as F p = P n i=1 fi p, for p > 0, are of intet in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random setches formed by the inner product of the frequency vector f 1,..., f n with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating F p and for data stream computations in general. The concept of p-stable setches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indy [11] for estimating F p, for 0 < p <, and has been further studied in Li [13]. In this paper, we consider the problem of estimating F p, for 0 < p <. A disadvantage of the stable setches technique and its variants is that they require O( 1 ɛ ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We pent algorithms for estimating F p, for 0 < p <, that does not require the use of stable setches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. Our algorithms require space Õ( 1 ) 3 to estimate F ɛ +p p to within 1 ± ɛ factors and requi expected time O(log F 1 log 1 ) to process each update. Thus, our technique trades an O( 1 ) factor in space for much more efficient processing of stream updates. We δ ɛ p also pent a stand-alone iterative estimator for F 1. 1 Introduction Recently, there has been an emergence of monitoring applications in diverse areas including networ traffic monitoring, networ topology monitoring, sensor networs, financial maret monitoring, and web-log monitoring. In these applications, data is generated rapidly and continuously, and must be analyzed efficiently, in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, and so on. In many of these applications, it is often required to continuously trac the big picture, or an aggregate view of the data, as opposed to a detailed view of the data. In such scenarios, efficient approximate computation is often acceptable. The data streaming model has gained popularity as a computational model for such applications where incoming data 3 Following standard convention, we say that f(n) is Õ(g(n)) if f(n) = ` O 1 O(1) (log m) O(1) (log n) O(1) g(n). ɛ

2 (or updates) are processed very efficiently and in an online fashion using space, much less than that needed to store the data in its entirety. A data stream S is viewed as a sequence of arrivals of the form (i, v), where i is the identity of an item that is a member of the domain [n] = {1,..., n} and v is the update to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corponding deletion. The frequency of an item i, denoted by f i, is the sum of the updates to i since the inception of the stream, that is, f i = (i,v) appears in S v. In this paper, we consider the problem of estimating the pth frequency moment of a data stream, defined as F p = n i=1 f i p, for 0 < p <. Equivalently, this can be interpreted as the pth power of the L p norm of a vector defined by the stream. The techniques used to design algorithms and lower bounds for the frequency moment problem have been influential in the design of algorithmic and lower bound techniques for data stream computation. We briefly review the current state of the art in estimating F p, with particular emphasis to the range 0 < p <. 1.1 Review Alon, Matias and Szegedy [1] pent the seminal technique of AMS setches for estimating F. An (atomic) AMS setch is a random integer X = n i=1 f iξ i, where {ξ i } i=1,,...,n is a family of fourwise random variables assuming the values 1 or 1 with equal probability. An AMS setch is easily maintained over a stream: for each update (i, v), the setch X is updated as X := X +vf i ξ i. Moreover, since the family {ξ i } is only 4-wise random, for each i, ξ i can be obtained from a randomly chosen cubic polynomial h over a field F that contains the domain of items [n], (ξ i = 1 if the last bit of h(i) is 1 and ξ i = 1 otherwise). It then follows that E [ X ] = F and Var [ X ] F [1]. An estimate of F that is accurate to within 1 ± ɛ with confidence 7 8 can therefore be obtained as the average of the squa of O( 1 ɛ ) independent setch values. There has also been significant study of the case p = 0, also nown as the distinct elements problem. Alon, Matias and Szegedy [1] gave a constant factor approximation in small space. Gibbons and Tirthapura [10] showed a (1±ɛ) factor approximation space Õ( 1 ɛ ); subsequent wor has improved the (hidden) logarithmic factors []. p-stable setches. The use of p-stable setches was pioneered by Indy [11] for estimating F p, with 0 < p <. A stable setch S is defined as Y = n i=1 f is i, where s i is drawn at random from a p-stable distribution, denoted S(p, 1) (the second parameter of S(, ) is the scale factor). By the defining property of p-stable distribution, Y is distributed as S(p, ( n i=1 f i p ) 1/p ). In other words, Y is p-stable distributed, with scale factor Fp 1/p. Indy gives a technique to estimate F p by eeping O( 1 ɛ ) independent p-stable setches and returning the median of the these observations [11]. Woodruff [18] pents an Ω( 1 ɛ ) space lower bound for the problem of estimating F p, for all p 0, implying that the stable setches technique is space optimal up to logarithmic factors. Li [13] further analyses of stable setches and suggests the use of the geometric mean estimator, that is, ˆF p = c Y 1 1/ Y 1/ Y 1/

3 where Y 1, Y,..., Y are independent p-stable setches of the data stream. Li shows the above estimator is unbiased, that is, E [ ] ˆFp = Fp and Var [ ] π ˆFp F p 6p. It follows (by Chebychev s inequality) that returning the geometric mean of O( 1 ɛ p ) setches returns an estimate for F p that is accurate to within factors of (1 ± ɛ) with probability 7 8. Li also shows that the geometric means estimator has lower variance than the median estimator proposed originally by Indy [11]. Very sparse setches. The very sparse setch method due to Li et al. aims to maintain the same space and accuracy bounds, but reduce the time cost to process each update [14, 13]. Note that this technique applies only when the data satisfies some uniformity properties, whereas the preceeding techniques need no such assumptions. A very sparse (nearly) p-stable setch is a linear combination of the form W = n i=1 f iw i, where w i is P p with probability β/, P p with probability β/, and 0 otherwise. Here, P p is the p-pareto distribution with probability tail function Pr{P p > t} = 1 t, t 1. p Pareto distributions are proposed since they are much simpler to sample from as compared to stable distributions. Further, Li shows that W is asymptotically p-stable provdided F 0. Thus, very Fp 1/p sparse setches provide for a principled way of reducing the data stream processing time provided the data satisfies certain uniformity properties. Drawbacs of stable-based methods. A drawbac of the original technique of stable setches is that, in general, for each stream update all of the O( 1 ɛ ) stable setches must be updated. Each setch update requi a pseudo-random generation of a random variable drawn from a p-stable distribution, maing it time-consuming to process each stream update. The very sparse stable setches somewhat alleviates this problem by speeding up the processing time by a factor of approximately 1/β, although the data must now satisfy certain uniformity conditions. In general, it is not possible to a-priori guarantee or verify whether the data stream satisfies the desired properties. We therefore advocate that in the absence of nowledge of the data distribution, the geometric means estimator over p-stable setches is the most reliable of the nown estimators which is quite expensive. Contributions. In this paper, we pent a technique for estimating F p, for 0 < p <. Our technique requi space O( 1 ɛ +p log F 1 log n) to estimate F p to within relative error (1 ± ɛ) with probability 7/8. Further, it requi O(log n) expected time (and O(log F 1 log n) worst-case time) to process each stream update. Thus, our technique trades a factor of O( 1 ɛ p ) space for improved processing time per stream update. From an algorithm design viewpoint, perhaps the most salient feature of the technique is that it does not recourse to stable distributions. Our technique is elementary in nature and uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. It is based on maing some crucial and subtle modifications to the HSS technique [3]. Organization. The remainder of the paper is organized as follows. In Section, we review the HSS technique for estimating a class of data stream metrics. Sections 3 and 4 pectively, pent a family of algorithms for estimating F p and a recursive estimator for F 1, pectively. Finally, we conclude in Section 5.

4 Review of HSS technique In this section, we briefly review the HSS (for Hierarchical Sampling from Setches ) procedure [3] for estimating F p, p > over data streams. Appendix A reviews the COUNTSKETCH and the COUNT-MIN algorithms for finding frequent items in a data stream and algorithms to estimate the idual first and second moments pectively of a data stream [9]. The HSS method is a general technique for estimating a class of metrics over data streams of the form: Ψ(S) = ψ(f i ). (1) i:f i>0 From the input stream S, sub-streams S 0... S L are created by successive sub-sampling, that is, S 0 = S and for 1 l L, S l is obtained from S l 1 by sub-sampling each distinct item appearing in S l 1 independently with probability 1 (hence L = O(log n)). Let be a space parameter. At each level l, we eep a frequent items data-structure, denoted by D l (), that taes as input the sub-stream S l, and returns an approximation to the top() items of its input stream and their frequencies. D l () is instantiated by either the COUNT-MIN or COUNTSKETCH data structu. At level l, suppose that the frequent items structure at this level has an additive error of l () (with high probability), that is, ˆf i f i l () with probability 1 t where t is a parameter. Define F1 (, l) to be (the random variable denoting) the value of F 1 of the sub-stream S l after removing the largest absolute frequencies; and F (, l) to be the corponding value of F. The (non-random value) F1 (, 0) (pectively, F (, 0)) is written as F1 () (p. F ()). Lemma 1 (Lemma 1 from [3]). 1. For l 1 and 48, F1 (, l) F 1 () with probability 1 l For l 1 and 48, F (, l) F () with probability 1 l ( ) By the above lemma, let 0 = F 1 () F () 1/, or 0 = depending on whether the COUNT-MIN or the COUNTSKETCH structure is being used as the frequent items structure at each level. Let ɛ = ɛ 16, T 0 = 0 ɛ and T l = T0, l = 0, 1,..., log T l 0. The items are grouped into groups G 0, G 1,..., G L as follows: G 0 = {i S : f i T 0 } and G l = {i S : T l f i < T l 1 }, 1 l L. It follows that, with high probability, for all items of G l that are pent in the random sub-stream S l, ˆfi l ɛ and ˆf i f i ɛf i. Corponding to every stream update (i, v), we use a hash-function h : [n] [n] to map the item i to a level u = lsb(h(i)), where, lsb(x) is the position of the least significant 1 in binary repentation of x. The stream update (i, v) is then propagated to the frequent items data structu D l for 0 l u, so in effect, i is included in the sub-streams from level 0 to level u. The hash function is assumed to be chosen randomly from a fully independent family; later we reduce the number of random bits required by a standard data streaming argument. At the time of inference, the algorithm collects samples as follows. From each level l, the set of items whose estimated frequency crosses the thhold 0 are identified, using the frequent items ɛ l structure D l. It is possible for the estimate ˆf i,l of an item i obtained from the sub-stream S l to exceed

5 this thhold for multiple levels l. We therefore apply the disambiguation-rule of using the estimate obtained from the lowest level at which it crosses the thhold for that level. The estimated frequency after disambiguation is denoted as ˆf i. Based on their disambiguated frequencies, the sampled items are sorted into their pective groups, Ḡ0, Ḡ1,..., ḠL, as follows: Ḡ 0 = {i ˆf i T 0 } and Ḡl = {i T l 1 < ˆf i T l and i S l }, 1 l L. We define the estimator ˆΨ and a second idealized estimator Ψ which is used for analysis only. ˆΨ = L ψ( ˆf L i ) l Ψ = ψ(f i ) l () l=0 i Ḡl l=0 i Ḡl We now briefly review the salient points in the error analysis. Lemma shows that the expected value of Ψ is close to Ψ. Lemma (Lemma from [3]). Suppose that for 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then E [ Ψ] Ψ Ψ t+log L. We now pent a bound on the variance of the idealized estimator. The frequency group G l is partitioned into three sub-groups, namely, lmargin(g l ) = [T l, T l (1 + ɛ/)], rmargin(g l ) = [T l 1 (1 ɛ), T l 1 ] and midregion(g l ) = [T l (1 + ɛ/), T l 1 (1 ɛ)], that pectively denote the lmargin (leftmargin), rmargin (right-margin) and midregion of the group G l. An item i is said to belong to one of these regions if its true frequency lies in that region. For any item i with non-zero frequency, we denote by l(i) the group index l such that i G l. For any subset T [n], denote by ψ(t ) the expsion i T ψ(f i). Let Ψ = Ψ (S) denote n i=1 ψ (f i ). Lemma 3 (Lemma 3 from [3]). Suppose that for all 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then, Var [ Ψ] t+l+ Ψ + i / (G 0 lmargin(g 0)) ψ (f i ) l(i)+1. Corollary 4. If the function ψ( ) is non-decreasing in the interval [0... T ], then, choosing t = L + log 1 ɛ +, we get Var [ Ψ] ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) (3) The error incurred by the estimate ˆΨ is ˆΨ Ψ, and can be written as the sum of two error components using the triangle inequality. ˆΨ Ψ Ψ Ψ + ˆΨ Ψ = E 1 + E Here, E 1 = Ψ Ψ is the error due to sampling and E = ˆΨ Ψ is the error due to the approximate estimation of the frequencies. By Chebychev s inequality, E 1 = Ψ Ψ E[ Ψ] Ψ +

6 3 Var [ Ψ] with probability 8 9. Using Lemma and Corollary 4, and choosing t = L + log 1 ɛ +, the expsion for E 1 can be simplified as follows: E 1 ɛ LΨ L + 3( ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) ) 1/ (4) with probability 8 9. We now pent an upper bound on E. Lemma 5. Suppose that for 1 i n and 0 l L, ˆf i,l f i ɛf i with probability 1 t. L ψ Then E 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). The analysis assumes that the hash function mapping items to levels is completely independent. We adopt a standard technique of reducing the required randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. More details are provided in Appendix B. 3 Estimating F p In this section, we use the HSS technique with some subtle but vital modifications to estimate F p for 0 < p <. We use the COUNTSKETCH structure as the frequent items structure at each level l. We observe that a direct application of the HSS technique does not pent an Õ(1) space procedure, and so we need some novel analysis. To see this, suppose that is the space parameter of the COUNTSKETCH procedure at each level. Firstly, observe that E 0 L l=0 ψ (ξ i ) L 0 L l = l p f i p 1 ɛ f i p ɛ 1+p/ F p ɛf p i G l i G l i G l l=0 as required, since p < and 0 l ɛ f i for i G l. Now, by equation (4) and using t = L + log 1 ɛ +, we have E 1 ɛ F p + 3 ( ɛ F p + L ( F () l l=0 ) p Fp (G l ) l+1 + (1 + ɛ) ( F () Further, if we write f ran(r) to denote the rth largest frequency (in absolute value) ) p Fp (lmargin(g 0 ))) 1. F () = fran(r) ( ) /p 1 f p ran(+1) f p ran(r) Fp F p r> r> and hence the expsion for E 1 simplifies to E 1 ɛ F p + 3 ( ɛ F p + F p L l+1 lp/ F p (G l ) + (1 + ɛ)f p ( ) /p Fp, for p > 0 F p (lmargin(g 0 )) ) 1/

7 The main problem arises with the the middle term in the above expsion, namely, F p L l+1 lp/ F p (G l ), which can be quite large. Our approach relies on altering the group definitions to mae them depend on the (randomized) quantities F (, l) so that the ulting expsion for E 1 becomes bounded by O(ɛF p ). We also note that the expsion for E derived above remains bounded by ɛf p. Altered group definitions and estimator. We use the COUNTSKETCH data structure as the frequent items structure at each level l = 0, 1,..., L, with = O( 1 ɛ +p ) bucets per hash function and s = O(log F 1 ) hash functions per setch. We first observe that Lemma 1 can be slightly strengthened as follows: Lemma 6 (A slightly stronger version of Lemma 1). For l 1 and F1 (, l) F 1 ( l 1 ) l with probability F (, l) F ( l 1 ) l with probability Proof. The ult follows from the proof given for Lemma 1 [3]. At each level l, we use the COUNTSKETCH structure to estimate F (, l) to within an accuracy of (1 ± 1 1 ɛ 4 ) with probability 1 16L, where, ɛ = 3. Let F (, l) denote 5 4 F (, l). We redefine the thholds T 0, T 1,..., as follows: ( F l = (, l) l ) 1/, T l = l, l = 0, 1,,..., L. ɛ The groups G 0, G 1, G,..., are set in the usual way, using the new thholds: The estimator is defined by (1) as before. G 0 = {i f i T 0 } and G l = {i T l f i T l 1 } Lemma 7. Suppose 16 ɛ +p. Then, E 8ɛF p with probability at least 3 4. Proof. We use the property of F (t) derived above, that for any 1 t F 0. We therefore have, F (t) t ( Fp t ) /p = F /p p, for p > 0. (5) t/p 1 ( ) F 1/ ( (, l) 5F ( l 1 ) 1/ ) T l = ɛ l 4ɛ l, by Lemma 6 ( ) 5F /p 1/ p, by (5) 4( l 1 ) /p 1 ɛ l = 1 ( ) 1/p 5 Fp ɛ l (6)

8 By equation (4), one component of E 1 can be simplified as follows: E 1,1 L ɛ p ( 5 10 ɛ p F p The other component of E 1 is T p l F p(g l ) l+1 ) p/ L F p l F p(g l ) l+1 substituting (6) L F p (G l ) since p < = 10 ɛ p F p(f p F p (G 0 )) 5 8 ɛ F p since 16 ɛ +p E 1, = T p 0 (1 + ɛ)f p(lmargin(g 0 )) ɛ p (5/) p/ F p (1 + ɛ)f p 10 ɛ p (1 + ɛ)f p ɛ F p, also using 16 ɛ and ɛ < 1 +p. Substituting in (4), we have, Adding, the total error is bounded by E 1 ɛf p + 3(ɛ F p + E 1,1 + E 1, ) 1/ < 6ɛF p. (7) E E 1 + E 8ɛF p We summarize this section in the following theorem. Theorem 8. There exists an algorithm that returns ˆF p satisfying ˆF p F p ɛf p with probability using space O( ɛ (log n)(log F +p 1 )) and processes each stream update in expected time O(log n log F 1 ) and worst case time O(log n log F 1 ) standard arithmetic operations on words of size log F 1 bits. Remars. 1. We note that for 0 < p < 1, an estimator for F p with similar properties may be designed in an exactly analogous fashion by using COUNT-MIN instead of COUNTSKETCH as the frequent items structure at each level. Such an estimator would require an ɛ-accurate estimation of F 1 (which would imply estimation of F 1 using standard techniques), which could either be done using Cauchysetches [11, 13] or using the stand-alone technique pented in Section 4. However, using Cauchysetches means that, in general, O( 1 ɛ ) time is required to process each stream update. In order to maintain poly-logarithmic processing time per stream update, the technique of Section 4 may be used.. The space requirement of the stable setches estimator grows as Õ( 1 ɛ p ) as a function of p [13], ( whereas, the HSS-based ) technique requi space Õ( 1 ɛ ). For small values of p, i.e. p = +p O 1 log ɛ 1 (1+log log ɛ 1 ), the HSS technique can be asymptotically more space-efficient.

9 4 An iterative estimator for F 1 In this section, we use the HSS technique to pent a stand-alone, iterative estimator for F 1 = n i=1 f i. The previous section pents an estimator for F p that uses, as a sub-routine, an estimator for F () at each level of the structure. In this section, we pent a stand-alone estimator that uses only COUNT-MIN setch to estimate F 1. The technique may be of independent intet. The estimator uses two separate instantiations of the HSS structure. The first instantiation uses COUNT-MIN setch structure with = 8 ɛ bucets per hash function, and s = O(log G) hash functions, where, G = O(F ) and ɛ = ɛ 8. A collection of s = O(log 1 δ 3 ) independent copies of the structure are ept for the purpose of boosting the confidence of an estimate by taing the median. The second instantiation of the HSS structure uses = 18 ɛ 3 bucets per hash function (so = 16) and s = O(log G) hash functions. For estimating F 1, we use a two-step procedure, namely, (1) first, we obtain an estimate of F 1 that is correct to within a factor of 16 using the first HSS instantiation and () then, we use the second instantiation to obtain an ɛ-accurate estimation of F 1. The first step of the estimation is done using the first instantiation of the HSS structure as follows. We set the thhold T 0 to a parameter t, T l = T0 l and the thhold frequency for finding in group l to be T l. The group definitions are as defined earlier: G 0 = [t, F 1 ], G l = [ t l, t l 1 ), 1 l L. The disambiguation rule for the estimated frequency is as follows: if ˆf i,l > T l, then, ˆfi is set to the estimate obtained from the lowest of these levels. The sampled groups Ḡl are defined as follows. { Ḡ 0 = {i ˆf i T 0 }, Ḡ l = i T 0 l ˆf i < T } 0 l 1 and i S l, 1 i L. The estimators ˆF 1 and F 1 are defined as before these are now functions of t. ˆF 1 (t) = L ˆf i l F1 (t) = l=0 i Ḡl L f i l. l=0 i Ḡl Estimator. Let t iterate over values 1,, med,..., G and for each value of t let ˆF 1 (t) denote the median of the estimates ˆF 1 returned from the s 1 = O(log 1 δ ) copies of the HSS structure, each using independent random bits and the same value of t. Let t max denote the largest value of t satisfying The final estimate returned is ˆF med 1 (t) 16t 1.01ɛ. ˆF med 1 (t max ) using the second HSS instantiation. Analysis. We first note that Lemmas and 3 hold for all choices of t. Lemma 5 gets modified as follows. Lemma 9. Suppose that for 1 i n and 0 l L, ˆf i,l f i 0 with probability 1 t, l where, 0 = F 1 () L ψ. Then, E 16 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for an i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). Lemma 10. Let ɛ = ɛ 8, = 8 ɛ 3 and ɛ 1 8. Then, with probability 1 δ each,

10 1. For 4F1 ɛ t 8F1 ɛ. For any t 64F1 ɛ med ˆF 1 (t) F ɛF 1 1 (t) < 16t, ˆF med and 1.0ɛ with probability 1 δ. ˆF med 1 (t) 16t 1.0ɛ Proof. We consider the two summation terms in error term E 1 given by equation (4) separately. L t E 1,1 = l F 1(G l ) l+1 t(f 1 F 0 ), and E 1, = tf 1 (lmargin(g 0 )). Adding, E 1 ((t + 0 )F 1 ) 1/. We ignore the term t+l+ F in E 1, since, by choosing t = O(L), this term can be made arbitrarily small in comparison with the other two terms. Since, G l F1 l t, our bound on E becomes E 16 0 L l=0 G l l 16F 1 t Therefore, the expsion for total error is E(t) = E 1 + E ( t E(t) F ) 1/ + 16F 1 F 1 t Suppose = 18 ɛ 3 We therefore have, med ˆF Therefore, for 4F1 ɛ 1.01ɛF 1 and 4F1 ɛ t 8F1 ɛ. Using ɛ 1 8 and ɛ = ɛ 8 E(t) F 1 ( t F (t) F 1 E(t) 1.01ɛF 1 ) 1/ + 16F 1 t., we have 1.01ɛF 1. (8). (9), for 4F 1 ɛ t 8F 1 with probability 1 δ. ɛ t 8F1 ɛ, with probability 1 δ, we have from (9) that tf 1 and so t (1.01) ɛ F ɛ med ˆF 1 16 ( ɛ) so Let t = j+ F 1 ɛ for some j 0, and suppose that (10) is satisfied. Then by (8) E(t) (j 1)/ ɛ(1.01)f 1 + j ɛf 1 j/ ɛf 1 with probability 1 δ. med With probability 1 δ, ˆF 1 (t) F 1 E and so, using ɛ 1 8, j/ 3 F 1 j/ med ɛf 1 E(t) ˆF 1 (t) F 1 16t 1.0ɛ F 1 by (10) j 3 F F 1 using ɛ = ɛ and = 8 8 ɛ 3 which is a contradiction for j 4, proving claim. ˆF med 1 (t) 16t 1.0ɛ. The correctness of the algorithm follows from the above Lemma. The space requirement of the algorithm is O( 1 ɛ (log 3 n)(log F 3 1 )) bits and the expected time taen to process each stream update is O(log F 1 log 1 δ ) standard arithmetic operations on words of size O(log n). (10)

11 5 Conclusions We pent a family of algorithms for the randomized estimation of F p for 0 < p < and another family of algorithms for estimating F p for 0 < p < 1. The first algorithm family estimates F p by using the COUNTSKETCH structure and F estimation as sub-routines. The second algorithm family estimates F p by using the COUNT-MIN setch structure and F 1 estimation as a sub-routines. The space required by these algorithms are O( 1 ɛ (log n)(log F +p 1 )(log 1 δ ) and the expected time required to process each stream update is O(log n log 1 δ ). Finally, we also pent a stand-alone iterative estimator for F 1 that only uses the COUNT-MIN setch structure as a sub-routine. Prior approaches to the problem of estimating F p [11, 13] used setches of the frequency vector with random variables drawn from a symmetric p-stable distribution. An inteting feature of the above algorithms is that they do not require the use of stable distributions. The proposed algorithms trade an extra factor of O(ɛ p ) factor of space for dramatically improved procesing time (with no polynomial dependency on ɛ) per stream update. References 1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating frequency moments. Journal of Computer and System Sciences, 58(1): , Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivaumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, L. Bhuvanagiri and S. Ganguly. Estimating Entropy over Data Streams. In Proc. ESA, D. Carney, U. Cetintemel, M. Cherniac, C. Convey, S. Lee, G. Seidman, M. Stonebraer, N. Tatbul, and S. Zdoni. Monitoring streams a new class of data management applications. In Proc. VLDB, M. Chariar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. ICALP, 00, pages G. Cormode and S. Muthurishnan. What s New: Finding Significant Differences in Networ Data Streams. In IEEE INFOCOM, G. Cormode and S. Muthurishnan. An Improved Data Stream Summary: The Count-Min Setch and its Applications. J. Algorithms, 55(1):58 75, April P. Flajolet and G.N. Martin. Probabilistic Counting Algorithms for Database Applications. Journal of Computer and System Sciences, 31():18 09, S. Ganguly, D. Kesh, and C. Saha. Practical Algorithms for Tracing Database Join Sizes. In Proc. FSTTCS, P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, P. Indy. Stable Distributions, Pseudo Random Generators, Embeddings and Data Stream Computation. In Proc. IEEE FOCS, P. Indy and D. Woodruff. Optimal Approximations of the Frequency Moments. In Proc. ACM STOC, P. Li. Very Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections. Manuscript, 006.

12 14. P. Li, T. J. Hastie and K. W. Church. Very Sparse Random Projections. In Proc. ACM SIGKDD, N. Nisan. Pseudo-Random Generators for Space Bounded Computation. In Proc. ACM STOC, M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proc. ACM SODA, pages , January M.N. Wegman and Carter J. L. New Hash Functions and their Use in Authentication and Set Equality. Journal of Computer and System Sciences, :65 79, D.P. Woodruff. Optimal approximations of all frequency moments. In Proc. ACM SODA, pages , January 004. A COUNT-MIN and COUNTSKETCH summaries Given a data stream defining a set of item frequencies, ran(r) returns an item with the r th largest absolute value of the frequency (ties are broen arbitrarily). We say that an item i has ran r if ran(r) = i. For a given value of, 1 n, the set top() is the set of items with ran. The idual second moment [5] of a data stream, denoted by F (), is defined as the second moment of the stream after the top- frequencies have been removed. Then, F () = r> f ran(r). The idual first moment [7] of a data stream, denoted by F 1, is analogously defined as the F 1 norm of the data stream after the top- frequencies have been removed, that is, F 1 = r> f ran(r). A setch [1] is a random integer X = i f i x i, where, x i { 1, +1}, for i D and the family of variables {x i } i D with certain independence properties. The family of random variables {x i } i D is referred to as the setch basis. For any d, a d-wise independent setch basis can be constructed in a pseudo-random manner from a truly random seed of size O(d log n) bits as follows. Let F be field of characteristic and of size at least n + 1. Choose a degree d 1 polynomial g : F F with coefficients that are randomly chosen from F [17]. Define x i to be 1 if the first bit (i.e., the least significant position) of g(i) is 1, and define x i to be 1 otherwise. The d-wise independence of the x i s follows from an application of Wegman and Carter s universal hash functions [17]. Pair-wise independent setches are used in [5] to design the COUNTSKETCH algorithm for finding the top- frequent items in an insert-only stream. The data structure consists of a collection of s = O(log 1 δ ) independent hash tables U 1, U,..., U s each consisting of 8 bucets. A pair-wise independent hash function h j : [n] {1,,..., 8} is associated with each hash table that maps items randomly to one of the 8 bucets, where, is a space parameter. Additionally, for each table index j = 1,,..., s, we eep a pair-wise independent family of random variables {x ij } i [n], where, each x ij { 1, +1} with equal probability. Each bucet eeps a setch of the sub-stream that maps to it, that is, U j [r] = i:h j(i)=r f ix ij, 1 j s, 1 r 8. An estimate ˆf i is returned as follows: ˆf i = median s j=1 U j [h j (i)]x ij. The accuracy of estimation is stated as a function of the idual second moment given parameters and b is defined as [5] ( F ) 1/ () (b, ) = 8. b The space versus accuracy guarantees of the COUNTSKETCH algorithm is pented in Theorem 11.

13 Theorem 11 ([5]). Let = (, 8). Then, for any given i [n], Pr{ ˆf i f i } 1 δ. The space used is O( log 1 δ (log F 1)) bits, and the time taen to process a stream update is O(log 1 δ ). The COUNTSKETCH algorithm can be adapted to return approximate frequent items and their frequencies. The original algorithm [5] uses a heap for maintaining the current top- items in terms of their estimated frequencies. After processing each arriving stream record of the form (i, v), where, v is assumed to be non-negative, an estimate for ˆf i is calculated using the scheme outline above. If i is already in the current estimated top- heap then its frequency is corpondingly increased. If i is not in the heap but ˆf i is larger than the current smallest frequency in the heap, then it replaces that element in the heap. This scheme is applicable to insert-only streams. A generalization of this method for strict update streams is pented in [6] and returns, with probability 1 δ, (a) all items with frequency at least ( F () ) 1/ and, (b) does not return any item with frequency less than (1 ɛ)( F () ) 1/ using space O ( ɛ log n log(ɛ 1 log(ɛ 1 )) log F 1 ) bits. For general update streams, a variation of this technique can be used for retrieving items satisfying properties (a) and (b) above using space O(ɛ log(δ 1 n) log F 1 ) bits. The COUNT-MIN algorithm [7] for finding approximate frequent items eeps a collection of s = O(log 1 δ ) independent hash tables T 1, T,..., T s, where each hash table T j is of size b = bucets and uses a pair-wise independent hash function h j : [n] {1,..., }, for j = 1,,..., s. The bucet T j [r] is an integer counter that maintains the following sum T j [r] = i:h j(i)=r f i. The estimated frequency ˆf i is obtained as ˆf i = median s r=1t j [h j (i)]. The space versus accuracy guarantees for the COUNT-MIN algorithm is given in terms of the quantity F 1 () = r> f ran(r). Theorem 1 ([7]).Pr{ ˆf i f i F 1 () } 1 δ with probability using space O( log 1 δ log F 1) bits and time O(log 1 δ ) to process each stream update. Estimating F1 and F. [9] pents an algorithm to estimate F () to within an accuracy of (1±ɛ) with confidence 1 δ using space O( ɛ log(f 1 ) log( n δ )) bits. The data structure used is identical to the COUNTSKETCH structure. The algorithm basically removes the top- estimated frequencies from the COUNTSKETCH structure and then estimates F. Let ˆf τ1,..., ˆf τ denote the top- estimated frequencies from the COUNTSKETCH structure. Next, the contributions of these estimates are removed from the structure, that is, U j [r]:=u j [r] t:h f j(τ t)=r τ t x jτt. Subsequently, the Fast-AMS algorithm [16], a variant of the original setch algorithm [1], is used to estimate the second moment as ˆF = median s 8 j=1 r=1 (U j[r]). Formally, we can state: Lemma 13 ([9]). For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF () satisfying ˆF () F () ɛf () with probability 1 δ using space O( ɛ (log F1 δ )(log F 1)) bits. A similar argument can be applied to estimate F 1 (s), where, instead of using the COUNTSKETCH algorithm, we use the COUNT-MIN algorithm for retrieving the top- estimated absolute frequencies. In parallel, a set of s = O( 1 ɛ ) setches based on a 1-stable distribution [11] (i.e., Y j = i f iz ji, where z ji is drawn from a 1-stable distribution). After retrieving the top- frequencies f τ1,..., f τ

14 with pect to their absolute values, we reduce the setches Y j :=Y j r=1 f τ r z jτr and estimate F1 () as median s j=1 Y j. We summarize this in Lemma 14. Lemma 14. For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF 1 () satisfying ˆF 1 () F1 () ɛf1 () with probability 1 δ using O( 1 ɛ ( + 1 ɛ )(log δ )(log F 1)) bits. B Reducing random bits by using a PRG We use a standard technique of reducing the randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. Notation. Let M be a finite state machine that uses S bits and has running time R. Assume that M uses the random bits in segments, each segment consisting of b bits. Let U r be a uniform distribution over {0, 1} r and for a discrete random variable X, let F[X] denote the probability distribution of X, treated as a vector. Let M(x) denote the state of M after using the random bits in x. The generator G : {0, 1} u {0, 1} b expands a small number of u bits that are truly random to a sequence of b bits that appear random to M. G is said to be a pseudo-random generator for a class C of finite state machines with parameter ɛ, provided, for every M C F[Mx U b(x)] F[M x U m(g(x))] 1 ɛ where, y 1 denotes the F 1 norm of the vector y. Nisan [15] shows the following property (this version is from [11]). Theorem 15 ([15]). There exists a PRG G for Space(S) and Time(R) with parameter ɛ = O(S) that requi O(S) bits such that G expands O(S log R) bits into O(R) bits. This is sufficient due to the fact that we can compute the frequency moments by considering each (aggregate) frequency f i in turn and use only segments of O(log F 1 ) bits to store and process it.

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We