On Estimating Frequency Moments of Data Streams

Size: px
Start display at page:

Download "On Estimating Frequency Moments of Data Streams"

Transcription

1 On Estimating Frequency Moments of Data Streams Sumit Ganguly and 1 Graham Cormode 1 Indian Institute of Technology, Kanpur, sganguly@iit.ac.in AT&T Labs Research, graham@earch.att.com Abstract. Space-economical estimation of the pth frequency moments, defined as F p = P n i=1 fi p, for p > 0, are of intet in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random setches formed by the inner product of the frequency vector f 1,..., f n with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating F p and for data stream computations in general. The concept of p-stable setches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indy [11] for estimating F p, for 0 < p <, and has been further studied in Li [13]. In this paper, we consider the problem of estimating F p, for 0 < p <. A disadvantage of the stable setches technique and its variants is that they require O( 1 ɛ ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We pent algorithms for estimating F p, for 0 < p <, that does not require the use of stable setches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. Our algorithms require space Õ( 1 ) 3 to estimate F ɛ +p p to within 1 ± ɛ factors and requi expected time O(log F 1 log 1 ) to process each update. Thus, our technique trades an O( 1 ) factor in space for much more efficient processing of stream updates. We δ ɛ p also pent a stand-alone iterative estimator for F 1. 1 Introduction Recently, there has been an emergence of monitoring applications in diverse areas including networ traffic monitoring, networ topology monitoring, sensor networs, financial maret monitoring, and web-log monitoring. In these applications, data is generated rapidly and continuously, and must be analyzed efficiently, in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, and so on. In many of these applications, it is often required to continuously trac the big picture, or an aggregate view of the data, as opposed to a detailed view of the data. In such scenarios, efficient approximate computation is often acceptable. The data streaming model has gained popularity as a computational model for such applications where incoming data 3 Following standard convention, we say that f(n) is Õ(g(n)) if f(n) = ` O 1 O(1) (log m) O(1) (log n) O(1) g(n). ɛ

2 (or updates) are processed very efficiently and in an online fashion using space, much less than that needed to store the data in its entirety. A data stream S is viewed as a sequence of arrivals of the form (i, v), where i is the identity of an item that is a member of the domain [n] = {1,..., n} and v is the update to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corponding deletion. The frequency of an item i, denoted by f i, is the sum of the updates to i since the inception of the stream, that is, f i = (i,v) appears in S v. In this paper, we consider the problem of estimating the pth frequency moment of a data stream, defined as F p = n i=1 f i p, for 0 < p <. Equivalently, this can be interpreted as the pth power of the L p norm of a vector defined by the stream. The techniques used to design algorithms and lower bounds for the frequency moment problem have been influential in the design of algorithmic and lower bound techniques for data stream computation. We briefly review the current state of the art in estimating F p, with particular emphasis to the range 0 < p <. 1.1 Review Alon, Matias and Szegedy [1] pent the seminal technique of AMS setches for estimating F. An (atomic) AMS setch is a random integer X = n i=1 f iξ i, where {ξ i } i=1,,...,n is a family of fourwise random variables assuming the values 1 or 1 with equal probability. An AMS setch is easily maintained over a stream: for each update (i, v), the setch X is updated as X := X +vf i ξ i. Moreover, since the family {ξ i } is only 4-wise random, for each i, ξ i can be obtained from a randomly chosen cubic polynomial h over a field F that contains the domain of items [n], (ξ i = 1 if the last bit of h(i) is 1 and ξ i = 1 otherwise). It then follows that E [ X ] = F and Var [ X ] F [1]. An estimate of F that is accurate to within 1 ± ɛ with confidence 7 8 can therefore be obtained as the average of the squa of O( 1 ɛ ) independent setch values. There has also been significant study of the case p = 0, also nown as the distinct elements problem. Alon, Matias and Szegedy [1] gave a constant factor approximation in small space. Gibbons and Tirthapura [10] showed a (1±ɛ) factor approximation space Õ( 1 ɛ ); subsequent wor has improved the (hidden) logarithmic factors []. p-stable setches. The use of p-stable setches was pioneered by Indy [11] for estimating F p, with 0 < p <. A stable setch S is defined as Y = n i=1 f is i, where s i is drawn at random from a p-stable distribution, denoted S(p, 1) (the second parameter of S(, ) is the scale factor). By the defining property of p-stable distribution, Y is distributed as S(p, ( n i=1 f i p ) 1/p ). In other words, Y is p-stable distributed, with scale factor Fp 1/p. Indy gives a technique to estimate F p by eeping O( 1 ɛ ) independent p-stable setches and returning the median of the these observations [11]. Woodruff [18] pents an Ω( 1 ɛ ) space lower bound for the problem of estimating F p, for all p 0, implying that the stable setches technique is space optimal up to logarithmic factors. Li [13] further analyses of stable setches and suggests the use of the geometric mean estimator, that is, ˆF p = c Y 1 1/ Y 1/ Y 1/

3 where Y 1, Y,..., Y are independent p-stable setches of the data stream. Li shows the above estimator is unbiased, that is, E [ ] ˆFp = Fp and Var [ ] π ˆFp F p 6p. It follows (by Chebychev s inequality) that returning the geometric mean of O( 1 ɛ p ) setches returns an estimate for F p that is accurate to within factors of (1 ± ɛ) with probability 7 8. Li also shows that the geometric means estimator has lower variance than the median estimator proposed originally by Indy [11]. Very sparse setches. The very sparse setch method due to Li et al. aims to maintain the same space and accuracy bounds, but reduce the time cost to process each update [14, 13]. Note that this technique applies only when the data satisfies some uniformity properties, whereas the preceeding techniques need no such assumptions. A very sparse (nearly) p-stable setch is a linear combination of the form W = n i=1 f iw i, where w i is P p with probability β/, P p with probability β/, and 0 otherwise. Here, P p is the p-pareto distribution with probability tail function Pr{P p > t} = 1 t, t 1. p Pareto distributions are proposed since they are much simpler to sample from as compared to stable distributions. Further, Li shows that W is asymptotically p-stable provdided F 0. Thus, very Fp 1/p sparse setches provide for a principled way of reducing the data stream processing time provided the data satisfies certain uniformity properties. Drawbacs of stable-based methods. A drawbac of the original technique of stable setches is that, in general, for each stream update all of the O( 1 ɛ ) stable setches must be updated. Each setch update requi a pseudo-random generation of a random variable drawn from a p-stable distribution, maing it time-consuming to process each stream update. The very sparse stable setches somewhat alleviates this problem by speeding up the processing time by a factor of approximately 1/β, although the data must now satisfy certain uniformity conditions. In general, it is not possible to a-priori guarantee or verify whether the data stream satisfies the desired properties. We therefore advocate that in the absence of nowledge of the data distribution, the geometric means estimator over p-stable setches is the most reliable of the nown estimators which is quite expensive. Contributions. In this paper, we pent a technique for estimating F p, for 0 < p <. Our technique requi space O( 1 ɛ +p log F 1 log n) to estimate F p to within relative error (1 ± ɛ) with probability 7/8. Further, it requi O(log n) expected time (and O(log F 1 log n) worst-case time) to process each stream update. Thus, our technique trades a factor of O( 1 ɛ p ) space for improved processing time per stream update. From an algorithm design viewpoint, perhaps the most salient feature of the technique is that it does not recourse to stable distributions. Our technique is elementary in nature and uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. It is based on maing some crucial and subtle modifications to the HSS technique [3]. Organization. The remainder of the paper is organized as follows. In Section, we review the HSS technique for estimating a class of data stream metrics. Sections 3 and 4 pectively, pent a family of algorithms for estimating F p and a recursive estimator for F 1, pectively. Finally, we conclude in Section 5.

4 Review of HSS technique In this section, we briefly review the HSS (for Hierarchical Sampling from Setches ) procedure [3] for estimating F p, p > over data streams. Appendix A reviews the COUNTSKETCH and the COUNT-MIN algorithms for finding frequent items in a data stream and algorithms to estimate the idual first and second moments pectively of a data stream [9]. The HSS method is a general technique for estimating a class of metrics over data streams of the form: Ψ(S) = ψ(f i ). (1) i:f i>0 From the input stream S, sub-streams S 0... S L are created by successive sub-sampling, that is, S 0 = S and for 1 l L, S l is obtained from S l 1 by sub-sampling each distinct item appearing in S l 1 independently with probability 1 (hence L = O(log n)). Let be a space parameter. At each level l, we eep a frequent items data-structure, denoted by D l (), that taes as input the sub-stream S l, and returns an approximation to the top() items of its input stream and their frequencies. D l () is instantiated by either the COUNT-MIN or COUNTSKETCH data structu. At level l, suppose that the frequent items structure at this level has an additive error of l () (with high probability), that is, ˆf i f i l () with probability 1 t where t is a parameter. Define F1 (, l) to be (the random variable denoting) the value of F 1 of the sub-stream S l after removing the largest absolute frequencies; and F (, l) to be the corponding value of F. The (non-random value) F1 (, 0) (pectively, F (, 0)) is written as F1 () (p. F ()). Lemma 1 (Lemma 1 from [3]). 1. For l 1 and 48, F1 (, l) F 1 () with probability 1 l For l 1 and 48, F (, l) F () with probability 1 l ( ) By the above lemma, let 0 = F 1 () F () 1/, or 0 = depending on whether the COUNT-MIN or the COUNTSKETCH structure is being used as the frequent items structure at each level. Let ɛ = ɛ 16, T 0 = 0 ɛ and T l = T0, l = 0, 1,..., log T l 0. The items are grouped into groups G 0, G 1,..., G L as follows: G 0 = {i S : f i T 0 } and G l = {i S : T l f i < T l 1 }, 1 l L. It follows that, with high probability, for all items of G l that are pent in the random sub-stream S l, ˆfi l ɛ and ˆf i f i ɛf i. Corponding to every stream update (i, v), we use a hash-function h : [n] [n] to map the item i to a level u = lsb(h(i)), where, lsb(x) is the position of the least significant 1 in binary repentation of x. The stream update (i, v) is then propagated to the frequent items data structu D l for 0 l u, so in effect, i is included in the sub-streams from level 0 to level u. The hash function is assumed to be chosen randomly from a fully independent family; later we reduce the number of random bits required by a standard data streaming argument. At the time of inference, the algorithm collects samples as follows. From each level l, the set of items whose estimated frequency crosses the thhold 0 are identified, using the frequent items ɛ l structure D l. It is possible for the estimate ˆf i,l of an item i obtained from the sub-stream S l to exceed

5 this thhold for multiple levels l. We therefore apply the disambiguation-rule of using the estimate obtained from the lowest level at which it crosses the thhold for that level. The estimated frequency after disambiguation is denoted as ˆf i. Based on their disambiguated frequencies, the sampled items are sorted into their pective groups, Ḡ0, Ḡ1,..., ḠL, as follows: Ḡ 0 = {i ˆf i T 0 } and Ḡl = {i T l 1 < ˆf i T l and i S l }, 1 l L. We define the estimator ˆΨ and a second idealized estimator Ψ which is used for analysis only. ˆΨ = L ψ( ˆf L i ) l Ψ = ψ(f i ) l () l=0 i Ḡl l=0 i Ḡl We now briefly review the salient points in the error analysis. Lemma shows that the expected value of Ψ is close to Ψ. Lemma (Lemma from [3]). Suppose that for 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then E [ Ψ] Ψ Ψ t+log L. We now pent a bound on the variance of the idealized estimator. The frequency group G l is partitioned into three sub-groups, namely, lmargin(g l ) = [T l, T l (1 + ɛ/)], rmargin(g l ) = [T l 1 (1 ɛ), T l 1 ] and midregion(g l ) = [T l (1 + ɛ/), T l 1 (1 ɛ)], that pectively denote the lmargin (leftmargin), rmargin (right-margin) and midregion of the group G l. An item i is said to belong to one of these regions if its true frequency lies in that region. For any item i with non-zero frequency, we denote by l(i) the group index l such that i G l. For any subset T [n], denote by ψ(t ) the expsion i T ψ(f i). Let Ψ = Ψ (S) denote n i=1 ψ (f i ). Lemma 3 (Lemma 3 from [3]). Suppose that for all 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then, Var [ Ψ] t+l+ Ψ + i / (G 0 lmargin(g 0)) ψ (f i ) l(i)+1. Corollary 4. If the function ψ( ) is non-decreasing in the interval [0... T ], then, choosing t = L + log 1 ɛ +, we get Var [ Ψ] ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) (3) The error incurred by the estimate ˆΨ is ˆΨ Ψ, and can be written as the sum of two error components using the triangle inequality. ˆΨ Ψ Ψ Ψ + ˆΨ Ψ = E 1 + E Here, E 1 = Ψ Ψ is the error due to sampling and E = ˆΨ Ψ is the error due to the approximate estimation of the frequencies. By Chebychev s inequality, E 1 = Ψ Ψ E[ Ψ] Ψ +

6 3 Var [ Ψ] with probability 8 9. Using Lemma and Corollary 4, and choosing t = L + log 1 ɛ +, the expsion for E 1 can be simplified as follows: E 1 ɛ LΨ L + 3( ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) ) 1/ (4) with probability 8 9. We now pent an upper bound on E. Lemma 5. Suppose that for 1 i n and 0 l L, ˆf i,l f i ɛf i with probability 1 t. L ψ Then E 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). The analysis assumes that the hash function mapping items to levels is completely independent. We adopt a standard technique of reducing the required randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. More details are provided in Appendix B. 3 Estimating F p In this section, we use the HSS technique with some subtle but vital modifications to estimate F p for 0 < p <. We use the COUNTSKETCH structure as the frequent items structure at each level l. We observe that a direct application of the HSS technique does not pent an Õ(1) space procedure, and so we need some novel analysis. To see this, suppose that is the space parameter of the COUNTSKETCH procedure at each level. Firstly, observe that E 0 L l=0 ψ (ξ i ) L 0 L l = l p f i p 1 ɛ f i p ɛ 1+p/ F p ɛf p i G l i G l i G l l=0 as required, since p < and 0 l ɛ f i for i G l. Now, by equation (4) and using t = L + log 1 ɛ +, we have E 1 ɛ F p + 3 ( ɛ F p + L ( F () l l=0 ) p Fp (G l ) l+1 + (1 + ɛ) ( F () Further, if we write f ran(r) to denote the rth largest frequency (in absolute value) ) p Fp (lmargin(g 0 ))) 1. F () = fran(r) ( ) /p 1 f p ran(+1) f p ran(r) Fp F p r> r> and hence the expsion for E 1 simplifies to E 1 ɛ F p + 3 ( ɛ F p + F p L l+1 lp/ F p (G l ) + (1 + ɛ)f p ( ) /p Fp, for p > 0 F p (lmargin(g 0 )) ) 1/

7 The main problem arises with the the middle term in the above expsion, namely, F p L l+1 lp/ F p (G l ), which can be quite large. Our approach relies on altering the group definitions to mae them depend on the (randomized) quantities F (, l) so that the ulting expsion for E 1 becomes bounded by O(ɛF p ). We also note that the expsion for E derived above remains bounded by ɛf p. Altered group definitions and estimator. We use the COUNTSKETCH data structure as the frequent items structure at each level l = 0, 1,..., L, with = O( 1 ɛ +p ) bucets per hash function and s = O(log F 1 ) hash functions per setch. We first observe that Lemma 1 can be slightly strengthened as follows: Lemma 6 (A slightly stronger version of Lemma 1). For l 1 and F1 (, l) F 1 ( l 1 ) l with probability F (, l) F ( l 1 ) l with probability Proof. The ult follows from the proof given for Lemma 1 [3]. At each level l, we use the COUNTSKETCH structure to estimate F (, l) to within an accuracy of (1 ± 1 1 ɛ 4 ) with probability 1 16L, where, ɛ = 3. Let F (, l) denote 5 4 F (, l). We redefine the thholds T 0, T 1,..., as follows: ( F l = (, l) l ) 1/, T l = l, l = 0, 1,,..., L. ɛ The groups G 0, G 1, G,..., are set in the usual way, using the new thholds: The estimator is defined by (1) as before. G 0 = {i f i T 0 } and G l = {i T l f i T l 1 } Lemma 7. Suppose 16 ɛ +p. Then, E 8ɛF p with probability at least 3 4. Proof. We use the property of F (t) derived above, that for any 1 t F 0. We therefore have, F (t) t ( Fp t ) /p = F /p p, for p > 0. (5) t/p 1 ( ) F 1/ ( (, l) 5F ( l 1 ) 1/ ) T l = ɛ l 4ɛ l, by Lemma 6 ( ) 5F /p 1/ p, by (5) 4( l 1 ) /p 1 ɛ l = 1 ( ) 1/p 5 Fp ɛ l (6)

8 By equation (4), one component of E 1 can be simplified as follows: E 1,1 L ɛ p ( 5 10 ɛ p F p The other component of E 1 is T p l F p(g l ) l+1 ) p/ L F p l F p(g l ) l+1 substituting (6) L F p (G l ) since p < = 10 ɛ p F p(f p F p (G 0 )) 5 8 ɛ F p since 16 ɛ +p E 1, = T p 0 (1 + ɛ)f p(lmargin(g 0 )) ɛ p (5/) p/ F p (1 + ɛ)f p 10 ɛ p (1 + ɛ)f p ɛ F p, also using 16 ɛ and ɛ < 1 +p. Substituting in (4), we have, Adding, the total error is bounded by E 1 ɛf p + 3(ɛ F p + E 1,1 + E 1, ) 1/ < 6ɛF p. (7) E E 1 + E 8ɛF p We summarize this section in the following theorem. Theorem 8. There exists an algorithm that returns ˆF p satisfying ˆF p F p ɛf p with probability using space O( ɛ (log n)(log F +p 1 )) and processes each stream update in expected time O(log n log F 1 ) and worst case time O(log n log F 1 ) standard arithmetic operations on words of size log F 1 bits. Remars. 1. We note that for 0 < p < 1, an estimator for F p with similar properties may be designed in an exactly analogous fashion by using COUNT-MIN instead of COUNTSKETCH as the frequent items structure at each level. Such an estimator would require an ɛ-accurate estimation of F 1 (which would imply estimation of F 1 using standard techniques), which could either be done using Cauchysetches [11, 13] or using the stand-alone technique pented in Section 4. However, using Cauchysetches means that, in general, O( 1 ɛ ) time is required to process each stream update. In order to maintain poly-logarithmic processing time per stream update, the technique of Section 4 may be used.. The space requirement of the stable setches estimator grows as Õ( 1 ɛ p ) as a function of p [13], ( whereas, the HSS-based ) technique requi space Õ( 1 ɛ ). For small values of p, i.e. p = +p O 1 log ɛ 1 (1+log log ɛ 1 ), the HSS technique can be asymptotically more space-efficient.

9 4 An iterative estimator for F 1 In this section, we use the HSS technique to pent a stand-alone, iterative estimator for F 1 = n i=1 f i. The previous section pents an estimator for F p that uses, as a sub-routine, an estimator for F () at each level of the structure. In this section, we pent a stand-alone estimator that uses only COUNT-MIN setch to estimate F 1. The technique may be of independent intet. The estimator uses two separate instantiations of the HSS structure. The first instantiation uses COUNT-MIN setch structure with = 8 ɛ bucets per hash function, and s = O(log G) hash functions, where, G = O(F ) and ɛ = ɛ 8. A collection of s = O(log 1 δ 3 ) independent copies of the structure are ept for the purpose of boosting the confidence of an estimate by taing the median. The second instantiation of the HSS structure uses = 18 ɛ 3 bucets per hash function (so = 16) and s = O(log G) hash functions. For estimating F 1, we use a two-step procedure, namely, (1) first, we obtain an estimate of F 1 that is correct to within a factor of 16 using the first HSS instantiation and () then, we use the second instantiation to obtain an ɛ-accurate estimation of F 1. The first step of the estimation is done using the first instantiation of the HSS structure as follows. We set the thhold T 0 to a parameter t, T l = T0 l and the thhold frequency for finding in group l to be T l. The group definitions are as defined earlier: G 0 = [t, F 1 ], G l = [ t l, t l 1 ), 1 l L. The disambiguation rule for the estimated frequency is as follows: if ˆf i,l > T l, then, ˆfi is set to the estimate obtained from the lowest of these levels. The sampled groups Ḡl are defined as follows. { Ḡ 0 = {i ˆf i T 0 }, Ḡ l = i T 0 l ˆf i < T } 0 l 1 and i S l, 1 i L. The estimators ˆF 1 and F 1 are defined as before these are now functions of t. ˆF 1 (t) = L ˆf i l F1 (t) = l=0 i Ḡl L f i l. l=0 i Ḡl Estimator. Let t iterate over values 1,, med,..., G and for each value of t let ˆF 1 (t) denote the median of the estimates ˆF 1 returned from the s 1 = O(log 1 δ ) copies of the HSS structure, each using independent random bits and the same value of t. Let t max denote the largest value of t satisfying The final estimate returned is ˆF med 1 (t) 16t 1.01ɛ. ˆF med 1 (t max ) using the second HSS instantiation. Analysis. We first note that Lemmas and 3 hold for all choices of t. Lemma 5 gets modified as follows. Lemma 9. Suppose that for 1 i n and 0 l L, ˆf i,l f i 0 with probability 1 t, l where, 0 = F 1 () L ψ. Then, E 16 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for an i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). Lemma 10. Let ɛ = ɛ 8, = 8 ɛ 3 and ɛ 1 8. Then, with probability 1 δ each,

10 1. For 4F1 ɛ t 8F1 ɛ. For any t 64F1 ɛ med ˆF 1 (t) F ɛF 1 1 (t) < 16t, ˆF med and 1.0ɛ with probability 1 δ. ˆF med 1 (t) 16t 1.0ɛ Proof. We consider the two summation terms in error term E 1 given by equation (4) separately. L t E 1,1 = l F 1(G l ) l+1 t(f 1 F 0 ), and E 1, = tf 1 (lmargin(g 0 )). Adding, E 1 ((t + 0 )F 1 ) 1/. We ignore the term t+l+ F in E 1, since, by choosing t = O(L), this term can be made arbitrarily small in comparison with the other two terms. Since, G l F1 l t, our bound on E becomes E 16 0 L l=0 G l l 16F 1 t Therefore, the expsion for total error is E(t) = E 1 + E ( t E(t) F ) 1/ + 16F 1 F 1 t Suppose = 18 ɛ 3 We therefore have, med ˆF Therefore, for 4F1 ɛ 1.01ɛF 1 and 4F1 ɛ t 8F1 ɛ. Using ɛ 1 8 and ɛ = ɛ 8 E(t) F 1 ( t F (t) F 1 E(t) 1.01ɛF 1 ) 1/ + 16F 1 t., we have 1.01ɛF 1. (8). (9), for 4F 1 ɛ t 8F 1 with probability 1 δ. ɛ t 8F1 ɛ, with probability 1 δ, we have from (9) that tf 1 and so t (1.01) ɛ F ɛ med ˆF 1 16 ( ɛ) so Let t = j+ F 1 ɛ for some j 0, and suppose that (10) is satisfied. Then by (8) E(t) (j 1)/ ɛ(1.01)f 1 + j ɛf 1 j/ ɛf 1 with probability 1 δ. med With probability 1 δ, ˆF 1 (t) F 1 E and so, using ɛ 1 8, j/ 3 F 1 j/ med ɛf 1 E(t) ˆF 1 (t) F 1 16t 1.0ɛ F 1 by (10) j 3 F F 1 using ɛ = ɛ and = 8 8 ɛ 3 which is a contradiction for j 4, proving claim. ˆF med 1 (t) 16t 1.0ɛ. The correctness of the algorithm follows from the above Lemma. The space requirement of the algorithm is O( 1 ɛ (log 3 n)(log F 3 1 )) bits and the expected time taen to process each stream update is O(log F 1 log 1 δ ) standard arithmetic operations on words of size O(log n). (10)

11 5 Conclusions We pent a family of algorithms for the randomized estimation of F p for 0 < p < and another family of algorithms for estimating F p for 0 < p < 1. The first algorithm family estimates F p by using the COUNTSKETCH structure and F estimation as sub-routines. The second algorithm family estimates F p by using the COUNT-MIN setch structure and F 1 estimation as a sub-routines. The space required by these algorithms are O( 1 ɛ (log n)(log F +p 1 )(log 1 δ ) and the expected time required to process each stream update is O(log n log 1 δ ). Finally, we also pent a stand-alone iterative estimator for F 1 that only uses the COUNT-MIN setch structure as a sub-routine. Prior approaches to the problem of estimating F p [11, 13] used setches of the frequency vector with random variables drawn from a symmetric p-stable distribution. An inteting feature of the above algorithms is that they do not require the use of stable distributions. The proposed algorithms trade an extra factor of O(ɛ p ) factor of space for dramatically improved procesing time (with no polynomial dependency on ɛ) per stream update. References 1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating frequency moments. Journal of Computer and System Sciences, 58(1): , Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivaumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, L. Bhuvanagiri and S. Ganguly. Estimating Entropy over Data Streams. In Proc. ESA, D. Carney, U. Cetintemel, M. Cherniac, C. Convey, S. Lee, G. Seidman, M. Stonebraer, N. Tatbul, and S. Zdoni. Monitoring streams a new class of data management applications. In Proc. VLDB, M. Chariar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. ICALP, 00, pages G. Cormode and S. Muthurishnan. What s New: Finding Significant Differences in Networ Data Streams. In IEEE INFOCOM, G. Cormode and S. Muthurishnan. An Improved Data Stream Summary: The Count-Min Setch and its Applications. J. Algorithms, 55(1):58 75, April P. Flajolet and G.N. Martin. Probabilistic Counting Algorithms for Database Applications. Journal of Computer and System Sciences, 31():18 09, S. Ganguly, D. Kesh, and C. Saha. Practical Algorithms for Tracing Database Join Sizes. In Proc. FSTTCS, P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, P. Indy. Stable Distributions, Pseudo Random Generators, Embeddings and Data Stream Computation. In Proc. IEEE FOCS, P. Indy and D. Woodruff. Optimal Approximations of the Frequency Moments. In Proc. ACM STOC, P. Li. Very Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections. Manuscript, 006.

12 14. P. Li, T. J. Hastie and K. W. Church. Very Sparse Random Projections. In Proc. ACM SIGKDD, N. Nisan. Pseudo-Random Generators for Space Bounded Computation. In Proc. ACM STOC, M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proc. ACM SODA, pages , January M.N. Wegman and Carter J. L. New Hash Functions and their Use in Authentication and Set Equality. Journal of Computer and System Sciences, :65 79, D.P. Woodruff. Optimal approximations of all frequency moments. In Proc. ACM SODA, pages , January 004. A COUNT-MIN and COUNTSKETCH summaries Given a data stream defining a set of item frequencies, ran(r) returns an item with the r th largest absolute value of the frequency (ties are broen arbitrarily). We say that an item i has ran r if ran(r) = i. For a given value of, 1 n, the set top() is the set of items with ran. The idual second moment [5] of a data stream, denoted by F (), is defined as the second moment of the stream after the top- frequencies have been removed. Then, F () = r> f ran(r). The idual first moment [7] of a data stream, denoted by F 1, is analogously defined as the F 1 norm of the data stream after the top- frequencies have been removed, that is, F 1 = r> f ran(r). A setch [1] is a random integer X = i f i x i, where, x i { 1, +1}, for i D and the family of variables {x i } i D with certain independence properties. The family of random variables {x i } i D is referred to as the setch basis. For any d, a d-wise independent setch basis can be constructed in a pseudo-random manner from a truly random seed of size O(d log n) bits as follows. Let F be field of characteristic and of size at least n + 1. Choose a degree d 1 polynomial g : F F with coefficients that are randomly chosen from F [17]. Define x i to be 1 if the first bit (i.e., the least significant position) of g(i) is 1, and define x i to be 1 otherwise. The d-wise independence of the x i s follows from an application of Wegman and Carter s universal hash functions [17]. Pair-wise independent setches are used in [5] to design the COUNTSKETCH algorithm for finding the top- frequent items in an insert-only stream. The data structure consists of a collection of s = O(log 1 δ ) independent hash tables U 1, U,..., U s each consisting of 8 bucets. A pair-wise independent hash function h j : [n] {1,,..., 8} is associated with each hash table that maps items randomly to one of the 8 bucets, where, is a space parameter. Additionally, for each table index j = 1,,..., s, we eep a pair-wise independent family of random variables {x ij } i [n], where, each x ij { 1, +1} with equal probability. Each bucet eeps a setch of the sub-stream that maps to it, that is, U j [r] = i:h j(i)=r f ix ij, 1 j s, 1 r 8. An estimate ˆf i is returned as follows: ˆf i = median s j=1 U j [h j (i)]x ij. The accuracy of estimation is stated as a function of the idual second moment given parameters and b is defined as [5] ( F ) 1/ () (b, ) = 8. b The space versus accuracy guarantees of the COUNTSKETCH algorithm is pented in Theorem 11.

13 Theorem 11 ([5]). Let = (, 8). Then, for any given i [n], Pr{ ˆf i f i } 1 δ. The space used is O( log 1 δ (log F 1)) bits, and the time taen to process a stream update is O(log 1 δ ). The COUNTSKETCH algorithm can be adapted to return approximate frequent items and their frequencies. The original algorithm [5] uses a heap for maintaining the current top- items in terms of their estimated frequencies. After processing each arriving stream record of the form (i, v), where, v is assumed to be non-negative, an estimate for ˆf i is calculated using the scheme outline above. If i is already in the current estimated top- heap then its frequency is corpondingly increased. If i is not in the heap but ˆf i is larger than the current smallest frequency in the heap, then it replaces that element in the heap. This scheme is applicable to insert-only streams. A generalization of this method for strict update streams is pented in [6] and returns, with probability 1 δ, (a) all items with frequency at least ( F () ) 1/ and, (b) does not return any item with frequency less than (1 ɛ)( F () ) 1/ using space O ( ɛ log n log(ɛ 1 log(ɛ 1 )) log F 1 ) bits. For general update streams, a variation of this technique can be used for retrieving items satisfying properties (a) and (b) above using space O(ɛ log(δ 1 n) log F 1 ) bits. The COUNT-MIN algorithm [7] for finding approximate frequent items eeps a collection of s = O(log 1 δ ) independent hash tables T 1, T,..., T s, where each hash table T j is of size b = bucets and uses a pair-wise independent hash function h j : [n] {1,..., }, for j = 1,,..., s. The bucet T j [r] is an integer counter that maintains the following sum T j [r] = i:h j(i)=r f i. The estimated frequency ˆf i is obtained as ˆf i = median s r=1t j [h j (i)]. The space versus accuracy guarantees for the COUNT-MIN algorithm is given in terms of the quantity F 1 () = r> f ran(r). Theorem 1 ([7]).Pr{ ˆf i f i F 1 () } 1 δ with probability using space O( log 1 δ log F 1) bits and time O(log 1 δ ) to process each stream update. Estimating F1 and F. [9] pents an algorithm to estimate F () to within an accuracy of (1±ɛ) with confidence 1 δ using space O( ɛ log(f 1 ) log( n δ )) bits. The data structure used is identical to the COUNTSKETCH structure. The algorithm basically removes the top- estimated frequencies from the COUNTSKETCH structure and then estimates F. Let ˆf τ1,..., ˆf τ denote the top- estimated frequencies from the COUNTSKETCH structure. Next, the contributions of these estimates are removed from the structure, that is, U j [r]:=u j [r] t:h f j(τ t)=r τ t x jτt. Subsequently, the Fast-AMS algorithm [16], a variant of the original setch algorithm [1], is used to estimate the second moment as ˆF = median s 8 j=1 r=1 (U j[r]). Formally, we can state: Lemma 13 ([9]). For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF () satisfying ˆF () F () ɛf () with probability 1 δ using space O( ɛ (log F1 δ )(log F 1)) bits. A similar argument can be applied to estimate F 1 (s), where, instead of using the COUNTSKETCH algorithm, we use the COUNT-MIN algorithm for retrieving the top- estimated absolute frequencies. In parallel, a set of s = O( 1 ɛ ) setches based on a 1-stable distribution [11] (i.e., Y j = i f iz ji, where z ji is drawn from a 1-stable distribution). After retrieving the top- frequencies f τ1,..., f τ

14 with pect to their absolute values, we reduce the setches Y j :=Y j r=1 f τ r z jτr and estimate F1 () as median s j=1 Y j. We summarize this in Lemma 14. Lemma 14. For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF 1 () satisfying ˆF 1 () F1 () ɛf1 () with probability 1 δ using O( 1 ɛ ( + 1 ɛ )(log δ )(log F 1)) bits. B Reducing random bits by using a PRG We use a standard technique of reducing the randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. Notation. Let M be a finite state machine that uses S bits and has running time R. Assume that M uses the random bits in segments, each segment consisting of b bits. Let U r be a uniform distribution over {0, 1} r and for a discrete random variable X, let F[X] denote the probability distribution of X, treated as a vector. Let M(x) denote the state of M after using the random bits in x. The generator G : {0, 1} u {0, 1} b expands a small number of u bits that are truly random to a sequence of b bits that appear random to M. G is said to be a pseudo-random generator for a class C of finite state machines with parameter ɛ, provided, for every M C F[Mx U b(x)] F[M x U m(g(x))] 1 ɛ where, y 1 denotes the F 1 norm of the vector y. Nisan [15] shows the following property (this version is from [11]). Theorem 15 ([15]). There exists a PRG G for Space(S) and Time(R) with parameter ɛ = O(S) that requi O(S) bits such that G expands O(S log R) bits into O(R) bits. This is sufficient due to the fact that we can compute the frequency moments by considering each (aggregate) frequency f i in turn and use only segments of O(log F 1 ) bits to store and process it.

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

Lecture 2: Streaming Algorithms

Lecture 2: Streaming Algorithms CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Maintaining Significant Stream Statistics over Sliding Windows

Maintaining Significant Stream Statistics over Sliding Windows Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder April 22, 2011 Abstract A fundamental fact in the analysis of randomized algorithm is that

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder May 1, 2013 Abstract A fundamental fact in the analysis of randomized algorithms is that when

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Improved Sketching of Hamming Distance with Error Correcting

Improved Sketching of Hamming Distance with Error Correcting Improved Setching of Hamming Distance with Error Correcting Ohad Lipsy Bar-Ilan University Ely Porat Bar-Ilan University Abstract We address the problem of setching the hamming distance of data streams.

More information

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,

Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Succinct Approximate Counting of Skewed Data

Succinct Approximate Counting of Skewed Data Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing

MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very

More information

An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters

An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters IEEE/ACM TRANSACTIONS ON NETWORKING An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters Mehrnoosh Shafiee, Student Member, IEEE, and Javad Ghaderi, Member, IEEE

More information

CR-precis: A deterministic summary structure for update data streams

CR-precis: A deterministic summary structure for update data streams CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation

A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from

More information

Finding Heavy Distinct Hitters in Data Streams

Finding Heavy Distinct Hitters in Data Streams Finding Heavy Distinct Hitters in Data Streams Thomas Locher IBM Research Zurich thl@zurich.ibm.com ABSTRACT A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct

More information

Arthur-Merlin Streaming Complexity

Arthur-Merlin Streaming Complexity Weizmann Institute of Science Joint work with Ran Raz Data Streams The data stream model is an abstraction commonly used for algorithms that process network traffic using sublinear space. A data stream

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Part 1: Hashing and Its Many Applications

Part 1: Hashing and Its Many Applications 1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random

More information

IITM-CS6845: Theory Toolkit February 3, 2012

IITM-CS6845: Theory Toolkit February 3, 2012 IITM-CS6845: Theory Toolkit February 3, 2012 Lecture 4 : Derandomizing the logspace algorithm for s-t connectivity Lecturer: N S Narayanaswamy Scribe: Mrinal Kumar Lecture Plan:In this lecture, we will

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)

12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) 12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.

More information

Big Data. Big data arises in many forms: Common themes:

Big Data. Big data arises in many forms: Common themes: Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity

More information

Data Stream Methods. Graham Cormode S. Muthukrishnan

Data Stream Methods. Graham Cormode S. Muthukrishnan Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating

More information

Sketching and Streaming Entropy via Approximation Theory

Sketching and Streaming Entropy via Approximation Theory Setching and Streaming Entropy via Approximation Theory Nicholas J. A. Harvey Jelani Nelson Krzysztof Ona Abstract We conclude a sequence of wor by giving near-optimal setching and streaming algorithms

More information

arxiv: v1 [cs.ds] 3 Feb 2018

arxiv: v1 [cs.ds] 3 Feb 2018 A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based

More information

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com

More information

The Online Space Complexity of Probabilistic Languages

The Online Space Complexity of Probabilistic Languages The Online Space Complexity of Probabilistic Languages Nathanaël Fijalkow LIAFA, Paris 7, France University of Warsaw, Poland, University of Oxford, United Kingdom Abstract. In this paper, we define the

More information

Hashing. Martin Babka. January 12, 2011

Hashing. Martin Babka. January 12, 2011 Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice

More information

1-Pass Relative-Error L p -Sampling with Applications

1-Pass Relative-Error L p -Sampling with Applications -Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m

More information

Improved Algorithms for Distributed Entropy Monitoring

Improved Algorithms for Distributed Entropy Monitoring Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data

More information

Testing k-wise Independence over Streaming Data

Testing k-wise Independence over Streaming Data Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations

More information

Bounds on the OBDD-Size of Integer Multiplication via Universal Hashing

Bounds on the OBDD-Size of Integer Multiplication via Universal Hashing Bounds on the OBDD-Size of Integer Multiplication via Universal Hashing Philipp Woelfel Dept. of Computer Science University Dortmund D-44221 Dortmund Germany phone: +49 231 755-2120 fax: +49 231 755-2047

More information

Last time, we described a pseudorandom generator that stretched its truly random input by one. If f is ( 1 2

Last time, we described a pseudorandom generator that stretched its truly random input by one. If f is ( 1 2 CMPT 881: Pseudorandomness Prof. Valentine Kabanets Lecture 20: N W Pseudorandom Generator November 25, 2004 Scribe: Ladan A. Mahabadi 1 Introduction In this last lecture of the course, we ll discuss the

More information

The 2-valued case of makespan minimization with assignment constraints

The 2-valued case of makespan minimization with assignment constraints The 2-valued case of maespan minimization with assignment constraints Stavros G. Kolliopoulos Yannis Moysoglou Abstract We consider the following special case of minimizing maespan. A set of jobs J and

More information

Optimal compression of approximate Euclidean distances

Optimal compression of approximate Euclidean distances Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch

More information

The quantum complexity of approximating the frequency moments

The quantum complexity of approximating the frequency moments The quantum complexity of approximating the frequency moments Ashley Montanaro May 1, 2015 Abstract The th frequency moment of a sequence of integers is defined as F = j n j, where n j is the number of

More information

Cleaning Uncertain Data for Top-k Queries

Cleaning Uncertain Data for Top-k Queries Cleaning Uncertain Data for Top- Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang Department of Computer Science, University of Hong Kong, Pofulam Road, Hong Kong {lymo, ccheng, xli, dcheung,

More information

Improved Range-Summable Random Variable Construction Algorithms

Improved Range-Summable Random Variable Construction Algorithms Improved Range-Summable Random Variable Construction Algorithms A. R. Calderbank A. Gilbert K. Levchenko S. Muthukrishnan M. Strauss January 19, 2005 Abstract Range-summable universal hash functions, also

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

PRGs for space-bounded computation: INW, Nisan

PRGs for space-bounded computation: INW, Nisan 0368-4283: Space-Bounded Computation 15/5/2018 Lecture 9 PRGs for space-bounded computation: INW, Nisan Amnon Ta-Shma and Dean Doron 1 PRGs Definition 1. Let C be a collection of functions C : Σ n {0,

More information

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage

Minimum Repair Bandwidth for Exact Regeneration in Distributed Storage 1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,

More information

A Near-Optimal Algorithm for Computing the Entropy of a Stream

A Near-Optimal Algorithm for Computing the Entropy of a Stream A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a

More information

Asymptotically optimal induced universal graphs

Asymptotically optimal induced universal graphs Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1+o(1))2 ( 1)/2.

More information

A Robust APTAS for the Classical Bin Packing Problem

A Robust APTAS for the Classical Bin Packing Problem A Robust APTAS for the Classical Bin Packing Problem Leah Epstein 1 and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. Email: lea@math.haifa.ac.il 2 Department of Statistics,

More information

Geometric Optimization Problems over Sliding Windows

Geometric Optimization Problems over Sliding Windows Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

CS Data Structures and Algorithm Analysis

CS Data Structures and Algorithm Analysis CS 483 - Data Structures and Algorithm Analysis Lecture II: Chapter 2 R. Paul Wiegand George Mason University, Department of Computer Science February 1, 2006 Outline 1 Analysis Framework 2 Asymptotic

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES

NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES JONATHAN LUK These notes discuss theorems on the existence, uniqueness and extension of solutions for ODEs. None of these results are original. The proofs

More information

3.1 Asymptotic notation

3.1 Asymptotic notation 3.1 Asymptotic notation The notations we use to describe the asymptotic running time of an algorithm are defined in terms of functions whose domains are the set of natural numbers N = {0, 1, 2,... Such

More information

Tight Bounds for Distributed Streaming

Tight Bounds for Distributed Streaming Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming

More information

A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy

A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy Nir Halman Hebrew University of Jerusalem halman@huji.ac.il July 3, 2016 Abstract Given n elements

More information

Poly-logarithmic independence fools AC 0 circuits

Poly-logarithmic independence fools AC 0 circuits Poly-logarithmic independence fools AC 0 circuits Mark Braverman Microsoft Research New England January 30, 2009 Abstract We prove that poly-sized AC 0 circuits cannot distinguish a poly-logarithmically

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

Tolerant Versus Intolerant Testing for Boolean Properties

Tolerant Versus Intolerant Testing for Boolean Properties Tolerant Versus Intolerant Testing for Boolean Properties Eldar Fischer Faculty of Computer Science Technion Israel Institute of Technology Technion City, Haifa 32000, Israel. eldar@cs.technion.ac.il Lance

More information

Irredundant Families of Subcubes

Irredundant Families of Subcubes Irredundant Families of Subcubes David Ellis January 2010 Abstract We consider the problem of finding the maximum possible size of a family of -dimensional subcubes of the n-cube {0, 1} n, none of which

More information

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional

More information

5 + 9(10) + 3(100) + 0(1000) + 2(10000) =

5 + 9(10) + 3(100) + 0(1000) + 2(10000) = Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,

More information

Asymptotically optimal induced universal graphs

Asymptotically optimal induced universal graphs Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1 + o(1))2 (

More information

Data stream algorithms via expander graphs

Data stream algorithms via expander graphs Data stream algorithms via expander graphs Sumit Ganguly sganguly@cse.iitk.ac.in IIT Kanpur, India Abstract. We present a simple way of designing deterministic algorithms for problems in the data stream

More information

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms

CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions

More information

Variable Selection and Model Building

Variable Selection and Model Building LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 39 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur 5. Akaike s information

More information

Approximation algorithms for cycle packing problems

Approximation algorithms for cycle packing problems Approximation algorithms for cycle packing problems Michael Krivelevich Zeev Nutov Raphael Yuster Abstract The cycle packing number ν c (G) of a graph G is the maximum number of pairwise edgedisjoint cycles

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

Kolmogorov complexity

Kolmogorov complexity Kolmogorov complexity In this section we study how we can define the amount of information in a bitstring. Consider the following strings: 00000000000000000000000000000000000 0000000000000000000000000000000000000000

More information

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li

Ping Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,

More information

Improved Concentration Bounds for Count-Sketch

Improved Concentration Bounds for Count-Sketch Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

On (ε, k)-min-wise independent permutations

On (ε, k)-min-wise independent permutations On ε, -min-wise independent permutations Noga Alon nogaa@post.tau.ac.il Toshiya Itoh titoh@dac.gsic.titech.ac.jp Tatsuya Nagatani Nagatani.Tatsuya@aj.MitsubishiElectric.co.jp Abstract A family of permutations

More information

COMP 355 Advanced Algorithms

COMP 355 Advanced Algorithms COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background 1 Polynomial Running Time Brute force. For many non-trivial problems, there is a natural brute force search algorithm that

More information

Algorithms for Querying Noisy Distributed/Streaming Datasets

Algorithms for Querying Noisy Distributed/Streaming Datasets Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

The Count-Min-Sketch and its Applications

The Count-Min-Sketch and its Applications The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local

More information

RANDOM SIMULATIONS OF BRAESS S PARADOX

RANDOM SIMULATIONS OF BRAESS S PARADOX RANDOM SIMULATIONS OF BRAESS S PARADOX PETER CHOTRAS APPROVED: Dr. Dieter Armbruster, Director........................................................ Dr. Nicolas Lanchier, Second Committee Member......................................

More information

Introduction: The Perceptron

Introduction: The Perceptron Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron

More information

Sparser Johnson-Lindenstrauss Transforms

Sparser Johnson-Lindenstrauss Transforms Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)

More information

Tight Bounds for Distributed Functional Monitoring

Tight Bounds for Distributed Functional Monitoring Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.

More information

COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background

COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background 1 Polynomial Time Brute force. For many non-trivial problems, there is a natural brute force search algorithm that checks every

More information

Zero-One Laws for Sliding Windows and Universal Sketches

Zero-One Laws for Sliding Windows and Universal Sketches Zero-One Laws for Sliding Windows and Universal Sketches Vladimir Braverman 1, Rafail Ostrovsky 2, and Alan Roytman 3 1 Department of Computer Science, Johns Hopkins University, USA vova@cs.jhu.edu 2 Department

More information

Analysis of Algorithms

Analysis of Algorithms October 1, 2015 Analysis of Algorithms CS 141, Fall 2015 1 Analysis of Algorithms: Issues Correctness/Optimality Running time ( time complexity ) Memory requirements ( space complexity ) Power I/O utilization

More information

Solving Zero-Sum Security Games in Discretized Spatio-Temporal Domains

Solving Zero-Sum Security Games in Discretized Spatio-Temporal Domains Solving Zero-Sum Security Games in Discretized Spatio-Temporal Domains APPENDIX LP Formulation for Constant Number of Resources (Fang et al. 3) For the sae of completeness, we describe the LP formulation

More information

Solving Classification Problems By Knowledge Sets

Solving Classification Problems By Knowledge Sets Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

Revisiting Frequency Moment Estimation in Random Order Streams

Revisiting Frequency Moment Estimation in Random Order Streams Revisiting Frequency Moment Estimation in Random Order Streams arxiv:803.02270v [cs.ds] 6 Mar 208 Vladimir Braverman, Johns Hopkins University David Woodruff, Carnegie Mellon University March 6, 208 Abstract

More information