On Estimating Frequency Moments of Data Streams
|
|
- Reginald Nathan Jennings
- 6 years ago
- Views:
Transcription
1 On Estimating Frequency Moments of Data Streams Sumit Ganguly and 1 Graham Cormode 1 Indian Institute of Technology, Kanpur, sganguly@iit.ac.in AT&T Labs Research, graham@earch.att.com Abstract. Space-economical estimation of the pth frequency moments, defined as F p = P n i=1 fi p, for p > 0, are of intet in estimating all-pairs distances in a large data matrix [14], machine learning, and in data stream computation. Random setches formed by the inner product of the frequency vector f 1,..., f n with a suitably chosen random vector were pioneered by Alon, Matias and Szegedy [1], and have since played a central role in estimating F p and for data stream computations in general. The concept of p-stable setches formed by the inner product of the frequency vector with a random vector whose components are drawn from a p-stable distribution, was proposed by Indy [11] for estimating F p, for 0 < p <, and has been further studied in Li [13]. In this paper, we consider the problem of estimating F p, for 0 < p <. A disadvantage of the stable setches technique and its variants is that they require O( 1 ɛ ) inner-products of the frequency vector with dense vectors of stable (or nearly stable [14, 13]) random variables to be maintained. This means that each stream update can be quite time-consuming. We pent algorithms for estimating F p, for 0 < p <, that does not require the use of stable setches or its approximations. Our technique is elementary in nature, in that, it uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. Our algorithms require space Õ( 1 ) 3 to estimate F ɛ +p p to within 1 ± ɛ factors and requi expected time O(log F 1 log 1 ) to process each update. Thus, our technique trades an O( 1 ) factor in space for much more efficient processing of stream updates. We δ ɛ p also pent a stand-alone iterative estimator for F 1. 1 Introduction Recently, there has been an emergence of monitoring applications in diverse areas including networ traffic monitoring, networ topology monitoring, sensor networs, financial maret monitoring, and web-log monitoring. In these applications, data is generated rapidly and continuously, and must be analyzed efficiently, in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, and so on. In many of these applications, it is often required to continuously trac the big picture, or an aggregate view of the data, as opposed to a detailed view of the data. In such scenarios, efficient approximate computation is often acceptable. The data streaming model has gained popularity as a computational model for such applications where incoming data 3 Following standard convention, we say that f(n) is Õ(g(n)) if f(n) = ` O 1 O(1) (log m) O(1) (log n) O(1) g(n). ɛ
2 (or updates) are processed very efficiently and in an online fashion using space, much less than that needed to store the data in its entirety. A data stream S is viewed as a sequence of arrivals of the form (i, v), where i is the identity of an item that is a member of the domain [n] = {1,..., n} and v is the update to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corponding deletion. The frequency of an item i, denoted by f i, is the sum of the updates to i since the inception of the stream, that is, f i = (i,v) appears in S v. In this paper, we consider the problem of estimating the pth frequency moment of a data stream, defined as F p = n i=1 f i p, for 0 < p <. Equivalently, this can be interpreted as the pth power of the L p norm of a vector defined by the stream. The techniques used to design algorithms and lower bounds for the frequency moment problem have been influential in the design of algorithmic and lower bound techniques for data stream computation. We briefly review the current state of the art in estimating F p, with particular emphasis to the range 0 < p <. 1.1 Review Alon, Matias and Szegedy [1] pent the seminal technique of AMS setches for estimating F. An (atomic) AMS setch is a random integer X = n i=1 f iξ i, where {ξ i } i=1,,...,n is a family of fourwise random variables assuming the values 1 or 1 with equal probability. An AMS setch is easily maintained over a stream: for each update (i, v), the setch X is updated as X := X +vf i ξ i. Moreover, since the family {ξ i } is only 4-wise random, for each i, ξ i can be obtained from a randomly chosen cubic polynomial h over a field F that contains the domain of items [n], (ξ i = 1 if the last bit of h(i) is 1 and ξ i = 1 otherwise). It then follows that E [ X ] = F and Var [ X ] F [1]. An estimate of F that is accurate to within 1 ± ɛ with confidence 7 8 can therefore be obtained as the average of the squa of O( 1 ɛ ) independent setch values. There has also been significant study of the case p = 0, also nown as the distinct elements problem. Alon, Matias and Szegedy [1] gave a constant factor approximation in small space. Gibbons and Tirthapura [10] showed a (1±ɛ) factor approximation space Õ( 1 ɛ ); subsequent wor has improved the (hidden) logarithmic factors []. p-stable setches. The use of p-stable setches was pioneered by Indy [11] for estimating F p, with 0 < p <. A stable setch S is defined as Y = n i=1 f is i, where s i is drawn at random from a p-stable distribution, denoted S(p, 1) (the second parameter of S(, ) is the scale factor). By the defining property of p-stable distribution, Y is distributed as S(p, ( n i=1 f i p ) 1/p ). In other words, Y is p-stable distributed, with scale factor Fp 1/p. Indy gives a technique to estimate F p by eeping O( 1 ɛ ) independent p-stable setches and returning the median of the these observations [11]. Woodruff [18] pents an Ω( 1 ɛ ) space lower bound for the problem of estimating F p, for all p 0, implying that the stable setches technique is space optimal up to logarithmic factors. Li [13] further analyses of stable setches and suggests the use of the geometric mean estimator, that is, ˆF p = c Y 1 1/ Y 1/ Y 1/
3 where Y 1, Y,..., Y are independent p-stable setches of the data stream. Li shows the above estimator is unbiased, that is, E [ ] ˆFp = Fp and Var [ ] π ˆFp F p 6p. It follows (by Chebychev s inequality) that returning the geometric mean of O( 1 ɛ p ) setches returns an estimate for F p that is accurate to within factors of (1 ± ɛ) with probability 7 8. Li also shows that the geometric means estimator has lower variance than the median estimator proposed originally by Indy [11]. Very sparse setches. The very sparse setch method due to Li et al. aims to maintain the same space and accuracy bounds, but reduce the time cost to process each update [14, 13]. Note that this technique applies only when the data satisfies some uniformity properties, whereas the preceeding techniques need no such assumptions. A very sparse (nearly) p-stable setch is a linear combination of the form W = n i=1 f iw i, where w i is P p with probability β/, P p with probability β/, and 0 otherwise. Here, P p is the p-pareto distribution with probability tail function Pr{P p > t} = 1 t, t 1. p Pareto distributions are proposed since they are much simpler to sample from as compared to stable distributions. Further, Li shows that W is asymptotically p-stable provdided F 0. Thus, very Fp 1/p sparse setches provide for a principled way of reducing the data stream processing time provided the data satisfies certain uniformity properties. Drawbacs of stable-based methods. A drawbac of the original technique of stable setches is that, in general, for each stream update all of the O( 1 ɛ ) stable setches must be updated. Each setch update requi a pseudo-random generation of a random variable drawn from a p-stable distribution, maing it time-consuming to process each stream update. The very sparse stable setches somewhat alleviates this problem by speeding up the processing time by a factor of approximately 1/β, although the data must now satisfy certain uniformity conditions. In general, it is not possible to a-priori guarantee or verify whether the data stream satisfies the desired properties. We therefore advocate that in the absence of nowledge of the data distribution, the geometric means estimator over p-stable setches is the most reliable of the nown estimators which is quite expensive. Contributions. In this paper, we pent a technique for estimating F p, for 0 < p <. Our technique requi space O( 1 ɛ +p log F 1 log n) to estimate F p to within relative error (1 ± ɛ) with probability 7/8. Further, it requi O(log n) expected time (and O(log F 1 log n) worst-case time) to process each stream update. Thus, our technique trades a factor of O( 1 ɛ p ) space for improved processing time per stream update. From an algorithm design viewpoint, perhaps the most salient feature of the technique is that it does not recourse to stable distributions. Our technique is elementary in nature and uses simple randomization in conjunction with well-nown summary structu for data streams, such as the COUNT-MIN setch [7] and the COUNTSKETCH structure [5]. It is based on maing some crucial and subtle modifications to the HSS technique [3]. Organization. The remainder of the paper is organized as follows. In Section, we review the HSS technique for estimating a class of data stream metrics. Sections 3 and 4 pectively, pent a family of algorithms for estimating F p and a recursive estimator for F 1, pectively. Finally, we conclude in Section 5.
4 Review of HSS technique In this section, we briefly review the HSS (for Hierarchical Sampling from Setches ) procedure [3] for estimating F p, p > over data streams. Appendix A reviews the COUNTSKETCH and the COUNT-MIN algorithms for finding frequent items in a data stream and algorithms to estimate the idual first and second moments pectively of a data stream [9]. The HSS method is a general technique for estimating a class of metrics over data streams of the form: Ψ(S) = ψ(f i ). (1) i:f i>0 From the input stream S, sub-streams S 0... S L are created by successive sub-sampling, that is, S 0 = S and for 1 l L, S l is obtained from S l 1 by sub-sampling each distinct item appearing in S l 1 independently with probability 1 (hence L = O(log n)). Let be a space parameter. At each level l, we eep a frequent items data-structure, denoted by D l (), that taes as input the sub-stream S l, and returns an approximation to the top() items of its input stream and their frequencies. D l () is instantiated by either the COUNT-MIN or COUNTSKETCH data structu. At level l, suppose that the frequent items structure at this level has an additive error of l () (with high probability), that is, ˆf i f i l () with probability 1 t where t is a parameter. Define F1 (, l) to be (the random variable denoting) the value of F 1 of the sub-stream S l after removing the largest absolute frequencies; and F (, l) to be the corponding value of F. The (non-random value) F1 (, 0) (pectively, F (, 0)) is written as F1 () (p. F ()). Lemma 1 (Lemma 1 from [3]). 1. For l 1 and 48, F1 (, l) F 1 () with probability 1 l For l 1 and 48, F (, l) F () with probability 1 l ( ) By the above lemma, let 0 = F 1 () F () 1/, or 0 = depending on whether the COUNT-MIN or the COUNTSKETCH structure is being used as the frequent items structure at each level. Let ɛ = ɛ 16, T 0 = 0 ɛ and T l = T0, l = 0, 1,..., log T l 0. The items are grouped into groups G 0, G 1,..., G L as follows: G 0 = {i S : f i T 0 } and G l = {i S : T l f i < T l 1 }, 1 l L. It follows that, with high probability, for all items of G l that are pent in the random sub-stream S l, ˆfi l ɛ and ˆf i f i ɛf i. Corponding to every stream update (i, v), we use a hash-function h : [n] [n] to map the item i to a level u = lsb(h(i)), where, lsb(x) is the position of the least significant 1 in binary repentation of x. The stream update (i, v) is then propagated to the frequent items data structu D l for 0 l u, so in effect, i is included in the sub-streams from level 0 to level u. The hash function is assumed to be chosen randomly from a fully independent family; later we reduce the number of random bits required by a standard data streaming argument. At the time of inference, the algorithm collects samples as follows. From each level l, the set of items whose estimated frequency crosses the thhold 0 are identified, using the frequent items ɛ l structure D l. It is possible for the estimate ˆf i,l of an item i obtained from the sub-stream S l to exceed
5 this thhold for multiple levels l. We therefore apply the disambiguation-rule of using the estimate obtained from the lowest level at which it crosses the thhold for that level. The estimated frequency after disambiguation is denoted as ˆf i. Based on their disambiguated frequencies, the sampled items are sorted into their pective groups, Ḡ0, Ḡ1,..., ḠL, as follows: Ḡ 0 = {i ˆf i T 0 } and Ḡl = {i T l 1 < ˆf i T l and i S l }, 1 l L. We define the estimator ˆΨ and a second idealized estimator Ψ which is used for analysis only. ˆΨ = L ψ( ˆf L i ) l Ψ = ψ(f i ) l () l=0 i Ḡl l=0 i Ḡl We now briefly review the salient points in the error analysis. Lemma shows that the expected value of Ψ is close to Ψ. Lemma (Lemma from [3]). Suppose that for 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then E [ Ψ] Ψ Ψ t+log L. We now pent a bound on the variance of the idealized estimator. The frequency group G l is partitioned into three sub-groups, namely, lmargin(g l ) = [T l, T l (1 + ɛ/)], rmargin(g l ) = [T l 1 (1 ɛ), T l 1 ] and midregion(g l ) = [T l (1 + ɛ/), T l 1 (1 ɛ)], that pectively denote the lmargin (leftmargin), rmargin (right-margin) and midregion of the group G l. An item i is said to belong to one of these regions if its true frequency lies in that region. For any item i with non-zero frequency, we denote by l(i) the group index l such that i G l. For any subset T [n], denote by ψ(t ) the expsion i T ψ(f i). Let Ψ = Ψ (S) denote n i=1 ψ (f i ). Lemma 3 (Lemma 3 from [3]). Suppose that for all 0 i N 1 and 0 l L, ˆf i,l f i ɛf i with probability 1 t. Then, Var [ Ψ] t+l+ Ψ + i / (G 0 lmargin(g 0)) ψ (f i ) l(i)+1. Corollary 4. If the function ψ( ) is non-decreasing in the interval [0... T ], then, choosing t = L + log 1 ɛ +, we get Var [ Ψ] ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) (3) The error incurred by the estimate ˆΨ is ˆΨ Ψ, and can be written as the sum of two error components using the triangle inequality. ˆΨ Ψ Ψ Ψ + ˆΨ Ψ = E 1 + E Here, E 1 = Ψ Ψ is the error due to sampling and E = ˆΨ Ψ is the error due to the approximate estimation of the frequencies. By Chebychev s inequality, E 1 = Ψ Ψ E[ Ψ] Ψ +
6 3 Var [ Ψ] with probability 8 9. Using Lemma and Corollary 4, and choosing t = L + log 1 ɛ +, the expsion for E 1 can be simplified as follows: E 1 ɛ LΨ L + 3( ɛ Ψ + L ψ(t l 1 )ψ(g l ) l+1 + ψ(t )ψ(lmargin(g 0 )) ) 1/ (4) with probability 8 9. We now pent an upper bound on E. Lemma 5. Suppose that for 1 i n and 0 l L, ˆf i,l f i ɛf i with probability 1 t. L ψ Then E 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). The analysis assumes that the hash function mapping items to levels is completely independent. We adopt a standard technique of reducing the required randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. More details are provided in Appendix B. 3 Estimating F p In this section, we use the HSS technique with some subtle but vital modifications to estimate F p for 0 < p <. We use the COUNTSKETCH structure as the frequent items structure at each level l. We observe that a direct application of the HSS technique does not pent an Õ(1) space procedure, and so we need some novel analysis. To see this, suppose that is the space parameter of the COUNTSKETCH procedure at each level. Firstly, observe that E 0 L l=0 ψ (ξ i ) L 0 L l = l p f i p 1 ɛ f i p ɛ 1+p/ F p ɛf p i G l i G l i G l l=0 as required, since p < and 0 l ɛ f i for i G l. Now, by equation (4) and using t = L + log 1 ɛ +, we have E 1 ɛ F p + 3 ( ɛ F p + L ( F () l l=0 ) p Fp (G l ) l+1 + (1 + ɛ) ( F () Further, if we write f ran(r) to denote the rth largest frequency (in absolute value) ) p Fp (lmargin(g 0 ))) 1. F () = fran(r) ( ) /p 1 f p ran(+1) f p ran(r) Fp F p r> r> and hence the expsion for E 1 simplifies to E 1 ɛ F p + 3 ( ɛ F p + F p L l+1 lp/ F p (G l ) + (1 + ɛ)f p ( ) /p Fp, for p > 0 F p (lmargin(g 0 )) ) 1/
7 The main problem arises with the the middle term in the above expsion, namely, F p L l+1 lp/ F p (G l ), which can be quite large. Our approach relies on altering the group definitions to mae them depend on the (randomized) quantities F (, l) so that the ulting expsion for E 1 becomes bounded by O(ɛF p ). We also note that the expsion for E derived above remains bounded by ɛf p. Altered group definitions and estimator. We use the COUNTSKETCH data structure as the frequent items structure at each level l = 0, 1,..., L, with = O( 1 ɛ +p ) bucets per hash function and s = O(log F 1 ) hash functions per setch. We first observe that Lemma 1 can be slightly strengthened as follows: Lemma 6 (A slightly stronger version of Lemma 1). For l 1 and F1 (, l) F 1 ( l 1 ) l with probability F (, l) F ( l 1 ) l with probability Proof. The ult follows from the proof given for Lemma 1 [3]. At each level l, we use the COUNTSKETCH structure to estimate F (, l) to within an accuracy of (1 ± 1 1 ɛ 4 ) with probability 1 16L, where, ɛ = 3. Let F (, l) denote 5 4 F (, l). We redefine the thholds T 0, T 1,..., as follows: ( F l = (, l) l ) 1/, T l = l, l = 0, 1,,..., L. ɛ The groups G 0, G 1, G,..., are set in the usual way, using the new thholds: The estimator is defined by (1) as before. G 0 = {i f i T 0 } and G l = {i T l f i T l 1 } Lemma 7. Suppose 16 ɛ +p. Then, E 8ɛF p with probability at least 3 4. Proof. We use the property of F (t) derived above, that for any 1 t F 0. We therefore have, F (t) t ( Fp t ) /p = F /p p, for p > 0. (5) t/p 1 ( ) F 1/ ( (, l) 5F ( l 1 ) 1/ ) T l = ɛ l 4ɛ l, by Lemma 6 ( ) 5F /p 1/ p, by (5) 4( l 1 ) /p 1 ɛ l = 1 ( ) 1/p 5 Fp ɛ l (6)
8 By equation (4), one component of E 1 can be simplified as follows: E 1,1 L ɛ p ( 5 10 ɛ p F p The other component of E 1 is T p l F p(g l ) l+1 ) p/ L F p l F p(g l ) l+1 substituting (6) L F p (G l ) since p < = 10 ɛ p F p(f p F p (G 0 )) 5 8 ɛ F p since 16 ɛ +p E 1, = T p 0 (1 + ɛ)f p(lmargin(g 0 )) ɛ p (5/) p/ F p (1 + ɛ)f p 10 ɛ p (1 + ɛ)f p ɛ F p, also using 16 ɛ and ɛ < 1 +p. Substituting in (4), we have, Adding, the total error is bounded by E 1 ɛf p + 3(ɛ F p + E 1,1 + E 1, ) 1/ < 6ɛF p. (7) E E 1 + E 8ɛF p We summarize this section in the following theorem. Theorem 8. There exists an algorithm that returns ˆF p satisfying ˆF p F p ɛf p with probability using space O( ɛ (log n)(log F +p 1 )) and processes each stream update in expected time O(log n log F 1 ) and worst case time O(log n log F 1 ) standard arithmetic operations on words of size log F 1 bits. Remars. 1. We note that for 0 < p < 1, an estimator for F p with similar properties may be designed in an exactly analogous fashion by using COUNT-MIN instead of COUNTSKETCH as the frequent items structure at each level. Such an estimator would require an ɛ-accurate estimation of F 1 (which would imply estimation of F 1 using standard techniques), which could either be done using Cauchysetches [11, 13] or using the stand-alone technique pented in Section 4. However, using Cauchysetches means that, in general, O( 1 ɛ ) time is required to process each stream update. In order to maintain poly-logarithmic processing time per stream update, the technique of Section 4 may be used.. The space requirement of the stable setches estimator grows as Õ( 1 ɛ p ) as a function of p [13], ( whereas, the HSS-based ) technique requi space Õ( 1 ɛ ). For small values of p, i.e. p = +p O 1 log ɛ 1 (1+log log ɛ 1 ), the HSS technique can be asymptotically more space-efficient.
9 4 An iterative estimator for F 1 In this section, we use the HSS technique to pent a stand-alone, iterative estimator for F 1 = n i=1 f i. The previous section pents an estimator for F p that uses, as a sub-routine, an estimator for F () at each level of the structure. In this section, we pent a stand-alone estimator that uses only COUNT-MIN setch to estimate F 1. The technique may be of independent intet. The estimator uses two separate instantiations of the HSS structure. The first instantiation uses COUNT-MIN setch structure with = 8 ɛ bucets per hash function, and s = O(log G) hash functions, where, G = O(F ) and ɛ = ɛ 8. A collection of s = O(log 1 δ 3 ) independent copies of the structure are ept for the purpose of boosting the confidence of an estimate by taing the median. The second instantiation of the HSS structure uses = 18 ɛ 3 bucets per hash function (so = 16) and s = O(log G) hash functions. For estimating F 1, we use a two-step procedure, namely, (1) first, we obtain an estimate of F 1 that is correct to within a factor of 16 using the first HSS instantiation and () then, we use the second instantiation to obtain an ɛ-accurate estimation of F 1. The first step of the estimation is done using the first instantiation of the HSS structure as follows. We set the thhold T 0 to a parameter t, T l = T0 l and the thhold frequency for finding in group l to be T l. The group definitions are as defined earlier: G 0 = [t, F 1 ], G l = [ t l, t l 1 ), 1 l L. The disambiguation rule for the estimated frequency is as follows: if ˆf i,l > T l, then, ˆfi is set to the estimate obtained from the lowest of these levels. The sampled groups Ḡl are defined as follows. { Ḡ 0 = {i ˆf i T 0 }, Ḡ l = i T 0 l ˆf i < T } 0 l 1 and i S l, 1 i L. The estimators ˆF 1 and F 1 are defined as before these are now functions of t. ˆF 1 (t) = L ˆf i l F1 (t) = l=0 i Ḡl L f i l. l=0 i Ḡl Estimator. Let t iterate over values 1,, med,..., G and for each value of t let ˆF 1 (t) denote the median of the estimates ˆF 1 returned from the s 1 = O(log 1 δ ) copies of the HSS structure, each using independent random bits and the same value of t. Let t max denote the largest value of t satisfying The final estimate returned is ˆF med 1 (t) 16t 1.01ɛ. ˆF med 1 (t max ) using the second HSS instantiation. Analysis. We first note that Lemmas and 3 hold for all choices of t. Lemma 5 gets modified as follows. Lemma 9. Suppose that for 1 i n and 0 l L, ˆf i,l f i 0 with probability 1 t, l where, 0 = F 1 () L ψ. Then, E 16 0 l=0 i G (ξ i) l with probability 9 l 10 t, where for an i G l, ξ i lies between f i and ˆf i, and maximizes ψ (). Lemma 10. Let ɛ = ɛ 8, = 8 ɛ 3 and ɛ 1 8. Then, with probability 1 δ each,
10 1. For 4F1 ɛ t 8F1 ɛ. For any t 64F1 ɛ med ˆF 1 (t) F ɛF 1 1 (t) < 16t, ˆF med and 1.0ɛ with probability 1 δ. ˆF med 1 (t) 16t 1.0ɛ Proof. We consider the two summation terms in error term E 1 given by equation (4) separately. L t E 1,1 = l F 1(G l ) l+1 t(f 1 F 0 ), and E 1, = tf 1 (lmargin(g 0 )). Adding, E 1 ((t + 0 )F 1 ) 1/. We ignore the term t+l+ F in E 1, since, by choosing t = O(L), this term can be made arbitrarily small in comparison with the other two terms. Since, G l F1 l t, our bound on E becomes E 16 0 L l=0 G l l 16F 1 t Therefore, the expsion for total error is E(t) = E 1 + E ( t E(t) F ) 1/ + 16F 1 F 1 t Suppose = 18 ɛ 3 We therefore have, med ˆF Therefore, for 4F1 ɛ 1.01ɛF 1 and 4F1 ɛ t 8F1 ɛ. Using ɛ 1 8 and ɛ = ɛ 8 E(t) F 1 ( t F (t) F 1 E(t) 1.01ɛF 1 ) 1/ + 16F 1 t., we have 1.01ɛF 1. (8). (9), for 4F 1 ɛ t 8F 1 with probability 1 δ. ɛ t 8F1 ɛ, with probability 1 δ, we have from (9) that tf 1 and so t (1.01) ɛ F ɛ med ˆF 1 16 ( ɛ) so Let t = j+ F 1 ɛ for some j 0, and suppose that (10) is satisfied. Then by (8) E(t) (j 1)/ ɛ(1.01)f 1 + j ɛf 1 j/ ɛf 1 with probability 1 δ. med With probability 1 δ, ˆF 1 (t) F 1 E and so, using ɛ 1 8, j/ 3 F 1 j/ med ɛf 1 E(t) ˆF 1 (t) F 1 16t 1.0ɛ F 1 by (10) j 3 F F 1 using ɛ = ɛ and = 8 8 ɛ 3 which is a contradiction for j 4, proving claim. ˆF med 1 (t) 16t 1.0ɛ. The correctness of the algorithm follows from the above Lemma. The space requirement of the algorithm is O( 1 ɛ (log 3 n)(log F 3 1 )) bits and the expected time taen to process each stream update is O(log F 1 log 1 δ ) standard arithmetic operations on words of size O(log n). (10)
11 5 Conclusions We pent a family of algorithms for the randomized estimation of F p for 0 < p < and another family of algorithms for estimating F p for 0 < p < 1. The first algorithm family estimates F p by using the COUNTSKETCH structure and F estimation as sub-routines. The second algorithm family estimates F p by using the COUNT-MIN setch structure and F 1 estimation as a sub-routines. The space required by these algorithms are O( 1 ɛ (log n)(log F +p 1 )(log 1 δ ) and the expected time required to process each stream update is O(log n log 1 δ ). Finally, we also pent a stand-alone iterative estimator for F 1 that only uses the COUNT-MIN setch structure as a sub-routine. Prior approaches to the problem of estimating F p [11, 13] used setches of the frequency vector with random variables drawn from a symmetric p-stable distribution. An inteting feature of the above algorithms is that they do not require the use of stable distributions. The proposed algorithms trade an extra factor of O(ɛ p ) factor of space for dramatically improved procesing time (with no polynomial dependency on ɛ) per stream update. References 1. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating frequency moments. Journal of Computer and System Sciences, 58(1): , Z. Bar-Yossef, T. S. Jayram, R. Kumar, D. Sivaumar, and L. Trevisan. Counting distinct elements in a data stream. In RANDOM, L. Bhuvanagiri and S. Ganguly. Estimating Entropy over Data Streams. In Proc. ESA, D. Carney, U. Cetintemel, M. Cherniac, C. Convey, S. Lee, G. Seidman, M. Stonebraer, N. Tatbul, and S. Zdoni. Monitoring streams a new class of data management applications. In Proc. VLDB, M. Chariar, K. Chen, and M. Farach-Colton. Finding frequent items in data streams. In Proc. ICALP, 00, pages G. Cormode and S. Muthurishnan. What s New: Finding Significant Differences in Networ Data Streams. In IEEE INFOCOM, G. Cormode and S. Muthurishnan. An Improved Data Stream Summary: The Count-Min Setch and its Applications. J. Algorithms, 55(1):58 75, April P. Flajolet and G.N. Martin. Probabilistic Counting Algorithms for Database Applications. Journal of Computer and System Sciences, 31():18 09, S. Ganguly, D. Kesh, and C. Saha. Practical Algorithms for Tracing Database Join Sizes. In Proc. FSTTCS, P. Gibbons and S. Tirthapura. Estimating simple functions on the union of data streams. In Proc. SPAA, P. Indy. Stable Distributions, Pseudo Random Generators, Embeddings and Data Stream Computation. In Proc. IEEE FOCS, P. Indy and D. Woodruff. Optimal Approximations of the Frequency Moments. In Proc. ACM STOC, P. Li. Very Sparse Stable Random Projections, Estimators and Tail Bounds for Stable Random Projections. Manuscript, 006.
12 14. P. Li, T. J. Hastie and K. W. Church. Very Sparse Random Projections. In Proc. ACM SIGKDD, N. Nisan. Pseudo-Random Generators for Space Bounded Computation. In Proc. ACM STOC, M. Thorup and Y. Zhang. Tabulation based 4-universal hashing with applications to second moment estimation. In Proc. ACM SODA, pages , January M.N. Wegman and Carter J. L. New Hash Functions and their Use in Authentication and Set Equality. Journal of Computer and System Sciences, :65 79, D.P. Woodruff. Optimal approximations of all frequency moments. In Proc. ACM SODA, pages , January 004. A COUNT-MIN and COUNTSKETCH summaries Given a data stream defining a set of item frequencies, ran(r) returns an item with the r th largest absolute value of the frequency (ties are broen arbitrarily). We say that an item i has ran r if ran(r) = i. For a given value of, 1 n, the set top() is the set of items with ran. The idual second moment [5] of a data stream, denoted by F (), is defined as the second moment of the stream after the top- frequencies have been removed. Then, F () = r> f ran(r). The idual first moment [7] of a data stream, denoted by F 1, is analogously defined as the F 1 norm of the data stream after the top- frequencies have been removed, that is, F 1 = r> f ran(r). A setch [1] is a random integer X = i f i x i, where, x i { 1, +1}, for i D and the family of variables {x i } i D with certain independence properties. The family of random variables {x i } i D is referred to as the setch basis. For any d, a d-wise independent setch basis can be constructed in a pseudo-random manner from a truly random seed of size O(d log n) bits as follows. Let F be field of characteristic and of size at least n + 1. Choose a degree d 1 polynomial g : F F with coefficients that are randomly chosen from F [17]. Define x i to be 1 if the first bit (i.e., the least significant position) of g(i) is 1, and define x i to be 1 otherwise. The d-wise independence of the x i s follows from an application of Wegman and Carter s universal hash functions [17]. Pair-wise independent setches are used in [5] to design the COUNTSKETCH algorithm for finding the top- frequent items in an insert-only stream. The data structure consists of a collection of s = O(log 1 δ ) independent hash tables U 1, U,..., U s each consisting of 8 bucets. A pair-wise independent hash function h j : [n] {1,,..., 8} is associated with each hash table that maps items randomly to one of the 8 bucets, where, is a space parameter. Additionally, for each table index j = 1,,..., s, we eep a pair-wise independent family of random variables {x ij } i [n], where, each x ij { 1, +1} with equal probability. Each bucet eeps a setch of the sub-stream that maps to it, that is, U j [r] = i:h j(i)=r f ix ij, 1 j s, 1 r 8. An estimate ˆf i is returned as follows: ˆf i = median s j=1 U j [h j (i)]x ij. The accuracy of estimation is stated as a function of the idual second moment given parameters and b is defined as [5] ( F ) 1/ () (b, ) = 8. b The space versus accuracy guarantees of the COUNTSKETCH algorithm is pented in Theorem 11.
13 Theorem 11 ([5]). Let = (, 8). Then, for any given i [n], Pr{ ˆf i f i } 1 δ. The space used is O( log 1 δ (log F 1)) bits, and the time taen to process a stream update is O(log 1 δ ). The COUNTSKETCH algorithm can be adapted to return approximate frequent items and their frequencies. The original algorithm [5] uses a heap for maintaining the current top- items in terms of their estimated frequencies. After processing each arriving stream record of the form (i, v), where, v is assumed to be non-negative, an estimate for ˆf i is calculated using the scheme outline above. If i is already in the current estimated top- heap then its frequency is corpondingly increased. If i is not in the heap but ˆf i is larger than the current smallest frequency in the heap, then it replaces that element in the heap. This scheme is applicable to insert-only streams. A generalization of this method for strict update streams is pented in [6] and returns, with probability 1 δ, (a) all items with frequency at least ( F () ) 1/ and, (b) does not return any item with frequency less than (1 ɛ)( F () ) 1/ using space O ( ɛ log n log(ɛ 1 log(ɛ 1 )) log F 1 ) bits. For general update streams, a variation of this technique can be used for retrieving items satisfying properties (a) and (b) above using space O(ɛ log(δ 1 n) log F 1 ) bits. The COUNT-MIN algorithm [7] for finding approximate frequent items eeps a collection of s = O(log 1 δ ) independent hash tables T 1, T,..., T s, where each hash table T j is of size b = bucets and uses a pair-wise independent hash function h j : [n] {1,..., }, for j = 1,,..., s. The bucet T j [r] is an integer counter that maintains the following sum T j [r] = i:h j(i)=r f i. The estimated frequency ˆf i is obtained as ˆf i = median s r=1t j [h j (i)]. The space versus accuracy guarantees for the COUNT-MIN algorithm is given in terms of the quantity F 1 () = r> f ran(r). Theorem 1 ([7]).Pr{ ˆf i f i F 1 () } 1 δ with probability using space O( log 1 δ log F 1) bits and time O(log 1 δ ) to process each stream update. Estimating F1 and F. [9] pents an algorithm to estimate F () to within an accuracy of (1±ɛ) with confidence 1 δ using space O( ɛ log(f 1 ) log( n δ )) bits. The data structure used is identical to the COUNTSKETCH structure. The algorithm basically removes the top- estimated frequencies from the COUNTSKETCH structure and then estimates F. Let ˆf τ1,..., ˆf τ denote the top- estimated frequencies from the COUNTSKETCH structure. Next, the contributions of these estimates are removed from the structure, that is, U j [r]:=u j [r] t:h f j(τ t)=r τ t x jτt. Subsequently, the Fast-AMS algorithm [16], a variant of the original setch algorithm [1], is used to estimate the second moment as ˆF = median s 8 j=1 r=1 (U j[r]). Formally, we can state: Lemma 13 ([9]). For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF () satisfying ˆF () F () ɛf () with probability 1 δ using space O( ɛ (log F1 δ )(log F 1)) bits. A similar argument can be applied to estimate F 1 (s), where, instead of using the COUNTSKETCH algorithm, we use the COUNT-MIN algorithm for retrieving the top- estimated absolute frequencies. In parallel, a set of s = O( 1 ɛ ) setches based on a 1-stable distribution [11] (i.e., Y j = i f iz ji, where z ji is drawn from a 1-stable distribution). After retrieving the top- frequencies f τ1,..., f τ
14 with pect to their absolute values, we reduce the setches Y j :=Y j r=1 f τ r z jτr and estimate F1 () as median s j=1 Y j. We summarize this in Lemma 14. Lemma 14. For a given integer 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF 1 () satisfying ˆF 1 () F1 () ɛf1 () with probability 1 δ using O( 1 ɛ ( + 1 ɛ )(log δ )(log F 1)) bits. B Reducing random bits by using a PRG We use a standard technique of reducing the randomness by using a pseudo-random generator (PRG) of Nisan [15] along the lines of Indy in [11] and Indy and Woodruff in [1]. Notation. Let M be a finite state machine that uses S bits and has running time R. Assume that M uses the random bits in segments, each segment consisting of b bits. Let U r be a uniform distribution over {0, 1} r and for a discrete random variable X, let F[X] denote the probability distribution of X, treated as a vector. Let M(x) denote the state of M after using the random bits in x. The generator G : {0, 1} u {0, 1} b expands a small number of u bits that are truly random to a sequence of b bits that appear random to M. G is said to be a pseudo-random generator for a class C of finite state machines with parameter ɛ, provided, for every M C F[Mx U b(x)] F[M x U m(g(x))] 1 ɛ where, y 1 denotes the F 1 norm of the vector y. Nisan [15] shows the following property (this version is from [11]). Theorem 15 ([15]). There exists a PRG G for Space(S) and Time(R) with parameter ɛ = O(S) that requi O(S) bits such that G expands O(S log R) bits into O(R) bits. This is sufficient due to the fact that we can compute the frequency moments by considering each (aggregate) frequency f i in turn and use only segments of O(log F 1 ) bits to store and process it.
Hierarchical Sampling from Sketches: Estimating Functions over Data Streams
Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We
More informationCS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014
CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /
More informationLecture 2: Streaming Algorithms
CS369G: Algorithmic Techniques for Big Data Spring 2015-2016 Lecture 2: Streaming Algorithms Prof. Moses Chariar Scribes: Stephen Mussmann 1 Overview In this lecture, we first derive a concentration inequality
More information1 Estimating Frequency Moments in Streams
CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature
More informationMaintaining Significant Stream Statistics over Sliding Windows
Maintaining Significant Stream Statistics over Sliding Windows L.K. Lee H.F. Ting Abstract In this paper, we introduce the Significant One Counting problem. Let ε and θ be respectively some user-specified
More informationRange-efficient computation of F 0 over massive data streams
Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State
More informationData Streams & Communication Complexity
Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x
More informationLecture 2 Sept. 8, 2015
CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],
More informationAn Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems
An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in
More informationLecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1
Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a
More informationLecture 6 September 13, 2016
CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]
More informationBalls and Bins: Smaller Hash Families and Faster Evaluation
Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder April 22, 2011 Abstract A fundamental fact in the analysis of randomized algorithm is that
More informationLecture 3 Sept. 4, 2014
CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.
More informationBalls and Bins: Smaller Hash Families and Faster Evaluation
Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder May 1, 2013 Abstract A fundamental fact in the analysis of randomized algorithms is that when
More information14.1 Finding frequent elements in stream
Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours
More informationImproved Sketching of Hamming Distance with Error Correcting
Improved Setching of Hamming Distance with Error Correcting Ohad Lipsy Bar-Ilan University Ely Porat Bar-Ilan University Abstract We address the problem of setching the hamming distance of data streams.
More informationTitle: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick,
Title: Count-Min Sketch Name: Graham Cormode 1 Affil./Addr. Department of Computer Science, University of Warwick, Coventry, UK Keywords: streaming algorithms; frequent items; approximate counting, sketch
More informationLinear Sketches A Useful Tool in Streaming and Compressive Sensing
Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =
More informationB669 Sublinear Algorithms for Big Data
B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store
More informationSuccinct Approximate Counting of Skewed Data
Succinct Approximate Counting of Skewed Data David Talbot Google Inc., Mountain View, CA, USA talbot@google.com Abstract Practical data analysis relies on the ability to count observations of objects succinctly
More informationEstimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan
Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process
More informationMAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing
MAT 585: Johnson-Lindenstrauss, Group testing, and Compressed Sensing Afonso S. Bandeira April 9, 2015 1 The Johnson-Lindenstrauss Lemma Suppose one has n points, X = {x 1,..., x n }, in R d with d very
More informationAn Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters
IEEE/ACM TRANSACTIONS ON NETWORKING An Improved Bound for Minimizing the Total Weighted Completion Time of Coflows in Datacenters Mehrnoosh Shafiee, Student Member, IEEE, and Javad Ghaderi, Member, IEEE
More informationCR-precis: A deterministic summary structure for update data streams
CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present
More informationSome notes on streaming algorithms continued
U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.
More information1 Approximate Quantiles and Summaries
CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity
More informationA Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation
A Reservoir Sampling Algorithm with Adaptive Estimation of Conditional Expectation Vu Malbasa and Slobodan Vucetic Abstract Resource-constrained data mining introduces many constraints when learning from
More informationFinding Heavy Distinct Hitters in Data Streams
Finding Heavy Distinct Hitters in Data Streams Thomas Locher IBM Research Zurich thl@zurich.ibm.com ABSTRACT A simple indicator for an anomaly in a network is a rapid increase in the total number of distinct
More informationArthur-Merlin Streaming Complexity
Weizmann Institute of Science Joint work with Ran Raz Data Streams The data stream model is an abstraction commonly used for algorithms that process network traffic using sublinear space. A data stream
More information6 Filtering and Streaming
Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius
More informationPart 1: Hashing and Its Many Applications
1 Part 1: Hashing and Its Many Applications Sid C-K Chau Chi-Kin.Chau@cl.cam.ac.u http://www.cl.cam.ac.u/~cc25/teaching Why Randomized Algorithms? 2 Randomized Algorithms are algorithms that mae random
More informationIITM-CS6845: Theory Toolkit February 3, 2012
IITM-CS6845: Theory Toolkit February 3, 2012 Lecture 4 : Derandomizing the logspace algorithm for s-t connectivity Lecturer: N S Narayanaswamy Scribe: Mrinal Kumar Lecture Plan:In this lecture, we will
More informationLecture 4: Hashing and Streaming Algorithms
CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected
More informationLecture 2. Frequency problems
1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments
More information12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters)
12 Count-Min Sketch and Apriori Algorithm (and Bloom Filters) Many streaming algorithms use random hashing functions to compress data. They basically randomly map some data items on top of each other.
More informationBig Data. Big data arises in many forms: Common themes:
Big Data Big data arises in many forms: Physical Measurements: from science (physics, astronomy) Medical data: genetic sequences, detailed time series Activity data: GPS location, social network activity
More informationData Stream Methods. Graham Cormode S. Muthukrishnan
Data Stream Methods Graham Cormode graham@dimacs.rutgers.edu S. Muthukrishnan muthu@cs.rutgers.edu Plan of attack Frequent Items / Heavy Hitters Counting Distinct Elements Clustering items in Streams Motivating
More informationSketching and Streaming Entropy via Approximation Theory
Setching and Streaming Entropy via Approximation Theory Nicholas J. A. Harvey Jelani Nelson Krzysztof Ona Abstract We conclude a sequence of wor by giving near-optimal setching and streaming algorithms
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationEfficient Sketches for Earth-Mover Distance, with Applications
Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com
More informationThe Online Space Complexity of Probabilistic Languages
The Online Space Complexity of Probabilistic Languages Nathanaël Fijalkow LIAFA, Paris 7, France University of Warsaw, Poland, University of Oxford, United Kingdom Abstract. In this paper, we define the
More informationHashing. Martin Babka. January 12, 2011
Hashing Martin Babka January 12, 2011 Hashing Hashing, Universal hashing, Perfect hashing Input data is uniformly distributed. A dynamic set is stored. Universal hashing Randomised algorithm uniform choice
More information1-Pass Relative-Error L p -Sampling with Applications
-Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m
More informationImproved Algorithms for Distributed Entropy Monitoring
Improved Algorithms for Distributed Entropy Monitoring Jiecao Chen Indiana University Bloomington jiecchen@umail.iu.edu Qin Zhang Indiana University Bloomington qzhangcs@indiana.edu Abstract Modern data
More informationTesting k-wise Independence over Streaming Data
Testing k-wise Independence over Streaming Data Kai-Min Chung Zhenming Liu Michael Mitzenmacher Abstract Following on the work of Indyk and McGregor [5], we consider the problem of identifying correlations
More informationBounds on the OBDD-Size of Integer Multiplication via Universal Hashing
Bounds on the OBDD-Size of Integer Multiplication via Universal Hashing Philipp Woelfel Dept. of Computer Science University Dortmund D-44221 Dortmund Germany phone: +49 231 755-2120 fax: +49 231 755-2047
More informationLast time, we described a pseudorandom generator that stretched its truly random input by one. If f is ( 1 2
CMPT 881: Pseudorandomness Prof. Valentine Kabanets Lecture 20: N W Pseudorandom Generator November 25, 2004 Scribe: Ladan A. Mahabadi 1 Introduction In this last lecture of the course, we ll discuss the
More informationThe 2-valued case of makespan minimization with assignment constraints
The 2-valued case of maespan minimization with assignment constraints Stavros G. Kolliopoulos Yannis Moysoglou Abstract We consider the following special case of minimizing maespan. A set of jobs J and
More informationOptimal compression of approximate Euclidean distances
Optimal compression of approximate Euclidean distances Noga Alon 1 Bo az Klartag 2 Abstract Let X be a set of n points of norm at most 1 in the Euclidean space R k, and suppose ε > 0. An ε-distance sketch
More informationThe quantum complexity of approximating the frequency moments
The quantum complexity of approximating the frequency moments Ashley Montanaro May 1, 2015 Abstract The th frequency moment of a sequence of integers is defined as F = j n j, where n j is the number of
More informationCleaning Uncertain Data for Top-k Queries
Cleaning Uncertain Data for Top- Queries Luyi Mo, Reynold Cheng, Xiang Li, David Cheung, Xuan Yang Department of Computer Science, University of Hong Kong, Pofulam Road, Hong Kong {lymo, ccheng, xli, dcheung,
More informationImproved Range-Summable Random Variable Construction Algorithms
Improved Range-Summable Random Variable Construction Algorithms A. R. Calderbank A. Gilbert K. Levchenko S. Muthukrishnan M. Strauss January 19, 2005 Abstract Range-summable universal hash functions, also
More informationCSE 190, Great ideas in algorithms: Pairwise independent hash functions
CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required
More informationPRGs for space-bounded computation: INW, Nisan
0368-4283: Space-Bounded Computation 15/5/2018 Lecture 9 PRGs for space-bounded computation: INW, Nisan Amnon Ta-Shma and Dean Doron 1 PRGs Definition 1. Let C be a collection of functions C : Σ n {0,
More informationMinimum Repair Bandwidth for Exact Regeneration in Distributed Storage
1 Minimum Repair andwidth for Exact Regeneration in Distributed Storage Vivec R Cadambe, Syed A Jafar, Hamed Malei Electrical Engineering and Computer Science University of California Irvine, Irvine, California,
More informationA Near-Optimal Algorithm for Computing the Entropy of a Stream
A Near-Optimal Algorithm for Computing the Entropy of a Stream Amit Chakrabarti ac@cs.dartmouth.edu Graham Cormode graham@research.att.com Andrew McGregor andrewm@seas.upenn.edu Abstract We describe a
More informationAsymptotically optimal induced universal graphs
Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1+o(1))2 ( 1)/2.
More informationA Robust APTAS for the Classical Bin Packing Problem
A Robust APTAS for the Classical Bin Packing Problem Leah Epstein 1 and Asaf Levin 2 1 Department of Mathematics, University of Haifa, 31905 Haifa, Israel. Email: lea@math.haifa.ac.il 2 Department of Statistics,
More informationGeometric Optimization Problems over Sliding Windows
Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca
More informationCS 591, Lecture 9 Data Analytics: Theory and Applications Boston University
CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half
More informationCS Data Structures and Algorithm Analysis
CS 483 - Data Structures and Algorithm Analysis Lecture II: Chapter 2 R. Paul Wiegand George Mason University, Department of Computer Science February 1, 2006 Outline 1 Analysis Framework 2 Asymptotic
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms
More information4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit
More informationNOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES
NOTES ON EXISTENCE AND UNIQUENESS THEOREMS FOR ODES JONATHAN LUK These notes discuss theorems on the existence, uniqueness and extension of solutions for ODEs. None of these results are original. The proofs
More information3.1 Asymptotic notation
3.1 Asymptotic notation The notations we use to describe the asymptotic running time of an algorithm are defined in terms of functions whose domains are the set of natural numbers N = {0, 1, 2,... Such
More informationTight Bounds for Distributed Streaming
Tight Bounds for Distributed Streaming (a.k.a., Distributed Functional Monitoring) David Woodruff IBM Research Almaden Qin Zhang MADALGO, Aarhus Univ. STOC 12 May 22, 2012 1-1 The distributed streaming
More informationA Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy
A Deterministic Fully Polynomial Time Approximation Scheme For Counting Integer Knapsack Solutions Made Easy Nir Halman Hebrew University of Jerusalem halman@huji.ac.il July 3, 2016 Abstract Given n elements
More informationPoly-logarithmic independence fools AC 0 circuits
Poly-logarithmic independence fools AC 0 circuits Mark Braverman Microsoft Research New England January 30, 2009 Abstract We prove that poly-sized AC 0 circuits cannot distinguish a poly-logarithmically
More informationLower Bounds for Testing Bipartiteness in Dense Graphs
Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due
More informationTolerant Versus Intolerant Testing for Boolean Properties
Tolerant Versus Intolerant Testing for Boolean Properties Eldar Fischer Faculty of Computer Science Technion Israel Institute of Technology Technion City, Haifa 32000, Israel. eldar@cs.technion.ac.il Lance
More informationIrredundant Families of Subcubes
Irredundant Families of Subcubes David Ellis January 2010 Abstract We consider the problem of finding the maximum possible size of a family of -dimensional subcubes of the n-cube {0, 1} n, none of which
More informationMultivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma
Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma Suppose again we have n sample points x,..., x n R p. The data-point x i R p can be thought of as the i-th row X i of an n p-dimensional
More information5 + 9(10) + 3(100) + 0(1000) + 2(10000) =
Chapter 5 Analyzing Algorithms So far we have been proving statements about databases, mathematics and arithmetic, or sequences of numbers. Though these types of statements are common in computer science,
More informationAsymptotically optimal induced universal graphs
Asymptotically optimal induced universal graphs Noga Alon Abstract We prove that the minimum number of vertices of a graph that contains every graph on vertices as an induced subgraph is (1 + o(1))2 (
More informationData stream algorithms via expander graphs
Data stream algorithms via expander graphs Sumit Ganguly sganguly@cse.iitk.ac.in IIT Kanpur, India Abstract. We present a simple way of designing deterministic algorithms for problems in the data stream
More informationCS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms
CS 4407 Algorithms Lecture 2: Iterative and Divide and Conquer Algorithms Prof. Gregory Provan Department of Computer Science University College Cork 1 Lecture Outline CS 4407, Algorithms Growth Functions
More informationVariable Selection and Model Building
LINEAR REGRESSION ANALYSIS MODULE XIII Lecture - 39 Variable Selection and Model Building Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur 5. Akaike s information
More informationApproximation algorithms for cycle packing problems
Approximation algorithms for cycle packing problems Michael Krivelevich Zeev Nutov Raphael Yuster Abstract The cycle packing number ν c (G) of a graph G is the maximum number of pairwise edgedisjoint cycles
More informationCS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,
More informationKolmogorov complexity
Kolmogorov complexity In this section we study how we can define the amount of information in a bitstring. Consider the following strings: 00000000000000000000000000000000000 0000000000000000000000000000000000000000
More informationPing Li Compressed Counting, Data Streams March 2009 DIMACS Compressed Counting. Ping Li
Ping Li Compressed Counting, Data Streams March 2009 DIMACS 2009 1 Compressed Counting Ping Li Department of Statistical Science Faculty of Computing and Information Science Cornell University Ithaca,
More informationImproved Concentration Bounds for Count-Sketch
Improved Concentration Bounds for Count-Sketch Gregory T. Minton 1 Eric Price 2 1 MIT MSR New England 2 MIT IBM Almaden UT Austin 2014-01-06 Gregory T. Minton, Eric Price (IBM) Improved Concentration Bounds
More information15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018
15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science
More informationOn (ε, k)-min-wise independent permutations
On ε, -min-wise independent permutations Noga Alon nogaa@post.tau.ac.il Toshiya Itoh titoh@dac.gsic.titech.ac.jp Tatsuya Nagatani Nagatani.Tatsuya@aj.MitsubishiElectric.co.jp Abstract A family of permutations
More informationCOMP 355 Advanced Algorithms
COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background 1 Polynomial Running Time Brute force. For many non-trivial problems, there is a natural brute force search algorithm that
More informationAlgorithms for Querying Noisy Distributed/Streaming Datasets
Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang Indiana University Bloomington Sublinear Algo Workshop @ JHU Jan 9, 2016 1-1 The big data models The streaming model (Alon, Matias
More informationLecture Lecture 25 November 25, 2014
CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,
More informationThe Count-Min-Sketch and its Applications
The Count-Min-Sketch and its Applications Jannik Sundermeier Abstract In this thesis, we want to reveal how to get rid of a huge amount of data which is at least dicult or even impossible to store in local
More informationRANDOM SIMULATIONS OF BRAESS S PARADOX
RANDOM SIMULATIONS OF BRAESS S PARADOX PETER CHOTRAS APPROVED: Dr. Dieter Armbruster, Director........................................................ Dr. Nicolas Lanchier, Second Committee Member......................................
More informationIntroduction: The Perceptron
Introduction: The Perceptron Haim Sompolinsy, MIT October 4, 203 Perceptron Architecture The simplest type of perceptron has a single layer of weights connecting the inputs and output. Formally, the perceptron
More informationSparser Johnson-Lindenstrauss Transforms
Sparser Johnson-Lindenstrauss Transforms Jelani Nelson Princeton February 16, 212 joint work with Daniel Kane (Stanford) Random Projections x R d, d huge store y = Sx, where S is a k d matrix (compression)
More informationTight Bounds for Distributed Functional Monitoring
Tight Bounds for Distributed Functional Monitoring Qin Zhang MADALGO, Aarhus University Joint with David Woodruff, IBM Almaden NII Shonan meeting, Japan Jan. 2012 1-1 The distributed streaming model (a.k.a.
More informationCOMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background
COMP 355 Advanced Algorithms Algorithm Design Review: Mathematical Background 1 Polynomial Time Brute force. For many non-trivial problems, there is a natural brute force search algorithm that checks every
More informationZero-One Laws for Sliding Windows and Universal Sketches
Zero-One Laws for Sliding Windows and Universal Sketches Vladimir Braverman 1, Rafail Ostrovsky 2, and Alan Roytman 3 1 Department of Computer Science, Johns Hopkins University, USA vova@cs.jhu.edu 2 Department
More informationAnalysis of Algorithms
October 1, 2015 Analysis of Algorithms CS 141, Fall 2015 1 Analysis of Algorithms: Issues Correctness/Optimality Running time ( time complexity ) Memory requirements ( space complexity ) Power I/O utilization
More informationNow, the above series has all non-negative terms, and hence is an upper bound for any fixed term in the series. That is to say, for fixed n 0 N,
l p IS COMPLETE Let 1 p, and recall the definition of the metric space l p : { } For 1 p
More informationSolving Zero-Sum Security Games in Discretized Spatio-Temporal Domains
Solving Zero-Sum Security Games in Discretized Spatio-Temporal Domains APPENDIX LP Formulation for Constant Number of Resources (Fang et al. 3) For the sae of completeness, we describe the LP formulation
More informationSolving Classification Problems By Knowledge Sets
Solving Classification Problems By Knowledge Sets Marcin Orchel a, a Department of Computer Science, AGH University of Science and Technology, Al. A. Mickiewicza 30, 30-059 Kraków, Poland Abstract We propose
More informationLecture 3 Frequency Moments, Heavy Hitters
COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is
More informationRevisiting Frequency Moment Estimation in Random Order Streams
Revisiting Frequency Moment Estimation in Random Order Streams arxiv:803.02270v [cs.ds] 6 Mar 208 Vladimir Braverman, Johns Hopkins University David Woodruff, Carnegie Mellon University March 6, 208 Abstract
More information