Hierarchical Sampling from Sketches: Estimating Functions over Data Streams

Size: px
Start display at page:

Download "Hierarchical Sampling from Sketches: Estimating Functions over Data Streams"

Transcription

1 Hierarchical Sampling from Sketches: Estimating Functions over Data Streams Sumit Ganguly 1 and Lakshminath Bhuvanagiri 2 1 Indian Institute of Technology, Kanpur 2 Google Inc., Bangalore Abstract. We present a randomized procedure named Hierarchical Sampling from Sketches HSS) that can be used for estimating a class of functions over the frequency vector f of update streams of the form ΨS) = n i=1 ψ fi ). We illustrate this by applying the HSS technique to design nearly space-optimal algorithms for estimating the pth moment of the frequency vector, for real p 2 and for estimating the entropy of a data stream. 3 1 Introduction A variety of applications in diverse areas, such as, networking, database systems, sensor networks, web-applications, share some common characteristics, namely, that data is generated rapidly and continuously, and must be analyzed in real-time and in a single-pass over the data to identify large trends, anomalies, user-defined exception conditions, etc.. Furthermore, it is frequently sufficient to continuously track the big picture, or, an aggregate view of the data. In this context, efficient and approximate computation with bounded error probability is often acceptable. The data stream model presents a computational model for such applications, where, incoming data is processed in an online fashion using sub-linear space. 1.1 The data stream model A data stream S is viewed as a sequence of records of the form pos, i, v), where, pos is the index of the record in the sequence, i is the identity of an item in [1, n] = {1,..., n}, and v is the change to the frequency of the item. v > 0 indicates an insertion of multiplicity v, while v < 0 indicates a corresponding deletion. The frequency of an item i, denoted by f i, is the sum of the changes to the frequency of i since the inception of the stream, that 3 Preliminary version of this paper appeared as the following conference publications. Simpler algorithm for estimating frequency moments of data streams, Lakshminath Bhuvanagiri, Sumit Ganguly, Deepanjan Kesh and Chandan Saha, Proceedings of the ACM Symposium on Discrete Algorithms, 2006, pp and Estimating Entropy over Data Streams, Lakshminath Bhuvanagiri and Sumit Ganguly, Proceedings of the European Symposium on Algorithms, Springer LNCS Volume 4168, pp , 2006.

2 is, f i = pos,i,v) appears in S v. The following variations of the data stream model have been considered in the research literature. 1. The insert-only model, where, data streams to not have deletions, that is, v > 0 for all records. The unit insert-only model is a special case of the insert-only model, where, v = 1 for all records. 2. The strict update model, where, insertions and deletions are allowed, subject to the constraint that f i 0, for all i [1, n]. 3. The general update model where no constraints are placed on insertions and deletions. 4. The sliding window model, where, a window size parameter W is given and only the portion of the stream that has arrived within the last W time units is considered to be relevant. Records that are not part of the the current window are deemed to have expired. In the data stream model, an algorithm must perform its computations in an online manner, that is, the algorithm gets to view each stream record exactly once. Further, the computation is constrained to use sub-linear space, that is, on log ) bits, where, = n i=1 f i. This implies that the stream cannot be stored in its entirety and summary structures must be devised to solve the problem. We say that an algorithm estimates a quantity C with ɛ-accuracy and probability 1 δ if it returns an estimate that satisfies Ĉ C ɛc with probability 1 δ. The probability is assumed to hold for every instance of input data and is taken over the random coin tosses used by the algorithm. More simply, we say that a randomized algorithm estimates C with ɛ-accuracy if it returns an estimate satisfying Ĉ C ɛc with constant probability greater than 1 2 for e.g., 7 8 ). Such an estimator can be used to obtain another estimator for C that is ɛ-accurate and is correct with probability 1 δ by following the standard procedure of returning the the median of s = Olog 1 δ ) independent such estimates Ĉ1,..., Ĉs. In this paper, we consider two basic problems in the data streaming model, namely, estimating the moment of the frequency vector of a data stream and estimating the entropy of a data stream. We first define the problems and review the research literature. 1.2 Previous work on estimating F p for p 2 For any real p 0, the p th moment of the frequency vector of the stream S is defined as F p S) = n f i p. i=1 The problem of estimating F p has led to a number of advancements in the design of algorithms and lower bound techniques for data stream computation. It was first introduced in

3 [1] that also presented the first sub-linear space randomized algorithm for estimating F p, for p > 1 with ɛ-accuracy and using space O 1 ɛ 2 n 1 1/p log ) bits. For the special case of F 2, the seminal sketch technique was presented in [1], that uses space O 1 ɛ 2 log ) bits for estimating F 2 with ɛ-accuracy. The work in [9, 13] reduced the space requirement for ɛ-accurate estimation of F p, for p > 2, to O 1 ɛ 2 n 1 1/p 1) log ) ). 4 The space requirement was reduced in [12] to On 1 2/p+1) log )), for p > 2 and p integral. A space lower bound of Ωn 1 2 p ) for estimating F p, for p > 2, was shown in a series of contributions [1, 2, 7] see also [20]). Finally, Indyk and Woodruff [18] presented the first algorithm for estimating F p, for real p > 2, that matched the above space lower bound up to poly-logarithmic factors. The 2-pass algorithm of Indyk and Woodruff requires space O 1 ɛ 12 n 1 2/p log 2 n)log 6 )) bits. The 1-pass data streaming algorithm derived from the 2-pass algorithm further increases the constant and poly-logarithmic factors. The work in [24] presents an Ω 1 ɛ 2 ) space lower bound for ɛ-accurate estimation of F p, for any real p 1 and ɛ 1 n. 1.3 Previous work on estimating entropy The entropy H of a data stream is defined as H = :f i 0 f i log f i. It is a measure of the information theoretic randomness or the incompressibility of the data stream. A value of entropy close to log n, is indicative that the frequencies in the stream are randomly distributed, whereas, low values are indicative of patterns in the data. Monitoring changes in the entropy of a network traffic stream has been used to detect anomalies [15, 22, 25]. The work in [16] presents an algorithm for estimating H over unit insert-only streams to within α-approximation i.e., H a Ĥ bh and a b α) over insert-only streams using space O ) n 1/α H. The work in [5] presents an algorithm for estimating H to within α- ) approximation over insert-only streams using space O 1 F 2/α ɛ 2 1 log ) 2 log + log n). Subsequent to the publication of the conference version [3] of our algorithm, a different algorithm for estimating the entropy of insert-only streams was presented in [6]. The algorithm of [6] requires space O 1 log n)log F ɛ 2 1 )log + log1/ɛ)) ) respectively. The work in [6] ) also shows a space lower bound of Ω for estimating entropy over a data stream. 1 ɛ 2 log1/ɛ) 4 The algorithm of [13] assumes p to be integral.

4 1.4 Contributions In this paper we present a technique called hierarchical sampling from sketches or HSS, that is a general randomized technique for estimating a variety of functions over update data streams. We illustrate the HSS technique by applying it to derive near-optimal space algorithms for estimating frequency moments F p, for real p > 2. The HSS algorithm estimates F p, for any real p 2 to within ɛ-accuracy using space ) p 2 O ɛ 2+4/p n1 2/p log ) 2 log n + log log ) 2 Thus, the space upper bound of these algorithms match the lower bounds up to factors that are poly-logarithmic in, n and polynomial in 1 ɛ. The expected time required to process each stream update is Olog n + log log ) operations. The algorithm essentially uses the idea of Indyk and Woodruff [18] to classify items into groups, based on frequency. However, Indyk and Woodruff define groups whose boundaries are randomized; in our algorithm, the group boundaries are deterministic. The HSS technique is used to design an algorithm that estimates the entropy of general update data streams to within ɛ-accuracy using space log F1 ) 2 ) O ɛ 3 log1/ɛ) log n + log log ) 2. Organization. The remainder of the paper is organized as follows. In Section 2, we review relevant algorithms from the research literature. In Section 3, we present the HSS technique for estimating a class of data stream metrics. Sections 4 and 5 use the HSS technique to estimate frequency moments and entropy, respectively, of a data stream. Finally, we conclude in Section Preliminaries In this section, we review the COUNTSKETCH and the COUNT-MIN algorithms for finding frequent items in a data stream. We also review algorithms to estimate the residual second moment of a data stream [14]. For completeness, we also present an algorithm for estimating the residual first moment of a data stream. Given a data stream, rankr) is an item with the r th largest absolute value of the frequency, where, ties are broken arbitrarily. We say that an item i has rank r if rankr) = i. For a given value of k, 1 k n, the set topk) is the set of items with rank k. The residual second moment [8] of a data stream, denoted by F res 2 k), is defined as the second moment of the stream after the top-k frequencies have been removed, that is, F res 2 k) = r>k f 2 rankr).

5 The residual first moment [10] of a data stream, denoted by F res 1, is analogously defined as the norm of the data stream after the top-k frequencies have been removed, that is, F res 1 = r>k f rankr). Sketches. A linear sketch [1] is a random integer X = i f i x i, where, x i { 1, +1}, for i [1, n] and the family of variables {x i } is either pair-wise or 4-wise independent, depending on the use of the sketches. The family of random variables {x i } i D is referred to as the linear sketch basis. For any d 2, a d-wise independent linear sketch basis can be constructed in a pseudo-random manner from a truly random seed of size Od log n) bits as follows. Let F be field of characteristic 2 and of size at least n + 1. Choose a degree d 1 polynomial g : F F with coefficients that are randomly chosen from F [4, 23]. Define x i to be 1 if the first bit of gi) is 1, and define x i to be 1 otherwise. The d-wise independence of the x i s follows from Wegman and Carter s universal hash functions [23]. Linear sketches were pioneered by [1] to present an elegant and efficient algorithm for returning an estimate of the second moment F 2 of a stream to within a factor of 1 ± ɛ) with probability 7 8. Their procedure keeps s = O 1 ɛ 2 ) independent linear sketches i.e., using independent sketch bases) X 1, X 2,..., X s, where, the sketch basis used for each X i is fourwise independent. The algorithm returns ˆF 2 as the average of the square of the sketches, that is, ˆF 2 = 1 s s j=1 X2 j. A crucial property observed in [1] is that E[ X 2 j ] = F2 and Var [ X 2 j ] 5F2 2. In the remainder of the paper, we abbreviate linear sketches as simply, sketches. COUNTSKETCH algorithm for estimating frequency. Pair-wise independent sketches are used in [8] to design the COUNTSKETCH algorithm for estimating the frequency of any given items i [1, n] of a stream. The data structure consists of a collection of s = Olog 1 δ ) independent hash tables U 1, U 2,..., U s, each consisting of 8k buckets. A pair-wise independent hash function h j : [1, n] {1, 2,..., 8k} is associated with each hash table that maps items randomly to one of the 8k buckets. Additionally, for each table index j = 1, 2,..., s, the algorithm keeps a pair-wise independent family of random variables {x ij }, where, each x ij { 1, +1} with equal probability. Each bucket keeps a sketch of the sub-stream that maps to it, that is, U j [r] = i:h j i)=r f ix ij, i {1, 2,..., s} and j {1, 2,..., s}. An estimate ˆf i is returned as follows: ˆf i = median s j=1 U j[h j i)]x ij. The accuracy of estimation is stated as a function of the residual second moment defined as [8] s, A) def F res 2 = 8 s) ) 1/2. A

6 The space versus accuracy guarantees of the COUNTSKETCH algorithm is presented in Theorem 1. Theorem 1 [8]). Let = k, 8k). Then, for i [1, n], Pr { ˆf } i f i 1 δ. The space used is Oklog 1 δ )log )) bits and the time taken to process a stream update is Olog 1 δ ). COUNT-MIN Sketch for estimating frequency. The COUNT-MIN algorithm [10] for estimating frequencies keeps a collection of s = Olog 1 δ ) independent hash tables T 1, T 2,..., T s, where, each hash table T j is of size b = 2k buckets and uses a pair-wise independent hash function h j : [1, n] {1,..., 2k}, for j = 1, 2,..., s. The bucket T j [r] is an integer counter that maintains the following sum: T j [r] = i:h j i)=r f i. The estimated frequency ˆf i is obtained as ˆf i = median s r=1t j [h j i)]. The space versus accuracy guarantees for the COUNT-MIN algorithm is given in terms of the quantity F1 res k) = r>k f rankr). Theorem 2 [10]). Pr { ˆf } i f i res k) k 1 δ. The space used is O k log 1 ) δ log F1 ) ) bits and time O log 1 ) δ to process each stream update. Estimating F res 2. The work in [14] presents an algorithm to estimate Fres 2 s) to within an accuracy of 1 ± ɛ) with confidence 1 δ using space O s logf ɛ 2 1 ) log n δ )) bits. The data structure used is identical to the COUNTSKETCH structure. The algorithm basically removes the top-s estimated frequencies from the COUNTSKETCH structure and then estimates F 2. The COUNTSKETCH structure is used to find the top-k items with respect to the absolute values of their estimated frequencies. Let ˆf τ1... ˆf τk denote the top-k estimated frequencies. Next, the contributions of these estimates are removed from the structure, that is, U j [r]:=u j [r] t:h j τ f t)=r τ t x jτt. Subsequently, the FASTAMS algorithm [21], a variant of the original sketch algorithm [1], is used to estimate the second moment as follows: ˆF res 2 = median s 8k j=1 r=1 U j[r]) 2. If k = Oɛ 2 s), then, ˆF res 2 F res res 2 ɛˆf 2 [14]. Lemma 1 [14]). For a given integer k 1 and 0 < ɛ < 1, there exists an algorithm for update streams that returns an estimate ˆF res 2 k) satisfying ˆF res 2 k) F2 res k) ɛf2 res k) with probability 1 δ using space O k log F ɛ 2 1 δ )log )) bits. Estimating F1 res. An argument similar to the one used to estimate F res 2 k) can be applied to estimate F1 res k). We will prove the following property in this subsection.

7 Lemma 2. For 0 < ɛ { 1, there exists an algorithm for} update streams that returns an estimate ˆF res 1 satisfying Pr ˆF res 1 F1 res k) ɛf1 res k) 3 4. The algorithm uses O klog F1 ) ɛ ) + log )log n+log1/ɛ)) bits. ɛ 2 The algorithm is the following. Keep a COUNT-MIN sketch structure with height b, where, b is a parameter to be fixed, and width s = Olog n δ ). In parallel, we keep a set of s = O 1 ɛ 2 ) sketches based on a 1-stable distribution [17]. A one-stable sketch is a linear sketch Y j = i f iz j,i, where, the z j,i s are drawn from a 1-stable distribution, j = 1, 2,..., s. As shown by Indyk [17], the random variable } Ŷ = median s j=1 Y j satisfies Pr { Ŷ ɛ 7 8. The COUNT-MIN structure is used to obtain the top-k elements with respect to the absolute value of their estimated frequencies. Suppose these items are i 1, i 2,..., i k and ˆf i1 ˆf i2... ˆf ik. Let I = {i 1, i 2,..., i k }. Each one-stable sketch Y j is updated to remove the contribution of the estimated frequencies. That is, Finally, the following value is returned. Y j = Y j k r=1 ˆf ir z j,ir. ˆF res 1 k) = median k j=1 Y j. We now analyze the algorithm. Let T = T k = {t 1, t 2,..., t k } denote the set of indices of the top-k true) frequencies, such that f t1 f t2... f tk. Lemma 3. F res 1 k), i I f i F1 res k) 1 + k ) b. Proof. Since both T and I are sets of k elements each, therefore, T I = I T. Let i i be an arbitrary 1-1 map from i T I to an element i in I T. Since, i is a top-k frequency and i is not, therefore, f i f i. Further, ˆfi ˆf i, otherwise, i would be among the top-k items with respect to estimated frequencies, that is i would be in I, contradicting that i T I. The condition ˆf i ˆf i implies the following. f i ˆf i ˆf i f i + with probability 1 δ 4n, or, that f i f i + 2, with probability 1 δ 4n. Therefore, f i f i f i +.

8 Thus, using union bound for all items in [1, n], we have with probability 1 δ 2, f i f i f i + = f i + k. 1) i I T i T I i I T i I T Let G = f i = f i + I i T I i T I) f i By 1), it follows that f i + f i G i I T i T I) i I T f i + k + i T I) f i. Since, i I T f i + i T I) f i = F res 1 k), we have, By Theorem 2, we have, F res F res 1 k) G F res 1 k) + k. 1 k) b F res 1 k) G F res 1 k). Therefore, 1 + k b ), with prob. 1 δ 2. Proof Of Lemma 2.). Let f denote the frequency vector after the estimated frequencies are removed. Then, f r f r if r [1, n] I = f r ˆf r if r I. Let denote f i. Then, by Lemma 3, it follows that = f r + f i ˆf i ) G + k F1 res k) 1 + 2k b I i I since, F res 1 b) b F 1 = res k) b. Further, I f r + i I f i ˆf i ) G k F1 res k) 1 k ) b ) Combining, F1 res k) 1 k ) F1 res k) 1 + 2k b b )

9 with probability 1 2 Ωw).{ By construction, the} 1-stable sketches technique returns median s j=1 Y j satisfying Pr ˆF res 1 ɛ. Therefore, using union bound 7 8 res ˆ F1 res k) 2k b res k) + ɛf1 res k) 1 + 2k b If b 2k ɛ, then, res ˆ F1 res k) 3ɛF1 res k) with prob Replacing ɛ by ɛ/3 yields the first statement of the lemma. ) with prob. 7 8 n2 Ωw) ˆF res 1 = The space requirement is calculated as follows. We use the idea of a pseudo-random generator of Indyk [17] for streaming computations. The state of the data structure i.e., stable sketches) is the same as if the items arrived in sorted order. Therefore, as observed by Indyk [17], Nisan s pseudo-random generator [19] can be used to simulate the randomized computation using OS log R) random bits, where, S is the space used by the algorithm not counting the random bits used) and R is the running time of the algorithm. We apply) Indyk s observation to the portion of the algorithm that estimates. Thus, S = O log F1 ) and the ɛ 2 running time R = O ) F 0 ɛ, assuming the input is presented in sorted order of items. Here, 2 F0 is the number of distinct items in the stream with non-zero frequency. Then the total random bits required is OS log R) = O log F1 )log n+log1/ɛ) ). ɛ 2 3 The HSS algorithm In this section, we present a procedure for obtaining a representative sample over the input stream, which we refer to as Hierarchical Sampling over Sketches HSS) and use it for estimating a class of metrics over data-streams of the following form ΨS) = i:f i 0 ψ f i ). 2) Sampling sub-streams. The HSS algorithm uses a sampling scheme as follows. From the input stream S, we create sub-streams S 0,..., S L such that S 0 = S and for 1 l L, S l is obtained from S l 1 by sub-sampling each distinct item appearing in S l 1 independently with probability 1 2. At level 0, S 0 = S. The stream S 1, corresponding to level 1, is obtained by sampling choosing each distinct value of i with independently, with probability 1 2. Since, the sampling is based on the identity of an item i, either all records in S with identity i are present in S 1, or, none are each of these cases holds with probability 1 2. The stream S 2,

10 corresponding to level 2 is obtained by sampling each distinct value of i appearing in the sub-stream S 1, with probability 1 2 and independently of the other items in S 1. In this manner, S l is a randomly sampled sub-stream of S l 1, for l 1, based on the identity of the items. The sub-sampling scheme is implemented as follows. We assume that n is a power of 2. Let h : [1, n] [0, maxn 2, W )] be a random hash function drawn from a pair-wise independent hash family and W 2. Let L max = logmaxn 2, W )). Define the random function level : [1, n] [1, L max ] as follows. 1 if hi) = 0 leveli) = lsbhi)) 2 leveli) L max. where, lsbx) is the position of the least significant 1 in the binary representation of x. The probability distribution of the random level function is as follows. 1 2 Pr {leveli) = l} = + 1 n if l = 1 1 otherwise. 2 l All records pertaining to i are included in the sub-streams S 0 through S leveli). The sampling technique is based on the original idea of Flajolet and Martin [11] for estimating the number of distinct items in a data stream. The Flajolet-Martin scheme maps the original stream into disjoint sub-streams S 1, S 2,..., S log n, where, S l is the sequence of records of the form pos, i, v) such that leveli) = l. The HSS technique creates a monotonic decreasing sequence of random sub-streams in the sense that S 0 S 1 S 2... S Lmax and S l is the sequence of records for item i such that leveli) l. At each level l {0, 1,..., L max }, the HSS algorithm keeps a frequency estimation datastructure denoted by D l, that takes as input the sub-stream S l, and returns an approximation to the frequencies of items that map to S l. The D l structure can be any standard data structure such as the COUNT-MIN sketch structure or the COUNTSKETCH structure. We use the COUNT-MIN structure for estimating entropy and the COUNTSKETCH structure for estimating F p. Each stream update pos, i, v) belonging to S l is propagated to the frequent items data structures D l for 0 l leveli). Let kl) denote a space parameter for the data structure D l, for example, kl) is the size of the hash tables in the COUNT-MIN or COUNTSKETCH structures. The values of kl) are the same for levels l = 1, 2,..., L max and is twice the value for k0). That is, if k = k0), then, k1) =... = kl max ) = 2k. This non-uniformity is a technicality required by Lemma 4 and Corollary 1. We refer to k = k0) as the space parameter of the HSS structure.

11 Approximating f i. Let l k) denote the additive error of the frequency estimation by the data structure D l ) at level l and using space parameter k. That is, we assume that ˆf i,l f i l k) with probability 1 2 t where, t is a parameter and ˆf i,l is the estimate for the frequency of f i obtained using the frequent items structure D l k). By Theorem 2, if D l is instantiated by the COUNT-MIN sketch structure with height k and width log t, then, ˆf i,l f i res k,l) k with probability 1 2 t. If D l is instantiated using the COUNTSKETCH structure with height 8k and width Olog t), then, by Theorem 1, it follows that ˆf ) F res 1/2 i,l f i 8 with probability 1 2 t. 2 k,l) k We first relate the random values F1 res k, l) and F res 2 k, l) to their corresponding non-random values F1 res k) and F res 2 k), respectively. { Lemma 4. For l 1 and k 2, Pr F res 1 2 l 1 k) 1 k, l) F res 2 l 1 } 1 2e k/6. Proof. For item i [1, n], define an indicator variable x i to be 1 if records corresponding to i are included in the stream at level l, namely, S l, and let x i be 0 otherwise. Then, Pr {x i = 1} = 1 2 l. Define the random variable T l,k as the number of items in [1, n] with rank at most 2 l 1 k in the original stream S and that are included in S l. That is, T l,k = 1 ranki) 2 l 1 k By linearity of expectation, E [ ] T l,k = 2 l 1 k = k 2 l 2. Applying Chernoff s bounds to the sum of indicator variables T l,k, we obtain x i. Pr {T l,k > k} < e k/6. Say that the event SPARSEl) occurs if the element with rank k + 1 in S l has rank larger than 2 l 1 k in the original stream S. The argument above shows that Pr {SPARSEl)} 1 e k/6. Thus, F res 1 k, l) By linearity of expectation ranki)>2 l 1 k f i x i, assuming SPARSEl) holds. 3) E [ F res 1 k, l) SPARSEl) ] F res 1 2 l 1 k) 2 l.

12 Suppose u l is the frequency of the item with rank k + 1 in S l. Applying Hoeffding s bound to the sum of non-negative random variables in 3), each upper bounded by u l, we have, Pr { F1 res k, l) > 2E [ F1 res k, l) ] SPARSEl) } [ ] < e E F1 res k,l) /3u l ) or, { Pr F1 res k, l) < res 2 l 1 } k 2 l 1 SPARSEl) F res < e 1 2 l 1 k)/3 2 l u l ). 4) Assuming the event SPARSEl), it follows that u l f rank2 l 1 k+1). or, u l 1 2 l 2 k 2 l 2 k+1 ranki) 2 l 1 k) f ranki) res 2 l 2 k) 2 l 2 k F res 1 2 l 1 k) u l 2 l 2 k. Substituting in 4) and taking the probability of the complement event, we have, { Pr F1 res k, l) < res 2 l 1 } k) 2 l 1 SPARSEl) > 1 e k/6. Since, Pr {SPARSEl)} > 1 e k/6, { Pr F1 res k, l) < res 2 l 1 } k) > 1 e k/6 ) Pr {SPARSEl)} Corollary 1. For l 1, F res 2 2 l 1 = 1 e k/6 ) 2 > 1 2e k/6. Fres k, l) with probability 1 2 k 2 l l 1 k) Proof. Apply Lemma 4 to the frequency vector obtained by replacing f i by fi 2, for i [1, n]. 3.1 Group definitions Recall that at each level l, the sampled stream S l is provided as input to a data structure D l, that when queried, returns an estimate ˆf i,l for any i [1, n] satisfying ˆf i,l f i l, with prob. 1 2 t. Here, t is a parameter that will be fixed in the analysis and the additive error l is a function of the algorithm used by D l e.g., l = F res 1 k)/2 l 1 k) for COUNT-MIN sketches and

13 l = F res 2 k)/2l 1 k) for COUNTSKETCH). Fix a parameter ɛ which will be closely related to the given accuracy parameter ɛ, and is chosen depending on the problem. For example, in order to estimate F p, ɛ is set to ɛ 4p. Therefore, ˆf i,l 1 ± ɛ)f i, Define the following event provided, f i > l ɛ, and i S l, with prob. 1 2 t. By union bound, GOODEST ˆf i,l f i < l, for each i S l and l {0, 1,..., L}. Pr {GOODEST} 1 nl + 1)2 t. 5) Our analysis will be conditioned on the event GOODEST. Define a sequence of geometrically decreasing thresholds T 0, T 1,..., T L as follows. T l = T 0 2 l, l = 1, 2,..., L and 1 2 < T L 1. 6) In other words, L = log T 0. Note that L and L max are distinct parameters. L max is a data structure parameter and is decided prior to the run of the algorithm. L is a dynamic parameter that is dependent on T 0 and is instantiated at the time of inference. In the next paragraph, we discuss how T 0 is chosen. The threshold values T l s are used to partition the elements of the stream into groups G 0,..., G L as follows. G 0 = {i S : f i T 0 } and G l = {i S : T l < f i T l 1 }, l = 1, 2,..., L. An item i is said to be discovered as frequent at level l, provided, i maps to S l and ˆf i,l Q l, where, Q l, l = 0, 1, 2..., L, is a parameter family. The values of Q l are chosen as follows. The space parameter kl) is chosen at level l as follows. Q l = T l 1 ɛ) 7) 0 = 0 k) ɛq 0, 0 = l 2k) ɛq l, l = 1, 2,..., L. 8) The choice of T 0. The value of T 0 is a critical parameter for the HSS parameter and its precis choice depends on the problem that is being solved. For example, for estimating F p, ) 1 T 0 is chosen as ˆF2 1/2. ɛ1 ɛ) k For estimating the entropy H, it is sufficient to choose T0 as

14 1 ɛ1 ɛ) ˆ res k ) k, where, k and k are parameters of the estimation algorithm. T 0 must be chosen as small as possible subject to the following property: l ɛ1 ɛ) T 0. Lemma 4 and Corollary 1 show that for the COUNT-MIN structure and the COUNTSKETCH structure, T 0 can 2 l be F res ) 1/2, respectively. Since, neither F res 1 k) nor chosen to be as small as res k ) 2 k k and k k) can be exactly computed in sub-linear space, therefore, the algorithms of Lemmas 1 F res 2 and Lemma 2 are used to obtain 1 2 -approximations5 to the corresponding quantities. By replacing k by 2k at each level, it suffices to define T 0 as ) ˆF res 1 k) 2k or as ˆF res 2 k 1/2, 2k respectively. 3.2 Hierarchical samples Items are sampled and placed into sampled groups Ḡ0, Ḡ1,..., ḠL as follows. The estimated frequency of an item i is defined as ˆf i = ˆf i,r, where, r is the lowest level such that ˆf i,r > Q r. The sampled groups are defined as follows. Ḡ 0 = {i : ˆf i T 0 } and Ḡl = {i : T l 1 < ˆf i T l and i S l }, 1 l L. The choices of the parameter settings satisfy the following properties. We use the following standard notation. For a, b R and a < b, a, b) denotes the open interval defined by the set of points between a and b end points not included), [a, b] represents the closed interval of points between a and b both included) and finally, and [a, b) and a, b] respectively, represent the two half-open intervals. Partition a frequency group G l, for 1 l L 1, into three adjacent sub-regions: lmarging l ) = [T l, T l + ɛq l ], l = 0, 1,..., L 1 and is undefined for l = L. rmarging l ) = [Q l 1 ɛq l 1, T l 1 ), l = 1, 2,..., L and is undefined for l = 0. midg l ) = T l + ɛq l, Q l 1 ɛq l ), 1 l L 1 These regions respectively denote the lmargin left-margin), rmargin right-margin) and middleregion of the group G l. An item i is said to belong to one of these regions if its true frequency lies in that region. The middle-region of groups G 0 and G l is extended to include the right and left margins, respectively. That is, lmarging 0 ) = [T 0, T 0 + ɛq 0 ) and midg 0 ) = [T 0 + ɛq 0, ] rmarging L ) = Q L 1 ɛq L 1, T L 1 ) and midg 0 ) = 0, Q L 1 ɛq L 1 ]. 5 More accurate estimates of F res 2 and F res 1 can be obtained using Lemmas 1 and Lemma 2, but in our applications, a constant factor accuracy suffices.

15 Important Convention. For clarity of presentation, from now on, the description of the algorithm and the analysis throughout uses the frequencies f i instead of f i. However, the analysis remains unchanged if the frequencies are negative and f i is used in terms of f i. The only reason for making this notational convenience is to avoid writing in many places. An equivalent way of viewing this is to assume that the actual frequencies are given by an n-dimensional vector g. The vector f is defined as the absolute value of g, taken coordinate wise, i.e., f i = g i for all i). It is important to note that the HSS technique is only designed to work with functions of the form n i=1 ψ g i ). All results in this paper and their analysis, hold for general update data streams, where, item frequencies could be positive, negative or zero. We would now like to show that the following properties hold, with probability 1 2 t each. 1. Items belonging to the middle region of any G l may be discovered as frequent, that is, ˆf i,r Q r, only at a level r l. Further, ˆfi = ˆf i,l, that is, the estimate of its frequency is obtained from level l. These items are never misclassified, that is, if i belongs to some sampled group Ḡr, then, r = l. 2. Items belonging to the right region of G l may be discovered as frequent at level r l 1, but not at levels less than l 1, for l 1. Such items may be misclassified, but only to the extent that i may be placed in either Ḡl 1 or Ḡl. 3. Similarly, items belonging to the left-region of G l may be discovered as frequent only at levels l or higher. Such items may be misclassified, but only to the extent that i is placed either in Ḡl or in Ḡl+1. Lemma 5 states the properties formally. Lemma 5. Let ɛ 1 6. The following properties hold conditional on the event GOODEST. 1. Suppose i midg l ). Then, i is classified into Ḡl iff i S l and ˆf i = ˆf i,l. If i S l, then, ˆf i is undefined and i is unclassified. 2. Suppose i lmarging l ), for some l {0, 1,..., L 1}. If i S l, then, i is not classified into any group. Suppose i S l. Then, 1) i is classified into Ḡl iff i S l and ˆf i,l T l, and, 2) i is classified into Ḡl+1 iff i S l+1 ˆfi,l < T l. In both cases, ˆf i = ˆf i,l. 3. Suppose i rmarging l ) for some some l {1, 2,..., L}. If i S l 1, then, ˆfi is undefined and i is unclassified. Suppose i S l 1. Then, a) i is classified into Ḡl 1 iff 1) ˆf i,l 1 T l 1, or, 2) ˆf i,l 1 < Q l and i S l and ˆf i,l T l 1. In case 1), ˆf i = ˆf i,l 1 and in case 2), ˆf i = ˆf i,l. b) i is classified into Ḡl iff i S l and either 1) ˆf i,l 1 Q l 1 and ˆf i < T l 1, or, 2) ˆf i,l 1 < Q l 1 and ˆf i = ˆf i,l. In case 1), ˆf i = ˆf i,l 1 and in case 2) ˆf i = ˆf i,l.

16 Proof. We prove the statements in sequence. Assume that the event GOODEST holds. Let i midg l ). If l = 0 then the statement is obviously true, so we consider l 1. Suppose i S r, for some r < l, and i is discovered as frequent at level r, that is, ˆf i,r Q r. Since, GOODEST holds, therefore, f i Q r r. Since, i midg l ), f i < Q l 1 ɛq l 1. Combining, we have Q l 1 ɛq l 1 > f i Q r r = Q r ɛq r which is a contradiction for r l 1. Therefore, i is not discovered as frequent in any level r < l. Hence, if i S l, i remains unclassified. Now suppose that i S l. Since, GOODEST holds, ˆf i,l f i + l. Since, i midg l ), f i < Q l 1 ɛq l 1. Therefore, ˆf i,l < Q l 1 ɛq l 1 + l < Q l 1 9) since, l = ɛq l = ɛq l 1 /2. Further, i midg l ) implies that f i > T l + ɛq l. Since, GOODEST holds, ˆf i,l f i l > T l + ɛq l l = T l 10) since, l = ɛq l. Combining 9) and 10), we have, T l < ˆf i,l < Q l 1 < T l 1. Thus, i is classified into Ḡl. We now consider statement 2) of the lemma. Assume that GOODEST holds. Suppose i lmarging l ), for l {0, 1,..., L 1}. Then, T l f i T l + ɛq l = T l + l. Suppose r < l. We first show that if i S r, then, i cannot be discovered as frequent at level r, that is, ˆfi,r < Q r. Assume to the contrary that ˆfi,r Q r. Since, GOODEST holds, we have, ˆf i,r f i + r. Further, r = ɛq r and Q r = 1 ɛ)t r. Therefore, T r 1 ɛ) 2 = Q r r f i T l + l = T ɛ1 ɛ)). Since, T r = T l 2 l r, 2 l r 1 + ɛ1 ɛ) 1 ɛ) 2 < 2, if ɛ 1 6. This is a contradiction, if l > r. We conclude that i is not discovered as frequent at level r < l. Therefore, if i S l, then, i is not classified into any of the Ḡr s. Now suppose that

17 i S l. We first show that i is discovered as frequent at level l. Since, i lmarging l ), therefore, f i T l and hence, ˆf i,l > T l l = Q l 1 ɛ ɛq l > Q l. 11) Thus, i is discovered as frequent at level l. There are two cases, namely, either ˆf i,l T l or ˆf i,l < T l. In the former case, i is classified into Ḡl and ˆf i = ˆf i. In the latter case, ˆf i,l < T l, the decision regarding the classification is made at the next level l + 1. If i S l+1, then, i remains unclassified. Otherwise, suppose i S l+1. The estimate ˆf i,l+1 is ignored in favor of a lower level estimate ˆf i,l, which is deemed to be accurate, since it is at least Q l. By 11), ˆf i,l > Q l > T l+1. By assumption, ˆf i,l < T l. Therefore, i is classified into Ḡl+1. This proves statement 2) of the lemma. Statement 3) is proved in a similar fashion. Estimator. The sample is used to compute the estimate ˆΨ. We also define an idealized estimator Ψ that assumes that the frequent items structure is an oracle that does not make errors. ˆΨ = L ψ ˆf i ) 2 l Ψ L = ψf i ) 2 l 12) l=0 i Ḡl l=0 i Ḡl 3.3 Analysis For i [1, n] and r [0, L], define the indicator variable x i,r as follows. 1 if i S r and i x i,r = Ḡr 0 otherwise. In this notation, equation 12) can be written as follows. ˆΨ = ψ ˆf L i ) x i,r 2 r r=0 L Ψ = ψf i ) x i,r 2 r. 13) r=0 Note that for a fixed i, the family of variables x i,r s is not independent, since, each item belongs to at most one sampled group Ḡr i.e., L r=0 x i,r is either 0 or 1). We now prove a basic property of the sampling procedure. Lemma 6. Let i G l.

18 1. If i midg l ), then, a) Pr {x i,l = 1 GOODEST} = 1 2 l, and, b) Pr {x i,r = 1 GOODEST} = 0 for l r. 2. If 0 l L 1 and i lmarging l ), then, a) Pr {x i,l = 1 GOODEST} 2 l + Pr {x l+1 = 1 GOODEST} 2 l+1 = 1, and, b) Pr {x i,r = 1} = 0, for r {0,..., L} {l, l + 1}. 3. If 1 l L and i rmarging l ), then, a) Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l 1 = 1 GOODEST} 2 l 1 = 1, and, b) Pr {x i,r = 1 GOODEST} = 0, for r {0,..., L} {l 1, l}. Proof. We first note that the part b) of each of the three statements of the lemma is a restatement of parts of Lemma 5. For example, suppose i lmarging l ), l < L and assume that the event GOODEST holds. By Lemma 5, part 2), either i Ḡl or i Ḡl+1. Thus, x i,r = 0, if r is neither l nor l + 1. In a similar manner, parts b) of the other statements of the Lemma can be seen as a restatement of parts of Lemma 5. We now consider part a) of the statements. Assume that the event GOODEST holds. Suppose i midg l ). The probability that i is sampled into S l = 1 2 l, by construction of the sampling technique. By Lemma 5, part 1), if i S l, then, i Ḡl. Therefore, Pr {x i,l = 1 GOODEST} = 1 2 l. Suppose i lmarging l ) and l < L. Then, Pr {i S l } = 1 and Pr {i S 2 l l+1 i S l } = 1 2. By Lemma 5, part 2), a) i Ḡl, or, x i,l = 1 iff i S l and ˆf i,l T l and, b) i Ḡl+1, or, x i,l+1 = 1 iff i S l+1 and ˆf i,l < T l. Therefore, Pr {x i,l+1 = 1 i S l and GOODEST} { = Pr i S l+1 and ˆf } i,l < T l i S l and GOODEST { } = Pr ˆfi,l < T l i S l and GOODEST Pr {i S l+1 i S l and GOODEST and ˆf } i,l < T l 14) { } We know that Pr ˆfi,l < T l i S l and GOODEST = 1 Pr {x i,l = 1 i S l and GOODEST}. The specific value of the estimate ˆf i,l is a function solely of the random bits employed by D l and the sub-stream S l. By full-independence of the hash function mapping items to the levels, we have that Pr {i S l+1 i S l and GOODEST and ˆf } i,l < T l Substituting in 14), we have, Pr {x i,l+1 = 1 i S l and GOODEST} = 1 2 = Pr { S l+1 i S l } = 1 2. ) 1 Pr {x i,l = 1 i S l and GOODEST}.

19 By definition of conditional probabilities and multiplying by 2), 2Pr {x i,l+1 = 1 GOODEST} Pr {i S l } Since, Pr {i S l } = 1 2 l, we obtain, = 1 Pr {x i,l = 1 GOODEST} Pr {i S l }. 2 l+1 Pr {x i,l+1 = 1 GOODEST} = 1 2 l Pr {x i,l = 1 GOODEST} or, Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l+1 = 1 GOODEST} 2 l+1 = 1. This proves the statement 2) of the lemma. Statement 3) regarding the right-margin of G l can be proved analogously. A useful corollary of Lemma 6 is the following. Lemma 7. For i [1, n], L r=0 E[ x i,r GOODEST] 2 r = 1. Proof. If i midg l ), then, by Lemma 6, Pr {x i,l = 1 GOODEST} = 1 2 l and Pr {x i,r = 1 GOODEST} = 0, if r l. Therefore, L E [ ] L x i,r GOODEST = Pr {x i,r = 1 GOODEST} 2 l = Pr {x i,l = 1} 2 l = 1. r=0 r=0 Suppose i lmarging l ), for 0 l < L. By Lemma 6, Pr {x i,l = 1 GOODEST} 2 l + Pr {x i,l+1 = 1 GOODEST} 2 l+1 = 1 and Pr {x i,r = 1 GOODEST} = 0, for r {l, l+1}. Therefore, L E [ x i,r GOODEST] 2 r = r=0 L Pr {x i,r = 1 GOODEST} 2 r = 1. r=0 The case for i rmarging l ) is proved analogously. Lemma 8 shows that the expected value of Ψ is Ψ, assuming the event GOODEST holds. Lemma 8. E [ Ψ GOODEST ] = Ψ. Proof. By 13), Ψ = ψf i) L r=0 x i,r 2 r. Taking expectation and using linearity of expectation, E [ Ψ GOODEST ] = = ψf i ) ψf i ), L r=0 E [ x i,r 2 r ] GOODEST since, L E [ x i,r 2 r GOODEST] = 1, by Lemma 7 r=0 = Ψ.

20 The following lemma is useful in the calculation of the variance of Ψ. Notation. Let li) denote the index of the group G l such that i G l. Lemma 9. For i [1, n] and i G 0 lmarging 0 ), L r=0 E[ x i,r 2 2r GOODEST] 2 li)+1. If i G 0 lmarging 0 ), then, L r=0 E[ x i,r 2 2r GOODEST] = 1. Proof. We assume that all probabilities and expectations in this proof are conditioned on the event GOODEST. For brevity, we do not write the conditioning event. Let i midg l ) and assume that GOODEST holds. Then, x i,l = 1 iff i S l, by Lemma 6. Thus, Pr {x i,l = 1} 2 2l = 1 2 l 22l = 2 l. If i lmarging l ), then, by the argument in Lemma 7, Pr {x i,l+1 = 1} 2 l+1 + Pr {x i,l = 1} 2 l = 1. Multiplying by 2 l+1, Pr {x i,l+1 = 1} 2 2l+1) + Pr {x i,l = 1} 2 2l 2 l+1. Similarly, if i rmarging l ), then, Pr {x i,l = 1} 2 l + Pr {x i,l 1 = 1} 2 l 1 = 1. Therefore, Pr {x i,l = 1} 2 2l + Pr {x i,l 1 = 1} 2 2l 1) 2 2l Since, li) denotes the index of the group G l to which i belongs, therefore, L E [ x i,r 2 2r] < 2 li)+1. r=0 In particular, if i G 0 lmarging 0 ) or if i midg l ), then, the above sum is 2 li). Lemma 10. Var [ Ψ GOODEST ] i/ G 0 lmarging 0 )) ψ 2 f i ) 2 li)+1.

21 Proof. We assume that all expressions for probability and expectations in this proof are conditioned on the event GOODEST. For brevity, it is not written explicitly. E [ Ψ 2 ] = E [ = E [ ψf i ) L x i,r 2 r) 2] r=0 ψ 2 f i ) L x i,r 2 r) 2] [ + E ψf i ) ψf j ) r=0 i j L x i,r1 2 r 1 r 1 =0 L x j,r2 2 r ] 2 r 2 =0 = E [ L ψ 2 f i ) x 2 i,r 2 2r] + E [ ψ 2 f i ) ] x i,r1 x i,r2 2 r1+r2 r=0 r 1 r 2 + E [ L L ψf i ) ψf j ) x i,r1 2 r 1 x j,r2 2 r ] 2 i j r 1 =0 We note that, a) x 2 i,r = x i,r, b) an item i is classified into a unique group G r, and therefore, x i,r1 x i,r2 = 0, for r 1 r 2, and, c) for i j, x i,r1 and x j,r2 are assumed to be independent r 2 =0 of each other, regardless of the values of r 1 and r 2. Thus, E [ Ψ 2 ] = E [ ψ 2 f i ) L r=0 x i,r 2 2r] + i j r=0 E [ ψf i ) L x i,r1 2 r ] 1 E [ ψf j ) r 1 =0 = ψ 2 f i )E [ L x i,r 2 2r] + Ψ 2 ψ 2 f i ) L x j,r2 2 r ] 2 since, by Lemma 8, E [ Ψ] = Ψ = ψf i) L r=0 E[ x i,r 2 r]. As a result, the expression for Var [ Ψ] simplifies to Var [ Ψ] = E [ Ψ 2 ] E [ Ψ] ) 2 = i G 0 lmarging 0 ) i G 0 lmarging 0 ) ψ 2 f i )2 li)+1 + E [ ψ 2 f i ) L x i,r 2 2r] ψ 2 f i ) r=0 i G 0 lmarging 0 ) ψ 2 f i ) r 2 =0 ψ 2 f i ) ψ 2 f i )2 li)+1, by Lemma 9. For any subset S [1, n], denote by ψs) the expression i S ψf i). Let Ψ 2 = Ψ 2 S) denote n i=1 ψ2 f i ).

22 Corollary 2. If the function ψ ) is non-decreasing in the interval [0... T ], then, Var [ ] Ψ L GOODEST = ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 )) 15) l=1 Proof. If the monotonicity condition is satisfied, then ψt l 1 ) ψf i ) for all i G l, l 1 and ψf i ) ψt ) for i lmarging 0 ). Therefore, ψ 2 f i ) ψt l 1 ) ψf i ), in the first case and ψ 2 f i ) ψt ) in the second case. By Lemma 10, Var [ ] Ψ L GOODEST ψt l 1 )ψf i )2 li)+1 + ψt )ψf i ) l=1 i G l i lmarging 0 ) L = ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 )). l=1 3.4 Error in the estimate The error incurred by our estimate ˆΨ is ˆΨ Ψ, and can be written as the sum of two error components using triangle inequality. ˆΨ Ψ Ψ Ψ + ˆΨ Ψ = E 1 + E 2 Here, E 1 = Ψ Ψ is the error due to sampling and E 2 = ˆΨ Ψ is the error due to the estimation of the frequencies. By Chebychev s inequality { Pr E 1 3Var [ } Ψ] ) 1/2 GOODEST 8 9. Substituting the expression for Var [ Ψ] from 15), { L 1/2 } Pr E 1 3 ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 ))) GOODEST l= ) We now present an upper bound on E 2. Define a real valued function π : [1, n] R as follows. l ψ ξ i f i, l )) if i G 0 lmarging 0 ) or i midg l ) π i = l ψ ξ i f i, l )) if i lmarging l ), for some l > 1 l 1 ψ ξ i f i, l 1 )) if i rmarging l ) where, the notation ξ i f i, l ) returns the value of t that maximizes ψ t) in the interval [f i l, f i + l ].

23 Lemma 11. Assume that GOODEST holds. Then, E 2 π i L r=0 x i,r2 r. Proof. Assume that the event GOODEST holds. By triangle inequality, E 2 L ψ ˆf i ) ψf i ) x i,l 2 l = ψ ˆf L i ) ψf i ) x i,r 2 r. l=0 i Ḡl r=0 Case 1: i midg l ) or i G 0 lmarging 0 ). Then, i is classified only in group Ḡl with probability 1 2 l, or remains unclassified), and ˆf i,l = ˆf i, by Lemma 6. By Taylor s series ψ ˆf i ) ψf i ) l ψ ξ i ) where, ξ i = ξ i f i, l ) maximizes ψ t) for t [f i l, f i + l ]. Case 2: i lmarging l ) and l < L. Then, ˆf i = ˆf i,l or ˆf i = ˆf i,l+1. Therefore, ˆf i f i l and by Taylor s series, ψ ˆf i ) ψf i ) l ψ ξ i ). Finally, Case 3: i rmarging l ) and l > 0. Then, i Ḡl 1 or i Ḡl. Similarly, it can be shown that ψ ˆf i ) ψf i ) l 1 ψ ξ i ). Adding, E 2 i G 0 lmarging 0 ) or i midg l ) l ψ ξ i f i, l )) + Using the notation of π i s, we have, E 2 π i L L r=0 l=1 i rmarging l ) L 1 x i,r 2 r + l=0 i lmarging l ) l 1 ψ ξ i f i, l 1 )) L x i,r 2 r. r=0 l ψ ξ i f i, l )) L x i,r 2 r 2 l ) To abbreviate the statement of the next few lemmas, we introduce the following notation. Π 1 = Π 2 = 3 L Λ = 3 π i 17) i G 0 lmarging 0 ) l=1 r=0 L x i,r 2 r r=0 π 2 i 2 li)+1 ) 1/2, and 18) ψt l 1 )ψg l )2 l+1 + ψt )ψlmarging 0 ))) 1/2 19)

24 Lemma 12. E [ E 2 GOODEST] Π1, and Var [ ] Π 2 E 2 GOODEST 2 9. Therefore, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Proof. Assume that GOODEST holds and define E 2 = π i L l=0 x i,l2 l. By Lemma 11, E 2 E 2. Applying Lemmas 8 and 10 to E 2 gives E [ E 2 GOODEST] Π1, and Var [ E 2 ] Π 2 GOODEST 2 9. By Chebychev s inequality, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Thus, Pr {E 2 Π 1 + Π 2 GOODEST} 8 9. Lemma 13 presents the overall expression of error and its probability. Lemma 13. Let ɛ 1 3. Suppose ψ) is a monotonic function in the interval [0, T ]. Pr { ˆΨ Ψ } Λ + Π 1 + Π 2 > nl + 1))2 t ). Proof. Combining Lemma 12 and equation 16), and using the notation of equations 17), 18) and 19), we have, { Pr ˆΨ Ψ } Λ + Π 1 + Π 2 GOODEST = 7 9 Since, Pr {GOODEST} > 1 nl + 1))2 t, therefore, Pr { ˆΨ Ψ } Λ + Π 1 + Π 2 { = Pr ˆΨ Ψ } Λ + Π 1 + Π 2 GOODEST Pr {GOODEST} nl + 1))2 t ). Reducing Randomness by using Pseudo-random Generator. The analysis has assumed that the hash function mapping items to levels is completely independent. A pseudo-random generator can be constructed along the lines of Indyk in [17] and Indyk and Woodruff in [18], to reduce the required randomness. This is illustrated for each of the two estimations that we consider in the following sections, namely, estimating F p and estimating H.

25 4 Estimating F p for p 2 In this section, we apply the HSS technique to estimate F p for p 2. Let ɛ = ɛ 4p. For estimating F p for p 2, we use the HSS technique instantiated with the COUNTSKETCH data structure that uses space parameter k. The value of k will be fixed during the analysis. We use a standard estimator such as sketches [1] or FASTAMS [21] for estimating ˆF 2 to within accuracy of 1± and with confidence 27 using space O log ). We extend the event ɛ 2 GOODEST to include this event, that is, GOODEST = ˆF 2 F 2 < F 2 4 and ˆf i,l f i l, i [1, n] and l {0, 1,..., L}. The width of the COUNTSKETCH structure at each level is chosen to be s = Olog n + log log ) so that Pr {GOODEST} Let F 0 denote the number of distinct items in the data stream, that is, the number of items with non-zero frequency. ) Lemma 14. Let ɛ = min ɛ 4p, 1 6. Then, Π 1 F p ) ɛf p. Proof. Let i G l and l > 0. Then, π i li) 1 ψ ξf i, l 1 )). Since, F p is monotonic and increasing, we have, ξf i, t) = f i + t, for t > 0. For i G l and l > 0, l 1 < 2 ɛf i. Further, ψ ξf i, l 1 ) pf i + l 1 ) p 1 pf p 1 i ɛ) p 1 pf p 1 i e 2p 1)/4p 2pf p i. Thus, π i l 1 ψ ξf i, l 1 )) 4 ɛf i pf p 1 i < ɛf p ɛ i, since, ɛ = 4p. 20) For i G 0, 0 < ɛt l ɛf i. Therefore, Combining this with 20), we have, π i 0 ψ ξf i, 0 )) 2 ɛf i pf i + 0 ) p 1 ɛf p i. 21) Π 1 F p ) ) Lemma 15. Let ɛ min ɛ 4p, 1 6. Then, ɛf p i = ɛf p. Π 2 ɛf p, provided, k 32 72)2/p p 2 ɛ 2 n 1 2/p.

26 Proof. Π 2 is defined as Π 2 ) 2 = 9,i G 0 lmarging 0 ) πi 2 2 li)+1 Suppose i G l and l 1, or, i lmarging 0 ). Then, by 20) and 21), it follows that, π i ɛf p i. Therefore, Π 2 ) 2 9,i G 0 lmarging 0 ) ɛ 2 f 2p i For i G l and l 1, f i < T l 1 and therefore, f 2p i T p l 1 f p i. For i lmarging 0 ), f i T < T ɛ) and therefore, Combining, we have, f 2p i T ) p f p i T p ɛ)p, for i lmarging 0 ). 2 li)+1 22) By construction, Π 2 ) 2 9 ɛ 2 T p ɛ)p f p i + L i lmarging 0 ) l=1 T p l 1 ) 1/2 2F2 0 = ɛq 0 = ɛ1 ɛ)t 0. k i G l f p i 2l+1. 23) and T l = T 0 2 l. Substituting in 23), we have, Π 2 ) 2 9 ɛ 2 1 ɛ) p ) p/2 2F2 ɛ 2 k i lmarging 0 ) L 1 + ɛ) p f p i + l=1 i G l f p i 2 l+1 2 l 1)p 1) 24) Since, p 2, 2 l+1 2 l 1)p 1) 4. Further, since, ɛ 1 4p, 1 + ɛ)p 2 and 1 ɛ) p > 1 2. Therefore, 24) can be simplified as Π 2 ) 2 9 ) p/2 8ɛ 2 2F2 ɛ 2 k i lmarging 0 ) L f p i + f p i l=1 i G l ) p 1)/2 8ɛ 2 2F2 ɛ 2 F p 25) k

27 since, i lmarging0) f p i + L ) l=1 i G l f p i = F p i G 0 lmarging 0 ) f p i. We now use the identity F2 F 0 ) 1/2 Fr F 0 ) 1/r or, F r/2 2 F r F r/2 1 0, for any r 2. 26) Letting r = p, we have F p 2 < F pf p/ Substituting in 25), we have, Π 2 ) p/2 ɛ 2 Fp 2 F p/2 1 0 ɛ 2 k) p/2 F p 72ɛ 2 2 p/2 F 2 p Letting k = 2 72)2/p n 1 2/p = 32 72)2/p p2 n 1 2/p, we have, ɛ 2 ɛ 2 Lemma 16. If ɛ min Π 2 ) 2 ɛ 2 F 2 p, or, Π 2 ɛf p. ɛ 4p, 1 ) 6 ) p/2 n 1 2/p ɛ 2 k and k 32 72)2/p p 2 ɛ 2+4/p n 1 2/p, then, Λ < ɛf p. Proof. The expression 19) for Λ can be written as follows. Λ 2 9 = i lmarging 0 ) L T ) p f p i + T l 1 ) p f p i 2l+1 27) i G l Except for the factor of ɛ 2, the expression on the RHS is identical to the expression in the RHS 23), for which an upper bound was derived in Lemma 15. Following the same proof and modifying the constants, we obtain that l=1 Λ 2 ɛ 2 F 2 p if k 32 72)2/p p 2 ɛ 2+4/p n 1 2/p. Recall that s is the width of the COUNTSKETCH structures kept at each level l = 0,..., L. ) Lemma 17. Suppose p 2. Let k 32 72)2/p p2 n 1 2/p, ɛ = min ɛ ɛ 2+4/p 4p, 1 6 and s = Olog n + log log ), then, Pr { ˆF } p F p < 3ɛF p with probability 6 9. Proof. By Lemma 13, { Pr ˆF } p F p Π 1 + Π 2 + Λ 1 7 )Pr {GOODEST}). 9

28 By Lemma 14, we have, Π 1 ɛf p. By Lemma 15, we have, Π 2 ɛf p. Finally, by Lemma 16, Λ ɛf p. Therefore, Π 1 + Π 2 + Λ 3ɛF p. By choosing the number s of independent tables to be Olog n+log log ) in the COUNTSKETCH structure at each level, we have, Pr {GOODEST} This includes the error probability of estimating ˆF 2 to within factor of 1 ± 1 2 of F 2. Thus, { Pr ˆF } p F p Π 1 + Π 2 + Λ 1 2 ) Pr ) 6 {GOODEST} > 9 9. The space requirement for the estimation procedure whose guarantees are presented in Lemma 17 can be calculated as follows. There are L = Olog ) levels, where, each level uses a COUNTSKETCH structure with hash table size size k = O p 2 n 1 2/p 1 and number ɛ 1+2/p ) of independent hash tables s = Olog n + log log ). Since, each table entry requires space Olog ) bits, the space required to store the tables is p 2 S = O ɛ 2 n1 2+4/p ) p log 2 )log n + log log ) The number of random bits can be reduced from Onlog n + log )) to OS log R) using the pseudo-random generator of [17, 18]. Since the final state of the data structure is the same if the input is presented in the order of item identities, therefore, R = On L s). Therefore, the number of random bits required is OSlog n+log log ). The expected time required to update the structure corresponding to each stream update of the form pos, i, v) is Os) = Olog n + log log ). This is summarized in the following theorem. Theorem 3. For every p 2 and 0 < ɛ 1, there exists a randomized algorithm that returns an estimate ˆF p satisfying Pr { ˆF } p F p ɛf p 2 3 using space ) O p 2 n 1 2/p log F ɛ 2+4/p 1 ) 2 log n + log log ) 2. The expected time to process each stream update is Olog n + log log ).. 5 Estimating Entropy In this section, we apply the HSS technique to estimate the entropy H = i:f i 0 f i log f i

On Estimating Frequency Moments of Data Streams

On Estimating Frequency Moments of Data Streams On Estimating Frequency Moments of Data Streams Sumit Ganguly and 1 Graham Cormode 1 Indian Institute of Technology, Kanpur, sganguly@iit.ac.in AT&T Labs Research, graham@earch.att.com Abstract. Space-economical

More information

Data Streams & Communication Complexity

Data Streams & Communication Complexity Data Streams & Communication Complexity Lecture 1: Simple Stream Statistics in Small Space Andrew McGregor, UMass Amherst 1/25 Data Stream Model Stream: m elements from universe of size n, e.g., x 1, x

More information

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014

CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 CS 598CSC: Algorithms for Big Data Lecture date: Sept 11, 2014 Instructor: Chandra Cheuri Scribe: Chandra Cheuri The Misra-Greis deterministic counting guarantees that all items with frequency > F 1 /

More information

14.1 Finding frequent elements in stream

14.1 Finding frequent elements in stream Chapter 14 Streaming Data Model 14.1 Finding frequent elements in stream A very useful statistics for many applications is to keep track of elements that occur more frequently. It can come in many flavours

More information

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1

Lecture 10. Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Lecture 10 Sublinear Time Algorithms (contd) CSC2420 Allan Borodin & Nisarg Shah 1 Recap Sublinear time algorithms Deterministic + exact: binary search Deterministic + inexact: estimating diameter in a

More information

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems

An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems An Optimal Algorithm for l 1 -Heavy Hitters in Insertion Streams and Related Problems Arnab Bhattacharyya, Palash Dey, and David P. Woodruff Indian Institute of Science, Bangalore {arnabb,palash}@csa.iisc.ernet.in

More information

1 Approximate Quantiles and Summaries

1 Approximate Quantiles and Summaries CS 598CSC: Algorithms for Big Data Lecture date: Sept 25, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri Suppose we have a stream a 1, a 2,..., a n of objects from an ordered universe. For simplicity

More information

Lecture 3 Sept. 4, 2014

Lecture 3 Sept. 4, 2014 CS 395T: Sublinear Algorithms Fall 2014 Prof. Eric Price Lecture 3 Sept. 4, 2014 Scribe: Zhao Song In today s lecture, we will discuss the following problems: 1. Distinct elements 2. Turnstile model 3.

More information

The space complexity of approximating the frequency moments

The space complexity of approximating the frequency moments The space complexity of approximating the frequency moments Felix Biermeier November 24, 2015 1 Overview Introduction Approximations of frequency moments lower bounds 2 Frequency moments Problem Estimate

More information

1 Estimating Frequency Moments in Streams

1 Estimating Frequency Moments in Streams CS 598CSC: Algorithms for Big Data Lecture date: August 28, 2014 Instructor: Chandra Chekuri Scribe: Chandra Chekuri 1 Estimating Frequency Moments in Streams A significant fraction of streaming literature

More information

Some notes on streaming algorithms continued

Some notes on streaming algorithms continued U.C. Berkeley CS170: Algorithms Handout LN-11-9 Christos Papadimitriou & Luca Trevisan November 9, 016 Some notes on streaming algorithms continued Today we complete our quick review of streaming algorithms.

More information

Lecture 2. Frequency problems

Lecture 2. Frequency problems 1 / 43 Lecture 2. Frequency problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 43 1 Frequency problems in data streams 2 Approximating inner product 3 Computing frequency moments

More information

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University

CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University CS 591, Lecture 9 Data Analytics: Theory and Applications Boston University Babis Tsourakakis February 22nd, 2017 Announcement We will cover the Monday s 2/20 lecture (President s day) by appending half

More information

Lower Bounds for Testing Bipartiteness in Dense Graphs

Lower Bounds for Testing Bipartiteness in Dense Graphs Lower Bounds for Testing Bipartiteness in Dense Graphs Andrej Bogdanov Luca Trevisan Abstract We consider the problem of testing bipartiteness in the adjacency matrix model. The best known algorithm, due

More information

compare to comparison and pointer based sorting, binary trees

compare to comparison and pointer based sorting, binary trees Admin Hashing Dictionaries Model Operations. makeset, insert, delete, find keys are integers in M = {1,..., m} (so assume machine word size, or unit time, is log m) can store in array of size M using power:

More information

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES

ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES ON COST MATRICES WITH TWO AND THREE DISTINCT VALUES OF HAMILTONIAN PATHS AND CYCLES SANTOSH N. KABADI AND ABRAHAM P. PUNNEN Abstract. Polynomially testable characterization of cost matrices associated

More information

Lecture 6 September 13, 2016

Lecture 6 September 13, 2016 CS 395T: Sublinear Algorithms Fall 206 Prof. Eric Price Lecture 6 September 3, 206 Scribe: Shanshan Wu, Yitao Chen Overview Recap of last lecture. We talked about Johnson-Lindenstrauss (JL) lemma [JL84]

More information

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan

Estimating Dominance Norms of Multiple Data Streams Graham Cormode Joint work with S. Muthukrishnan Estimating Dominance Norms of Multiple Data Streams Graham Cormode graham@dimacs.rutgers.edu Joint work with S. Muthukrishnan Data Stream Phenomenon Data is being produced faster than our ability to process

More information

18.10 Addendum: Arbitrary number of pigeons

18.10 Addendum: Arbitrary number of pigeons 18 Resolution 18. Addendum: Arbitrary number of pigeons Razborov s idea is to use a more subtle concept of width of clauses, tailor made for this particular CNF formula. Theorem 18.22 For every m n + 1,

More information

CS246 Final Exam. March 16, :30AM - 11:30AM

CS246 Final Exam. March 16, :30AM - 11:30AM CS246 Final Exam March 16, 2016 8:30AM - 11:30AM Name : SUID : I acknowledge and accept the Stanford Honor Code. I have neither given nor received unpermitted help on this examination. (signed) Directions

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder April 22, 2011 Abstract A fundamental fact in the analysis of randomized algorithm is that

More information

10.1 The Formal Model

10.1 The Formal Model 67577 Intro. to Machine Learning Fall semester, 2008/9 Lecture 10: The Formal (PAC) Learning Model Lecturer: Amnon Shashua Scribe: Amnon Shashua 1 We have see so far algorithms that explicitly estimate

More information

1-Pass Relative-Error L p -Sampling with Applications

1-Pass Relative-Error L p -Sampling with Applications -Pass Relative-Error L p -Sampling with Applications Morteza Monemizadeh David P Woodruff Abstract For any p [0, 2], we give a -pass poly(ε log n)-space algorithm which, given a data stream of length m

More information

Lecture 2 Sept. 8, 2015

Lecture 2 Sept. 8, 2015 CS 9r: Algorithms for Big Data Fall 5 Prof. Jelani Nelson Lecture Sept. 8, 5 Scribe: Jeffrey Ling Probability Recap Chebyshev: P ( X EX > λ) < V ar[x] λ Chernoff: For X,..., X n independent in [, ],

More information

Linear Sketches A Useful Tool in Streaming and Compressive Sensing

Linear Sketches A Useful Tool in Streaming and Compressive Sensing Linear Sketches A Useful Tool in Streaming and Compressive Sensing Qin Zhang 1-1 Linear sketch Random linear projection M : R n R k that preserves properties of any v R n with high prob. where k n. M =

More information

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS341 info session is on Thu 3/1 5pm in Gates415. CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS341 info session is on Thu 3/1 5pm in Gates415 CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/28/18 Jure Leskovec, Stanford CS246: Mining Massive Datasets,

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/26/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 More algorithms

More information

Geometric Optimization Problems over Sliding Windows

Geometric Optimization Problems over Sliding Windows Geometric Optimization Problems over Sliding Windows Timothy M. Chan and Bashir S. Sadjad School of Computer Science University of Waterloo Waterloo, Ontario, N2L 3G1, Canada {tmchan,bssadjad}@uwaterloo.ca

More information

CR-precis: A deterministic summary structure for update data streams

CR-precis: A deterministic summary structure for update data streams CR-precis: A deterministic summary structure for update data streams Sumit Ganguly 1 and Anirban Majumder 2 1 Indian Institute of Technology, Kanpur 2 Lucent Technologies, Bangalore Abstract. We present

More information

A Las Vegas approximation algorithm for metric 1-median selection

A Las Vegas approximation algorithm for metric 1-median selection A Las Vegas approximation algorithm for metric -median selection arxiv:70.0306v [cs.ds] 5 Feb 07 Ching-Lueh Chang February 8, 07 Abstract Given an n-point metric space, consider the problem of finding

More information

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window

A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window A Deterministic Algorithm for Summarizing Asynchronous Streams over a Sliding Window Costas Busch 1 and Srikanta Tirthapura 2 1 Department of Computer Science Rensselaer Polytechnic Institute, Troy, NY

More information

Arthur-Merlin Streaming Complexity

Arthur-Merlin Streaming Complexity Weizmann Institute of Science Joint work with Ran Raz Data Streams The data stream model is an abstraction commonly used for algorithms that process network traffic using sublinear space. A data stream

More information

Learning symmetric non-monotone submodular functions

Learning symmetric non-monotone submodular functions Learning symmetric non-monotone submodular functions Maria-Florina Balcan Georgia Institute of Technology ninamf@cc.gatech.edu Nicholas J. A. Harvey University of British Columbia nickhar@cs.ubc.ca Satoru

More information

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018

15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 15-451/651: Design & Analysis of Algorithms September 13, 2018 Lecture #6: Streaming Algorithms last changed: August 30, 2018 Today we ll talk about a topic that is both very old (as far as computer science

More information

CS Communication Complexity: Applications and New Directions

CS Communication Complexity: Applications and New Directions CS 2429 - Communication Complexity: Applications and New Directions Lecturer: Toniann Pitassi 1 Introduction In this course we will define the basic two-party model of communication, as introduced in the

More information

New Characterizations in Turnstile Streams with Applications

New Characterizations in Turnstile Streams with Applications New Characterizations in Turnstile Streams with Applications Yuqing Ai 1, Wei Hu 1, Yi Li 2, and David P. Woodruff 3 1 The Institute for Theoretical Computer Science (ITCS), Institute for Interdisciplinary

More information

A Lower Bound for the Size of Syntactically Multilinear Arithmetic Circuits

A Lower Bound for the Size of Syntactically Multilinear Arithmetic Circuits A Lower Bound for the Size of Syntactically Multilinear Arithmetic Circuits Ran Raz Amir Shpilka Amir Yehudayoff Abstract We construct an explicit polynomial f(x 1,..., x n ), with coefficients in {0,

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

Stochastic Submodular Cover with Limited Adaptivity

Stochastic Submodular Cover with Limited Adaptivity Stochastic Submodular Cover with Limited Adaptivity Arpit Agarwal Sepehr Assadi Sanjeev Khanna Abstract In the submodular cover problem, we are given a non-negative monotone submodular function f over

More information

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard,

Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, Morten Houmøller Nygaard, Estimating Frequencies and Finding Heavy Hitters Jonas Nicolai Hovmand, 2011 3884 Morten Houmøller Nygaard, 2011 4582 Master s Thesis, Computer Science June 2016 Main Supervisor: Gerth Stølting Brodal

More information

1 Variance of a Random Variable

1 Variance of a Random Variable Indian Institute of Technology Bombay Department of Electrical Engineering Handout 14 EE 325 Probability and Random Processes Lecture Notes 9 August 28, 2014 1 Variance of a Random Variable The expectation

More information

Approximate counting: count-min data structure. Problem definition

Approximate counting: count-min data structure. Problem definition Approximate counting: count-min data structure G. Cormode and S. Muthukrishhan: An improved data stream summary: the count-min sketch and its applications. Journal of Algorithms 55 (2005) 58-75. Problem

More information

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler Complexity Theory Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität Wien 15 May, 2018 Reinhard

More information

CSE 190, Great ideas in algorithms: Pairwise independent hash functions

CSE 190, Great ideas in algorithms: Pairwise independent hash functions CSE 190, Great ideas in algorithms: Pairwise independent hash functions 1 Hash functions The goal of hash functions is to map elements from a large domain to a small one. Typically, to obtain the required

More information

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181. Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 6. The Polynomial Hierarchy Reinhard Pichler Institut für Informationssysteme Arbeitsbereich DBAI Technische Universität

More information

Efficient Sketches for Earth-Mover Distance, with Applications

Efficient Sketches for Earth-Mover Distance, with Applications Efficient Sketches for Earth-Mover Distance, with Applications Alexandr Andoni MIT andoni@mit.edu Khanh Do Ba MIT doba@mit.edu Piotr Indyk MIT indyk@mit.edu David Woodruff IBM Almaden dpwoodru@us.ibm.com

More information

Computing the Entropy of a Stream

Computing the Entropy of a Stream Computing the Entropy of a Stream To appear in SODA 2007 Graham Cormode graham@research.att.com Amit Chakrabarti Dartmouth College Andrew McGregor U. Penn / UCSD Outline Introduction Entropy Upper Bound

More information

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9

CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Lecture 9 Allan Borodin March 13, 2016 1 / 33 Lecture 9 Announcements 1 I have the assignments graded by Lalla. 2 I have now posted five questions

More information

CS 350 Algorithms and Complexity

CS 350 Algorithms and Complexity CS 350 Algorithms and Complexity Winter 2019 Lecture 15: Limitations of Algorithmic Power Introduction to complexity theory Andrew P. Black Department of Computer Science Portland State University Lower

More information

CS 350 Algorithms and Complexity

CS 350 Algorithms and Complexity 1 CS 350 Algorithms and Complexity Fall 2015 Lecture 15: Limitations of Algorithmic Power Introduction to complexity theory Andrew P. Black Department of Computer Science Portland State University Lower

More information

MATH 31BH Homework 1 Solutions

MATH 31BH Homework 1 Solutions MATH 3BH Homework Solutions January 0, 04 Problem.5. (a) (x, y)-plane in R 3 is closed and not open. To see that this plane is not open, notice that any ball around the origin (0, 0, 0) will contain points

More information

Testing random variables for independence and identity

Testing random variables for independence and identity Testing random variables for independence and identity Tuğkan Batu Eldar Fischer Lance Fortnow Ravi Kumar Ronitt Rubinfeld Patrick White January 10, 2003 Abstract Given access to independent samples of

More information

B669 Sublinear Algorithms for Big Data

B669 Sublinear Algorithms for Big Data B669 Sublinear Algorithms for Big Data Qin Zhang 1-1 2-1 Part 1: Sublinear in Space The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store

More information

Lecture Lecture 9 October 1, 2015

Lecture Lecture 9 October 1, 2015 CS 229r: Algorithms for Big Data Fall 2015 Lecture Lecture 9 October 1, 2015 Prof. Jelani Nelson Scribe: Rachit Singh 1 Overview In the last lecture we covered the distance to monotonicity (DTM) and longest

More information

Jeffrey D. Ullman Stanford University

Jeffrey D. Ullman Stanford University Jeffrey D. Ullman Stanford University 3 We are given a set of training examples, consisting of input-output pairs (x,y), where: 1. x is an item of the type we want to evaluate. 2. y is the value of some

More information

1 Randomized Computation

1 Randomized Computation CS 6743 Lecture 17 1 Fall 2007 1 Randomized Computation Why is randomness useful? Imagine you have a stack of bank notes, with very few counterfeit ones. You want to choose a genuine bank note to pay at

More information

Notes for Lecture 15

Notes for Lecture 15 U.C. Berkeley CS278: Computational Complexity Handout N15 Professor Luca Trevisan 10/27/2004 Notes for Lecture 15 Notes written 12/07/04 Learning Decision Trees In these notes it will be convenient to

More information

Lecture 9: March 26, 2014

Lecture 9: March 26, 2014 COMS 6998-3: Sub-Linear Algorithms in Learning and Testing Lecturer: Rocco Servedio Lecture 9: March 26, 204 Spring 204 Scriber: Keith Nichols Overview. Last Time Finished analysis of O ( n ɛ ) -query

More information

Lecture 5: Hashing. David Woodruff Carnegie Mellon University

Lecture 5: Hashing. David Woodruff Carnegie Mellon University Lecture 5: Hashing David Woodruff Carnegie Mellon University Hashing Universal hashing Perfect hashing Maintaining a Dictionary Let U be a universe of keys U could be all strings of ASCII characters of

More information

Some hard families of parameterised counting problems

Some hard families of parameterised counting problems Some hard families of parameterised counting problems Mark Jerrum and Kitty Meeks School of Mathematical Sciences, Queen Mary University of London {m.jerrum,k.meeks}@qmul.ac.uk September 2014 Abstract

More information

Lecture 4: Hashing and Streaming Algorithms

Lecture 4: Hashing and Streaming Algorithms CSE 521: Design and Analysis of Algorithms I Winter 2017 Lecture 4: Hashing and Streaming Algorithms Lecturer: Shayan Oveis Gharan 01/18/2017 Scribe: Yuqing Ai Disclaimer: These notes have not been subjected

More information

Efficient Approximation for Restricted Biclique Cover Problems

Efficient Approximation for Restricted Biclique Cover Problems algorithms Article Efficient Approximation for Restricted Biclique Cover Problems Alessandro Epasto 1, *, and Eli Upfal 2 ID 1 Google Research, New York, NY 10011, USA 2 Department of Computer Science,

More information

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s

Mining Data Streams. The Stream Model. The Stream Model Sliding Windows Counting 1 s Mining Data Streams The Stream Model Sliding Windows Counting 1 s 1 The Stream Model Data enters at a rapid rate from one or more input ports. The system cannot store the entire stream. How do you make

More information

A necessary and sufficient condition for the existence of a spanning tree with specified vertices having large degrees

A necessary and sufficient condition for the existence of a spanning tree with specified vertices having large degrees A necessary and sufficient condition for the existence of a spanning tree with specified vertices having large degrees Yoshimi Egawa Department of Mathematical Information Science, Tokyo University of

More information

10 Distributed Matrix Sketching

10 Distributed Matrix Sketching 10 Distributed Matrix Sketching In order to define a distributed matrix sketching problem thoroughly, one has to specify the distributed model, data model and the partition model of data. The distributed

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

Lecture 8: The Goemans-Williamson MAXCUT algorithm

Lecture 8: The Goemans-Williamson MAXCUT algorithm IU Summer School Lecture 8: The Goemans-Williamson MAXCUT algorithm Lecturer: Igor Gorodezky The Goemans-Williamson algorithm is an approximation algorithm for MAX-CUT based on semidefinite programming.

More information

Lecture 10 September 27, 2016

Lecture 10 September 27, 2016 CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are

More information

Randomized Time Space Tradeoffs for Directed Graph Connectivity

Randomized Time Space Tradeoffs for Directed Graph Connectivity Randomized Time Space Tradeoffs for Directed Graph Connectivity Parikshit Gopalan, Richard Lipton, and Aranyak Mehta College of Computing, Georgia Institute of Technology, Atlanta GA 30309, USA. Email:

More information

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions

Supervised Machine Learning (Spring 2014) Homework 2, sample solutions 58669 Supervised Machine Learning (Spring 014) Homework, sample solutions Credit for the solutions goes to mainly to Panu Luosto and Joonas Paalasmaa, with some additional contributions by Jyrki Kivinen

More information

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S

4/26/2017. More algorithms for streams: Each element of data stream is a tuple Given a list of keys S Determine which tuples of stream are in S Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit

More information

Range-efficient computation of F 0 over massive data streams

Range-efficient computation of F 0 over massive data streams Range-efficient computation of F 0 over massive data streams A. Pavan Dept. of Computer Science Iowa State University pavan@cs.iastate.edu Srikanta Tirthapura Dept. of Elec. and Computer Engg. Iowa State

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

The Algorithmic Aspects of the Regularity Lemma

The Algorithmic Aspects of the Regularity Lemma The Algorithmic Aspects of the Regularity Lemma N. Alon R. A. Duke H. Lefmann V. Rödl R. Yuster Abstract The Regularity Lemma of Szemerédi is a result that asserts that every graph can be partitioned in

More information

Existence of Nash Networks in One-Way Flow Models

Existence of Nash Networks in One-Way Flow Models Existence of Nash Networks in One-Way Flow Models pascal billand a, christophe bravard a, sudipta sarangi b a CREUSET, Jean Monnet University, Saint-Etienne, France. email: pascal.billand@univ-st-etienne.fr

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

arxiv:quant-ph/ v1 29 May 2003

arxiv:quant-ph/ v1 29 May 2003 Quantum Lower Bounds for Collision and Element Distinctness with Small Range arxiv:quant-ph/0305179v1 29 May 2003 Andris Ambainis Abstract We give a general method for proving quantum lower bounds for

More information

Lecture Lecture 25 November 25, 2014

Lecture Lecture 25 November 25, 2014 CS 224: Advanced Algorithms Fall 2014 Lecture Lecture 25 November 25, 2014 Prof. Jelani Nelson Scribe: Keno Fischer 1 Today Finish faster exponential time algorithms (Inclusion-Exclusion/Zeta Transform,

More information

Algorithms for Distributed Functional Monitoring

Algorithms for Distributed Functional Monitoring Algorithms for Distributed Functional Monitoring Graham Cormode AT&T Labs and S. Muthukrishnan Google Inc. and Ke Yi Hong Kong University of Science and Technology We study what we call functional monitoring

More information

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions

Economics 204 Fall 2011 Problem Set 1 Suggested Solutions Economics 204 Fall 2011 Problem Set 1 Suggested Solutions 1. Suppose k is a positive integer. Use induction to prove the following two statements. (a) For all n N 0, the inequality (k 2 + n)! k 2n holds.

More information

Balls and Bins: Smaller Hash Families and Faster Evaluation

Balls and Bins: Smaller Hash Families and Faster Evaluation Balls and Bins: Smaller Hash Families and Faster Evaluation L. Elisa Celis Omer Reingold Gil Segev Udi Wieder May 1, 2013 Abstract A fundamental fact in the analysis of randomized algorithms is that when

More information

Notes for Lecture 14 v0.9

Notes for Lecture 14 v0.9 U.C. Berkeley CS27: Computational Complexity Handout N14 v0.9 Professor Luca Trevisan 10/25/2002 Notes for Lecture 14 v0.9 These notes are in a draft version. Please give me any comments you may have,

More information

Lecture 4: LMN Learning (Part 2)

Lecture 4: LMN Learning (Part 2) CS 294-114 Fine-Grained Compleity and Algorithms Sept 8, 2015 Lecture 4: LMN Learning (Part 2) Instructor: Russell Impagliazzo Scribe: Preetum Nakkiran 1 Overview Continuing from last lecture, we will

More information

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15) Problem 1: Chernoff Bounds via Negative Dependence - from MU Ex 5.15) While deriving lower bounds on the load of the maximum loaded bin when n balls are thrown in n bins, we saw the use of negative dependence.

More information

Notes on Complexity Theory Last updated: December, Lecture 27

Notes on Complexity Theory Last updated: December, Lecture 27 Notes on Complexity Theory Last updated: December, 2011 Jonathan Katz Lecture 27 1 Space-Bounded Derandomization We now discuss derandomization of space-bounded algorithms. Here non-trivial results can

More information

arxiv:cs/ v1 [cs.cg] 7 Feb 2006

arxiv:cs/ v1 [cs.cg] 7 Feb 2006 Approximate Weighted Farthest Neighbors and Minimum Dilation Stars John Augustine, David Eppstein, and Kevin A. Wortman Computer Science Department University of California, Irvine Irvine, CA 92697, USA

More information

Some Sieving Algorithms for Lattice Problems

Some Sieving Algorithms for Lattice Problems Foundations of Software Technology and Theoretical Computer Science (Bangalore) 2008. Editors: R. Hariharan, M. Mukund, V. Vinay; pp - Some Sieving Algorithms for Lattice Problems V. Arvind and Pushkar

More information

6 Filtering and Streaming

6 Filtering and Streaming Casus ubique valet; semper tibi pendeat hamus: Quo minime credas gurgite, piscis erit. [Luck affects everything. Let your hook always be cast. Where you least expect it, there will be a fish.] Publius

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Testing Monotone High-Dimensional Distributions

Testing Monotone High-Dimensional Distributions Testing Monotone High-Dimensional Distributions Ronitt Rubinfeld Computer Science & Artificial Intelligence Lab. MIT Cambridge, MA 02139 ronitt@theory.lcs.mit.edu Rocco A. Servedio Department of Computer

More information

arxiv: v2 [cs.ds] 3 Oct 2017

arxiv: v2 [cs.ds] 3 Oct 2017 Orthogonal Vectors Indexing Isaac Goldstein 1, Moshe Lewenstein 1, and Ely Porat 1 1 Bar-Ilan University, Ramat Gan, Israel {goldshi,moshe,porately}@cs.biu.ac.il arxiv:1710.00586v2 [cs.ds] 3 Oct 2017 Abstract

More information

Multi-Linear Formulas for Permanent and Determinant are of Super-Polynomial Size

Multi-Linear Formulas for Permanent and Determinant are of Super-Polynomial Size Multi-Linear Formulas for Permanent and Determinant are of Super-Polynomial Size Ran Raz Weizmann Institute ranraz@wisdom.weizmann.ac.il Abstract An arithmetic formula is multi-linear if the polynomial

More information

CS 161: Design and Analysis of Algorithms

CS 161: Design and Analysis of Algorithms CS 161: Design and Analysis of Algorithms Greedy Algorithms 3: Minimum Spanning Trees/Scheduling Disjoint Sets, continued Analysis of Kruskal s Algorithm Interval Scheduling Disjoint Sets, Continued Each

More information

Notes on Logarithmic Lower Bounds in the Cell Probe Model

Notes on Logarithmic Lower Bounds in the Cell Probe Model Notes on Logarithmic Lower Bounds in the Cell Probe Model Kevin Zatloukal November 10, 2010 1 Overview Paper is by Mihai Pâtraşcu and Erik Demaine. Both were at MIT at the time. (Mihai is now at AT&T Labs.)

More information

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond Measure Theory on Topological Spaces Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond May 22, 2011 Contents 1 Introduction 2 1.1 The Riemann Integral........................................ 2 1.2 Measurable..............................................

More information

Lecture 1 : Probabilistic Method

Lecture 1 : Probabilistic Method IITM-CS6845: Theory Jan 04, 01 Lecturer: N.S.Narayanaswamy Lecture 1 : Probabilistic Method Scribe: R.Krithika The probabilistic method is a technique to deal with combinatorial problems by introducing

More information

Lecture 3 Frequency Moments, Heavy Hitters

Lecture 3 Frequency Moments, Heavy Hitters COMS E6998-9: Algorithmic Techniques for Massive Data Sep 15, 2015 Lecture 3 Frequency Moments, Heavy Hitters Instructor: Alex Andoni Scribes: Daniel Alabi, Wangda Zhang 1 Introduction This lecture is

More information

Rank Determination for Low-Rank Data Completion

Rank Determination for Low-Rank Data Completion Journal of Machine Learning Research 18 017) 1-9 Submitted 7/17; Revised 8/17; Published 9/17 Rank Determination for Low-Rank Data Completion Morteza Ashraphijuo Columbia University New York, NY 1007,

More information

Resolution Lower Bounds for the Weak Pigeonhole Principle

Resolution Lower Bounds for the Weak Pigeonhole Principle Electronic Colloquium on Computational Complexity, Report No. 21 (2001) Resolution Lower Bounds for the Weak Pigeonhole Principle Ran Raz Weizmann Institute, and The Institute for Advanced Study ranraz@wisdom.weizmann.ac.il

More information

Linear Discrimination Functions

Linear Discrimination Functions Laurea Magistrale in Informatica Nicola Fanizzi Dipartimento di Informatica Università degli Studi di Bari November 4, 2009 Outline Linear models Gradient descent Perceptron Minimum square error approach

More information