Notes on Frequency Estimation in Data Streams

Notes on Frequency Estmaton n Data Streams In (one of) the data streamng model(s), the data s a sequence of arrvals a 1, a 2,..., a m of the form a j = (, v) where s the dentty of the tem and belongs to the doman {1,..., n}, and v s the change n the frequency of the tem: f v 1 then the meanng s v addtons of tem and f v 1 then the meanng s v deletons of tem. The goal s to compute some functon whle usng space that s sublnear n the length of the stream. Ths s relevant both when data s lterally obtaned as a long stream of sgnals, where the stream s too long to keep n memory, and when the data resdes on some external devce and readng t n one pass s much more effcent than allowng random access. A natural specal case s that v = +1 for every elements. In ths case the stream s smply a sequence of tems (wth repettons) a j = for {1,..., n}. One of the frst problems that was studed n ths model (wth the specal case of sngle addtons), s computng frequency moments. Namely, let m = {j : a j = } denote the number of occurrences of n the stream. Then for each k 0 we defne F k = n (m ) k. (1) In partcular, F 1 equals m, the length of the sequence, F 0 s the number of dstnct elements appearng n the sequence (snce f m > 0 then m 0 = 1 and f m = 0 then m 0 = 0), and F 2 s the repeat rate or Gn s ndex of homogenety needed n order to compute the surprse ndex of the sequence. Fnally, for k = we defne F = max 1 n m (2) Gven an approxmaton parameter ɛ and a securty parameter δ, the algorthm should compute an estmate ˆF k such that the probablty that ˆF k F k > ɛf k s at most δ. What s known? 1. There s a lower bound of n 1 2 k (for constant ɛ and δ), whch n partcular means that for k 3 the lower bound s of the form n α for constant α (that approaches 1 when k ncreases). 2. There s a (recent) upper bound whose dependence on n s Õ n 1 2 k, so that t roughly matches the lower bound (the exact expresson s O k 2 log(1/δ) n 1 2 ɛ 2+4/k k log 2 m(log m + log n) ). 3. For the specal case of k = 1, clearly the exact value of F k = m can be computed usng space log m. To get an estmate, O(log log m + log(1/ɛ)) bts suffce. 4. For the specal case of k = 0 t s possble to compute an estmate that s wthn a factor 1/c and a factor c of F 0 wth probablty at least 1 2/c, where c > 2, usng O(log n) bts. 1

5. For the specal case of k = 2 t suffces to use O log(1/δ) (log n + log m). ɛ 2 6. Estmatng F requres space Ω(n) for m = O(n) and constant ɛ and δ. 7. Randomness s crucal: for k 1, every algorthm that computes an estmate of F k wth constant ɛ must use Ω(n) space. We shall dscuss the orgnal result of Alon et. al. whose dependence on n s Õ n 1 1 k (to be precse: O k log(1/δ) n 1 1 ɛ 2 k (log n + log m). If tme permts we wll talk about some of the specal cases. Assume frst that the length of the sequence, m, s known n advance. Ths assumpton s removed later. Let s 1 = 8 kn 1 1 ɛ 2 k and s 2 = 2 log(1/δ). The algorthm computes s 2 random varables, Y 1,..., Y s2 and outputs ther medan (ths s a standard technque to go from a constant probablty of devatng by more than some allowed devaton, to only an δ probablty that ths event occurs so the nterestng part s n defnng and analyzng the behavor of the Y t s). Each Y t s the average of s 1 random varables, X t,j where 1 j s 1. The X t,j s are ndependent, dentcally dstrbuted random varables. In order to explan how each X t,j = X s dstrbuted, we ntroduce some notaton. For each p {1,..., m}, let r(p) = {q : q p, a q = a p } (3) denote the number of occurrences of l = a p among the elements n the sequence that follow a p, ncludng a p (so that r(p) 1). Next defne ( R k (p) = m (r(p)) k (r(p) 1) k) (4) Each varable X t,j = X s determned (ndependently) by selectng an ndex p {1,..., m} unformly at random and lettng X = R k (p). Note that n order to compute r(p) and hence X = R k (p) t suffces to use log m bts to select p and count up to p, and then t suffces to mantan the log n bts representng a p and the log m bts representng r(p) and the log m bts representng R k (p). By defnton of X (recall that m = {j : a j = }), Exp[X] = 1 m = = = m n ( m R k (j) (5) ( (r(j)) k (r(j) 1) k) (6) ) ((m ) k (m 1) k ) + ((m 1) k (m 2) k ) +... + (2 k 1 k ) + (1 k 0 k )(7) k (m ) k = F k. Thus we have an unbased estmator of F k. What remans to be done s to bound the devaton of the average of the X t,j s from ths correct expected value. (The X t,j s are ndependent, so we could (8) 2

apply Chernoff. However, ther range s very bg so we wouldn t get a very good bound.) To ths end we bound the varance Var[X] = Exp[X 2 ] Exp 2 [X] and apply Chebshev: so that Pr[ X Exp[X] t Var 1/2 [X]] 1 t 2 Pr[ X Exp[X] T ] Var[X] T 2 In order to bound Exp[X 2 ] we shall use the followng nequalty, whch holds for any par of numbers a, b such that a > b > 0: a k b k = (a b)(a k 1 + a k 2 b +... + ab k 2 + b k 1 ) (9) (a b)ka k 1 (10) (You may be famlar wth the specal case of (a 2 b 2 ) = (a b)(a + b).) We use ths nequalty wth a = b + 1, so that a k (a 1) k ka k 1, and get: Exp[X 2 ] = 1 m m = m k m k n m ( ) 2 m (r k (r 1) k ) r=1 m n k r k 1 ((r k (r 1) k ) r=1 n n ( (m ) 2k 1 (m ) k 1 (m 1) k + (m 1) 2k 1 (m 1) k 1 (m 2) k +... + 2 2k 1 m 2k 1 = m k F 2k 1 = k F 1 F 2k 1 It can be shown (and s gven as an exercse) that F 1 F 2k 1 n 1 1/k (F k ) 2 (16) where we have used the nequalty ( 1 n n m ) k 1 n n mk. Therefore, Var[X] Exp[X 2 ] k F 1 F 2k 1 k n 1 1/k F 2 k (17) and so whereas Var[Y t ] = Var 1 s 1 s 1 X t,j Exp[Y t ] = Exp 1 s 1 = 1 s 1 Var[X] k n1 1/k F 2 k s 1 (18) s 1 X t,j = Exp[X] = F k (19) 3

By Chebyshev s nequalty ] Pr [ Y t F k > ɛf k Var[Y t] ɛ 2 F 2 k k n1 1/k F 2 k s 1 ɛ 2 F 2 k (20) By our choce of s 1 = 8 kn 1 1/k ths s at most 1 ɛ 2 8. As mentoned before, a standard analyss transform the constant probablty of small devaton of the Y t s to a hgh probablty of small devaton of ther medan (gven as an exercse). Dealng wth an unknown m. In ths case we start computng the random varable X wth the assumpton that m = 1, so that necessarly a p = a 1 (and we get that r(p) = 1 and X = 1 (1 k 0 k ) = 1). If ndeed m = 1 the process ends (note that f m = 1 then F k = 1 for every k). Otherwse, the value of m s updated to 2, and p = 1 s replaced by p = 2 wth probablty 1/2. In ether case, r(p) s modfed accordngly. In general, after vewng the frst t 1 tems, there s a current choce of p t 1 and a correspondng value of r(p t 1 ). If a new tem arrves, the belef for m s changed to t and p t s set to t wth probablty 1/t and remans p t 1 wth probablty 1 1/t. In the former case we have that r(p t ) = 1, and n the latter case r(p t ) s r(p t 1 ) + 1, f a t = a pt, and s r(p t 1 ) otherwse. As n the case that m s known, the algorthm only needs to remember a pt and r(p t ) at each step, at a cost of O(log n + log m) bts, and flppng a con wth bas 1/m takes O(log m) bts as well. On the relaton between m and n. If m = poly(n) then the factor of (log n + log m) s smply log n. When m s very large then nstead of computng r(p) exactly, we can estmate t usng log log m + log(1/ɛ) bts. Improved Estmaton of F 2 If we plug n k = 2 n the aforementoned expresson, we get a dependence on n that grows lke Õ( n). We next show how to get an estmate usng only O log(1/δ) (log n + log m) memory bts. ɛ 2 We set s 2 = 2 log(1/δ) as before and s 1 = 16. Here too the output s the medan of s ɛ 2 1 random varables Y 1,..., Y s1, where each Y t s the average of X t,j for j = 1,..., s 2. Each X,j = X s computed as follows. A central dea s usng a set V = {v 1,..., v h } of vectors of length n wth +1, 1 entres that are four-wse ndependent. That s, for every four dstnct coordnates, 1 1 < 2 < 3 < 4 n, and for every choce of γ 1,..., γ 4 { 1, +1}, exactly a (1/16)-fracton of the vectors n V have γ j n ther j coordnate for every j = 1,..., 4. (Note that 4-wse ndependence mples that for each coordnate, half of the vectors n V have +1 n the th coordnate and half have 1, and t mples s-wse ndependence for s = 2 and s = 3.) Such sets, of sze only h = O(n 2 ), not only exst but t s possble to compute each partcular coordnate of any v p of our choce usng O(log n) space. To compute X, we frst select 1 p h unformly at random (ths requres O(log n) bts of space). Ths determnes v p = (β 1,..., β n ) (where we wll compute the coordnates of v p when we need them). Let Z = n β m. Computng Z can be done n one pass usng O(log n + log m) space: Intally, Z = 0. For each a j, j = 1,..., m, f β aj = +1 then Z s ncremented by 1, and f β aj = 1 then t s decremented by 1. To compute each β aj t takes O(log n) space, and to mantan Z t takes O(log m) space. When the sequence termnates, we set X = Z 2. 4

As n the proof for general k we next compute Exp[X] and Var[X]. Before dong so, we make a few observatons that follow from the fact that each β { 1, +1} and the 4-wse ndependece: 1. For every, β 2 = β4 = 1, whle β3 = β. 2. For every, j Exp[β β j ] = 1 4 (+1 +1) + 1 4 (+1 1)1 4 ( 1 +1)1 ( 1 1) = 0 4 3. Smlarly, for every j k, Exp[β β j β k ] = 0 and for every j k l, Exp[β β j β k β l ] = 0. Usng the frst two propertes: ( n ) 2 Exp[X] = Exp[Z 2 ] = Exp β m (21) = Exp β β j m m j (22) j (m ) 2 Exp[β 2 ] + m m j Exp[β β j ] (23) j (m ) 2 = F 2 (24) Smlarly (though a bt more tedously...) ( n ) 4 Exp[X 2 ] = Exp β m (25) (m ) 4 Exp[β 4 ] + 4 (m ) 3 m j Exp[β 3 β j ] + 4 (m ) 2 m j m k Exp[β 2 β j β k ] j j k + 6 (m ) 2 (m j ) 2 Exp[β 2 βj 2 ] + m m j m k m l Exp[β β j β k β l ] (26) j j k l (m ) 4 + 6 j (m ) 2 (m j ) 2 (27) It follows that Var[X] = Exp[X 2 ] (Exp[X]) 2 (28) = (m ) 4 + 6 2 (m ) 2 (m j ) 2 (m ) 2 (29) j (m ) 4 + 6 (m ) 2 (m j ) 2 j (m ) 4 + 2 j (m ) 2 (m j ) 2 (30) = 4 j (m ) 2 (m j ) 2 2F 2 2 (31) 5

By Chebshev, for each 1 s 2, Pr [ Y F 2 > ɛf 2 ] Var[Y ] ɛ 2 F 2 2 2F 2 2 s 1 ɛ 2 F 2 2 = 1 8 (32) and we complete the argument as before. Estmatng F 0 to wthn a constant factor Here we ll only gve the dea wthout the full analyss. Let F = GF (2 d ) where d = log n. We vew each a j n the sequence as an element n the feld F. To compute an estmate for F 0 (the number of dstnct elements n the sequence), the algorthm selects α, β unformly at random n F. For each a j, the algorthm computes z j = z(a j ) = αa j + β, and consders the representaton of z j as a d-bt vector z j,1,..., z j,d. It then sets r j = r(z j ) to be the largest ndex such that all r(z j ) rghtmost bts of z j are 0. It mantans R as the maxmum over all r j, and when the sequence termnates t outputs Y = 2 R. The underlyng dea s that for each fxed l F, z(l) s unformly dstrbuted n F (gven the choce of α and β). That s, for every l, l F, Pr α,β [z(l) = l ] = 1 F. and so, for any r, the probablty that ts r rghtmost bts are 0 s 2 r. Now, f F 0 < 2 r /c, then by Markov, the probablty that any one of the the F 0 dfferent values l gves z(l) wth r(z(l)) > r s less than 1/c. (A few more detals: Let B denote the subset of elements that appear n the stream (where we want to estmate B. Let F r denote all elements n F whose r rghtmost bts are 0, so that F r = 2 d r. For each element l F, let X l be a 0/1 random varable that s 1 f and only f z(l) F r. Now, Pr[X l = 1] = 2 r so that Exp[ l B X l] = B 2 r. If B < 2 r /c then Ths expectaton s less than 1/c, and so the probablty that get at least 1 (.e., c tmes the expectaton) s less than 1/c) For the other drecton (showng that f F 0 > c2 r then the probablty that none of the F 0 dfferent l s gve r(z(l)) > r s less than 1/c), requres to apply Chebshev, and to use the fact that for any par l l the probablty that r(z(l)), r(z(l )) r, s 2 2r. 6

Constructng k-wse Independent Sample Spaces In the estmaton of F 2 we bult on the exstence of a set of n-dmensonal vectors of sze O(n 2 ) that are 4-wse ndependent. Here we shall show a general (but slghtly weaker) constructon of k-wse ndepdent sample spaces of sze O(n k ). Here too let F = GF (2 d ) where d = log n. We shall actually construct a set of n k vectors over F n (f we want to get bnary vectors we can take the least sgnfcant bt of each coordnate). Let w 1,..., w n denote the elements of the feld F. For each choce of k elements c 0,..., c k 1 F we defne the vector v c 0,...,c k 1 as follows: v c 0,...,c k 1 = k 1 j=0 c j w j. In other words, f we defne the (unvarate) polynomal p c 0,...,c k 1 = k 1 j=0 c jx j, then v c 0,...,c k 1 = p c 0,...,c k 1 (w ). By constructon there are n k vectors and each coordnate of any gven vector can be computed usng O(log n) bts. To see why we get a k-wse ndependent space, consder the n d Vendermonde matrx, where M,j = w j. Then each vector vc 0,...,c k 1 s the result of multplyng the matrx M wth the vector (c 0,..., c k 1 ). If we consder any choce of k rows, they are lnearly ndependent, mplyng the desred k-wse ndependence. 7