Estimating Entropy and Entropy Norm on Data Streams

Estiating Entropy and Entropy Nor on Data Streas Ait Chakrabarti 1, Khanh Do Ba 1, and S. Muthukrishnan 2 1 Departent of Coputer Science, Dartouth College, Hanover, NH 03755, USA 2 Departent of Coputer Science, Rutgers University, Piscataway, NJ 08854, USA Abstract. We consider the proble of coputing inforation theoretic functions such as entropy on a data strea, using sublinear space. Our first result deals with a easure we call the entropy nor of an input strea: it is closely related to entropy but is structurally siilar to the well-studied notion of frequency oents. We give a polylogarithic space one-pass algorith for estiating this nor under certain conditions on the input strea. We also prove a lower bound that rules out such an algorith if these conditions do not hold. Our second group of results are for estiating the epirical entropy of an input strea. We first present a sublinear space one-pass algorith for this proble. For a strea of ites and a given real paraeter α, our algorith uses space e O( 2α ) and provides an approxiation of 1/α in the worst case and (1 + ε) in ost cases. We then present a two-pass polylogarithic space (1 + ε)-approxiation algorith. All our algoriths are quite siple. 1 Introduction Algoriths for coputational probles on data streas have been the focus of plenty of recent research in several counities, such as theory, databases and networks [1, 6, 2, 13]. In this odel of coputation, the input is a strea of ites that is too long to be stored copletely in eory, and a typical proble involves coputing soe statistics on this strea. The ain challenge is to design algoriths that are efficient not only in ters of running tie, but also in ters of space (i.e., eory usage): sublinear space is andatory and polylogarithic space is often the goal. The seinal paper of Alon, Matias and Szegedy [1] considered the proble of estiating the frequency oents of the input strea: if a strea contains i occurrences of ite i (for 1 i n), its k th frequency oent is denoted Supported by an NSF CAREER award and Dartouth College startup funds. Work partly done while visiting DIMACS in the REU progra, supported by NSF ITR 0220280, DMS 0354600, and a Dean of Faculty Fellowship fro Dartouth College. Supported by NSF ITR 0220280 and DMS 0354600.

F k and is defined by F k := n i=1 k i. Alon et al. showed that F k could be estiated arbitrarily well in sublinear space for all nonnegative integers k and in polylogarithic (in and n) space for k {0, 1, 2}. Their algorithic results were subsequently iproved by Coppersith and Kuar [4] and Indyk and Woodruff [10]. In this work, we first consider a soewhat related statistic of the input strea, inspired by the classic inforation theoretic notion of entropy. We consider the entropy nor of the input strea, denoted F H and defined by F H := n i=1 i lg i. 3 We prove (see Theore 2.2) that F H can be estiated arbitrarily well in polylogarithic space provided its value is not too sall, a condition that is satisfied if, e.g., the input strea is at least twice as long as the nuber of distinct ites in it. We also prove (see Theore 2.5) that F H cannot be estiated well in polylogarithic space if its value is too sall. Second, we consider the estiation of entropy itself, as opposed to the entropy nor. Any input strea iplicitly defines an epirical probability distribution on the set of ites it contains; the probability of ite i being i /, where is the length of the strea. The epirical entropy of the strea, denoted H, is defined to be the entropy of this probability distribution: H := n ( i /) lg(/ i ) = lg F H /. (1) i=1 An algorith that coputes F H exactly clearly suffices to copute H as well. However, since we are only able to approxiate F H in the data strea odel, we need new techniques to estiate H. We prove (see Theore 3.1) that H can be approxiated using sublinear space. Although the space usage is not polylogarithic in general, our algorith provides a tradeoff between space and approxiation factor and can be tuned to use space arbitrarily close to polylogarithic. The standard data strea odel allows us only one pass over the input. If, however, we are allowed two passes over the input but still restricted to sall space, we have an algorith that approxiates H to within a (1 + ε) factor and uses polylogarithic space. Both entropy and entropy nor are natural statistics to approxiate on data streas. Arguably, entropy related easures are even ore natural than L p nors or frequency oents F k. In addition, they have direct applications. The quintessential need arises in analyzing IP network traffic at packet level on high speed routers. In onitoring IP traffic, one cares about anoalies. In general, anoalies are hard to define and detect since there are subtle intrusions, sophisticated dependence aongst network events and agents gaing the attacks. A nuber of recent results in the networking counity use entropy as an approach [7, 14, 15] to detect sudden changes in the network behavior and as an indicator of anoalous events. The rationale is well explained elsewhere, 3 Throughout this paper lg denotes logarith to the base 2.

chiefly in Section 2 of [14]. The current research in this area [14, 7, 15] relies on full space algoriths for entropy calculation; this is a serious bottleneck in high speed routers where high speed eory is at preiu. Indeed, this is the bottleneck that otivated data strea algoriths and their applications to IP network analysis [6, 13]. Our sall-space algoriths can iediately ake entropy estiation at line speed practical on high speed routers. To the best of our knowledge, our upper and lower bound results for the entropy nor are the first of their kind. Recently Guha, McGregor and Venkatasubraanian [8] considered approxiation algoriths for the entropy of a given distribution ( under various odels, including the data strea odel. They obtain a e e 1 )-approxiation + ε for the entropy H of an input strea provided H is at least a sufficiently large constant, using space Õ(1/(ε2 H)), where the Õ-notation hides factors polylogarithic in and n. Our work shows that H can be (1 + ε)-approxiated in Õ(1/ε2 ) space for H 1 (see the reark after Theore 3.1); Our space bounds are independent of H. Guha et al. [8] also give a two-pass (1+ε)-approxiation algorith for entropy, using Õ(1/(ε2 H)) space. We do the sae using only Õ(1/ε2 ) space. Finally, Guha et al. consider the entropy estiation proble in the rando streas odel, where it is assued that the ites in the input strea are presented in a unifor rando order. Under this assuption, they obtain a (1 + ε)-approxiation using Õ(1/ε2 ) space. We study adversarial data strea inputs only. 2 Estiating the Entropy Nor In this section we present a polylogarithic space (1 + ε)-approxiation algorith for entropy nor that assues the nor is sufficiently large, and prove a atching lower bound if the nor is in fact not as large. 2.1 Upper Bound Our algorith is inspired by the work of Alon et al. [1]. Their first algorith, for the frequency oents F k, has the following nice structure to it (soe of the terinology is ours). A subroutine coputes a basic estiator, which is a rando variable X whose ean is exactly the quantity we seek and whose variance is sall. The algorith itself uses this subroutine to aintain s 1 s 2 independent basic estiators {X ij : 1 i s 1, 1 j s 2 }, where each X ij is distributed identically to X. It then outputs a final estiator Y defined by ( Y := edian 1 j s 2 1 s 1 s 1 i=1 X ij ) The following lea, iplicit in [1], gives a guarantee on the quality of this final estiator.

Lea 2.1. Let µ := E[X]. For any ε, δ (0, 1], if s 1 8 Var[X]/(ε 2 µ 2 ) and s 2 = 4 lg(1/δ), then the above final estiator deviates fro µ by no ore than εµ with probability at least 1 δ. The above algorith can be ipleented to use space O(S log(1/δ) Var[X]/(ε 2 µ 2 )), provided the basic estiator can be coputed using space at ost S. Proof. The clai about the space usage is iediate fro the structure of the algorith. Let Y j = 1 s1 s 1 i=1 X ij. Then E[Y j ] = µ and Var[Y j ] = Var[X]/s 1 ε 2 µ 2 /8. Applying Chebyshev s Inequality gives us Pr[ Y j µ εµ] 1/8. Now, if fewer than (s 2 /2) of the Y j s deviate by as uch as εµ fro µ, then Y ust be within εµ of µ. So we upper bound the probability that this does not happen. Define s 2 indicator rando variables I j, where I j = 1 iff Y j µ εµ, and let W = s 2 j=1 I j. Then E[W ] s 2 /8. A standard Chernoff bound (see, e.g. [12, Theore 4.1]) gives Pr [ Y µ εµ ] [ Pr W s ] 2 2 ( e 3 4 4 ) s2/8 = ( e 3 4 4 ) 1 2 lg(1/δ) δ, which copletes the proof. We use the following subroutine to copute a basic estiator X for the entropy nor F H. 1 2 3 Input strea: A = a 1, a 2,..., a, where each a i {1,..., n}. Choose p uniforly at rando fro {1,..., }. Let r = {q : a q = a p, p q }. Note that r 1. Let X = ( r lg r (r 1) lg(r 1) ), with the convention that 0 lg 0 = 0. Our algorith for estiating the entropy nor outputs a final estiator based on this basic estiator, as described above. This gives us the following theore. Theore 2.2. For any > 0, if F H /, the above one-pass algorith can be ipleented so that its output deviates fro F H by no ore than εf H with probability at least 1 δ, and so that it uses space log(1/δ) O log (log + log n). ε 2 In particular, taking to be a constant, we have a polylogarithic space algorith that works on streas whose F H is not too sall.

Proof. We first check that the expected value of X is indeed the desired quantity: E[X] = n v r lg r (r 1) lg(r 1) = v=1 r=1 n ( v lg v 0 lg 0) = F H. v=1 The approxiation guarantee of the algorith now follows fro Lea 2.1. To bound the space usage, we ust bound the variance Var[X] and for this we bound E[X 2 ]. Let f(r) := r lg r, with f(0) := 0, so that X can be expressed as X = (f(r) f(r 1)). Then n v E[X 2 2 ] = f(r) f(r 1) v=1 r=1 ax f(r) f(r 1) 1 r n v f(r) f(r 1) v=1 r=1 sup {f (x) : x (0, ]} F H (2) = (lg e + lg ) F H (3) (lg e + lg ) F 2 H, where (2) follows fro the Mean Value Theore. Thus, Var[X]/E[X] 2 = O( lg ). Moreover, the basic estiator can be ipleented using space O(log + log n): O(log ) to count and r, and O(log n) to store the value of a p. Plugging these bounds into Lea 2.1 yields the claied upper bound on the space of our algorith. Let F 0 denote the nuber of distinct ites in the input strea (this notation deliberately coincides with that for frequency oents). Let f(x) := x lg x as used in the proof above. Observe that f is convex on (0, ) whence, via Jensen s inequality, we obtain ( F H = F 0 F 0 n f( v ) F 0 f v=1 1 F 0 v=1 ) n v = lg F 0. (4) Thus, if the input strea satisfies 2F 0 (or the sipler, but stronger, condition 2n), then we have F H. As a direct corollary of Theore 2.2 (for = 1) we obtain a (1 + ε)-approxiation algorith for the entropy nor in space O((log(1/δ)/ε 2 ) log (log + log n)). However, we can do slightly better. Theore 2.3. If 2F 0 then the above one-pass, (1 + ε)-approxiation algorith can be ipleented in space log(1/δ) O log log n without a priori knowledge of the strea length. ε 2

Proof. We follow the proof of Theore 2.2 up to the bound (3) to obtain Var[X] (2 lg )F H, for large enough. We now ake the following clai lg lg(/f 0 ) 2 ax{lg F 0, 1}. (5) Assuing the truth of this clai and using (4), we obtain Var[X] (2 lg )F H 2 lg lg(/f 0 ) F 2 H 4 ax{lg F 0, 1}F 2 H (4 lg n)f 2 H. Plugging this into Lea 2.1 and proceeding as before, we obtain the desired space upper bound. Note that we no longer need to know before starting the algorith, because the nuber of basic estiators used by the algorith is now independent of. Although aintaining each basic estiator sees, at first, to require prior knowledge of, a careful ipleentation can avoid this, as shown by Alon et al [1]. We turn to proving our clai (5). We will need the assuption 2F 0. If F 2 0, then lg 2 lg F 0 = 2 lg F 0 lg(2f 0 /F 0 ) 2 lg F 0 lg(/f 0 ) and we are done. On the other hand, if F 2 0, then F 0 1/2 so that lg(/f 0 ) lg (1/2) lg = (1/2) lg and we are done as well. Reark 2.4. Theore 2.2 generalizes to estiating quantities of the for ˆµ = n v=1 ˆf( v ), for any onotone increasing (on integer values), differentiable function ˆf that satisfies ˆf(0) = 0. Assuing ˆµ /, it gives us a one-pass (1 + ε)- approxiation algorith that uses Õ( ˆf () ) space. For instance, this space usage is polylogarithic in if ˆf(x) = x polylog(x). 2.2 Lower Bound The following lower bound shows that the algorith of Theore 2.2 is optial, up to factors polylogarithic in and n. Theore 2.5. Suppose and c are integers with 4 o() and 0 c /. On input streas of size at ost, a randoized algorith able to distinguish between F H 2c and F H c + 2/ ust use space at least Ω( ). In particular, the upper bound in Theore 2.2 is tight in its dependence on. Proof. We present a reduction fro the classic proble of (two-party) Set Disjointness in counication coplexity [11]. Suppose Alice has a subset X and Bob a subset Y of {1, 2,..., 1}, such that X and Y either are disjoint or intersect at exactly one point. Let us define the apping { ( 2c)x φ : x + i : i Z, 0 i < 2c }.

Alice creates a strea A by listing all eleents in x X φ(x) and concatenating the c special eleents + 1,..., + c. Siilarly, Bob creates a strea B by listing all eleents in y Y φ(y) and concatenating the sae c special eleents + 1,..., + c. Now, Alice can process her strea (with the hypothetical entropy nor estiation algorith) and send over her eory contents to Bob, who can then finish the processing. Note that the length of the cobined strea A B is at ost 2c + X Y (( 2c)/ ). We now show that, based on the output of the algorith, Alice and Bob can tell whether or not X and Y intersect. Since the set disjointness proble has counication coplexity Ω( ), we get the desired space lower bound. Suppose X and Y are disjoint. Then the ites in A B are all distinct except for the c special eleents, which appear twice each. So F H (A B) = c (2 lg 2) = 2c. Now suppose X Y = {z}. Then the ites in A B are all distinct except for the ( 2c)/ eleents in φ(z) and the c special eleents, each of which appears twice. So F H (A B) = 2(c + ( 2c)/ ) c + 2/, since 4. Reark 2.6. Notice that the above theore rules out even a polylogarithic space constant factor approxiation to F H that can work on streas with sall F H. This can be seen by setting = γ for soe constant γ > 0. 3 Estiating the Epirical Entropy We now turn to the estiation of the epirical entropy H of a data strea, defined as in equation (1): H = n i=1 ( i/) lg(/ i ). Although H can be coputed exactly fro F H, as shown in (1), a (1 + ε)-approxiation of F H can yield a poor estiate of H when H is sall (sublinear in its axiu value, lg ). We therefore present a different sublinear space, one-pass algorith that directly coputes entropy. Our data structure takes a user paraeter α > 0, and consists of three coponents. The first (A1) is a sketch in the anner of Section 2, with basic estiator ( r X = lg r r 1 lg ), (6) r 1 and a final estiator derived fro this basic estiator using s 1 = (8/ε 2 ) 2α lg 2 and s 2 = 4 lg(1/δ). The second coponent (A2) is an array of 2α counters (each counting fro 1 to ) used to keep exact counts of the first 2α distinct ites seen in the input strea. The third coponent (A3) is a Count-Min Sketch, as described by Corode and Muthukrishnan [5], which we use to estiate k, defined to be the nuber of ites in the strea that are different fro the ost frequent ite; i.e., k = ax{ i : 1 i n}. The algorith itself works as follows. Recall that F 0 denotes the nuber of distinct ites in the strea.

1 2 3 4 5 6 7 Maintain A1, A2, A3 as described above. When queried (or at end of input): if F 0 2α then return exact H fro A2. else let ˆk = estiate of k fro A3. if ˆk (1 ε) 1 α then return final estiator, Y, of A1. else return (ˆk lg )/. end Theore 3.1. The above algorith uses log(1/δ) O ε 2 2α log 2 (log + log n) space and outputs a rando variable Z that satisfies the following properties. 1. If k 2α 1, then Z = H. 2. If k 1 α, then Pr [ Z H εh ] δ. 3. Otherwise (i.e., if 2α k < 1 α ), Z is a (1/α)-approxiation of H. Reark 3.2. Under the assuption H 1, an algorith that uses only the basic estiator in A1 and sets s 1 = (8/ε 2 ) lg 2 suffices to give a (1+ε)-approxiation in O(ε 2 log 2 ) space. Proof. The space bound is clear fro the specifications of A1, A2 and A3, and Lea 2.1. We now prove the three claied properties of Z in sequence. Property 1: This follows directly fro the fact that F 0 k + 1. Property 2: The Count-Min sketch guarantees that ˆk k and, with probability at least 1 δ, ˆk (1 ε)k. The condition in Property 2 therefore iplies that ˆk (1 ε) 1 α, that is, Z = Y, with probability at least 1 δ. Here we need the following lea. Lea 3.3. Given that the ost frequent ite in the input strea A has count k, the iniu entropy H in is achieved when all the reaining k ites are identical, and the axiu H ax is achieved when they are all distinct. Therefore, H in = k lg k + k lg k, H ax = k lg k + k lg. and

Proof. Consider a iniu-entropy strea A in and suppose that, apart fro its ost frequent ite, it has at least two other ites with positive count. Without loss of generality, let 1 = k and 2, 3 1. Modify A in to A by letting 2 = 2 + 3 and 3 = 0, and keeping all other counts the sae. Then H(A ) H(A in ) = (lg F H (A )/) (lg F H (A in )/) = (F H (A in ) F H (A ))/ = 2 lg 2 + 3 lg 3 ( 2 + 3 ) lg( 2 + 3 ) < 0, since x lg x is convex and onotone increasing (on integer values), giving us a contradiction. The proof of the axiu-entropy distribution is siilar. Now, consider equation (6) and note that for any r, X lg. Thus, if E[X] = H 1, then Var[X]/ E[X] 2 E[X 2 ] lg 2 and our choice of s 1 is sufficiently large to give us the desired (1 + ε)-approxiation, by Lea 2.1. 4 On the other hand, if H < 1, then k < /2, by a siple arguent siilar to the proof of Lea 3.3. Using the expression for H in fro Lea 3.3, we then have H in = lg k + k lg k k ( lg 1 k ) k α, which gives us Var[X]/ E[X] 2 E[X 2 ]/ 2α (lg 2 ) 2α. Again, plugging this and our choice of s 1 into Lea 2.1 gives us the desired (1+ε)-approxiation. Property 3: By assuption, k < 1 α. If ˆk (1 ε) 1 α, then Z = Y and the analysis proceeds as for Property 2. Otherwise, Z = (ˆk lg )/ (k lg )/. This tie, again by Lea 3.3, we have and H in k lg k k lg (α ) = αk lg, H ax = k lg k + k lg = lg k + k lg( k) k k lg + O, which, for large, iplies H o(h) Z H/α and gives us Property 3. Note that we did not use the inequality 2α k in the proof of this property. The ideas involved in the proof of Theore 3.1 can be used to yield a very efficient two-pass algorith for estiating H, which can be found in [3]. 4 This observation, that H 1 = Var[X] lg 2, proves the stateent in the reark following Theore 3.1.

4 Conclusions Entropy and entropy nors are natural easures with direct applications in IP network traffic analysis for which one-pass streaing algoriths are needed. We have presented one-pass sublinear space algoriths for approxiating the entropy nors as well as the epirical entropy. We have also presented a two-pass algorith for epirical entropy that has a stronger approxiation guarantee and space bound. We believe our algoriths will be of interest in practice of data strea systes. It will be of interest to study these probles on streas in the presence of inserts and deletes. Note: Very recently, we have learned of a work in progress [9] that ay lead to a one-pass polylogarithic space algorith for approxiating H to within a (1 + ε)-factor. References 1. N. Alon, Y. Matias and M. Szegedy. The space coplexity of approxiating the frequency oents. Proc. ACM STOC, 20 29, 1996. 2. B. Babcock, S. Babu, M. Datar, R. Motwani and J. Wido. Models and issues in data strea systes. ACM PODS, 2002, 1 16. 3. A. Chakrabarti, K. Do Ba and S. Muthukrishnan. Estiating entropy and entropy nor on data streas. DIMACS technical report, 2005-33. 4. D. Coppersith and R. Kuar. An iproved data strea algorith for frequency oents. ACM-SIAM SODA, 151 156, 2004. 5. G. Corode and S. Muthukrishnan. An iproved data strea suary: the countin sketch and its applications. J. Algoriths, 55(1): 58 75, April 2005. 6. C. Estan and G. Varghese. New directions in traffic easureent and accounting: Focusing on the elephants, ignoring the ice. ACM Trans. Coput. Syst., 21(3): 270 313, 2003. 7. Y. Gu, A. McCallu and D. Towsley. Detecting Anoalies in Network Traffic Using Maxiu Entropy Estiation. Proc. Internet Measureent Conference, 2005. 8. S. Guha, A. McGregor, and S. Venkatasubraanian. Streaing and Sublinear Approxiation of Entropy and Inforation Distances. ACM-SIAM SODA, to appear, 2006. 9. P. Indyk. Personal e-ail counication. Septeber 2005. 10. P. Indyk and D. Woodruff. Optial approxiations of the frequency oents of data streas. ACM STOC, 202 208, 2005. 11. E. Kushilevitz and N. Nisan. Counication Coplexity. Cabridge University Press, Cabridge, 1997. 12. R. Motwani and P. Raghavan. Randoized Algoriths. Cabridge University Press, New York, 1995. 13. S. Muthukrishnan. Data Streas: Algoriths and Applications. Manuscript, Available online at http://www.cs.rutgers.edu/ uthu/strea-1-1.ps 14. A. Wagner and B. Plattner Entropy Based Wor and Anoaly Detection in Fast IP Networks. 14th IEEE International Workshops on Enabling Technologies: Infrastructures for Collaborative Enterprises (WET ICE), STCA security workshop, Linkping, Sweden, June, 2005 15. K. Xu, Z. Zhang, and S. Bhattacharya. Profiling Internet Backbone Traffic: Behavior Models and Applications. Proc. ACM SIGCOMM 2005.