Analysis of an algorithm catching elephants on the Internet

Size: px

Start display at page:

Download "Analysis of an algorithm catching elephants on the Internet"

Amber Rodgers
5 years ago
Views:

1 Fifth Colloquiu on Matheatics and Coputer Science DMTCS proc. AI, 2008, Analysis of an algorith catching elephants on the Internet Yousra Chabchoub 1, Christine Fricker 1, Frédéric Meunier 2 Tibi 3 and Danielle 1 INRIA, Doaine de Voluceau, Le Chesnay CX, France 2 Université Paris Est, LVMT, Ecole des Ponts et Chaussées, 6 et 8 avenue Blaise Pascal, Cité Descartes, Chaps sur Marne Marne-la-Vallée 3 Université Paris 7, UMR 7599, 2 Place Jussieu, Paris Cedex 05 The paper deals with the proble of catching the elephants in the Internet traffic. The ai is to investigate an algorith proposed by Azzana based on a ultistage Bloo filter, with a refreshent echanis called shift in the present paper), able to treat on-line a huge aount of flows with high traffic variations. An analysis of a siplified odel estiates the nuber of false positives. Liit theores for the Markov chain that describes the algorith for large filters are rigorously obtained. The asyptotic behavior of the stochastic odel is here deterinistic. The liit has a nice forulation in ters of a M/G/1/C queue, which is analytically tractable and which allows to tune the algorith optially. Keywords: attack detection; Bloo filter; coupon collector; elephants and ice; network ining Introduction One traditionally distinguishes two kinds of flows in the Internet traffic: long flows, called elephants, which are the less nuerous typically 5-10%), and short flows, called ice, which are the ost nuerous. The convention is to fix a threshold C and to call elephant any flow having ore than C packets, and ouse any flow having strictly less than C packets. For various reasons, detections of attacks, pricing, statistics, it is an iportant task to be able to catch the elephants, that eans to be able to get the list of all elephants, with their IP addresses, flowing through a given router. We ephasize the fact that this proble is distinct fro the one consisting only in providing statistical estiates for the traffic. Even a siple on-line counting of the nuber of distinct flows reveals to be difficult due to the high throughput of the traffic. There is a wide literature on algoriths for fast estiations of cardinality i.e. of the nuber of distinct eleents in a set with repeated eleents) of huge data sets see [6], [9] and [11]). A siilar question consists in finding the k ost frequent flows the so-called icebergs see [4] and [12]). If one asks for the proportion of elephants or the size distribution of elephants, it is possible to use the Adaptive Sapling algorith proposed by Wegan and analyzed by Flajolet [8], which provides a saple of the flows independently fro their size. This saple can then be used to copute statistics for the elephants c 2008 Discrete Matheatics and Theoretical Coputer Science DMTCS), Nancy, France

2 296 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi and actually for all the flows). For counting the elephants, Gandouet and Jean-Marie have proposed in [10] an algorith based on sapling, thus requiring a knowledge on the flow size distribution, which reduces its application. For the target applications, a unique elephant hidden in a huge traffic of ice which does not exist fro a statistical point of view has to be detected. For this purpose, an algorith based on Bloo filters has been already presented by Estan and Varghese [7] in As explained here, a Bloo filter allows elephants to accuulate but due to huge traffic, collisions occur and ice can be detected as elephants. Collisions will be controlled by taking several stages and cleverly flushing the filter. More precisely, the principle of this latter algorith is the following. The IP header of each packet is hashed with k independent hashing functions in a k-ulti-stage filter. Counters are increented and when a counter reaches C, the corresponding flow is declared as an elephant and its packets are counted to give eventually its size. The proble is that in fact, under a heavy Internet traffic, the ultistage filter quickly gets totally filled. To avoid this proble, Estan and Varghese propose to periodically every 5 seconds) reinitialize the filter to zero. But, without any a priori knowledge about the traffic intensity, flows arrival rate...), 5 seconds can either be too long in which case the filter can be saturated) or too short a lot of long flows can be issed). Therefore the accuracy of the algorith closely depends on the characteristics of the traffic trace. In order to settle this proble, Azzana [2] introduces an adaptative refreshent echanis, that we will call shift, in the ulti-stage filter algorith. It is an efficient ethod to adapt the algorith to traffic variations: The filter is refreshed with a frequency depending on the current traffic intensity. Moreover the filter is not reinitialized to zero, but a softer technique is used to avoid issing soe elephants. The ain difference with the Estan and Varghese algorith is its ability to deal with traffic variations. Azzana shows in [2] that the refreshent echanis iproves notably the efficiency and accuracy of the algorith see Section 3 for practical results). Paraeters, as the filter size or related to the refreshent echanis, are experientally optiized. Azzana proposes soe eleents of analysis for this algorith. Our purpose is to go further and to provide analytical results when the filter is large. stage 1 Flow A h 1A) h A) 2 h k A) stage 2 stage k All > C Elephants Meory Fig. 1: The ulti-stage filter Description of the algorith The algorith designed by Azzana uses a Bloo filter with counters defined below) and involves four paraeters in input: the nuber k of stages in the Bloo filter, the nuber of counters in each stage, the axiu value C of each counter, i.e. the size threshold C to be declared as an elephant and the filling rate r. A Bloo filter is a set of k stages, each of these stages being a set of counters, initially at 0 and

3 Analysis of an algorith catching elephants on the Internet 297 taking values in {0,..., C}. Together with the k stages F 1,..., F k, one supposes that k hashing functions h 1,..., h k are given, one for each stage. We ake the strong) assuption that these hashing functions are independent, which iplies that k is sall k = 10 is probably the upper liit). Each hashing function h i aps the part of the IP header of a packet indicating the flow to which it belongs, to one of the counters of stage F i. The algorith works on-line on the strea, processing the packets one after the other. Flows identified as elephants are stored in a list E. When a packet is processed, it is first checked if it belongs to a flow already identified as an elephant that is a flow already in E). Indeed, in this case, there is no interest in apping it to the k counters, and the algorith siply forgets this packet. If not, it is apped by the hashing functions on one counter per stage and it increases these counters by one, except for those that have already reached C, in which case they reain at C. When, after processing soe packet, all the k counters are at C, the flow is declared to be an elephant and stored in the dedicated eory E. When the proportion of nonzero counters reaches r in the whole set of k counters, one decreases all nonzero counters by one. This last operation is called the shift. See Figure 1 for an illustration. Motivation of the algorith Packets of a sae flow hit the sae k counters, but two distinct flows ay also increase the sae counter in one or several stages. The idea of using several stages where flows are apped independently and uniforly, intends to reduce the probability of collisions between flows. The shift is crucial in the sense that it prevents the filters to be copletely saturated, that is, to have any counters with high values. Without the shift operation, ice would be very quickly apped to counters equal to C and declared as elephants. The algorith would have a finite lifetie because when the filter is saturated, nothing can be detected. False positive and false negative A false positive is a ouse detected as an elephant by the algorith. A false negative is an elephant not declared as such hence considered as a ouse) by the algorith. Generally, a false negative is worse than a false positive. Think of an attack: One does not want to iss it, and a false alar has less serious consequences than a successful attack. In our context, a false positive is a ouse one packet of which is apped onto counters all C 1. A false negative is due to the shift, and if it happens, it eans that there were at least f C + 1 shifts during the transission tie of soe elephant of size f. If shifts do not occur too often, a false negative is then an elephant whose packets are broadcast at a slow rate. Intuitively, and it will be confired by the forthcoing analysis, if the paraeters actually r) are chosen so as to aintain counters at low values, then shifts occur often, and if one tries to decrease the shift frequency, then the counters tend to have high values. Therefore, a coproise has to be found between these two properties frequency of the shifts, height of the counters), which translates into a coproise between false positives and false negatives. This last coproise depends on the applications. A Markovian representation In this paper, we will ainly focus our analysis on the one-stage filter case when the traffic is ade up only of ice of size 1. The ai is to estiate the proportion of false positives. Fro this analysis, we will then derive results for the general case. Let us now introduce our ain notations.

4 298 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi We thus assue that k = 1 and that all flows have size 1. Throughout the paper, Wn i) denotes the proportion of counters having value i just before the nth shift in a filter with counters. According to this notation, Wn 0) is close to and C i=0 W n i) = 1. Notice that the nth shift exactly decreases the nuber of nonzero counters by Wn 1). An iportant part of our analysis will consist in estiating Wn C 1) + Wn C). Indeed, it gives an upper bound on the probability that a flow is declared as an elephant that is a false positive) between the n 1)th and the nth shifts, since, according to our assuptions, there is no elephant at all. In this fraework k = 1 and flows of size one), the algorith has a siple description in ters of urns and balls. Each flow is a ball thrown at rando into one of urns each urn being one of the counters). When a ball falls into an urn with C balls, it is iediately reoved, in order to have at ost C balls in each urn. When the proportion of non epty urns) reaches r, one ball is reoved in every non epty urn. For fixed, Wn ) n N = Wn i)) i=0,...,c is an ergodic Markov chain on soe finite state n N space. Its invariant probability easure π is the distribution of soe variable W. For C = 2, the first non-trivial case, even the expression of the transition atrix P of the Markov chain is cobinatorially quite coplicated and an expression for π sees out of reach. In practice, the nuber of counters per stage is large. This suggests to look at the liiting behavior of the algorith when tends to. We use as far as possible the Markovian structure of the algorith in order to derive rigorous liit theores and analytical expressions for the liiting regie. This is the longest and ost technical part of the paper, which also contains the ain result, fro a atheatical point of view. Main results The odel considered in the paper describes the collisions between ice in order to evaluate the nuber of false positives due to these collisions. In a one-stage filter where all flows are ice of size 1, the Markov chain Wn ) n N describes the evolution of the counters observed just before shift ties. The ain result is that, when is large, the rando vector W converges in distribution to soe deterinistic value w. This result is not quite copletely proved. The way to proceed is classical for large Markovian odels see for exaple [5] and [1]). The idea is to study the convergence of the process over finite ties. It is shown that the Markov chain given by the epirical distributions Wn ) n N converges to a deterinistic dynaical syste w n+1 = F w n ), which has a unique fixed point w. The situation is analogous in discrete tie to the study by Antunes and al. [1]. A Lyapunov function for F would allow to prove the convergence in distribution of W. Such a Lyapunov function is exhibited in the particular case C = 2. The dynaical syste provides a liiting description of the original chain which stationary behavior is then described by w. The fixed point w has the following interpretation. The fixed point w is identified as the invariant probability easure µ λ of the nuber of custoers in an M/G/1/C queue where service ties are 1 and arrival rate is soe λ satisfying the fixed point equation or equivalently µ λ 0) = λ = log 1 + µ λ 1) ). As a byproduct, the stationary tie between two shifts divided by converges in distribution to the constant λ. Thus the inter-shift tie closely related to the nuber of false negatives) and the probability

5 Analysis of an algorith catching elephants on the Internet 299 of false positives are respectively approxiated by λ and bounded by µ λ C 1) + µ λ C) when is large. When ice have general size distribution, the previous odel is extended to an approxiated odel where packets of a given ouse arrive siultaneously. The involved quantity is the invariant easure of an M/G/1/C queue with arrivals by batches with distribution the ouse size distribution. In the case of size 1 ice, the ulti-stage filter case is investigated. Even if µ λ is not explicit, which coplicates the exhibition of a Lyapunov function, the quantities λ and µ λ C 1) + µ λ C) can be nuerically coputed. It appears that the latter quantity is an increasing function of r as r varies fro 0 to 1). Hence, given the ouse size distribution, one can nuerically deterine the values of r for which the algorith perfors well. Section 1 is the ost technical part of the paper. It investigates the one-stage filter in case of size 1 flows. In Section 2, this analysis is generalized to a general ouse size distribution in a siplified odel and to a ulti-stage filter. Then Section 3 is devoted to discussing the perforance of the algorith, to experiental results and iproveents validated through an ipleentation). 1 The Markovian urn and ball odel In this section, C is fixed and we consider the sequence Wn ) n N, where Wn denotes the vector of the proportions of urns with 0,..., C balls just before the nth shift tie. For 1, Wn ) n N is an ergodic Markov chain on the finite state space P r) = { w = w0),..., wc)) ) C+1 N, C wi) = 1 and i=0 C i=1 } wi) = r, where r denotes the sallest integer larger or equal to r) with transition atrix P defined as follows: If Wn = w P r), then Wn+1, distributed according to P w,.), is the epirical distribution of urns when, starting with distribution w, one ball is reoved fro every non epty urn and then balls are thrown at rando until r urns are non epty again, balls overflowing the capacity C being rejected. The required nuber of thrown balls is τ n = r 1 l= r W n 1) Y l, 1) where Y l, l N are independent rando variables with geoetrical distributions on N with respective paraeters l/, i.e. PY l = k) = l/) k 1 1 l/), k 1. { Let F be defined on P = w R C+1 +, } C i=0 wi) = 1 by F w) = T C sw) Pλw) ) 2)

6 300 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi where s : w w0) + w1), w2),..., wc), 0) on P { } T C : PN) = w n ) n N, λ : P R +, w log + i=0 w i = w1) ) P, w w0),..., wc 1), wi) i C and P λ is the Poisson { distribution with paraeter λ. Notice that F aps P to itself and, also by definition of λ, P r) def = w R C+1 +, C i=0 wi) = 1 and } C i=1 wi) = r to itself. 1.1 Convergence to a dynaical syste We prove the convergence of W n ) n N to the dynaical syste given by F as tends to +. The following lea is the key arguent. The unifor convergence stated below appears as the convenient way to express the convergence of P w,.) to δ F w) in order to prove both the convergence of W n ) n N, and, later on, the convergence of the stationary distributions. Define x = sup C i=0 x i for x R C+1. Lea 1 For ε > 0, sup P w, {w P r) : w F w) > ε}) 0. w P r) + Proof: The first step is to prove that, for ε > 0, ) λw) > ε 0 3) where λw) = log 1 + w1) = w). By Bienayé-Chebyshev s inequality, it is enough to prove that and τ1 sup P w w P r) ) and P w.) denotes P. W 0 sup w P r) E w ) τ 1 By equation 1), as EY l ) = 1/1 l/), using a change of index, ) τ E 1 w = λw) 0 4) ) τ sup Var 1 w 0. 5) w P r) r 1 l= r w1) r +w1) 1 l = 1 j. 6) j= r +1

7 Analysis of an algorith catching elephants on the Internet 301 A coparison with integrals leads to the following inequalities: log 1 r + w1) r + 1 ) τ E 1 w log 1 r + w1). 1 r It is then easy to show that the two extree ters tend to λw) = log1 + w1)/)), uniforly in w1) [0, 1]. This gives 4). For 5), as VarY l ) = l/)/1 l/) 2, by the sae change of index, ) τ Var 1 w = 1 r +w1) j= r +1 r +w1) j 1 j 2 = j 2 1 ) τ E 1 w. 7) j= r +1 The first ter of the right-hand side is bounded independently of w by + j= r +1 1/j2, which tends to 0 as tends to +. The second ter tends to 0 uniforly in w using 4) together with the unifor bound λw) log1 + 1/)). To obtain the lea, it is then sufficient to prove that, for each ε > 0, ) sup w P r) P w W1 F w) > ε, τ1 λw) ε ) Since W1 and F w) are probability easures on {0,..., C}, to get 8), it is sufficient to prove that for j {0,..., C 1}, sup w P r) P w W1 j) F w)j) > ε, τ1 λw) ε ) ) Let w P r). Define the following rando variables: For 1 i, Ni respectively Ñ i w)) is the nuber of additional balls in urn i when τ1 respectively λw)) new balls are thrown in the urns. One can construct these variables fro the sae sequence of balls i.e. of i.i.d. unifor on {1,..., } rando variables), eaning that balls are thrown in the sae locations for both operations until stopping. This provides a natural coupling for the N i s and Ñi s. Let j {0,..., C 1} be fixed. Given W0 = w, as j C 1, the capacity constraint does not interfere and W1 j) can be represented as W 1 j) = 1 j k=0 i Iw,k 1 {N i =j k} 10) where Iw,k is the set of urns with k balls in soe configuration of urns with distribution sw), so that card Iw,k = sw)k). The su over i is exactly the nuber of urns that contains k balls after the reoving of one ball per urn, and having j balls after new balls have been thrown. By coupling, on the event {W0 = w, τ1 / λw) ε/2}, the following is true: card{i, N i Ñ i w)} ε 2 11)

8 302 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi thus, denoting W 1 j) = 1 j k=0 i I 1 w,k { N e i w)=j k}, on the sae event, W 1 j) W 1 j) ε 2. To prove equation 9), it is then sufficient to show that This will result fro sup P w W 1 j) F w)j) > ε w P r) sup w P r) ) 0. E w W 1 j)) F w)j) 0 and sup r) w P Var w W 1 j)) 0. which is quite standard to prove. The key arguent, with classical proof, is the following: If L i is the nuber of balls in urn i when throwing λ balls at rando in urns, if 0 < a < b, then, for all i 1, i 2 ) N 2, i) sup PL 1 λ [a,b] = i 1 ) P λ i 1 ) 0, ii) sup PL 1 = i 1, L 2 = i 2 ) P λ i 1 )P λ i 2 ) 0. λ [a,b] It is applied since λw) [0, log1 + 1/)]. It ends the proof. Proposition 1 If W0 converges in distribution to w 0 P r) then Wn ) n N converges in distribution to the dynaical syste w n ) n N given by the recursion w n+1 = F w n ), n N. Proof: Assue that W0 converges in distribution to w 0 P r). Convergence of W0,..., Wn ) can be proved by induction on n N. By assuption it is true for n = 0. Let us just prove it for n = 1, the sae arguents holding for general n, fro the assued property for n 1. Let g be continuous on the copact) set P r)2. Since the distribution µ of W0 has support in P r), E gw0, W1 )) = gw, w )P w, dw )dµ w) = P r)2 P r) P r) 12) gw, w )P w, dw ) gw, F w))) dµ w) + gw, F w))dµ w). P r) Since g., F.)) is continuous on P r) F being continuous as can be easily checked), the last integral converges to gw 0, w 1 ) by assuption or case n = 0). The first ter is bounded in odulus, for each η > 0, by sup gw, w )P w, dw ) gw, F w)) w P r) P r) 2 g { }) sup P w, w P r), w F w) > ε + η w P r)

9 Analysis of an algorith catching elephants on the Internet 303 where ε is associated to η by the unifor continuity of g on P r)2. By Lea 1, this is less than 2η for sufficiently large. Thus, as tends to +, E gw 0, W 1 )) gw 0, w 1 ). 1.2 Convergence of invariant easures Let, for N, π be the stationary distribution of W n ) n N. Define P as the transition on P r) given by P w,.) = δ F w). Proposition 2 Any liiting point π of π ) N is a probability easure on P r) which is invariant for P i.e. that satisfies F π) = π. Proof: A classical result states that, if P and P, N, are transition kernels on soe etric space E such that, for any bounded continuous f on E, P f is continuous and P f converges to P f uniforly on E then, for any sequence π ) of probability easures such that π is invariant under P, any liiting point of π is invariant under P. Indeed, for any and any bounded continuous f, π P f = π f. If a subsequence π p ) converges weakly to π, then π p f converges to πf. Writing π p P p f = π p P f + π p P p f P f), since P f continuous and bounded since f is), the first ter π p P f converges to πp f and the second ter tends to 0 by unifor convergence of P f to P f. Equation π p P p f = π p f thus gives, in the liit, πp f = πf for any bounded continuous f. Here the difficulty is that the P s and P are transitions on P r) and P r), which are in general disjoint. To solve this difficulty, extend artificially P and P to P by setting: P w,.) = δ F w) for w P \ P r) P w,.) = δ F w) for w P \ P r). The proposition is then deduced fro the classical result if we prove that, for each f continuous on P notice that then P f = f F is continuous), sup P fw) ff w)) 0, w P r) which is straightforward fro Lea 1. The fact that the support of π is in P r) is deduced fro the portanteau theore see Billingsley [3] p.16) using the sequence of closed sets { } C P r),n = w P, r wi) r + 1. n The fixed points of the dynaical syste are the probability easures w on P r) such that i=1 w = F w) = T C sw) P λw) )

10 304 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi where λw) = log1 + w1)/)). This is exactly the invariant easure equation for the nuber of custoers just after copletion ties in an M/G/1/C queue with arrival rate λw) and service ties 1, so that it is equivalent to w = µ λw) 13) where µ λ respectively ν λ ) is the liiting distribution of the process of the nuber of custoers in an M/G/1/C respectively M/G/1/ ) queue with arrival rate λ and service ties 1. Indeed, it is well-known that this queue has a liiting distribution for λ R + respectively 0 λ < 1) which is the invariant probability easure of the ebedded Markov chain of the nuber of custoers just after copletion ties. The balance equations here reduce to a recursion syste, so that, even when λ 1, ν λ is well defined up to a ultiplicative constant which can not be noralized into a probability easure in this case). Moreover, ν λ is given by the Pollaczek-Khintchine forula for its generating function: n N ν λ n)u n = ν λ 0) g λu)u 1), for u < 1 14) u g λ u) where g λ u) = e λ1 u) and for λ < 1, ν λ 0) = 1 λ see for exaple Robert [13] p ). Notice that ν λ n) n N) has no closed for. For exaple, the expressions of the first ters are ν λ 1) = ν λ 0)e λ 1), ν λ 2) = ν λ 0)e λ e λ 1 λ), ν λ 3) = ν λ 0)e λ λλ + 2) λ)e λ + e 2λ ) 15) where ν λ 0) = 1 λ if λ < 1. For the M/G/1/C queue, µ λ i) = The following proposition characterizes the fixed points of F. ν λ i) C l=0 ν, i {0,..., C}. 16) λl) Proposition 3 F defined by 2) has one unique fixed point, denoted by w, in P r) given by the liiting distribution µ λ of the nuber of custoers in an M/G/1/C queue with arrival rate λ and service ties 1, where λ is deterined by the iplicit equation µ λ 0) = which is equivalent to λ = log 1 + µ λ 1) ), 17) where µ λ is given by 16) and ν λ by the Pollaczek-Kintchine forula 14). Notice that, oreover, r λ log). The upper bound on λ, obtained fro equation 17) using µ λ 1) r just says that the stationary ean nuber of balls between two shifts is less than the ean nuber of balls thrown until the first shift

11 Analysis of an algorith catching elephants on the Internet 305 starting with epty urns). Moreover λ r, which is clear if λ 1, and obtained if λ < 1 writing 16) for i = 0 and using C l=0 ν λ l) 1 in this equation. This is exactly the fact that the asyptotic stationary ean nuber of balls λ arriving between two shift ties is greater than the nuber of reoved balls at each shift, which is r. It is due to the losses under the capacity liit C. Proof: Only the existence and uniqueness result reains to prove. According to 13), w is soe fixed point if and only if it is a fixed point of the function P r) P r) w µ λw) with λw) = log1 + w1)/)). This function being continuous on the convex copact set P r), by Brouwer s theore, it has a fixed point. To prove uniqueness, let w and w be two fixed points of F in P r). By definition of P r), µ λw) 0) = µ λw )0) =. 18) A coupling arguent shows that, if λ λ then µ λ is stochastically doinated by µ λ, and in particular, µ λ 0) + µ λ 1) µ λ 0) + µ λ 1). 19) It can then be deduced that λw) = λw ). Indeed, if for exaple λw) < λw ), by equations 18) and 19), µ λw) 1) µ λw )1). thus, using 15) together with 18), λw) = log 1 + µ ) λw)1) λw ) = log 1 + µ ) λw )1) which contradicts λw) < λw ). One finally gets λw) = λw ), and then by equation 13), w = w. A Lyapunov function for the dynaical syste given by F on P r) is a function g 0 on P r) such that, for each w P r), gf w)) gw) with equality if and only if w is the fixed point of F. In the particular case C = 2, a Lyapunov function can be exhibited, resulting fro a contracting property of F in this case. Indeed, restricted to P r), F is here given by: [ w = 1 r, w1), w2) = r w1)) F w) =, ) log [ 1 ) log 1 + w1) 1 + w1) ) + ) + r w1) + w1) 1 + w1) P r) is soe one diensional subvariety of R 3, so that any w P r) can be identified with its second coordinate w1) [0, r], or equivalently with λw) = log1 + w1)/)) [0, log1/))]. ]). ],

12 306 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi Using this last paraetrization of P r), it is easy to show that F rewrites as G, ) apping the interval I = [0, log1/))] to itself and defined, for λ I, by Gλ) = log λ + e λ. An eleentary coputation shows that G has derivative on I taking values in the interval ] 1, 0], which gives the already known existence and uniqueness of a fixed point λ for G or F, both assertions being equivalent, and λ being equal to λw)). Moreover, the following inequality holds for λ I: Gλ) λ λ λ, equality occurring only at λ = λ. As a result, g defined on P r) by gw) = λw) λ = λw) λw) = + w1) log + w1), is a Lyapunov function for the dynaical syste defined by F. For C > 2, we conjecture the existence of such a g. Theore 1 Assue that a Lyapunov function exists for the dynaical syste given by F on P r) then, as tends to +, the invariant easure of W n ) n N converges to δ w where w is the unique fixed point of F. Thus the following diagra coutes, Wn d) ) n N n + + d) W d) w n ) n N w Proof: We prove that δ w is the unique invariant easure π of P with support in P r). Let g be the Lyapunov function for F on P r). π is P -invariant, thus πp = π and πp g = πg which can be rewritten g F g)dπ = 0. This iplies that g = g F holds π alost surely because g g F 0. Equality being only true at w, π has support in { w}. 2 A ore general odel 2.1 Mice with general size distribution Let Wn ) n N be the sequence of vectors giving the proportions of urns at 0,..., C just before the nth shift tie in a odel where balls are thrown by batches. The balls in a batch are thrown together in a unique urn chosen at rando aong the urns. The ith batch is coposed with S i balls and S i ) i N is a sequence of i.i.d. rando variables distributed as a rando variable S on N with support containing 1. Let φ be the generating function of S. The quantity S i is called the size of batch i. The dynaics is the sae: If, before the nth shift tie, the state is w P r), it first becoes sw) and then a nuber τn defined by 1) of successive batches are thrown in urns until r urns are non epty. The odel generalizes the previous one obtained for S = 1.

13 Analysis of an algorith catching elephants on the Internet 307 Let F be defined on P by F w) = T c sw) C λw),s ) 20) where T C, λ and S are already defined and C λw),s is a copound Poisson distribution i.e. the distribution of the rando variable Y = X S i 21) i=1 where X is independent of S i ) i N with Poisson distribution of paraeter λw). We iic the arguents in Section 1 to obtain the convergence of the stationary distribution of the ergodic Markov chain W n ) n N as tends to + to a Dirac easure at the unique fixed point of F. Propositions 1 and 2 hold. The fixed points of F are described in the following proposition. Proposition 4 F defined for w P by F w) = T c sw) C λw),s ) has a unique fixed point on P r) which is exactly the invariant easure µ λ of the nuber of custoers in a M/G/1/C queue with batches of custoers arriving according to a Poisson process with intensity λ, batch sizes being i.i.d. distributed as S with generating function φ and service ties 1, where λ is deterined by the iplicit equation µ λ 0) = which is equivalent to where for i {0,..., C}, and ν λ is given by + n=0 λ = log 1 + µ λ1) ) µ λ i) = ν λ i) C l=0 ν λl) ν λ n)u n g φu)u 1) = ν λ 0) u g φu), u < 1 where g λ u) = e λ1 u) and ν λ 0) = 1 λes) when λ < 1. Recall that the first ters of ν λ are given by ν λ 1) = ν λ 0)e λ 1) and ν λ 2) = ν λ 0)e λ e λ 1 λps = 1)) where ν λ 0) = 1 λes) when λ < 1, which generalizes the previous expressions. For C = 2, the Lyapunov function defined when S = 1 still works. Furtherore, for C > 2, we assue the existence of a Lyapunov function for F. Theore 1 still holds.

14 308 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi 2.2 A ulti-stage filter The filter is previously supposed to have only one stage. Let now assue that the filter has k stages of counters each. The natural odel then consists of k sets of urns where, when a ball is thrown, k copies of this ball are sent siultaneously and independently into the k stages, each falling at rando in one of the urns of its set. If soe ball hits an urn with C balls, then it is rejected. Moreover, when the proportion of non-epty urns in the whole filter reaches r, then one ball is reoved fro each non-epty urn: This is called a shift. The previous analysis extends with one ain difference: When the syste is initialized at soe state w j i), 1 j k, 0 i C), where w j i) is the proportion of urns with i balls in stage j, the nuber τ1 of balls thrown in each stage before the next shift is now asyptotically equivalent to λw), where w = wi), 0 i C) here gives the global proportions of urns in each possible state in the whole filter, that is wi) = 1 k w j i) for 0 i C. Thus λ is the sae function as for the one-filter case, here k j=1 evaluated at the global proportions: λw) = log 1 + k j=1 w j1) k) The one-stage proof of 3) is however not reproducible here, due to the lack of a representation of τ 1 analogous to 1) of Section 1. Another proof can be written. The alternative arguent is provided by noticing that when α balls per stage are thrown, with α / [λw) ε, λw)+ε] using the sae arguents i) and ii) as in Section 1), the epirical distributions of the urns at each stage are precisely known for large ) and do not correspond to the global proportion r of non-epty urns. Once the unifor) convergence of τ 1 / is established, the proof then proceeds along the sae lines as for k = 1 the sae reasoning holding for each stage). Notice however that the Markov property does not hold for the process of global proportions at shift ties, so that convergence in distribution is proved for the process of proportions detailed by stage, then inducing convergence. 3 Discussions 3.1 Synthesis: false positives and false negatives Fro a practical point of view, the ain results are Propositions 3 and 4. Given soe size distribution for the flows the generating function φ of Section 2), these propositions show how the values of the counters can be coputed fro the different paraeters of the algorith, since these values are encoded by the fixed point w of F : according to Theore 1, w is the state reached in the stationary regie when there is one stage and also when there are several stages see Subsection 2.2: one has k j=1 w jc) = kwc)). Moreover, the convergence is experientally really fast see the reark below), which ensures that in practice the algorith lives in the stationary phase. The coponent wi) of w gives the approxiate proportion of counters having value i in the whole Bloo filter. λ is the nuber of packets that arrive between two shifts. w and λ are respectively related to the nuber of false positives and to the nuber of false negatives: ).

15 Analysis of an algorith catching elephants on the Internet 309 The probability that a packet is a false positive is less than wc 1) + wc)) k since, in stage j, the probability to hit a counter at height C is at ost w j C 1) + w j C) and k w j C 1) + w j C)) wc 1) + wc)) k. j=1 The quantity λ is the tie nuber of packets) between two shifts, which is connected to the nuber of false negatives according to the discussion False positive and false negative of the Introduction. 3.2 Ipleentation and tests The algorith has been ipleented with an iproveent called the in-rule already proposed in [7]. Instead of increasing the k counters, an arriving packet is increenting only the counters aong k having the iniu value. Analytically ore difficult to study, the algorith should perfor better: Heuristically, ore flows are needed to reach high values of the counters inducing fewer false positives; oreover, the tie between two shifts is longer and hence the nuber of false negatives is decreased. It has been tested against on two ADSL traffic traces fro France Teleco, involving illions of flows. The perforance of the algorith is evaluated coparing the real nuber of elephants with the value estiated by the algorith. Even under the in-rule, the algorith perfors well only if r stays under a critical value r c, closely dependent on the ice distribution. Siulations have been processed with a one-stage filter with flows of size 1 to evaluate the transient phase duration. It appears that the nuber of shifts to reach the stationary phase is not uch greater than C. Such a result on the speed of convergence sees however theoretically out of reach. References [1] Nelson Antunes, Christine Fricker, Philippe Robert, and Danielle Tibi. Stochastic networks with ultiple stable points. Annals of Probability, 361): , [2] Youssef Azzana. Mesures de la topologie et du trafic Internet. PhD thesis, Université Pierre et Marie Curie, july [3] Patrick Billingsley. Convergence of probability easures. Wiley Series in Probability and Statistics: Probability and Statistics. John Wiley & Sons Inc., New York, second edition, A Wiley- Interscience Publication. [4] E. D. Deaine, A. López-Ortiz, and J. I. Munro. Frequency estiation, of internet packet streas with liited space. In Proceedings of the 10th Annual European Syposiu on Algoriths ESA 2002), pages , Roe, Italy, Septeber [5] Vincent Duas, Fabrice Guillein, and Philippe Robert. A Markovian analysis of additive-increase ultiplicative-decrease algoriths. Adv. in Appl. Probab., 341):85 111, [6] M. Durand and P. Flajolet. Loglog counting of large cardinalities. In G. Di Battista and U. Zwick, editors, Proceedings of the annual european syposiu on algoriths, ESA03), pages , 2003.

16 310 Yousra Chabchoub, Christine Fricker, Frédéric Meunier and Danielle Tibi [7] C. Estan and G. Varghese. New directions in traffic easureent and accounting. In John Wroclawski, editor, Proceedings of the ACM SIGCOMM 2002 Conference on Applications, Technologies, Architectures, and Protocols for Coputer Counications SIGCOMM-02), volue 32, 4 of Coputer Counication Review, pages , New York, August ACM Press. [8] P. Flajolet. On adaptative sapling. Coputing, pages , [9] P. Flajolet, E. Fusy, O. Gandouët, and F. Meunier. Hyperloglog: the analysis of a near-optial cardinality estiation algorith. In Proceedings of the 13th conference on analysis of algorith AofA 07), pages , Juan-les-Pins, France, [10] O. Gandouet and A. Jean-Marie. Loglog counting for the estiation of ip traffic. In Proceedings of the 4th Colloquiu on Matheatics and Coputer Science Algoriths, Trees, Cobinatorics and Probabilities, Nancy, France, [11] F. Giroire. Order statistics and estiating cardinalities of assive data sets. Discrete Applied Matheatics, to appear. [12] G. S. Manku and R. Motwani. Approxiate frequency counts aver data streas. In Proceedings of the 28th VLDB Conference, pages , Hong Kong, China, [13] Philippe Robert. Stochastic networks and queues, volue 52 of Applications of Matheatics. Springer-Verlag, Berlin, Stochastic Modelling and Applied Probability.

LogLog-Beta and More: A New Algorithm for Cardinality Estimation Based on LogLog Counting

LogLog-Beta and More: A New Algorith for Cardinality Estiation Based on LogLog Counting Jason Qin, Denys Ki, Yuei Tung The AOLP Core Data Service, AOL, 22000 AOL Way Dulles, VA 20163 E-ail: jasonqin@teaaolco