Modeling Residual-Geometric Flow Sampling

Size: px

Start display at page:

Download "Modeling Residual-Geometric Flow Sampling"

Tyrone Wilcox
6 years ago
Views:

1 Modeling Residual-Geometric Flow Samling Xiaoming Wang Amazon.com Seattle, WA USA Xiaoyong Li Texas A&M University College Station, TX USA Dmitri Loguinov Texas A&M University College Station, TX USA Abstract Traffic monitoring and estimation of flow arameters in high seed routers have recently become challenging as the Internet grew in both scale and comlexity. In this aer, we focus on a family of flow-size estimation algorithms we call Residual- Geometric Samling (RGS), which generates a random oint within each flow according to a geometric random variable and records all remaining ackets in a flow counter. Our analytical investigation shows that revious estimation algorithms based on this method exhibit certain bias in recovering flow statistics from the samled measurements. To address this roblem, we derive a novel set of unbiased estimators for RGS, validate them using real Internet traces, and show that they rovide an accurate and scalable solution to Internet traffic monitoring. I. INTRODUCTION Recent growth of the Internet in both scale and comlexity has imosed a number of challenges on network management, oeration, and traffic monitoring. The main roblem in this line of work is to scale measurement algorithms to achieve certain objectives (e.g., accuracy) while satisfying real-time resource constraints (e.g., fixed memory consumtion and er-acket rocessing delay) of high-seed Internet routers. This is commonly accomlished (e.g., [5], [6], [7], [8], [9], [10], [11], [14], [15], [17], [21], [18], [19], [20], [22], [25], [30]) by reducing the amount of information a router has to store in its internal tables, which comes at the exense of deloying secial estimation techniques that can recover metrics of interest from the collected samles. In this aer, we study two roblems in the general area of measuring s 1) determining the number of ackets transmitted by elehant flows [11], [15], [17], [21], [20], [22] and 2) building the distribution of s seen by the router in some time window [7], [18], [30] couled in a single measurement technique. The former roblem arises in usage-based accounting and traffic engineering [6], [11], [12], [13], [26], while the latter has many security alications such as anomaly and intrusion detection [1], [23], [16]. Our interest falls within the family of residual samling, which selects a random oint A within each flow and then samles the remainder R of that flow until it ends. Denoting by L the size (in ackets) of a random flow, samled residuals R are simly L A. Stochastically larger A results in fewer flows being samled and leads to lower overhead in terms of both CPU and RAM consumtion. Besides reduced overhead arising from omission of many small-size flows from counter Suorted by NSF grants CNS and CNS tables, residual samling guarantees to cature large flows with robability 1 o(1) as their size L. This allows ISPs to determine heavy-hitters and charge the corresonding customers for generated traffic. While in P2P networks residual samling distributes the initial oint A uniformly within user lifetimes [29], flow-based estimation [11], [17] usually emloys geometric A since it can be easily imlemented with a sequence of indeendent Bernoulli variables. We call the resulting aroach Residual- Geometric Samling (RGS) and note that it has received some limited analytical attention in [11], [17]; however, unbiased estimation of individual s, analysis of the resulting error, asymtotically accurate recovery of flow-size distribution P (L = i) from samled residuals R, and analysis of sace- CPU requirements in steady state have not been exlored. We overcome these issues below. A. Single-Flow Usage We start with the roblem of obtaining sizes of individual flows for accounting uroses. Since residual samling requires an estimator to convert residuals into the metrics of interest, our first task is to define roer notation and desired roerties for the estimation algorithm. Assume that for a flow of size L the samling algorithm roduces residual R L, where both L and R L are random variables. We call an estimator e(r L ) unbiased if its exectation roduces the correct flow size, i.e., E[e(R L ) L = l] = E[e(R l )] = l. Unbiased estimation allows one to average the of several flows of a given size l and accurately estimate their total contribution. We further call an estimator elehant-accurate if ratio e(r l )/l converges to 1 in mean-square as l. Elehant-accuracy ensures that the variance of e(r l )/l tends to zero as l, which means that the amount of relative error between e(r l ) and l becomes negligible for large flows. Prior work on RGS [11], [17] has suggested the following estimator: e(r L ) = R L 1 + 1/, (1) where 0 < 1 is the arameter of geometric variable A. To understand the erformance of (1), we first build a general robabilistic for residual-geometric samling and derive the relationshi between L and its residual R L. Using this result, we rove that: E[e(R l )] = l 1 (1 ) l, (2)

2 which indicates that (2) is generally biased and on average tends to overestimate the original by a factor of u to 1/. To address this roblem, we derive a different estimator: ê(r L ) = R L 1 + 1/ (1 )R L and rove that it is both unbiased and elehant-accurate. We also derive in closed-form the mean-square error δ l = E[(ê(R l )/l 1) 2 ] for finite l, which can be used to determine when (3) aroximates the true with accuracy sufficient for billing uroses. B. Flow-Size Distribution Our second roblem is estimation of the original flow-size robability mass function (PMF), which we assume is given by f i = P (L = i), i = 1, 2,... We call PMF estimator q i asymtotically unbiased if it converges in robability to f i for all i as the number of samled flows M. One may be at first temted to comute this distribution based on the values roduced by either (1) or (3) for each observed flow; however, we show that such q i almost always differ from the original distribution f i and the bias ersists as samle size M. The reason for this discreancy is that e(.) and ê(.) both estimate the sizes of flows that have been samled by the algorithm, which are not reresentative of the entire oulation assing through the router. As longer flows are more likely to be selected by residual samling, this aroach overestimates their fraction and skews the PMF towards the tail. Denote by M i the number of samled flows with R L = i and define a new estimator: (3) q i = M i (1 )M i+1 M + (1 )M 1. (4) Using the general of RGS derived later in the aer, we rove that q i tends to f i in robability as M = i M i and obtain the amount of error q i f i for finite M. We also rovide asymtotically unbiased estimators for the total number of flows n: ñ = M + 1 M 1 (5) and the number of flows n i with exactly i ackets: ñ i = M i (1 )M i+1, (6) where ñ/n 1 and ñ i /n i 1, both in robability, as M. We call the resulting combination (3)-(6) Unbiased Residual-Geometric Estimators (URGE). C. Performance Evaluation To reduce RAM overhead, our imlementation eriodically discards flow records if the corresonding flows have comleted (i.e., FIN, RST ackets detected) or if no ackets from these flows arrive within some timeout τ. Unfortunately, no analytical results are available on the number of flows M(t) that a router needs to track in steady state or the amount of RAM needed to kee the counters. We overcome this roblem by deriving E[M(t)] in equilibrium and showing that it can be orders of magnitude smaller than both the total number of flows n and the number of samled flows M. We finish the aer by evaluating URGE with real Internet traces obtained from NLANR [24] and CAIDA [3]. Our exeriments reveal that the roosed algorithm roduces very accurate estimation of flow metrics and thus allows one to erform more aggressive samling (i.e., smaller robability ) of the monitored traffic. We also discover in exeriments that URGE works very well even on short traces, which makes it suitable for monitoring small customer networks and individual rotocols. II. RELATED WORK In this section, we review several samling algorithms in the area of traffic monitoring. In articular, we classify existing work into two categories: acket samling and flow samling, where the former makes er-acket and the latter er-flow decisions to samle incoming traffic. A. Packet Samling Samled NetFlow (SNF) [25] is a widely used technique in which incoming ackets are samled with a fixed robability. The general goal of SNF is to obtain the PMF of flow sizes; however, [14] shows that it is imossible to accurately recover the original flow-size distribution from samled SNF data. Estan et al. [10] roose Adative NetFlow (ANF), which adjusts the samling robability according to the size of the flow table; however, ANF s bias in the samled data is equivalent to that in SNF and is similarly difficult to overcome in ractice. Instead of using one uniform robability for all flows as in [10], [25], another direction in acket samling is to comute i (c) for each flow i based on its currently observed size c. This aroach has been studied by two indeendent aers, Sketch-Guided Samling (SGS) [20] and Adative Non-Linear Samling (ANLS) [15]. A common feature of these two methods is to samle a new flow with robability 1 and then monotonically decrease i (c) as c grows. Both methods must maintain a counter for each flow resent in the network and are difficult to scale due to the high RAM/CPU usage. B. Flow Samling In flow thinning [14], each flow is samled indeendently with robability and then all ackets in samled flows are counted. Hohn et al. [14] show that flow thinning is able to accurately estimate the distribution; however, this method tyically misses 1 ercent of elehant flows and thus does not suort alications such as usage-based accounting and traffic engineering [6], [11], [12], [13], [26]. For highly skewed distributions with a few extremely large flows and many short ones (which is tyical for Internet links), this method may also take a long time to converge. To address these roblems of flow thinning, Estan et. al. [11] introduce a size-deendent flow samling algorithm called Samle-and-Hold (S&H), which is roosed to identify 2

3 elehant flows. For each acket from a new flow, the algorithm creates a flow counter with robability ; once a flow is samled, all of its subsequent ackets are then counted. It is easy to verify that S&H samles a flow with size l with robability 1 (1 ) l, which quickly aroaches 1 as l grows. Creating a unifying analytical for this aroach and understanding the roerties of samles it collects is the main toic of this aer. Another direction of size-deendent flow samling has been exlored by Duffield et al. in [5], [6], [8], which resent another size-deendent flow measurement method called Smart Samling. Their aroach selects each flow of size L with robability (L) = min(1, L/z), where z is some constant. Since this method requires L before deciding whether to samle it or not, it can only be alied off-line. Komella et. al. [17] examine a method called Flow Slicing (FS), which combines SNF and S&H with a variant of smart samling. Other non-samling methods include exact counting [27], [31] and lossy counting [18], [22], which are orthogonal to our work. III. UNDERLYING MODEL In this section, we build a general robabilistic of Samle-and-Hold [11] and establish the necessary analytical foundation for the results that follow. Due to limited sace, omitted roofs, s, and various imlementation details can be found in the extended technical reort [28]. A. Samle-and-Hold Consider a sequence of ackets traversing a router and assume that its flow-measurement algorithm checks each acket s flow identifier x in some RAM table. If x is found in the table, the corresonding counter is incremented by 1; otherwise, with robability a new entry for x is created in the table (with counter value 1) and with robability 1 the acket is ignored. To this rocess, we first need several definitions. Assume that s are i.i.d. random variables and define geometric age A L to be the number of ackets discarded from the front of a flow with size L before it is samled (see Fig. 1). Let G be a shifted geometric random variable with success robability, i.e., P (G = j) = (1 ) j. It thus follows that A L is simly: A L = min(g, L). (7) Now define geometric residual R L to be the final counter value of a flow of size L conditioned on the fact that it has been samled (i.e., A L < L): R L = L A L, (8) which is also illustrated in Fig. 1. From the ersective of traffic monitoring in this aer, geometric residual R L is the only quantity collected during measurement and available to an estimation algorithm. Since this aroach belongs to the class of residual-samling techniques [29] and secifically uses geometric age, this aer calls S&H by a more mathematicallysecific name Residual-Geometric Samling (RGS). ackets in a flow Age A L L Residual R L Fig. 1. Residual-geometric samling of a flow with size L. discarded acket samled acket Assume that L has a PMF f i = P (L = i), where i = 1, 2,..., and denote by s = P (A L < L) the robability that a random flow is samled. Then, we have the following result. Lemma 1: Probability s that a flow is selected by RGS is: s = E[1 (1 ) L ] = 1 f i (1 ) i. (9) i=1 Next, let h i = P (R L = i) be the PMF of geometric residual R L. The following lemma exresses h i in terms of f i. Lemma 2: The PMF of geometric residual R L is: h i = j=i f j(1 ) j i s. (10) The result of Lemma 2 is fundamental as most of the results in this aer are conveniently derived from (10). B. Fixed Flow Size We next analyze a secial case of residual samling where the original is fixed at L = l. Note that residuals are now R l instead of R L since the original is no longer a random variable. Recall that the goal of single-flow size estimation is to obtain l from R l for each samled flow. The next corollary follows from (10) and gives the distribution and exectation of geometric residual R l. Corollary 1: Given L = l, the PMF of R l is: and its exectation is: P (R l = i) = (1 )l i 1 (1 ) l (11) E[R l ] = l + 1 1/. (12) 1 (1 ) l Next, we aly the results obtained in this section to analyze existing estimation methods that have been roosed for RGS. IV. ANALYSIS OF EXISTING METHODS In this section, we examine rior aroaches [11], [17] to estimating single-flow usage and whether their results can be generalized to recover the PMF of L. A. Single-Flow Usage To evaluate single-flow estimators, we use the following definition that is commonly used in statistics [2]. Definition 1: Estimator e(r l ) is called unbiased if E[e(R l )] = l for all l 1. Unbiased estimation is a key roerty of an estimator as it allows accurate estimation of the total contribution from a sufficiently large ool of flows (e.g., one customer network). However, since large flows are tyically rare, one commonly 3

4 unbiased unbiased Fig. 2. Exectation of estimator (13) in s and its (14). Fig. 3. RRMSE of (13) in s and its (16). faces an additional requirement to estimate their size with just a single samle e(r l ), which is formalized in the next definition. Definition 2: Estimator e(r l ) is called elehant-accurate if e(r l )/l 1 in mean-square as l. Elehant-accuracy guarantees that the amount of relative error between e(r l ) and l decays to zero as l. As before, suose that a flow of size l roduces a counter with value R l. Recall that [11], [17] suggest the following estimator: e(r l ) = R l 1 + 1/, (13) where is the robability of residual-geometric samling. The next result directly follows from (12). Theorem 1: Exectation E[e(R l )] is given by: E[e(R l )] = l 1 (1 ) l. (14) Note that (14) indicates that (13) is generally biased, esecially when l is small. Indeed, for l 0, we have 1 (1 ) l l and E[e(R l )] 1/ regardless of l, which shows that in such cases E[e(R l )] carries no information about the original. However, as l, it is straightforward to verify that the bias in e(r l ) vanishes exonentially, which is consistent with the analysis in [17], which has only considered the case of l. To see the extent of bias in (13) and verify (14), we aly residual-geometric samling to flows of size l ranging from 1 to ackets, feed the measured sizes to (13), and average the result after 1000 iterations for each l. Fig. 2 lots the obtained E[e(R l )] along with (14). The figure indicates that (14) indeed catures the bias and that (13) tends to overestimate the size of short flows even in exectation, where smaller samling robability leads to more error. To quantify the error of individual values e(r l ) in estimating l and to understand elehant-accuracy, denote by Y l = e(r l )/l and define the Relative Root Mean Square Error (RRMSE) to be: δ l = E[(Y l 1) 2 ]. (15) Note that δ l 0 indicates that Y l 1 in mean-square and thus imlies elehant-accurate estimation. The next result derives δ l in closed form. Theorem 2: The RRMSE of (13) is given by: 1 l(l 1) δ l = 2 (1 ) l (1 ) l+1 l 2 2 (1 (1 ) l. (16) ) Observe from (16) that for flows with size l = 1, the relative error is 1 /, but as l, δ l 0 and the estimator is elehant-accurate. Fig. 3 lots (16) against s, indicating a close match. The figure also shows that the RRMSE starts from 1/ and decreases towards zero as Θ(1/l) as l. B. Flow-Size Distribution We now investigate whether e(r L ) defined in (13) can be used to estimate the actual flow-size distribution {f i } i=1. Denote by q i = P (e(r L ) = i) the PMF of s among the samled flows. To understand our objectives with aroximating the PMF of L, the following definition is in order. Definition 3: An estimator {q i } i=1 of PMF {f i} i=1 is called asymtotically unbiased if q i converges in robability to f i for all i as the number of samled flows M. The next theorem follows directly from (10). Theorem 3: The PMF of s estimated from (13) is given by: j=y(i) q i = f j(1 ) j y(i), (17) s where y(i) = i + 1 1/ and s is in (9). The result in (17) indicates that each q i is different from f i regardless of the samling duration and thus cannot be used to aroximate the flow-size distribution. We verify (17) with a simulated acket stream with 5M flows, where flow sizes follow a ower-law distribution P (L i) = 1 i α for i = 1, 2,... and α = 1.1. Fig. 4 lots the CCDF of random variable e(r L ) obtained from s as well as (17), both in comarison to the tail of the actual distribution. The figure shows that (17) accurately redicts the values obtained from s and that PMF {q i } is indeed quite different from {f i }. So far, our study of existing methods in residual-geometric samling has shown that they are not only generally biased, but also unable to recover the flow-size distribution from residuals R L. This motivates us to seek better estimation aroaches, which we erform next. 4

5 Fig. 4. Distribution {q i } in s and its (17). Fig. 6. RRMSE of (19) in s and (21). unbiased Fig. 5. unbiased Exectation of estimator (19) in s. V. URGE This section rooses a family of algorithms called Unbiased Residual-Geometric Estimators (URGE), roves their accuracy, and verifies them in s. A. Single-Flow Usage For estimating individual s, we first consider an estimator directly imlied by the result in (12). Notice that solving (12) for l and exressing l in terms of E[R l ], we get: 1 ( ) l = u log(1 ) W u(1 ) u log(1 ), (18) where u = E[R l ] + 1/ 1 and W (z) is Lambert s function (i.e., a multi-valued solution to W e W = z) [4]. Thus, a ossible estimator can be comuted from (18) with E[R l ] relaced by the measured value of geometric residual R l. However, there are two reasons that (18) is a bad estimator of s. First, Lambert s function W (z) has no closed form solution and has to be numerically solved using tools such as Matlab. Second, it can be verified (not shown here for brevity) that (18) is not an unbiased estimator. Instead, we define a new estimator: ê(r l ) = R l 1 + 1/ (1 )R l. (19) and next show that it is unbiased. Lemma 3: Estimator ê(r l ) in (19) is unbiased, i.e., E[ê(R l )] = l. (20) We lot in Fig. 5 results obtained from (19). The figure indicates that ê(r l ) accurately estimates s for all flows in both cases of. Next, we derive the RRMSE of URGE. Theorem 4: The RRMSE of (19) is given by: 1 + l( 2)(1 ) ˆδ l = l (1 ) 2l+1 l 2 2 (1 (1 ) l. (21) ) It is easy to verify from (21) that URGE has zero RRMSE for l = 1 or l, confirming its elehant-accuracy. We lot ˆδ l obtained from s along with the in Fig. 6, which shows that (21) accurately tracks the actual relative error. From Figures 5-6, it is clear that ê(r l ) significantly imroves the accuracy of estimating small s comared to e(r l ). In ractice, (21) can be used to determine threshold l 0, which leads to desired bounds on error for all l l 0 and allows ISPs to use e(r l ) instead of l. B. Flow-Size Distribution It is worth mentioning that while (19) roduces unbiased estimation of s, ê(r L ) is not suitable for comuting the flow-size distribution, as we show below. Denote by ˆq i = P (ê(r L ) = i) the PMF of ê(r L ). Then, we have the following result. Lemma 4: PMF of ê(r L ) is given by: ˆq i = 1 s j=y(i) where s is in (9), function y(i) is: (1 ) j y(i) f j, (22) y(i) = i + 1 1/ ω, (23) and ω = W ( (1 ) i+1 1/ log(1 ) ). Notice from (22)-(23) that distribution ˆq i does not even remotely aroximate the original PMF f i. This roblem is fundamental since residual samling exhibits bias towards larger flows and even if we could recover L from R L exactly, the distribution of samled s would not accurately aroximate that of all flows assing through the router. We thus exlore another technique for estimating the flowsize distribution. Before doing that, we need the next lemma. Lemma 5: The distribution f i can be exressed using the PMF of geometric residuals {h i } in (10) as: f i = h i (1 )h i+1 + (1 )h 1. (24) 5

6 , M = 194, 208, M = 26, 233 (a) =, M = 3, 090 (b) = 10 5, M = 337 Fig. 7. Estimator (25) in s. Fig. 8. Estimator (25) in s with very small. This result leads to a new estimator for the flow-size distribution: q i = M i (1 )M i+1 M + (1 )M 1, (25) where M is the total number of samled flows and M i is the number of them with the geometric residual equal to i. Since M i /M h i in robability as M (from the weak law of large numbers), we immediately get the following result. Corollary 2: The estimator in (25) is asymtotically unbiased. We next verify the accuracy of q i in s with 5M flows in the same setting as in the revious section. We lot in Fig. 7 the CCDF estimated from (25) along with the actual distribution. The figure shows that q i accurately follows the for both cases of. C. Convergence Seed We next examine the effect of samle size M on the convergence of estimator q i. To illustrate the roblems arising from small M, we study (25) with = and 10 5 in s with the same 5M flows. The estimator obtained M = 3, 090 flows for = and just M = 337 for = Fig. 8 indicates that while the estimated curves under both choices of still aroximate the trend of the original distribution, they exhibit different levels of noise. As the next result indicates, small leads to a small samle size M and thus more noise in the estimated values. Corollary 3: Suose that M flows are selected by residual-geometric samling from a total of n flows. Then, the exected value of M is given by: E[M] = n s = ne[1 (1 ) L ]. (26) To shed light on the choice of roer for RGS, we show how to determine the minimum M that would guarantee a certain level of accuracy in q i. Define h i = M i /M to be an estimate of h i = P (R L = i). The next lemma follows from Lemma 5 and Corollary 2 and indicates that the accuracy of q i directly deends on whether h i aroximates h i accurately. Lemma 6: Suose that h j h j ηh j holds with robability 1 ξ for j [1, i + 1] and small constants η and ξ. Then, there exists a constant ζ: ζ = η( + 2η(1 )h 1) + (1 )(1 η)h 1 (27) such that ζ 0 as η 0 and P ( q i f i ζf i ) = 1 ξ. Next, we obtain a bound on M from the requirement that h i be bounded in robability within a given range [h i (1 η), h i (1 + η)]. Theorem 5: For small constants η and ξ, h i h i ηh i holds with robability 1 ξ if samle size M is no less than: M (1 h i) h i η 2 ( Φ 1 (1 ξ/2) ) 2, (28) where Φ(x) is the CDF of the standard Gaussian distribution N (0, 1). For examle, to bound h i within 10% ercent of h i (i.e., η = 0.1) with robability 1 ξ = 95% for all h i, the following must hold: M ( ) , (29) which indicates that M = 38K flows must be samled to achieve target accuracy. If we reduce η to 1%, increase 1 ξ to 99%, and require the aroximation to hold for all h i, then M must be at least 66M flows. Converting η into ζ using (27), one can establish similar bounds on the deviation of q i from f i. D. Estimation of Other Flow Metrics Besides s and the flow-size distribution, URGE also rovides estimators for the total number of flows and the number of them with size i. Before introducing these estimators, we need the next lemma. Lemma 7: The exected number of flows with samled residuals R L = i is: E[M i ] = E[M]h i = nh i s, (30) where h i is the PMF of geometric residuals R L and s is given by (9). Based on this, we next develo two estimators and rove their accuracy. Let ñ be an estimator of the total number of flows n observed in the measurement window [0, T ]: ñ = M + (1 ) M 1 (31) and ñ i be an estimator of the number of flows n i with size i: ñ i = M i (1 )M i+1. (32) 6

7 Then, the next result shows that both of these estimators are asymtotically unbiased. Lemma 8: Ratios ñ/n and ñ i /n i converge to 1 in robability as M. Note that [17] rovided a similar estimator as (31) and roved E[ñ] = n using a different aroach from ours; however, our results are stronger as they show convergence in robability and additionally address estimation of n i. Simulations verifying (31)-(32) are omitted for brevity. E. Active Flows In a tyical imlementation of URGE, one needs a flow table to kee a maing between flow identifiers and associated counters. An imortant element of any samling algorithm is to ensure that the table kees only active flows, which can be accomlished by eriodic swees through RAM and removal of all flows that have comleted. Such a strategy together with RGS can achieve a significant reduction in the table size. To understand how much benefit removal of dead flows rovides to memory consumtion, we next derive the exected number of active flows at any time t and their fraction samled by the algorithm. Assume a measurement window [0, T ], where T is given in ackets seen by the router. For each flow j, let inter-acket delays within the flow be given by a random variable j, which counts the number of acket arrivals from other flows between adjacent ackets of j. Denoting by = E[ j ], we have the following result. Lemma 9: Assuming stationary flow arrivals in [0, T ] and T, the exected number of active flows N(t) at time t is given by: E[N(t)] = + 1. (33) Our baseline reduction in flow volume comes from geometric samling in revious sections and reduces the number of flows by a factor of r 1 = n/e[m]. Now additionally define ratio r 2 = n/e[n(t)] = T/( + 1)E[L] and observe that longer observation windows (i.e., larger T ), smaller flow sizes (i.e., smaller E[L]), and denser arrivals (i.e., smaller ) imly more savings of memory. In fact, T results in r 2 if the other arameters are fixed. However, even more reduction is ossible by discarding dead flows in RGS. Denote by M(t) the number of samled flows that are still alive at t and consider the next result. Lemma 10: Assuming the flow arrival rocess is stationary in [0, T ] and T, the exected number of active samled flows at time t is given by: ( E[M(t)] = ( + 1) 1 1 ) E[L] s, (34) where s in (9) is the fraction of all flows samled by RGS. Define r 3 = n/e[m(t)] and notice that it increases not only as T grows, but also when decreases. Performing a selfcheck using Jensen s inequality, observe that 0 s /E[L] 1 and therefore E[M(t)] E[N(t)], which means that the former indeed always results in more reduction in table size. Simulations with heavy-tailed flows (omitted due to limited sace) show that the is very accurate. actual estimated (a) E[e(R l )] actual estimated (b) E[ê(R l )] Fig. 9. Estimating single-flow usage in the FRG trace with = (a) δ l (b) ˆδ l Fig. 10. RRMSE of single-flow usage in the FRG trace with = VI. PERFORMANCE EVALUATION In this section, we evaluate our s using several Internet traces in Table I from NLANR [24] and CAIDA [3]. Trace FRG was collected from a gigabit link between UCSD and Abilene in We extracted from it additional traces with only Web, DNS, and NTP flows (also seen in the table). Additionally, we use three traces from CAIDA: LARGE a one-hour trace from an OC48 link, MEDIUM a one-minute trace from a OC192 link, and SMALL a 7-minute trace from a gigabit link. As the table shows, URGE tyically sees a reasonably large number of flows M over the entire interval [0, T ]; however, the number of active flows N(t) and those constantly ket in memory M(t) is much smaller. For the FRG trace, for examle, E[M] is 15 times smaller than n, while E[N(t)] is 81 and E[M(t)] is 658 times smaller. In general, NLANR traces benefit more from the removal of dead flows than CAIDA data, because former was collected over two consecutive days and thus had a larger observation window T, which led to larger ratios r 2 and r 3. The same reasoning also exlains the fact that the LARGE trace exhibits much higher benefit from removing dead flows than MEDIUM or SMALL traces. A. Estimation Accuracy First, we examine the roblem of estimating the total number of flows n in [0, T ] and size-one flows n 1 in this interval. The fifth and eighth columns of Table II list the absolute error of s (31) and (32), resectively. The table indicates that these estimates are commonly within 2.5% of the correct value. We next evaluate the erformance of URGE in estimating single-flow usage. Fig. 9 lots the exectation of estimated 7

8 TABLE I REDUCTION IN THE NUMBER OF FLOWS USING RESIDUAL SAMPLING WITH = 0.01 AND DIFFERENT TYPES OF PERIODIC REMOVAL OF DEAD FLOWS source trace total flows n total kts ne[l] samling only removal only both E[M] r 1 E[N(t)] r 2 E[M(t)] r 3 FRG 1, 756, , 821, , , , NLANR Web 239, 174 6, 497, , , DNS 120, , 977 2, , 797 NTP 382, , 447 4, , , 887 LARGE 9, 653, , 250, , , , CAIDA MEDIUM 2, 317, , 837, , , , SMALL 200, 9, 179, , , , source NLANR CAIDA TABLE II PERFORMANCE OF URGE WITH = trace # of flows # of size-one flows actual (n) estimated (ñ) error actual (n 1 ) estimated (ñ 1 ) error FRG 1, 756, 702 1, 736, % 768, , % Web 239, , % 13, , % DNS 120, , % 76, , % NTP 382, , % 281, , % LARGE 9, 653, 609 9, 717, % 4, 535, 449 4, 630, % MEDIUM 2, 317, 369 2, 278, % 1, 299, 343 1, 273, % SMALL 200, 902, % 93, , % (c) = Fig. 11. Estimating the distribution using URGE in the FRG trace. s (averaged over 100 iterations) along with the actual values obtained from the FRG trace using = The figure shows that the estimator e(r l ) from revious work tends to overestimate the sizes of small flows, while URGE s estimator ê(r l ) accurately follows the actual values. We also comare the relative errors of the two studied methods in Fig. 10, which indicates that URGE has RRMSE bounded by 1 for all flows, while e(r l ) exhibits very large δ l for small and medium flows, which is an increasing function of 1/. For the flow-size distribution, we first examine three values of to comare its effect on the accuracy of URGE in the FRG trace. Fig. 11 indicates that estimation for all three values of are very consistent and all of them follow the accurately. In our exeriments with = , URGE recovered the original PMF {f i } using only M = 7, 616 total flows out of n = 1.75M. Finally, we aly URGE with = to NLANR traces of different traffic tyes and lot in Fig. 12 the estimated distributions along with the actual ones. As the figure shows, the flow statistics of different alications can be accurately estimated by URGE. We observe a similar match in our exeriments with three CAIDA traces as shown in Fig. 13. VII. CONCLUSION In this aer, we roved that revious methods based on residual-geometric samling had certain bias in estimating single-flow usage and were unable to recover the flow-size distribution from the samled residuals. To overcome this limitation, we roosed a novel ing framework for analyzing residual samling and develoed a set of algorithms that were able to erform accurate estimation of flow statistics, even under the constraints of small router RAM size, short trace duration, and low CPU samling overhead. REFERENCES [1] D. Brauckhoff, B. Tellenbach, A. Wagner, A. Lakhina, and M. May, Imact of Traffic Samling on Anomaly Detection Metrics, in Proc. ACM IMC, Oct. 2006, [2] G. Casella and R. L. Berger, Statistical Inference, 2nd ed. Duxbury/Thomson Learning, [3] Cooerative Association for Internet Data Analysis (CAIDA). [Online]. Available: htt:// [4] R. M. Corless, G. H. Gonnet, D. E. G. Hare, D. J. Jeffrey, and D. E. Knuth, On the Lambert W Function, Advances in Comutational Mathematics, vol. 5, , [5] N. Duffield and C. Lund, Predicting Resource Usage and Estimation Accuracy in an IP Flow Measurement Collection Infrastructure, in Proc. ACM IMC, Oct. 2003,

9 10 5 (a) Web (b) NTP (c) DNS Fig. 12. Estimating the distribution using URGE in NLANR traces with = (a) LARGE 10 5 (b) MEDIUM (c) SMALL Fig. 13. Estimating the distribution using URGE in CAIDA traces with = [6] N. Duffield, C. Lund, and M. Thoru, Charging from Samled Network Usage, in Proc. ACM IMW, Nov. 2001, [7] N. Duffield, C. Lund, and M. Thoru, Estimating Flow Distributions from Samled Flow Statistics, in Proc. ACM SIGCOMM, Aug. 2003, [8] N. Duffield, C. Lund, and M. Thoru, Flow Samling under Hard Resource Constraints, in Proc. ACM SIGMETRICS, Jun. 2004, [9] N. Duffield, C. Lund, and M. Thoru, Learn More, Samle Less: Control of Volume and Variance in Network Measurement, IEEE Trans. Inform. Theory, vol. 51, no. 5, , May [10] C. Estan, K. Keys, D. Moore, and G. Varghese, Building a Better Netflow, in Proc. ACM SIGCOMM, Aug. 2004, [11] C. Estan and G. Varghese, New Directions in Traffic Measurement and Accounting, in Proc. ACM SIGCOMM, Aug. 2002, [12] W. Fang and L. Peterson, Inter-AS Traffic Patterns and their Imlications, in Proc. IEEE GLOBECOM, Dec. 1999, [13] A. Feldmann, A. Greenberg, C. Lund, N. Reingold, J. Rexford, and F. True, Deriving Traffic Demands for Oerational IP Networks: Methodology and Exerience, IEEE/ACM Trans. Netw., vol. 9, no. 3, , Jun [14] N. Hohn and D. Veitch, Inverting Samled Traffic, in Proc. ACM IMC, Oct. 2003, [15] C. Hu, S. Wang, J. Tian, B. Liu, Y. Cheng, and Y. Chen, Accurate and Efficient Traffic Monitoring Using Adative Non-linear Samling Method, in Proc. IEEE INFOCOM, Ar. 2008, [16] K. Ishibashi, R. Kawahara, T. Mori, T. Kondoh, and S. Asano, Effect of Samling Rate and Monitoring Granularity on Anomaly Detectability, in Proc. IEEE Global Internet Symosium, May 2007, [17] R. R. Komella and C. Estan, The Power of Slicing in Internet Flow Measurement, in Proc. USENIX/ACM IMC, Oct. 2005, [18] A. Kumar, M. Sung, J. Xu, and J. Wang, Data Streaming Algorithms for Efficient and Accurate Estimation of Flow Size Distribution, in Proc. ACM SIGMETRICS, Jun. 2004, [19] A. Kumar, M. Sung, J. Xu, and E. Zegura, A Data Streaming Algorithm for Estimating Suboulation Flow Size Distribution, in Proc. ACM SIGMETRICS, Jun. 2005, [20] A. Kumar and J. Xu, Sketch Guided Samling Using On-Line Estimates of Flow Size for Adative Data Collection, in Proc. IEEE INFOCOM, Ar. 2006, [21] A. Kumar, J. Xu, J. Wang, O. Satschek, and L. Li, Sace-Code Bloom Filter for Efficient Per-Flow Traffic Measurement, in Proc. IEEE INFOCOM, Mar. 2004, [22] Y. Lu, A. Montanari, B.Prabhakar, S. Dharmaurikar, and A. Kabbani, Counter Braids: A Novel Counter Architecture for Per-Flow Measurement, in Proc. ACM SIGMETRICS, Jun. 2008, [23] J. Mai, C.-N. Chuah, A. Sridharan, T. Ye, and H. Zang, Is Samled Data Sufficient for Anomaly Detection? in Proc. ACM IMC, Oct. 2006, [24] National Laboratory for Alied Network Research (NLANR). [Online]. Available: htt://moat.nlanr.net/. [25] Cisco Samled NetFlow. [Online]. Available: htt:// US/docs/ios/12 0s/feature/guide/12s sanf.html. [26] R. Pan, L. Breslau, B. Prabhakar, and S. Shenker, Aroximate Fairness through Differential Droing, ACM SIGCOMM Com. Comm. Rev., vol. 33, no. 2, , Ar [27] S. Ramabhadran and G. Varghese, Efficient Imlementation of a Statistics Counter Architecture, in Proc. ACM SIGMETRICS, Jun. 2003, [28] X. Wang, X. Li, and D. Loguinov, Modeling Residual-Geometric Flow Samling (extended version), Texas A&M University, Tech. Re , Dec [Online]. Available: htt://irl.cs.tamu.edu/ublications/. [29] X. Wang, Z. Yao, and D. Loguinov, Residual-Based Measurement of Peer and Link Lifetimes in Gnutella Networks, in Proc. IEEE INFOCOM, May 2007, [30] L. Yang and M. Michailidis, Samled Based Estimation of Network Traffic Flow Characteristics, in Proc. IEEE INFOCOM, May 2007, [31] Q. Zhao, J. Xu, and Z. Liu, Design of a Novel Statistics Counter Architecture with Otimal Sace and Time Efficiency, in Proc. ACM SIGMETRICS, Jun. 2006,

Modeling Residual-Geometric Flow Sampling

1 Modeling Residual-Geometric Flow Samling Xiaoming Wang, Xiaoyong Li, and Dmitri Loguinov Abstract Traffic monitoring and estimation of flow arameters in high seed routers have recently become challenging