Modeling Residual-Geometric Flow Sampling

Size: px

Start display at page:

Download "Modeling Residual-Geometric Flow Sampling"

Esther Bailey
5 years ago
Views:

1 1 Modeling Residual-Geometric Flow Samling Xiaoming Wang, Xiaoyong Li, and Dmitri Loguinov Abstract Traffic monitoring and estimation of flow arameters in high seed routers have recently become challenging as the Internet grew in both scale and comlexity. In this aer, we focus on a family of flow-size estimation algorithms we call Residual- Geometric Samling (RGS), which generates a random oint within each flow according to a geometric random variable and records all remaining ackets in a flow counter. Our analytical investigation shows that revious estimation algorithms based on this method exhibit certain bias in recovering flow statistics from the samled measurements. To address this roblem, we derive a novel set of unbiased estimators for RGS, validate them using real Internet traces, and show that they rovide an accurate and scalable solution to Internet traffic monitoring. I. INTRODUCTION Recent growth of the Internet in both scale and comlexity has imosed a number of challenges on network management, oeration, and traffic monitoring. The main roblem in this line of work is to scale measurement algorithms to achieve certain objectives (e.g., accuracy) while satisfying real-time resource constraints (e.g., fixed memory consumtion and er-acket rocessing delay) of high-seed Internet routers. This is commonly accomlished (e.g., [5], [6], [7], [8], [9], [10], [11], [14], [15], [17], [21], [18], [19], [20], [22], [26], [32]) by reducing the amount of information a router has to store in its internal tables, which comes at the exense of deloying secial estimation techniques that can recover metrics of interest from the collected samles. In this aer, we study two roblems in the general area of measuring s 1) determining the number of ackets transmitted by elehant flows [11], [15], [17], [21], [20], [22] and 2) building the distribution of s seen by the router in some time window [7], [18], [32] couled in a single measurement technique. The former roblem arises in usage-based accounting and traffic engineering [6], [11], [12], [13], [27], while the latter has many security alications such as anomaly and intrusion detection [1], [23], [16]. Our interest falls within the family of residual samling, which selects a random oint A within each flow and then samles the remainder R of that flow until it ends. Denoting by L the size (in ackets) of a random flow, samled residuals R are simly L A. Stochastically larger A results in fewer flows being samled and leads to lower overhead in terms of both CPU and RAM consumtion. Besides reduced overhead arising from omission of many small-size flows from counter tables, residual samling guarantees to cature large flows with robability 1 o(1) as their size L. This allows ISPs A shorter version of this aer aeared in IEEE INFOCOM Xiaoming Wang is with Amazon.com, Seattle, WA USA (xmwang@gmail.com). Xiaoyong Li and Dmitri Loguinov are with Texas A&M University, College Station, TX USA ({xiaoyong, dmitri}@cse.tamu.edu). to determine heavy-hitters and charge the corresonding customers for generated traffic. While in P2P networks residual samling distributes the initial oint A uniformly within user lifetimes [31], flow-based estimation [11], [17] usually emloys geometric A since it can be easily imlemented with a sequence of indeendent Bernoulli variables. We call the resulting aroach Residual- Geometric Samling (RGS) and note that it has received some limited analytical attention in [11], [17]; however, unbiased estimation of individual s, analysis of the resulting error, asymtotically accurate recovery of flow-size distribution P (L = i) from samled residuals R, and analysis of sace- CPU requirements in steady state have not been exlored. We overcome these issues below. A. Single-Flow Usage We start with the roblem of obtaining sizes of individual flows for accounting uroses. Since residual samling requires an estimator to convert residuals into the metrics of interest, our first task is to define roer notation and desired roerties for the estimation algorithm. Assume that for a flow of size L the samling algorithm roduces residual R L, where both L and R L are random variables. We call an estimator e(r L ) unbiased if its exectation roduces the correct flow size, i.e., E[e(R L ) L = l] = E[e(R l )] = l. Unbiased estimation allows one to average the estimated size of several flows of a given size l and accurately estimate their total contribution. We further call an estimator elehant-accurate if ratio e(r l )/l converges to 1 in mean-square as l. Elehant-accuracy ensures that the variance of e(r l )/l tends to zero as l, which means that the amount of relative error between e(r l ) and l becomes negligible for large flows. Prior work on RGS [11], [17] has suggested the following estimator: e(r L ) = R L 1 + 1/, (1) where 0 < 1 is the arameter of geometric variable A. To understand the erformance of (1), we first build a general robabilistic for residual-geometric samling and derive the relationshi between L and its residual R L. Using this result, we rove that: E[e(R l )] = l 1 (1 ) l, (2) which indicates that (2) is generally biased and on average tends to overestimate the original by a factor of u to 1/. To address this roblem, we derive a different estimator: ê(r L ) = R L 1 + 1/ (1 )R L (3) and rove that it is both unbiased and elehant-accurate. We also derive in closed-form the mean-square error δ l =

2 2 E[(ê(R l )/l 1) 2 ] for finite l, which can be used to determine when (3) aroximates the true with accuracy sufficient for billing uroses. B. Flow-Size Distribution Our second roblem is estimation of the original flow-size robability mass function (PMF), which we assume is given by f i = P (L = i), i = 1, 2,... We call PMF estimator q i asymtotically unbiased if it converges in robability to f i for all i as the number of samled flows M. One may be at first temted to comute this distribution based on the values roduced by either (1) or (3) for each observed flow; however, we show that such q i almost always differ from the original distribution f i and the bias ersists as samle size M. The reason for this discreancy is that e(.) and ê(.) both estimate the sizes of flows that have been samled by the algorithm, which are not reresentative of the entire oulation assing through the router. Since longer flows are more likely to be selected by residual samling, this aroach severely overestimates their fraction and thus skews the PMF towards the tail. Denote by M i the number of samled flows with R L = i and define a new estimator: q i = M i (1 )M i+1 M + (1 )M 1. (4) Using the general of RGS derived later in the aer, we rove that q i tends to f i in robability as M = i M i and obtain the amount of error q i f i for finite M. We also rovide asymtotically unbiased estimators for the total number of flows n: ñ = M + 1 M 1 (5) and the number of flows n i with exactly i ackets: ñ i = M i (1 )M i+1, (6) where ñ/n 1 and ñ i /n i 1, both in robability, as M. We call the resulting combination (3),(4)-(6) Unbiased Residual-Geometric Estimators (URGE). C. Imlementation and Evaluation We finish the aer by discussing an efficient imlementation of the above algorithms and evaluating their accuracy/erformance using several Internet traces. Prior work has not discussed how residual samling should be imlemented or its overhead in steady-state, which romts a fairly detailed exosition below. We assume URGE uses a chain-linked hash table of size K, which kees individual flow counters. Each linked list is sorted according to the flow ID and is traversed linearly until a match is found or an ID larger than the one being sought is encountered. Keeing the list sorted (as oosed to FIFO) reduces the looku delay by half for flows not already in the table. To reduce RAM overhead, we remove flows from the table if they have comleted (i.e., FIN, RST ackets detected) or if no ackets from these flows arrive within some timeout τ. To kee the overhead manageable, the removal rocess is run over the entire table on the timescale of seconds or even minutes. As before, assume that the router sees a total of n flows in window [0, T ]. Then, denote by N(t) the number of active flows at time t and by M(t) the number of them samled by the router. It then follows that memory consumtion W R (t) and looku delay is T R (t) are both functions of M(t). Under certain mild assumtions, we obtain a simle result on E[M(t)] and show that even as the total number of flows n, both RAM usage and CPU overhead of RGS remain constant. We then exlore how to satisfy the tradeoff between three design objectives memory consumtion, rocessing seed, and accuracy using arameters K and. Given uer bounds on memory usage W 0 and er-acket rocessing delay T 0, we roose a technique for deciding K based on the above analysis such that W R (t) W 0 and T R (t) T 0 are satisfied, while maximizing at the same time (i.e., achieving the best accuracy within the constraints). We finish the aer by evaluating URGE with real Internet traces obtained from NLANR [24] and CAIDA [3]. Our exeriments reveal that the roosed algorithm roduces very accurate estimation of flow metrics and thus allows one to erform more aggressive samling (i.e., smaller robability ) of the monitored traffic. With = 0.01, we find that E[M(t)] is times smaller than n and times smaller than E[M], with most lookus requiring just 1-2 RAM hits. We also discover in the exeriments with small traces that URGE does not degrade significantly in terms of accuracy even for small samle sizes, which makes it suitable for monitoring individual customer networks and certain rotocols. The remainder of the aer is organized as follows. We review rior work on traffic monitoring in Section II. We then develo a robabilistic for residual-geometric samling in Section III, analyze revious methods in Section IV, and roose the new estimators in Section V. We exlore the imlementation of the suggested framework in Section VI, evaluate its erformance in Section VII, and conclude the aer in Section VIII. II. RELATED WORK In this section, we review several samling algorithms in the area of traffic monitoring. In articular, we classify existing work into two categories: acket samling and flow samling, where the former makes er-acket and the latter er-flow decisions to samle incoming traffic. A. Packet Samling Samled NetFlow (SNF) [26] is a widely used technique in which incoming ackets are samled with a fixed robability. The general goal of SNF is to obtain the PMF of flow sizes; however, [14] shows that it is imossible to accurately recover the original flow-size distribution from samled SNF data. Estan et al. [10] roose Adative NetFlow (ANF), which adjusts the samling robability according to the size of

3 3 the flow table; however, ANF s bias in the samled data is equivalent to that in SNF and is similarly difficult to overcome in ractice. Instead of using one uniform robability for all flows as in [10], [26], another direction in acket samling is to comute i (c) for each flow i based on its currently observed size c. This aroach has been studied by two indeendent aers, Sketch-Guided Samling (SGS) [20] and Adative Non-Linear Samling (ANLS) [15]. A common feature of these two methods is to samle a new flow with robability 1 and then monotonically decrease i (c) as c grows. Both methods must maintain a counter for each flow resent in the network and are difficult to scale due to the high RAM/CPU usage. B. Flow Samling In flow thinning [14], each flow is samled indeendently with robability and then all ackets in samled flows are counted. Hohn et al. [14] show that flow thinning is able to accurately estimate the distribution; however, this method tyically misses 1 ercent of elehant flows and thus does not suort alications such as usage-based accounting and traffic engineering [6], [11], [12], [13], [27]. For highly skewed distributions with a few extremely large flows and many short ones (which is tyical for Internet links), this method may also take a long time to converge. To address these roblems of flow thinning, Estan et. al. [11] introduce a size-deendent flow samling algorithm called Samle-and-Hold (S&H), which is roosed to identify elehant flows. For each acket from a new flow, the algorithm creates a flow counter with robability ; once a flow is samled, all of its subsequent ackets are then counted. It is easy to verify that S&H samles a flow with size l with robability 1 (1 ) l, which quickly aroaches 1 as l grows. Creating a unifying analytical for this aroach and understanding the roerties of samles it collects is the main toic of this aer. Another direction of size-deendent flow samling has been exlored by Duffield et al. in [5], [6], [8], which resent another size-deendent flow measurement method called Smart Samling. Their aroach selects each flow of size L with robability (L) = min(1, L/z), where z is some constant. Since this method requires L before deciding whether to samle it or not, it can only be alied off-line. Komella et. al. [17] examine a method called Flow Slicing (FS), which combines SNF and S&H with a variant of smart samling. Other non-samling methods include exact counting [25], [28], [30], [33] and lossy counting [18], [22], which are orthogonal to our work. III. UNDERLYING MODEL In this section, we build a general robabilistic of Samle-and-Hold [11] and establish the necessary analytical foundation for the results that follow. A. Samle-and-Hold Consider a sequence of ackets traversing a router and assume that its flow-measurement algorithm checks each ackets in a flow Age A L L Residual R L Fig. 1. Residual-geometric samling of a flow with size L. discarded acket samled acket acket s flow identifier x in some RAM table. If x is found in the table, the corresonding counter is incremented by 1; otherwise, with robability a new entry for x is created in the table (with counter value 1) and with robability 1 the acket is ignored. To this rocess, we first need several definitions. Assume that s are i.i.d. random variables and define geometric age A L to be the number of ackets discarded from the front of a flow with size L before it is samled (see Fig. 1). Let G be a shifted geometric random variable with success robability, i.e., P (G = j) = (1 ) j. It thus follows that A L is simly: A L = min(g, L). (7) Now define geometric residual R L to be the final counter value of a flow of size L conditioned on the fact that it has been samled (i.e., A L < L): R L = L A L, (8) which is also illustrated in Fig. 1. From the ersective of traffic monitoring in this aer, geometric residual R L is the only quantity collected during measurement and available to an estimation algorithm. Since this aroach belongs to the class of residual-samling techniques [31] and secifically uses geometric age, this aer calls S&H by a more mathematicallysecific name Residual-Geometric Samling (RGS). Assume that L has a PMF f i = P (L = i), where i = 1, 2,..., and denote by s = P (A L < L) the robability that a random flow is samled. Then, we have the following result. Lemma 1: Probability s that a flow is selected by RGS is: s = E[1 (1 ) L ] = 1 f i (1 ) i. (9) Proof: Observe that for a fixed L = l, we have P (A l < l) = 1 (1 ) l. Unconditioning L, we immediately get (9). Next, let h i = P (R L = i) be the PMF of geometric residual R L. The following lemma exresses h i in terms of f i. Lemma 2: The PMF of geometric residual R L is: h i = j=i f j(1 ) j i. (10) s Proof: Using (8), we have: i=1 h i = P (R L = i) = P (L A L = i A L < L) = P (L A L = i A L < L) s, (11) where s = P (A L < L). Substituting (7) into (11) and combining the fact that L G = i 1, we establish: P (L G = i) j=0 h i = = P (G = j i)f j, (12) s s

4 4 which gives the desired result in (10) by substituting the PMF of G into (12). The result of Lemma 2 is fundamental as most of the results in this aer are conveniently derived from (10). estimated size estimated size B. Fixed Flow Size We next analyze a secial case of residual samling where the original is fixed at L = l. Note that residuals are now R l instead of R L since the original is no longer a random variable. Recall that the goal of single-flow size estimation is to obtain l from R l for each samled flow. The next corollary follows from (10) and gives the distribution and exectation of geometric residual R l. Corollary 1: Given L = l, the PMF of R l is: and its exectation is: P (R l = i) = (1 )l i 1 (1 ) l (13) E[R l ] = l + 1 1/. (14) 1 (1 ) l Proof: For L = l, we have f l = 1 and f i = 0 for all i l. Writing s = 1 (1 ) l, we get from (10): j=i P (R l = i) = f j(1 ) j i 1 (1 ) l = (1 )l i 1 (1 ) l, (15) which is exactly (13). We next derive exectation E[R l ], which can be exanded into: E[R l ] = E[l A l A l < l] = l E[G G < l]. (16) Recall that for any non-negative discrete random variable Y taking values over the integer set {0, 1,...}, its exectation is given by E[Y ] = y=0 P (Y > y). It thus follows that (16) reduces to: l 1 E[R l ] = l P (G > j G < l) j=0 l 1 = P (G j G < l) = j=0 l 1 j=0 P (G j). (17) P (G < l) Substituting P (G j) = 1 (1 ) j+1 into (17), we have: l 1 j=0 E[R l ] = [1 (1 )j+1 ] 1 (1 ) l = l (1 )(1 (1 )l )/ 1 (1 ) l, (18) which can be simlified to (14). Next, we aly the results obtained in this section to analyze existing estimation methods that have been roosed for RGS. IV. ANALYSIS OF EXISTING METHODS In this section, we examine rior aroaches [11], [17] to estimating single-flow usage and whether their results can be generalized to recover the PMF of L. unbiased (a) = 0.01 unbiased (b) = Fig. 2. Exectation of estimator (19) in s and its (20). A. Single-Flow Usage To evaluate single-flow estimators, we use the following definition that is commonly used in statistics [2]. Definition 1: Estimator e(r l ) is called unbiased if E[e(R l )] = l for all l 1. Unbiased estimation is a key roerty of an estimator as it allows accurate estimation of the total contribution from a sufficiently large ool of flows (e.g., one customer network). However, since large flows are tyically rare, one commonly faces an additional requirement to estimate their size with just a single samle e(r l ), which is formalized in the next definition. Definition 2: Estimator e(r l ) is called elehant-accurate if e(r l )/l 1 in mean-square as l. Elehant-accuracy guarantees that the amount of relative error between e(r l ) and l decays to zero as l. As before, suose that a flow of size l roduces a counter with value R l. Recall that [11], [17] suggest the following estimator: e(r l ) = R l 1 + 1/, (19) where is the robability of residual-geometric samling. The next result directly follows from (14). Theorem 1: Exectation E[e(R l )] is given by: E[e(R l )] = l 1 (1 ) l. (20) Proof: Taking the exectation of (19), we have: E[e(R l )] = E[R l ] 1 + 1/, (21) which immediately leads to (20) using (14). Note that (20) indicates that (19) is generally biased, esecially when l is small. Indeed, for l 0, we have 1 (1 ) l l and E[e(R l )] 1/ regardless of l, which shows that in such cases E[e(R l )] carries no information about the original. However, as l, it is straightforward to verify that the bias in e(r l ) vanishes exonentially, which is consistent with the analysis in [17], which has only considered the case of l. To see the extent of bias in (19) and verify (20), we aly residual-geometric samling to flows of size l ranging from 1 to 10 6 ackets, feed the measured sizes to (19), and average the result after 1000 iterations for each l. Fig. 2 lots the obtained E[e(R l )] along with (20). The figure indicates that (20) indeed catures the bias and that (19) tends to overestimate the size of short flows even in exectation, where smaller samling robability leads to more error.

5 relative RMSE (a) = 0.01 relative RMSE (b) = Fig. 3. RRMSE of (19) in s and its (23) (a) = (b) = Fig. 4. Distribution {q i } in s and its (24). To quantify the error of individual values e(r l ) in estimating l and to understand elehant-accuracy, denote by Y l = e(r l )/l and define the Relative Root Mean Square Error (RRMSE) to be: δ l = E[(Y l 1) 2 ]. (22) Note that δ l 0 indicates that Y l 1 in mean-square and thus imlies elehant-accurate estimation. The next result derives δ l in closed form. We omit the rather tedious derivations for brevity. Theorem 2: The RRMSE of (19) is given by: 1 l(l 1) δ l = 2 (1 ) l (1 ) l+1 l 2 2 (1 (1 ) l. (23) ) Observe from (23) that for flows with size l = 1, the relative error is 1 /, but as l, δ l 0 and the estimator is elehant-accurate. Fig. 3 lots (23) against s, indicating a close match. The figure also shows that the RRMSE starts from 1/ and decreases towards zero as Θ(1/l) as l. B. Flow-Size Distribution We now investigate whether e(r L ) defined in (19) can be used to estimate the actual flow-size distribution {f i } i=1. Denote by q i = P (e(r L ) = i) the PMF of estimated sizes among the samled flows. To understand our objectives with aroximating the PMF of L, a definition is in order. Definition 3: An estimator {q i } i=1 of PMF {f i} i=1 is called asymtotically unbiased if q i converges in robability to f i for all i as the number of samled flows M. The next theorem follows directly from (10). Theorem 3: The PMF of s estimated from (19) is given by: j=y(i) q i = f j(1 ) j y(i), (24) s where y(i) = i + 1 1/ and s is in (9). The result in (24) indicates that each q i is different from f i regardless of the samling duration and thus cannot be used to aroximate the flow-size distribution. We verify (24) with a simulated acket stream with 5M flows, where flow sizes follow a ower-law distribution P (L i) = 1 i α for i = 1, 2,... and α = 1.1. Fig. 4 lots the CCDF of random variable e(r L ) obtained from s as well as (24), both in comarison to the tail of the actual distribution. The figure shows that (24) accurately redicts the values obtained from s and that PMF {q i } is indeed quite different from {f i }. So far, our study of existing methods in residual-geometric samling has shown that they are not only generally biased, but also unable to recover the flow-size distribution from residuals R L. This motivates us to seek better estimation aroaches, which we erform next. V. URGE This section rooses a family of algorithms called Unbiased Residual-Geometric Estimators (URGE), roves their accuracy, and verifies them in s. A. Single-Flow Usage For estimating individual s, we first consider an estimator directly imlied by the result in (14). Notice that solving (14) for l and exressing l in terms of E[R l ], we get: 1 ( ) l = u log(1 ) W u(1 ) u log(1 ), (25) where u = E[R l ] + 1/ 1 and W (z) is Lambert s function (i.e., a multi-valued solution to W e W = z) [4]. Thus, a ossible estimator can be comuted from (25) with E[R l ] relaced by the measured value of geometric residual R l. However, there are two reasons that (25) is a bad estimator of s. First, Lambert s function W (z) has no closed form solution and has to be numerically solved using tools such as Matlab. Second, it can be verified (not shown here for brevity) that (25) is not an unbiased estimator. Instead, we define a new estimator: ê(r l ) = R l 1 + 1/ (1 )R l. (26) and next show that it is unbiased. Lemma 3: Estimator ê(r l ) in (26) is unbiased, i.e., E[ê(R l )] = l. (27) Proof: We rove (27) by deriving such function ψ(r l ) that satisfies E[ψ(R l )] = l. First, it follows from (13) that: l E[ψ(R l )] = ψ(j)p (R l = j) = j=1 l j=1 ψ(j)(1 )l j 1 (1 ) l. (28)

6 6 estimated size unbiased Fig. 5. relative RMSE (a) = 0.01 Exectation of estimator (26) in s (a) = 0.01 Fig. 6. RRMSE of (26) in s and (31). For E[ψ(R l )] = l to hold, we must have: l j=1 estimated size relative RMSE unbiased (b) = (b) = ψ(j)(1 ) j = l ( 1 (1 ) l) (1 ) l. (29) Writing (29) twice for l and l 1 and subtracting the two equations from each other, we get: ψ(l)(1 ) l = 1 + (l 1) (1 )l (1 ) l. (30) Simlifying (30), we obtain (26). We lot in Fig. 5 results obtained from (26). The figure indicates that ê(r l ) accurately estimates s for all flows in both cases of. Next, we derive the RRMSE of URGE. Theorem 4: The RRMSE of (26) is given by: 1 + l( 2)(1 ) ˆδ l = l (1 ) 2l+1 l 2 2 (1 (1 ) l. (31) ) It is easy to verify from (31) that URGE has zero RRMSE for l = 1 or l, confirming its elehant-accuracy. We lot ˆδ l obtained from s along with the in Fig. 6, which shows that (31) accurately tracks the actual relative error. From Figures 5-6, it is clear that ê(r l ) significantly imroves the accuracy of estimating small s comared to e(r l ). In ractice, (31) can be used to determine threshold l 0, which leads to desired bounds on error for all l l 0 and allows ISPs to use e(r l ) instead of l. B. Flow-Size Distribution It is worth mentioning that while (26) roduces unbiased estimation of s, ê(r L ) is not suitable for comuting the flow-size distribution, as we show below. Denote by ˆq i = P (ê(r L ) = i) the PMF of ê(r L ). Then, we have the following result. Lemma 4: PMF of ê(r L ) is given by: ˆq i = 1 s j=y(i) where s is in (9), function y(i) is: (1 ) j y(i) f j, (32) y(i) = i + 1 1/ ω, (33) and ω = W ( (1 ) i+1 1/ log(1 ) ). Proof: We first solve R L + 1/ 1 (1 ) R L / = i, (34) for R L and exress it in terms of i, i.e., R L = y(i), where y(i) is given by (33), ignoring aroximate round-offs to the nearest integer. Combining with (10), we have: ˆq i = P (R L = y(i)) = h y(i), (35) where h i is in (10). This directly leads to (32). Notice from (32)-(33) that distribution ˆq i does not even remotely aroximate the original PMF f i. This roblem is fundamental since residual samling exhibits bias towards larger flows and even if we could recover L from R L exactly, the distribution of samled s would not accurately aroximate that of all flows assing through the router. We thus exlore another technique for estimating the flowsize distribution. Before doing that, we need the next lemma. Lemma 5: The distribution f i can be exressed using the PMF of geometric residuals {h i } in (10) as: f i = h i (1 )h i+1 + (1 )h 1. (36) Proof: From (10), we obtain that: h i (1 )h i+1 = s f i. (37) It then immediately follows that f i is given by: f i = s(h i (1 )h i+1 ). (38) Notice that s in (9) is a function of {f i }, which are unknown from the measurement ersective. The last ste of the roof is to exress s in terms of known quantities {h i }, which can be accomlished by alying the normalization condition i=1 f i = 1. It is easy to verify that: h i = 1 and h i+1 = 1 h 1. (39) i=1 Then, summing u both sides of (38) for i from 1 to infinity gives us: s = i=1 (h i (1 )h i+1 ) =, (40) + (1 )h 1 i=1 which together with (38) establishes (36). This result leads to a new estimator for the flow-size distribution: q i = M i (1 )M i+1 M + (1 )M 1, (41)

7 (a) = 0.01, M = 194, 208 (b) = 0.001, M = 26, 233 (a) =, M = 3, 090 (b) = 10 5, M = 337 Fig. 7. Estimator (41) in s. Fig. 8. Estimator (41) in s with very small. where M is the total number of samled flows and M i is the number of them with the geometric residual equal to i. Since M i /M h i in robability as M (from the weak law of large numbers), we immediately get the following result. Corollary 2: The estimator in (41) is asymtotically unbiased. We next verify the accuracy of q i in s with 5M flows in the same setting as in the revious section. We lot in Fig. 7 the CCDF estimated from (41) along with the actual distribution. The figure shows that q i accurately follows the for both cases of. C. Convergence Seed We next examine the effect of samle size M on the convergence of estimator q i. To illustrate the roblems arising from small M, we study (41) with = and 10 5 in s with the same 5M flows. The estimator obtained M = 3, 090 flows for = and just M = 337 for = Fig. 8 indicates that while the estimated curves under both choices of still aroximate the trend of the original distribution, they exhibit different levels of noise. As the next result indicates, small leads to a small samle size M and thus more noise in the estimated values. Corollary 3: Suose that M flows are selected by residual-geometric samling from a total of n flows. Then, the exected value of M is given by: E[M] = n s = ne[1 (1 ) L ]. (42) Proof: This result follows from the fact that E[M] = np (A L < L) = n s, where s is given by (9). To shed light on the choice of roer for RGS, we show how to determine the minimum M that would guarantee a certain level of accuracy in q i. Define h i = M i /M to be an estimate of h i = P (R L = i). The next lemma follows from Lemma 5 and Corollary 2 and indicates that the accuracy of q i directly deends on whether h i aroximates h i accurately. Lemma 6: Suose that h j h j ηh j holds with robability 1 ξ for j [1, i+1], where η and ξ are small constants. Then, there exists a constant ζ: ζ = η( + 2η(1 )h 1) + (1 )(1 η)h 1 (43) such that ζ 0 as η 0 and P ( q i f i ζf i ) = 1 ξ. Proof: We rove the result by deriving ζ that satisfies q i f i ζf i given that h j h j ηh j. From (36) and (41), we have: where and q i f i = a 1 a 2, (44) a 1 = ( h i h i ) + (1 )( h i+1 h i+1 ) + (1 ) (h 1 hi h 1 h i ) + (1 ) 2 (h 1 hi+1 h 1 h i+1 ) (45) a 2 = ( + (1 )h 1 )( + (1 ) h 1 ). (46) From the condition h j h j ηh j, we bound a 1 and a 2 as follows: and a 1 ηh i + 2η(1 )h 1 h i + η(1 )h i+1 + 2η(1 ) 2 h 1 h i+1 = η(h i + (1 )h i+1 )( + 2η(1 )h 1 ), (47) a 2 ( + (1 )h 1 )( + (1 )(1 η)h 1 ). (48) It thus follows from (36) and (47)-(48) that q i f i ζf i, where constant ζ is given by: ζ = η( + 2η(1 )h 1) + (1 )(1 η)h 1, (49) and that ζ 0 as η 0. Next, we obtain a bound on M from the requirement that h i be bounded in robability within a given range [h i (1 η), h i (1 + η)]. Theorem 5: For small constants η and ξ, h i h i ηh i holds with robability 1 ξ if samle size M is no less than: M (1 h i) h i η 2 ( Φ 1 (1 ξ/2) ) 2, (50) where Φ(x) is the CDF of the standard Gaussian distribution N (0, 1). Proof: Notice that M i is a random variable whose distribution is given by Binomial(M, h i ) and that hi can be aroximated by a Gaussian random variable with mean µ i = h i and variance σi 2 = h i(1 h i )/M. Define Z = h i µ i σ i, (51) which is a standard Gaussian random variable with mean 0 and variance 1. It follows that: P ( Z z) = 2Φ(z) 1, (52)

8 8 where Φ(.) is the CDF function of the standard Gaussian distribution N (0, 1). Therefore, we establish that: P ( h i h i zσ i ) = 2Φ(z) 1. (53) We can guarantee target accuracy by setting zσ i = ηϕ i and 2Φ(z) 1 = 1 ξ, which gives the following equality: ηh i σ i = Φ 1 (1 ξ/2). (54) Substituting σ i = h i (1 h i )/M into the above equation and solving for M, we obtain (50). For examle, to bound h i within 10% ercent of h i (i.e., η = 0.1) with robability 1 ξ = 95% for all h i, the following must hold: M ( ) , (55) which indicates that M = 38K flows must be samled to achieve target accuracy. If we reduce η to 1%, increase 1 ξ to 99%, and require the aroximation to hold for all h i 10 3, then M must be at least 66M flows. Converting η into ζ using (43), one can establish similar bounds on the deviation of q i from f i. D. Estimation of Other Flow Metrics Besides s and the flow-size distribution, URGE also rovides estimators for the total number of flows and the number of them with size i. Before introducing these estimators, we need the next lemma. Lemma 7: The exected number of flows with samled residuals R L = i is: E[M i ] = E[M]h i = nh i s, (56) where h i is the PMF of geometric residuals and s is given by (9). Proof: Writing: E[M i ] = np (A L < L R L = i) = np (R L = i A L < L)P (A L < L), (57) notice that (56) follows from the fact that P (R L = i A L < L) = h i and P (A L < L) = s. Based on this, we next develo two estimators and rove their accuracy. Let ñ be an estimator of the total number of flows n observed in the measurement window [0, T ]: ñ = M + 1 M 1 (58) and ñ i be an estimator of the number of flows n i with size i: ñ i = M i (1 )M i+1. (59) Then, the next result shows that both of these estimators are asymtotically unbiased. Lemma 8: Ratios ñ/n and ñ i /n i, for all i such that f i > 0, converge to 1 in robability as M. Proof: To rove convergence in robability, it suffices to show that E[ñ/n] = 1 and V ar[ñ/n] 0 as n. From (58), we have: E[ñ] = E[M] + 1 E[M 1]. (60) Alying (42) and (56), we get: E[ñ] = n s ( 1 + (1 )h 1 ), (61) which simlifies to E[ñ] = n using (40). To tackle the variance of ñ/n, first notice that M can be reresented as a sum of n i.i.d. Bernoulli variables (i.e., M = n j=1 A j), each with fixed robability s. Therefore: [ M V ar n ] = 1 n 2 n j=1 V ar[a j ] = s(1 s ), (62) n where the last term is bounded by 1/n. Alying similar reasoning to M 1, we obtain that V ar[ñ/n] 1/n. Since we assumed that the number of samled flows M, this imlies that n s and thus from (40) that n, which establishes that V ar[ñ/n] 0. Convergence in robability immediately follows (in fact, an even stronger convergence in mean-square holds, but this distinction is not essential in our context). For the second art of the theorem, define X n = ñ i /n and Y n = n i /n. We first rove that both X n and Y n converge in robability to f i. We then argue that their ratio X n /Y n converges to 1, also in robability. Using (56), (40), and finally (36), we have: E[X n ] = E[M i] (1 )E[M i+1 ] n = s(h i (1 )h i+1 ) = h i (1 )h i+1 + (1 )h 1 = f i. (63) Since n i is the number of flows with size i, its exectation is E[n i ] = np (L = i) = nf i and thus E[Y n ] = f i. Using reasoning similar to that in the first half of this roof, we obtain that V ar[x n ] 0 and V ar[y n ] 0, which shows convergence of these variables to f i in robability. For the final ste, consider two sequences {X n } and {Y n } that converge to the same ositive constant f i > 0. Then, simle maniulation shows that their ratio converges to 1 in robability. We leave details to the reader. Note that [17] rovided a similar estimator as (58) and roved E[ñ] = n using a different aroach from ours; however, our results are stronger as they show convergence in robability and additionally address estimation of n i. Simulations verifying (58)-(59) are omitted for brevity. VI. IMPLEMENTATION In this section, we imlement URGE and examine its memory consumtion and rocessing seed.

9 9 ackets Processes Flow Classification Residual Samling Residual Estimation estimation flow ID + counter memory ointer Flow Counter Table K 1 Memory Fig. 10. Illustration of a chained hash table for maintaining flow counters. Fig. 9. The URGE framework. A. General Structure Fig. 9 illustrates a framework that imlements the various URGE algorithms. This framework contains three rocesses flow classification, residual-geometric samling, and estimation as well as one data structure containing the flow counter table. Flow classification rocesses each incoming acket for flow ID and then forwards it to residual-geometric samling. For each flow ID x arriving from flow classification, residualgeometric samling first checks if the flow table has an existing entry for x and increment the counter by 1; if an entry does not exist, it is created with robability and its counter is initialized to 1. The geometric estimation rocess collects counter values from the flow table and then uses URGE to estimate flow statistics. The flow table kees a maing between flow IDs and associated counters. The table suorts three oerations: 1) looku(x) to retrieve the record of flow x; 2) add(x) to insert a new entry for flow x in the table with the initial counter value 1; and 3) increment(x) to add 1 to the counter of flow x. We dislay in Fig. 10 an imlementation of the counter table, which is based on a chained hash table. Assume a hash function hash(x) that roduces an integer value in [0, 1,..., K 1]. We assume that the generated hash values are uniformly distributed within interval [0, K 1] and the imlementation of function hash(.) is fast enough. Efficient hardware hash functions can be found in [29]. We maintain an array A of size K and each entry A[k] oints to a liked list that kees the set of flows whose IDs have the same hash value k. Each node in the list contains two fields: 1) flow data that kee the flow ID, the acket counter, and the timestam of the last acket; and 2) a ointer to the next node. An imortant element of our algorithm is to ensure that the table kees only active flows, which is accomlished by eriodic swees through the table and removal of all flows that have comleted using FIN/RST ackets or have been idle for longer than τ time units. Uon removal, flow information is saved to disk (single-flow usage) or aggregated into a RAM-based PMF table (flow-distribution usage). Oerations add(x) and increment(x) automatically modify the timestams associated with each flow and allow timeout-based exulsion of dead flows. Notice that the flow table is accessed by residual-geometric samling uon each acket arrival. Therefore, the scalability of the measurement algorithm essentially deends on the access seed to the table. In what follows, we analyze the design of the flow table and quantify its two imortant roerties: memory consumtion and rocessing seed. B. Active Flows To understand how much benefit removal of dead flows rovides to memory consumtion, we next derive the exected number of active flows at any time t and their fraction samled by the algorithm. Assume a measurement window [0, T ], where T is given in ackets seen by the router. For each flow j, let inter-acket delays within the flow be given by a random variable j, which counts the number of acket arrivals from other flows between adjacent ackets of j. Denoting by = E[ j ], we have the following result. Lemma 9: Assuming stationary flow arrivals in [0, T ] and T, the exected number of active flows N(t) at time t is given by: E[N(t)] = + 1. (64) Proof: Reresent N(t) = n j=1 A j(t) as the sum of n indicator variables, where A j (t) is 1 if flow j is alive at t and 0 otherwise. Observe that: E[N(t)] = ne[a j (t)] = np (A j (t) = 1) (65) and notice that each flow exists at the router for L k=1 ( k j + 1) acket units, where 1 j, 2 j,... are i.i.d. instances of variable j. Then, the robability that t [0, T ] lands within a given flow is simly the flow s exected footrint (in ackets seen by the router) normalized by the window size: E[A j (t)] = 1 [ L T E ] ( k j + 1). (66) k=1 Using Wald s equation, this simlifies to E[A j (t)] = E[L]/T. Finally, since ne[l]/t = 1, we immediately obtain (64) using (65). Our baseline reduction in flow volume comes from geometric samling in revious sections and reduces the number of flows by a factor of r 1 = n/e[m]. Now additionally define ratio r 2 = n/e[n(t)] = T/( + 1)E[L] and observe that longer observation windows (i.e., larger T ), smaller flow sizes (i.e., smaller E[L]), and denser arrivals (i.e., smaller ) imly more savings of memory. In fact, T results in r 2 if the other arameters are fixed. However, even more reduction is ossible by discarding dead flows in RGS. Denote by M(t) the number of samled flows that are still alive at t and consider the next result.

10 TABLE I COMPARISON OF (64) AND (67) TO SIMULATION RESULTS N(t) time t (a) alive flows Fig. 11. Verifying s (64) and (67). M(t) time t (b) samled flows Lemma 10: Assuming the flow arrival rocess is stationary in [0, T ] and T, the exected number of active samled flows at time t is given by: ( E[M(t)] = ( + 1) 1 1 E[L] s ), (67) where s in (9) is the fraction of all flows samled by RGS. Proof: Following Lemma 9, it suffices to derive the average acket footrint of flow j within window [0, T ]. Dividing this footrint by T gives us the robability that current time t falls within the residual of the flow and multilying the result by n roduces the exected number of flows stored in RAM. Condition on L = l and define P l as the number of ackets counted by RGS from flow i: { R l flow samled P l =. (68) 0 otherwise Then, the flow s footrint F l is: P l F l = ( k j + 1), (69) k=1 where as before k j are i.i.d. inter-acket delays induced by cross-traffic that do not deend on the size of flow j. Next, taking the exectation of F l, we have: E[F l ] = E[P l ]( + 1) = E[R l samled]p (samled)( + 1). (70) Using (14) and recalling that P (samled) = 1 (1 ) l, we have: [ E[F l ] = ( + 1) l 1 ] (1 (1 )l ). (71) Unconditioning L = l, we have the exected footrint as: [ E[F ] = ( + 1) E[L] 1 ] s, (72) where s = E[1 (1 ) L ] is the robability that a flow is samled by RGS. Multilying E[F ] by n, dividing by T, and taking E[L] outside, we get (67). Define r 3 = n/e[m(t)] as the exected reduction of sace when tracking only active RGS flows comared to all seen flows at the router and notice that this ratio increases not only as T grows, but also when decreases. Performing a selfcheck using Jensen s inequality, observe that 0 s /E[L] 1 and therefore E[M(t)] E[N(t)], which means that the former indeed always results in more reduction in table size. time t E[N(t)] E[M(t)] (64) (67) We discuss numerical values of r 1 r 3 in the next section and now focus on the accuracy of the obtained results. We evaluate s (64) and (67) in s with 1, 000 iterations through window [0, T ] with randomly generated flows from the a distribution with flow-size CDF F i = 1 i α, where α = 1.1 and = Fig. 11 lots the evolution of N(t) and M(t) along with the exected values comuted from the s. Table I comares the s with E[N(t)] and E[M(t)] comuted in s, where each value is averaged using the same 1, 000 iterations of the traffic stream. Both indicate a very close match. C. Memory Consumtion The memory used by the flow table can be divided into two arts: one for the hash table, which contains an array of ointers, and the other for flow records, which are organized in a set of linked lists. Define w to be the number of bytes used by each memory ointer and w f to be that needed for flow counter, timestam, and flow ID. Then, the following theorem gives the memory required for the measurement algorithm. Theorem 6: The average number of bytes required by URGE in steady-state is: E[W R (t)] = Kw + E[M(t)](w c + w f ), (73) where E[M(t)] is the average number of samled active flows at time t given by (67). From (73), observe that for n original flows with a given distribution of L, memory consumtion E[W R (t)] can be reduced by lowering either M(t) or K. As discussed in the revious section, M(t) cannot be arbitrarily small as it would lead to lower accuracy. At the same time, small K leads to more conflicts in the hash table, longer linked lists, and thus may slow down the samling rocess, which are the issues we study next. D. Processing Time The time sent in rocessing each acket deends on how linked lists are built. We examine an aroach that sorts flow entries of each linked list based on flow IDs. In this aroach, function looku(x) returns a ointer to the entry of flow x if it exists in the table; otherwise, the function returns a ointer to where the new entry should be inserted. For each acket with flow ID x, we erform the following stes in sequential order: 1) comute the k = hash(x); 2) retrieve the linked-list head ointer A[k] from the hash table; 3) iterate through the linked list until a flow record is matched or a flow with ID larger than x is reached; 4) if x is not found, a

11 11 TABLE II CONSTANTS USED IN (73) AND (74) 4 x K l K u RAM constant value CPU constant value w 4B t h 12ns w f 17B t 9ns W MB t c 3ns T 0 24ns table size limits K l & K u = K 0 = exected E[W R ] (MB) K u = table size K x 10 5 (a) E[W R (t)] exected E[T R ] (ns) K l = table size K x 10 5 (b) E[T R (t)] Fig. 12. Tradeoff: (a) memory consumtion and (b) rocessing time with E[M(t)] = Gray areas dislay the accetable ranges of K. new entry for x is created with robability and inserted to the location returned by looku(x). Notice that the fourth ste is executed only when a new flow arrives and is samled, which is much less frequent comared to the case of an existing flow. Thus, consider its contribution to the overall overhead negligible and omit it from analysis. Denote by t h the time sent in comuting a hash, by t that of memory access, and by t c that of each comarison of flow IDs. Define T R (t) to be the rocessing delay/latency of each incoming acket at time t. Then, noticing that the exected list length is E[M(t)]/K entries and on average traversal stos in the middle of a list, we have the next result. Theorem 7: The exected er-acket rocessing time is: E[T R (t)] = t h + t + (t c + t ) E[M(t)] 2K. (74) The result in (74) indicates that both large hash table size K and small samle size M(t) can contribute to a faster samling rocess. Since larger K reduces (74), but increases (73), we next examine how to roerly select K and to simultaneously satisfy certain target constraints on E[W R (t)] and E[T R (t)] given their conflicting deendency on K. E. Tradeoff Analysis Now, we are ready to exlore the design sace of constants (K, ) to strike a balance between accuracy and scalability. Suose that a router requires that E[W R (t)] W 0 and E[T R (t)] T 0. Further assume that the number of samled flows E[M(t)] is known and fixed (i.e., fixed, window T, and flow-size distribution). Define two constants: and K l = (t c + t )E[M(t)] 2(T 0 (t h + t )), (75) K u = W 0 E[M(t)](w c + w f ) w. (76) Assuming K l K u, it then follows from (73) and (74) that one can choose any value K [K l, K u ] to satisfy the two samling robability Fig. 13. Lower and uer bounds on table size K with varying robability. Gray areas dislay the accetable range of K and. constraints on memory and seed. We show below how to vary in order to maximize accuracy while ensuring K l K u. To understand this better, consider the following examle. Assume that the original traffic contains n = 10 6 flows with a ower-law distribution P (L i) = 1 i 1.1. With = 0.01, residual-geometric samling obtains E[M(t)] = samled flows. Table II gives the constants we use to comute the exected memory consumtion and rocessing time in (73) and (74). We also imose the following constraints on memory and delay: W 0 = 1.65MB and T 0 = 24ns. 1 Fig. 12 illustrates the accetable ranges of table size K derived from the s. The figure indicates that table size K can be any value between K l = and K u = to simultaneously satisfy both requirements W 0 and T 0. Note that for some values of E[M(t)] it is ossible that K l is larger than K u and thus the constraints cannot be met. Therefore, we next vary to show how the choice of K will be affected. Fig. 13 lots K u and K l as functions of, where both curves are obtained from the corresonding s. Notice from the figure that K l monotonically increases and K u monotonically decreases in. This imlies that interval [K l, K u ] eventually shrinks to a single oint K 0, after which no feasible assignment of table size K exists. Since larger imlies more accurate estimation (i.e., the router sees more flows M in the interval [0, T ] and thus estimates distribution {h i } more accurately), it is desirable to select the maximum that allows the router to satisfy the sace and seed constraints. This occurs in a single otimum oint 0 that corresonds to K l = K u = K 0. In our examle, we get 0 = and K 0 = VII. PERFORMANCE EVALUATION In this section, we evaluate our s using several Internet traces in Table III from NLANR [24] and CAIDA [3]. Trace FRG was collected from a gigabit link between UCSD and Abilene in We extracted from it additional traces with only Web, DNS, and NTP flows (also seen in the table). Additionally, we use three traces from CAIDA: LARGE a one-hour trace from an OC48 link, MEDIUM a one-minute trace from a OC192 link, and SMALL a 7-minute trace from a gigabit link. As the table shows, URGE tyically sees a reasonably large number of flows M over the entire interval [0, T ]; however, 1 These values allow to hold about 10 5 flow records (each with a flow ID and a counter) and rocess 1-Kbit ackets at OC-768 rates (i.e., 40 Gbs).

12 12 TABLE III REDUCTION IN THE NUMBER OF FLOWS USING RESIDUAL SAMPLING WITH = 0.01 AND DIFFERENT TYPES OF PERIODIC REMOVAL OF DEAD FLOWS source trace total flows n total kts ne[l] samling only removal only both E[M] r 1 E[N(t)] r 2 E[M(t)] r 3 FRG 1, 756, , 821, , , , NLANR Web 239, 174 6, 497, , , DNS 120, , 977 2, , 797 NTP 382, , 447 4, , , 887 LARGE 9, 653, , 250, , , , CAIDA MEDIUM 2, 317, , 837, , , , SMALL 200, 9, 179, , , , TABLE IV PERFORMANCE OF URGE WITH = AND HASH TABLE SIZE K = E[M(t)] source trace E[W R (t)] E[T R (t)] # of flows # of size-one flows actual (n) estimated (ñ) error actual (n 1 ) estimated (ñ 1 ) error FRG 31KB 24.1ns 1, 756, 702 1, 736, % 768, , % NLANR Web 10KB 21.4ns 239, , % 13, , % DNS 257B 21ns 120, , % 76, , % NTP 752B 21.1ns 382, , % 281, , % LARGE 132KB 28.1ns 9, 653, 609 9, 717, % 4, 535, 449 4, 630, % CAIDA MEDIUM 341KB 23.7ns 2, 317, 369 2, 278, % 1, 299, 343 1, 273, % SMALL 23KB 21.2ns 200, 902, % 93, , % estimated size 10 4 actual estimated (a) E[e(R l )] estimated size 10 4 actual estimated (b) E[ê(R l )] Fig. 14. Estimating single-flow usage in the FRG trace with = relative RMSE (a) δ l (b) ˆδ l Fig. 15. RRMSE of single-flow usage in the FRG trace with = relative RMSE the number of active flows N(t) and those constantly ket in memory M(t) is much smaller. For the FRG trace, for examle, E[M] is 15 times smaller than n, while E[N(t)] is 81 and E[M(t)] is 658 times smaller. In general, NLANR traces benefit more from the removal of dead flows than CAIDA data, because former was collected over two consecutive days and thus had a larger observation window T, which led to larger ratios r 2 and r 3. The same reasoning also exlains the fact that the LARGE trace exhibits much higher benefit from removing dead flows than MEDIUM or SMALL traces. A. Memory and Seed We use the settings of Table II to comute the amount of memory consumed by URGE according to (73). As shown in the third column of Table IV for = and K = E[M(t)], the required memory size is small and rarely exceeds 40 KB. Even for the LARGE trace that has the most flows in this comarison, URGE only needs 132 KB of RAM, much smaller than roughly 120 MB required for keeing all flow counters. We also comute er-acket rocessing time from (74) based on Table II and show in the fourth column of Table IV that E[T R (t)] 25 ns in the majority of the studied cases. B. Estimation Accuracy First, we examine the roblem of estimating the total number of flows n in [0, T ] and size-one flows n 1 in this interval. The seventh and tenth columns of Table IV list the absolute error of s (58) and (59), resectively. With the excetion of the Web NLANR trace, these estimates are within aroximately 3% of the correct value. We next evaluate the erformance of URGE in estimating single-flow usage. Fig. 14 lots the exectation of estimated s (averaged over 100 iterations) along with the actual values obtained from the FRG trace using = The figure shows that the estimator e(r l ) from revious work tends to overestimate the sizes of small flows, while URGE s estimator ê(r l ) accurately follows the actual values. We also comare the relative errors of the two studied methods in Fig. 15, which indicates that URGE has RRMSE bounded by 1 for all flows, while e(r l ) exhibits very large δ l for small and medium flows, which is an increasing function of 1/. For the flow-size distribution, we first examine three values of to comare its effect on the accuracy of URGE in the FRG trace. Fig. 16 indicates that estimation for all three values of are very consistent and all of them follow the accurately. In our exeriments with = , URGE recovered the original PMF {f i } using only M = 7, 616 total flows out of n = 1.75M.

Modeling Residual-Geometric Flow Sampling

Modeling Residual-Geometric Flow Samling Xiaoming Wang Amazon.com Seattle, WA 98101 USA Email: xmwang@gmail.com Xiaoyong Li Texas A&M University College Station, TX 77843 USA Email: xiaoyong@cse.tamu.edu