Sampling and Censoring in Estimation of Flow Distributions

Size: px

Start display at page:

Download "Sampling and Censoring in Estimation of Flow Distributions"

Austin McKinney
5 years ago
Views:

1 Sampling and Censoring in Estimation of Flow Distributions Nelson Antunes Center for Computacional and Stochastic Mathematics University of Lisbon, Portugal Vladas Pipiras Department of Statistics and Operations Research University of North Carolina, USA Abstract Traffic monitoring and estimation of flow characteristics, such as the size and duration distributions, can be problematic when the length of an observation window is constrained (e.g., due to hard network resources). Indeed, as shown in this work, sampled flows are usually affected by censoring in an observation window, which leads to biased estimators. To account for censoring, a mathematical framework that describes sampling of flows in a time window is developed. Using censoring analysis, we provide nonparametric maximum likelihood estimators for the flow duration and size distributions. The estimators are computed using the EM algorithm. Finally, the estimators are applied to an actual traffic trace, and are found to perform very well. I. INTRODUCTION Measuring and monitoring flow traffic plays a central role in operating today s computer networks. Among the flow metrics considered in inference is the flow size (number of packets) [], useful for traffic modeling, management and urity. Flow duration [2] is another important metric for traffic prediction, traffic engineering to support QoS and in accounting, and it is arguably the most important one from the viewpoint of the user [3]. Due to the massive volume of traffic, and storing and processing costs, it has become impossible to reconstruct and track all flows on a network link. To overcome this problem, sampling flows have become commonplace [4]. One issue which has seemingly attracted little attention yet in the networking community is the role of the observation (time) window in making inference about flow metrics. An exception seems to be the work [5] for the flow duration distribution relying on full (non-sampled) traffic. An observation window of a typical order of several minutes naturally censors the packets of a flow (see Figure ). This can be more pronounced as the number of long-lived flows increases. The length of the observation window is often constrained by the need to sample flows under limited hard resources in the measurement infrastructure [6]. On the other hand, the length of the time window also depends on the targeted application. For instance, in network attacks [7], to detect a sudden increase in the number of flows with one packet, short time intervals are adequate. In addition, it is well known that Internet traffic is nonstationary [8] over long time periods because of daily variations of interactive services (video, web, etc), limiting the lenght of the observation window. What are censoring effects in estimation and how significant are they? The answer actually depends on the quantities used in estimation, as well as other factors (see Section II-D below). For example, if only uncensored flows are used in estimation of the distribution of flow sizes or durations, then the true distribution function will tend naturally to be overestimated for larger values of sizes and durations. Note that this not only leads to a bias but also a loss of information from censored flows. Perhaps more surprising effects on estimation are when all the sampled flows are used, as if they were not censored. To leave the reader in some suspense, the answer can be found in Section II-D below. But independently of what sampled flows are used, estimation ignoring censoring will lead to bias. Moreover, the bias will be larger for smaller observation windows. The problem might be mitigated if the observation window is sufficiently long, however, the measurement cost would be prohibitive. The main goal of this paper is to show how the bias can be eliminated by taking censoring into account. We develop an analytical framework that describes the sampling of flows in an observation window. Estimating the flow duration distribution under censoring turns out to be related to the problem considered by Wijers [9] in censoring analysis. The resulting estimator contrasts with Kaplan-Meier estimator [] used in [5] which does not fully describe all types of censored flows (only the ones that start in the observation window). It should be noted here that while censoring was not considered in greater depth in the networking problems, it is widespread in other contexts, especially biostatistics []. The only assumption needed to apply the Wijers setting concerns the arrival process of flows. The problem of interest concerning the flow size distribution is more challenging. However, if one is willing to assume a simple and natural model for the packet arrival process within a flow, we show that the censoring framework is close to that considered in a celebrated work of Laslett [2] on the socalled line segment problem. The corresponding nonparametric maximum likelihood estimators for flow duration and size distributions are implemented using the EM algorithm. We assess the adequacy of the modeling assumptions and apply the results to infer the flow distributions from censored flows on an Internet trace. The estimates show a very good accuracy compared to the true distributions. The paper is organized as follows. Section II describes the analytical framework considered. Section III gives the nonparametric maximum likelihood estimators for the flow duration and size distributions based on sampled censored

2 flows. The results are checked on an Internet trace in Section IV. Finally, conclusions are drawn in Section V. II. ANALYTICAL FRAMEWORK In this tion, we introduce the conceptual framework for the censoring analysis. We first describe the traffic model and the sampling method used. We then classify the types of censored flows and show that ignoring censoring can lead to biased estimators. A. Traffic modeling A flow is defined by the usual 5-tuple of origin and destination IP addresses, port numbers and protocol field. We need first to describe the dynamics of network traffic in terms of flow arrivals and the internal flow structure of packet arrivals. The model that we use, was verified against a large number of Internet traffic data and was found to reproduce the main features of the network traffic [3]. We suppose that flows arrive in time according to a Poisson process with rate λ. A flow arriving at random time T consists of the number of packets (size) W, with the probability mass function f W (w) = P (W = w), w, the distribution function F W (w) = P (W w), w and the mean µ W = E(W ). The W packets of the flow are separated by interarrival times (IATs) D i, i =,..., W, which are assumed to be i.i.d. and independent of W. Let D be a variable with a common distribution of D i, with the distribution function F D (t) = P (D t), t, and the mean µ D = E(D). Note that the flow duration is V = W D i, with F V (v) = P (V v W 2) and µ V = E(V W 2) representing the distribution function and the mean of the duration of flow with at least two packets. We note that the distribution functions of W and D are considered nonparametric, and the flows are i.i.d. In the stochastics literature, this model is known as the Bartlett-Lewis cluster point process [4], the difference being just the terminology used (cluster points instead of flow packets). B. Flow sampling in a time window Consider a measurement interval of duration t >, with and t denoting the endpoints of the interval. Conceptually, the FS (flow sampling) scheme is very simple: flows are sampled independently with probability p and thus discarded with probability p within the observation window (, t). Sampling a given flow means that each packet of the flow is sampled within the observation window and none of its other packets are captured outside this window. Hence, the size W and duration V of a sampled flow may not be observed, that is, they may be censored. C. Censored and uncensored flows Suppose that n sampled flows arrive during the interval (, t). Suppose that we also observe m sampled flows in (, t) that arrive before time. According to the traffic model above the random variables that describe the quantities n and m are independent and have a Poisson distribution. l.c. d.c. r.c. Observation window Fig. : Classification of sampled flows (both empty and full circles correspond to packets and full circles to sampled packets). Trace name Local start time Duration Packets TCP Flows Auckland IX 2:: :: 48 million,37,756 TABLE I: Summary Statistics of the data trace. Assume that sampled flows are enumerated in some way. Let T i, W i and V i represent the arrival time, size and duration of the sampled flow i. For a sampled flow arriving within (, t) if T i +V i < t, then the whole flow is observed and uncensored (, in short). If T i + V i t, then the sampled flow is right censored (r.c.). For sampled flows arriving before, they are necessarily either left censored (l.c.) in the case T i + V i < t or double censored (d.c.) when T i + V i t. See Figure. The original sizes W i and durations D i are not observed for both single end censored (s.e.c., i.e., r.c. or l.c.) and d.c. flows. Moreover, the observed duration of the d.c. flow is always equal to t. We consider only TCP flows which account for 8-9% of packets in the Internet traffic [5]. The classification of the sampled flows is based on looking at SYN and FIN flags in the packet header. The analysis and results of this paper can be applied to other kinds of flows provided that suitable substitutes could be found for connection startup (SYN) and connection termination (FIN). D. Trivial estimators In this paper, we are interested in estimating the complementary distribution function (CDF) F W = F W of the sizes of flows, and the CDF F V = F V of the durations of flows, from the available (and hence possibly censored) data in the observation window (, t). In this regard, it is important first to understand the relationship of these CDF s to the corresponding ones obtained using some of the available data without taking care of censoring. We call these estimators the trivial estimators. We consider the publicly available Internet trace, Auckland IX [6] (see Table I), spanning 6 min, and take the relative time of the observation window, 2 min after the start of the trace and an interval length of 3 min. Figure 2 presents the log plot of F V using different available types of censored flows without sampling (i.e., p = ) in order to show the full magnitude of the differences. If a sampled flow has only one SYN packet, we assume that it is if a timeout (3 ) since its arrival has expired during the observation window. Otherwise, it is assumed to be r.c. and is used in the estimation of the CDF of flow durations. The solid line t

3 F V 2 +r.c. +s.e.c.+d.c Fig. 2: CDF of flow durations in an observation window (t = 3 min). in Figure 2 corresponds to the empirical F V computed from all the flows with at least two packets in the trace, called original. The figure provides the estimates of the CDF when using only flows and when adding the r.c. flows in the estimation. In both cases, as expected, the complementary distribution is underestimated. Somewhat surprisingly, when all flows (i.e.,, s.e.c. and d.c.) in the observation interval are considered, the estimator lies above the original distribution. This is due to the fact that the l.c. and d.c. flows that started before the beginning of a measurement interval have in general a larger number of packets due to the heavy-tailed nature of the flow size distribution and therefore longer durations. These are the flows that have more chance to be active at given time. The bias of the trivial estimators that ignore censoring in Figure 2 motivates the study of more accurate estimators. The same can be observed in a time window of 5, or even 3 min but to a lower extent. Similar conclusions can also be drawn for the CDF of the flow sizes F W using the trivial estimators. III. CENSORING ANALYSIS The censoring analysis is based on the construction of a maximum likelihood function each type of sampled flows is treated differently. The likelihood function can be maximized in practice using the EM algorithm. To avoid defining different variables to denote similar quantities for the censoring analysis of flow durations and sizes, the notation introduced below is restricted to each tion. A. Flow duration We are interested here in estimation of the distribution of flow durations F V from the data of possibly censored durations in the observation window (, t). Suppose that N = n sampled flows (with at least two packets) start in the observation window at times T i, with the corresponding durations V i, i =,..., n. The random variable N is Poisson with mean λpf W ()t. Let {, Vi < t T Z i = min(v i, t T i ), L i = i,, V i t T i, be the observed (possibly censored) durations and the observed variables indicating whether right censoring occurs, respectively. Similarly, we suppose that there are M = m sampled flows (with at least two packets) that started at times, abusing the notation, T j before (T j < ) and that continue in the observation window (T j + V j > ). M follows a Poisson distribution with parameter λpf W ()µ V. For j =,..., m, let Y j = min(t j + V j, t), E j = {, Tj + V j < t,, T j + V j t, be the corresponding observed (censored) durations and the observed variables indicating whether right or double censoring occurs, respectively. Note that the values E j =, and L i =, correspond, respectively, to the left censored (l.c.), double censored (d.c.), uncensored () and right censored (r.c.) durations. Again, the basic problem is to estimate the distribution F V from the available data (Z i = z i, L i = l i ) and (Y j = y j, E j = e j ). Given N +M = m+n, the distribution F V can be estimated by a nonparametric maximum likelihood estimator (NPMLE) F V, maximizing the likelihood proportional to: (t + µ V ) n+m (df V (z i )) li ( F V (z i )) li m ( ej ( F V (y j )) ej ( F V (u))du). () y j=t Note that the likelihood () is defined in a natural way. For example, when l i =, (df (z i )) li = df (z i ) can be thought as the probability that duration V equals z i and is within the observation window (, t). The term ( F V (y j )) ej = F V (y j ) when e j = above, comes from the stationary density of the duration of the flow after time, if the flow started before zero. Indeed, recall that this density is ( F V (v))/µ V, v >. The solution of the problem stated above can be obtained following the censoring analysis developed by Wijers [9]. It is more convenient to find and compute the solution F V through a related probability distribution G satisfying dg(v) = t + v df V (v). t + µ V The likelihood () can be rewritten using G as r ( ) (dg(x i )) φi v=x i t + v dg(v) γi ( v=t v t t + v dg(v) ) n+m r, (2) x < x 2 <... < x r are the ordered values of y j and z i for which either e j = or l i =, (that is, uncensored and single end censored durations), and φ i and γ i are, respectively, the number of the uncensored and single end censored values at x i. Note that, unlike (), the likelihood (2) no longer involves the term /(t + µ V ) n+m at the very front. The likelihood (2) can further be expressed as r ( t ) (dg(x i )) φi v=x i t + v dg(v) + g(t) γi h n+m r, (3)

4 g(t) = t t + v dg(v), h = v t t t + v dg(v). The log-likelihood for (3) can finally be written as t log(dg(v))df u.c. (v) + g(v) = t v t log( g(v))df s.e.c. (v) dg(u) + g(t) t + u + log(h)f d.c. (t), (4) with g(t) defined by G(t) + 2t g(t) + h =, and F, F s.e.c. and F d.c. are, respectively, the empirical distributions of, s.e.c. and d.c. durations but normalized by n + m rather than the number of respective points see [9] for further details. When maximizing (4), the NPMLE (Ĝ, ĥ) of (G, h) can be found using the EM algorithm. This amounts to essentially differentiating (4) with respect to the unknown parameters dg(v) and setting the derivative equal to, leading to the so-called self-consistency equations dĝ(v) = df (v) + ĥ = F d.c. (t), s=v s= ĝ(s) df s.e.c. (s) t + v dĝ(v), 2tĝ(t) = ĥ Ĝ(t). In practice, the solution (Ĝ, ĥ) is found through the following steps: Algorithm : Input: z i, l i, i =,..., n, y j, e j, j =,..., m. The following quantities can then be computed: x i, i =,..., r (see above), F (v), F s.e.c. (v), F d.c. (v). E.g. F d.c. (t) = (# of y j = t)/(n + m). Initialization: Let ĥ = F d.c. (t) and, for k =, set dĝk i = ( ĥ)/(n + m), i =,..., r. Loop: Until the stopping condition below is reached, compute: ( i dĝk+ i = F (x i )+ ĝ k = ĝ k i = ĝ k i k := k + dĝk i t + x i + ĝ k (t), dĝk i t + x i, i = 2,..., r. F s.e.c. (x j ) ) d Ĝ k i ĝj k, i =,..., r, t + x i ĝ k (t) = ĥ r dĝk i, 2t Stopping condition: max,...,r dĝk+ i dĝk i < ɛ, for a given ɛ >. The implementation of the algorithm in MATLAB code is available upon request. The estimator of F V is obtained from dĝi, i =,..., r, and ĥ through: F V (x i ) = ν i dĝj, t + x j ν = ĥ r dĝi + 2t dĝj t + x j. Moreover, the mean flow duration estimator is µ V = t. (5) ν B. Flow size The estimation of the flow size distribution under censoring is more challenging. We first need to develop a framework flow sizes and censoring variables are independent. In particular, we will have to consider explicitly the IATs between conutive packets of a flow. We show below how this formulation can be achieved. Suppose that N = n sampled flows start in (, t), and T i and W i denote the arrival time and size of the sampled flow i, respectively, which are independent. The arrival times T i, i =,..., n, are independently and uniformly distributed in (, t). With these sampled n flows, we associate n flows with an infinite number of packets. For infinite size flow i, let S i,k be the time of the kth packet defined as S i, = T i, S i,k = S i, + D i, + D i, D i,k, k = 2, 3,..., the IATs D i,, D i,2,..., D i,k are i.i.d. and have the same distribution as D. Then, C i = max{k : S i,k < t} represents the number of packets of infinite size flow i over [T i, t). The size of the sampled flow i can now be viewed as censored by the independent random variable C i. That is, we observe Z i = z i, L i = l i, i =,..., n, {, Wi C Z i = min(w i, C i ), L i = i,, W i > C i. To derive the estimator of the flow size distribution, we assume for simplicity that the size of flow is driven by a positive continuous random variable. More specifically, this is used to simplify the likelihood (7) below. The approximation is a common assumption in the literature [7] since the range of flow sizes is very large. (The accuracy of the estimator obtained using this approximation is good see Section IV.) Thus, the function f W should be read as a probability density function. The probability that we observe the pairs (Z i = z i, L i = l i ) for the n sampled flows is proportional to f W (z i ) li ( F W (z i )) li. (6)

5 Maximization of (6) yields the celebrated Kaplan-Meier estimator [] of F W for right censored flows. Recall that we also observe M = m sampled flows that start before time. Similarly to above, consider m flows of infinite size that have started a long time ago (i.e., the infinite renewal process is in a stationary regime). For infinite size flow j, denote the arrival time of the kth packet after time by S j,k = D j, + D j, D j,k, k =, 2,..., Dj, has the equilibrium density function of D given by ( F D (u))/µ D, u, and D j,2,..., D j,k have the same distribution as D. Thus, C j = max{k : S j,k < t} is the number of packets of an infinite size flow j in (, t). For the m sampled flows that start before, let Wj denote the number of packets of the sampled flow j after time. The random variable Wj has density ( F W (w))/µ W and is possibly censored by the independent random variable Cj. We thus observe (Y j = y j, E j = e j ) for the sampled flow j that starts before time, {, W Y j = min(wj, Cj ), E j = j Cj,, Wj > C j. The probability that we observe the pairs (Y j = y j, E i = e j ), j =,..., m, for the sampled flows is proportional to ( m ) ej (( F W (y j ))/µ W ) ej ( F W (v))/µ W dv. y j (7) Under the traffic model, the total number of flows sampled N + M has Poisson distribution with mean λp(e[v ] + t) and E[V ] µ W µ D. Conditioning on N + M = n + m, and using (6) and (7), the likelihood to be maximized is ( ) ( m + n µw µ D ) m ( t ) n m µ W µ D + t µ W µ D + t m f W (z i ) li ( F W (z i )) li (( F W (y j ))/µ W ) ej which simplifies to ( ) ej ( F W (v))/µ W dv, y j f W (z i ) li ( F W (z i )) li m ( F W (y j )) ej ( ) ej / ( F W (v)) dv (µ W + t/µ D ) m+n. (8) y j Since µ W in the denominator also depends on F W, the maximization of (8) is a non-standard problem. The likelihood in (8) has the same form as the one considered in the Laslett s line segment problem [2]. Let x <... < x r be the ordered values of z i and y j for which either l i =, or e j = (these are the uncensored and single end censored sampled sizes), and let φ i and γ i be the number of uncensored and single censored values respectively at x i (with x = ). Denote by n i the double-end censored flow sizes in (x i, x i ], as u i,..., u ini, x r+ =. Instead of using the likelihood (8) with the distribution F W, we have to deal with µ W, Laslett proposed to write (8) in terms of v i = (t + x i )f W (x i ) t + µ W, i =,..., r, with t = t/µ D and let v r+ be such that v +...+v r+ =. We get the equivalent likelihood r n i (θ i θ i+ ) φi θ γi (H i (u ij x i )θ i )Hr nr+, (9) θ i = i v j t + x j, H i = v r+ + ( ) v j t + x i t, + x j i =,..., r, are linear functins of v,..., v r and θ r+ = and H = (t H + x )/(t + x ). The log-likelihood (9) can be written in the form L(v) = N log ( r+ α ij v j ), () α ij >. Turnbull [8] gives an algorithm (now known to be the EM algorithm) for maximization of the likelihood () with respect to v j s as follows: Algorithm 2: Input: x i, u ij, i =,..., r, j =,..., n i, t. Initialization: vi k = /(r + ), k =, i =,..., r +. Loop: Until the stopping condition below is reached, compute: θ k i = v k j t + x j, H k i = v k r+ + v k j ( ) t + x i t, + x j H k = (t H k + x )/(t + x ), ( i v k+ i = φ i + vk i γ j t + x i θ k + vk i j t (x i u lj ) + x i u lj x i )/ (Hl k (u lj x l )θl k ) ) (n + m), i =,..., r. k := k + Stopping condition: max,...,r v k+ i given ɛ >. vi k < ɛ, for a The Matlab code is available upon request. The estimator of the flow size distribution is obtained by going from v i s

.9 Empirical distribution function.8.7.6.5.4.3.2. 2 4 6 8 2 4 6 8 (a) Empirical distribution of the flow arrival times in an observation window (t = 3 min).

back to F W through F W (x i ) = (H i H i )t, i =,..., r. (x i x i )(H ) IV.

6 .9 Empirical distribution function (a) Empirical distribution of the flow arrival times in an observation window (t = 3 min). (b) Flow density for flow size against the average IAT. Fig. 3: Checking the model assumptions. (c) Average autocorrelation function of IATs at lag 3 for flow size against average IAT. back to F W through F W (x i ) = (H i H i )t, i =,..., r. (x i x i )(H ) IV. EXPERIMENTS In this tion, we assess the adequacy of the traffic model and the accuracy of the estimators for the flow duration and flow size distributions under censoring with real Internet traffic [6]. We consider that relative time of the observation window starts 2 min after the beginning of the trace (one hour long). A. Model checking To check the Poisson arrival assumption of flows, we may test if conditionally on the number of arrivals until time t, the arrival times come from a uniform distribution on (, t). Figure 3a plots the empirical distribution of the arrival times of flows during an observation window of 3 min, which agrees with the linear distribution function of the uniform distribution. The standard Kolmogorov-Smirov test for the uniform distribution gives the p-value smaller than 6. The adequacy of the finite renewal process to describe a flow can be assessed in several ways. We need to check if IATs between packets of a flow are i.i.d. and independent of flow size. This is particulary important for the analysis in Section III-B. Note that V i /(W i ), W i 2, can be thought as the average IAT between packets of flow i. We consider all the flows in trace. Figure 3b depicts W against the average IAT in logarithmic scale. The level corresponds to the flow density (number of flows in each small square). If IATs are identically distributed, the average IAT should cover a wide range of values for small flow sizes. Since most flows have a small number of packets, the highest density is concentrated in this region. We do not see multiple regions of high concentration that would suggest several non-identical IATs distribution. As the flow size W increases, the independence of IATs and size should translate into the top of the triangle and the average IAT being around the value the density is higher. In the plot shown some large flows have shorter IATs indicating small deviations from the finite renewal process assumptions for large flows. Regarding the independence of IATs, Figure 3c gives the average of the autocorrelation function of IATs at lag 3 within a flow, calculated individually for each flow with at least three packets and then averaged over squares in a log-log plot, for flow size against average IAT. We conclude that the autocorrelation is weak and we have observed that it diminishes as the lag between IATs increases. In conclusion, there is no indication of any severe problem with model assumptions used in the censoring analysis. B. Flow duration and size distribution Figure 4a depicts the log plot of the complementary distribution function (CDF) of flow duration with at least two packets F V obtained with the trivial estimators and the estimator developed in the censoring analysis of Section III-A. The length of the observation window is 3 min and the probability of sampling a flow is.9. If p is very small, we do not have enough sampled flows to use in the estimation. This also depends on the number of active flows in the window of time. Sampled flows are classified based on SYN and FIN flags and a timeout between packets (3 ) - see Sections II-C and II-D. The estimators are plotted against the original CDF computed with all flows in trace with at least two packets. Comparing with Figure 2, the trivial estimators shows the same poor performance with sampling. When we consider all the sampled types of flows (i.e.,, s.e.c. and d.c. flows), there is an intriguing overestimation of the complementary distribution in comparison with the other trivial estimators see Section II-D for a discussion. The estimator of Section III-A that uses the information of the censored sampled flows shows a very good accuracy and the converge of the EM algorithm is quite fast. We only observe durations within (, 8), and so we are not able to estimate the C.D.F on [8, ). However, we can use (5) to estimate the mean flow duration µ V. This gives.9425, with the true value being equal to.353. Figure 4b shows the estimation of F V over a 5 min window. The sampling probability is p =.5 since more flows now can be sampled in a larger window. As the window size increases, the bias of the trivial estimators is smaller. We have more variability in the tail of the estimator based on censored analysis due to a reduced number of large durations in the sample. The mean flow duration estimate is

7 F V 2 +r.c. +s.e.c.+d.c (a) t = 3 min and p =.9. F V 2 +r.c. +s.e.c.+d.c (b) t = 5 min and p =.5. Fig. 4: CDF of flow durations. Figure 5a presents the log plot of the CDF of flow size F W given by the trivial estimators and the estimator developed using censoring analysis (Section III-B). We use the same setting as in Figure 4a with respect to the observation window length and sampling probability. The original flow size distribution is computed using all flows in the trace. For the trivial estimator that considers l.c. and d.c. sampled flows gives a higher estimation of the complementary distribution. These sampled flows that start before the beginning of the observation window tend in general to have a large number of packets due to the heavy-tailed distribution of the flow sizes. The estimator which takes into account the information of the censored flows produces an accurate estimate of the distribution. Contrary to the flow duration which is limited to the length of the window, we do not have here a bound on the flow size of packets since this depends on the flow IATs between packets. We compute this nonparametric estimator using the Turnbull algorithm which converges quickly and plot it until its performance is satisfactory. The mean IAT between conutive packets of a flow, µ D, used as input in the algorithm is estimated from the sampled flows. F W 2 +r.c. +s.e.c+d.c packets (a) t = 3 min and p =.5. F W 2 +r.c. +s.e.c.+d.c packets Fig. 5: CDF of flow sizes. (b) t = 5 min and p =.5. Finally, Figure 5b shows F W in log scale t = 5 min and p =.5. Similar conclusions can be drawn as a larger part of the distribution tail can be estimated with censoring analysis. Due to space limitations we only presented the results for one data trace. The performance of the estimators were also good for other network traces found in [6]. V. CONCLUSIONS AND FUTURE WORK This paper was motivated by the need to sample flows in a constrained time interval due to limited network resources or application needs. The incomplete (censored) record of sampled flows can seriously degrade the accuracy of the estimates of the flow duration and size distributions. Under an appropriate mathematical description of flow sampling in an observation window and using censoring analysis, we presented nonparametric maximum likelihood estimators that rely on the information of the censored sampled flows. The estimators are computed using the EM algorithm. We evaluated them on a real Internet trace, and showed that they provide an accurate estimation of the distributions of flow level characteristics. As future work we would like to extend this analysis to other available and popular sampling methods such as Dual Sampling, and Sample and Hold [9]. REFERENCES [] N. Duffield, C. Lund, and M. Thorup, Estimating flow distributions from sampled flow statistics, IEEE/ACM Trans. Netw., vol. 3, pp , Oct. 25. [2] A. Chen, Y. Jin, J. Cao, and L. E. Li, Tracking long duration flows in network traffic, in Proc. INFOCOM, March 2, pp. 5. [3] N. Dukkipati and N. McKeown, Why flow-completion time is the right metric for congestion control, SIGCOMM Comput. Commun. Rev., vol. 36, pp , January 26. [4] N. Duffield, Sampling for passive internet measurement: A review, Statist. Sci., vol. 9, no. 3, pp , 24. [5] Y. Sakai, M. Uchida, M. Tsuru, and Y. Oie, Impact of censoring on estimation of flow duration distribution and its mitigation using Kaplan- Meier-based method, IEICE Trans. Inf. & Syst., vol. E92-D, no., pp , 29. [6] N. Duffield, C. Lund, and M. Thorup, Flow sampling under hard resource constraints, in Proc. ACM SIGMETRICS, 24, pp [7] M. Thottan and C. Ji, Anomaly detection in IP networks, Signal Processing, IEEE Transactions on, vol. 5, no. 8, pp , Aug 23. [8] J. Cao, W. S. Cleveland, D. Lin, and D. X. Sun, On the nonstationarity of internet traffic, in Proc. ACM SIGMETRICS, 2, pp [9] B. J. Wijers, Consistent non-parametric estimation for a onedimensional line segment process observed in an interval, Scandinavian Journal of Statistics, vol. 22, no. 3, pp , 995. [] E. L. Kaplan and P. Meier, Nonparametric estimation from incomplete observations, Journal of the American Statistical Association, vol. 53, no. 282, pp , 958. [] J. P. Klein and M. L. Moeschberger, Survival Analysis: Techniques for Censored and Truncated Data. Springer, 23. [2] G. M. Laslett, The survival curve under monotone density constraints with application to two-dimensional line segment processes, Biometrika, vol. 69, no., pp. 53 6, 982. [3] N. Hohn, D. Veitch, and P. Abry, Cluster processes: a natural language for network traffic, IEEE Trans. Signal Process., vol. 5, no. 8, pp , Aug. 23. [4] D. J. Daley and D. Vere-Jones, An Introduction to the theory of point processes. Springer-Verlag, 988. [5] B. Ribeiro, D. Towsley, T. Ye, and J. C. Bolot, Fisher information of sampled packets: an application to flow size estimation, in Proc. ACM SIGCOMM, 26, pp [6] Auckland IX, file , Available: [7] C. Barakat, G. Iannaccone, and C. Diot, Ranking flows from sampled traffic, in Proc. CoNEXT, 25, pp [8] B. W. Turnbull, The empirical distribution function with arbitrarily grouped, censored and truncated data, Journal of the Royal Statistical Society. Series B, vol. 38, no. 3, pp , 976. [9] P. Tune and D. Veitch, Fisher information in flow size estimation, IEEE Trans. Info. Theory, vol. 57, no., pp , Oct. 2.

Mice and Elephants Visualization of Internet

Mice and Elephants Visualization of Internet Traffic J. S. Marron, Felix Hernandez-Campos 2 and F. D. Smith 2 School of Operations Research and Industrial Engineering, Cornell University, Ithaca, NY, 4853,