Bivariate statistical analysis of TCP-flow sizes and durations

Size: px

Start display at page:

Download "Bivariate statistical analysis of TCP-flow sizes and durations"

Griselda Freeman
5 years ago
Views:

1 Ann Oper Res (2009) 170: DOI /s Bivariate statistical analysis of TCP-flow sizes and durations Natalia M. Markovich Jorma Kilpi Published online: 7 March 2009 Springer Science+Business Media, LLC 2009 Abstract We approximate the distribution of the TCP-flow rate by deriving it from the joint bivariate distribution of the flow sizes and flow durations of a given access network. The latter distribution is represented by a bivariate extreme value distribution using the Pickand s dependence A-function. We estimate the A-function to measure the dependencies of random pairs: TCP-flow size and duration, the rate of TCP-flow and size, as well as the rate and duration. We provide a method to test that the achieved estimate of A-function is good and perform the analysis with one concrete data example. Keywords TCP-flow Extreme value distribution Pickands function In a recent paper of D Auria and Resnick (2006) a new family of an infinite-source Poisson models is shown to be able to take into account several properties of true traffic, such as burstiness, long range dependence (LRD), heavy tails, bursty behavior determined by high bandwidth users and dependence determined by users with low transmission capacity. The basic elements of these flow-level models are beginnings of the data transfer sessions and the rates (R), durations (D) and sizes (S) of each of these data transfer sessions. Different dependence structures between the triple (R,D,S) are possible in these models and in D Auria and Resnick (2006) the natural but optimistic case when the rate R of a data transfer session is independent of the size S of the data transfer session is analyzed. A form of asymptotic independence for the pair (D, R) is obtained in Resnick (2006), p. 239 as a result of the examination of the tail of the product DR. In van de Meent and Mandjes (2005) N.M. Markovich ( ) Institute of Control Sciences, Russian Academy of Sciences, Profsoyuznaya str. 65, Moscow, Russia markovic@ipu.rssi.ru N.M. Markovich nmarkovich@yahoo.com J. Kilpi VTT, Vuorimiehentie 3, VTT, Finland jorma.kilpi@vtt.fi

2 200 Ann Oper Res (2009) 170: the dependence of R and D was found in 80% traces for aggregated flows generated by the same user by checking the condition ED ER/ES = 1. Some specific variant of this general model-family framework could be suitable also for access networks where the level of traffic aggregation may be moderate. The advantages of such flow-level models are discussed, for example, in van de Meent and Mandjes (2005). The data transfer session could be interpreted as an IP-flow, which is an aggregate of traffic with the same source/destination IP-addresses, or it could be defined by a TCP-connection if we restrict the interest only to TCP-traffic. This latter choice is somewhat more specific but more accurately defined. Sizes and durations of both directions of a TCP-connection are well-defined and can be easily measured with, for example, a tool like Tstat (2007). The quantities S and D are obtained directly from the data. For the purposes of this paper we define the rate R of a TCP-flow in a natural way as the ratio R = S/D. Since both S and D are random the rate R is also random. Since S and D are dependent (see results of the empirical study in Sect. 4.2) and positive, the joint distribution F of the pair (S, D) determines the distribution of R: F R (x) = P{S/D x}= zx 0 0 df(y,z) = 0 P{S zx D = z} df D (z). (1) We denote by F D the marginal cumulative distribution function (CDF) of D. Direct estimation of the distribution of R from the data, as the ratio S/D, would disregard the dependence structure of S and D. Since the distributions of both S and D are heavy-tailed (see investigation in Sect. 2), their joint distribution F is very complex for estimation. This dependence structure is driven by the physical structure of the access network, which contains the access links of various capacities, and by the separation of users behind their access links. With faster access rate users tend to download larger files, but the access link capacity is not the only factor. The tariffing policy, for example, has also an effect. The distribution of the rate R = S/D is of interest. The latter allows us to estimate the expectation and variance of the rate. Another aim is to measure the degree of dependencies of S and D, R and S as well as R and D. In the context of heavy-tailed distributions the asymptotical distribution (in a sense that the sample size increases infinitely) of a normalized maxima of the sample is used as a model of the tail. We shall estimate a bivariate Extreme Value Distribution (EVD) G of the pair (S, D) instead of F(y,z) in (1) (necessary explanations are given in Sect. 4.1). There are at least two causes for this. First, a maxima M n = max i=1,...,n {X i } of n observations of a random variable X fully represents its distribution in a sense P{X 1 + +X n >x} np{x 1 >x} P{M n >x} as x, Embrechts et al. (1997). Second, the lack of observations beyond the sample range does not allow to restore such a distribution at infinity. Our approach follows the three steps: 1. The preliminary detection of univariate heavy tails. 2. The dependence structure of univariate time series data. 3. The dependence structure of multivariate data. For heavy-tailed data and, in particular, data with infinite variance, the classical statistical methods are not adequate and flexible enough. We should also distinguish between methods that are valid for independent and dependent data. For example, an empirical CDF is

3 Ann Oper Res (2009) 170: biased when applied to dependent data. The maximum likelihood (ML) method requires independence assumption. The evaluation of the distributions of univariate data is especially important when estimating the multivariate quantiles and distributions. When the data are independent or weakly dependent one can apply traditional methods like kernel estimators to estimate the probability density function (PDF), see Hall et al. (1995). The problem of density estimation arises when the data are long-range dependent (LRD), see Castellana and Leadbetter (1986). The paper is organized as follows. First, the data that is analysed in this paper is briefly described in Sect. 1. The data analysis starts in Sect. 2 where we first describe rough methods to detect heaviness of tails of univariate distributions. Then, in Sect. 3, we study methods to detect dependencies in the data. The bivariate analysis is presented in Sect. 4. Finally, conclusions are made in Sect A brief description of the data The data set we use in this paper is part of a trace measured from a gateway between a mobile network and the Internet. The important thing is that all of the flows have passed through a mobile radio access, hence the access rates have always been rather limited, always less than 384 kb/s. The same data were used in a different study, reported in Kilpi and Lassila (2006), and from this study we know that the mobile backone network was not congested at any time during the measurement. For this reason we ignore the daily structure of this traffic. In Kilpi and Lassila (2006) the problems of TCP in this specific trace were found to be due to low access rate, limited CPU capacity of mobile terminals and self-congestion at the access link due to user behavior. The latter two properties are prominent in this trace due to the tariffing policy of the operator, flat-rate charged hours. Hence they are specific to this particular trace and are mentioned here only due to problems they generated for the data-analysis. The extreme value theory that will be applied for this data assumes that the size-duration pairs (S i,d i ), i = 1,...,N are an i.i.d sample from a common joint distribution F.However, data which is obtained from monitoring a single point of a network falls into the category of observational data. It cannot automatically be considered as i.i.d. data. Rather it resembles (unequally spaced) time series data. For example, a single web session of a single user can spawn several TCP-connections with short inter-arrival times and these TCPconnections are naturally dependent. The observational data are not necessarily homogeneous enough for meaningful inference from the trace specific empirical distribution. In order to get a more homogenius data set we considered only downstream flows running on the TCP port 80 (HTTP). We can assume that a user selects a random web content, that is he/she chooses the size S from the file size distribution randomly, and then downloads this content using TCP. The action of a user initiates the TCP connection and its arrival time AT but the TCP protocol generates the flow departure time DT when the downloading is finished and, thus, determines also the flow duration D = DT AT. The duration D that the downloading takes depends on the rate of the flow and, since the access rate is very low, it can be assumed to be the major constraint for the rate rather than congestion elsewhere in the network. It is the special feature of this data set. The majority of pairs (S, D) are like this, although there are also outliers in the data. The reconstruction of flows from the original tcpdump-file was made using the Tstat program (Tstat 2007). The total number of above described flows is N = and, throughout the paper, we will divide the data into smaller blocks of flows, consecutive according to arrival times AT.

4 202 Ann Oper Res (2009) 170: Table 1 Brief description of the data Statistic Unit Definition Sample mean Sample variance Sample max Min Max Min Max Min Max Size kb Content Transmitted Duration sec SYN-FIN The block size is denoted by n, the number of blocks is denoted by m, and we always have N = mn, we first consider m = 61 disjoint bivariate blocks, each of size n = Table 1 gives the observed ranges [min, max] of sample means, sample variances and sample maxima over these 61 blocks. Content in Table 1 refers to the size of the downloaded web content and Transmitted means Content plus segments retransmitted by TCP. Both are measures of the size of a flow. SYN-FIN means the period from the three-way handshaking (synchronization) to finish. (Other measures of duration would also be possible.) There are some individual examples where Transmitted is significantly larger than Content but, in general, they differ little. This may be due to a bias arising from the fact that we consider only completely observed flows. In the data examples of this paper the size S corresponds to Content. Then R = S/D can be interpreted as the rate that the user experiences. 2 Testing for heaviness of univariate tails 2.1 Theory Let X 1,X 2,...,X n be the observed sample of r.v.s that have identical marginal distribution. Rough methods to test the heaviness of tails and the number of finite moments, see Embrechts et al. (1997) and Markovich and Krieger (2006), include the ratio of the maximum to the sum, the plot of the empirical mean excess function, QQ-plots and the tail index estimation. The statistic R n (p) = M n (p)/s n (p), n 1, p>0, where S n (p) = X 1 p + + X n p and M n (p) = max( X 1 p,..., X n p ), can be used to check the moment conditions of the data (Embrechts et al. 1997, p. 308). If R n (p) is small for large n, thene X p <, otherwise it suggests that the pth moment E X p =. The Extreme Value Index (EVI) γ or the tail index α = 1/γ indicate the shape of the tail and is a significant part of the analysis of heavy-tailed data. Distributions with regularly varying tails (RVT) provide a wide class of heavy-tailed distributions. The distribution function of RVT distributions determines by formula F(x)= 1 x 1/γ l(x),wherel(x) denotes a slowly varying function, e.g., ln x or a positive constant. The smaller the α the heavier the tail. A positive sign of the EVI may imply the heaviness of tail. The EVI may indicate the number of finite moments if the distribution is RVT. In this case, the moments of order β exist, i.e. E{X β } < if β<α,bute{x β }= if β>α, Embrechts et al. (1997). The well-known Hill s estimator is defined as γ H (n, k) = 1 k k log X (n i+1) log X (n k), i=1

5 Ann Oper Res (2009) 170: where the parameter k indicates the largest order statistics X (n k) X (n). This estimator is valid for a positive EVI γ and can be constructed for dependent data also. Two other estimators, the moment estimator ˆγ M (n, k) =ˆγ H (n, k) ( 1 ( ˆγ H (n, k)) 2 /S n,k ) ) 1, (2) where S n,k = (1/k) k i=1 (log X (n i+1) log X (n k) ) 2 and the UH estimator ˆγ UH (n, k) = (1/k) k log UH i log UH k+1, (3) i=1 where UH i = X (n i) ˆγ H (n, i), of the EVI, see Beirlant et al. (2004) or Embrechts et al. (1997), are based on a similar argument involving k + 1 largest values and are valid for any real valued EVIs. We also apply the group estimator which is based on a different principle. For this estimator the sample is divided into l groups V 1,...,V l, each group containing r r.v.s, i.e. n = lr. LetM (1) li = max{x j : X j V i } and let M (2) li denote the second largest element in the same group V i.thenγ l = 1/z l 1, where z l = (1/l) l i=1 M(2) li /M (1) li, can also be used as an estimator of γ, see Davydov et al. (2000) for details. The main problem of all EVI estimates is the choice of the parameter k (or r in the group estimator). This parameter can be automatically calculated by the bootstrap method, Markovitch and Krieger (2002), Markovich (2005). The linearity of a QQ-plot shows that the parametric model of the distribution is selected correctly. It is fruitful to check the Generalized Pareto distribution GPD(σ, γ ), with the distribution function { 1 (1 + γx/σ) σ,γ (x) = 1/γ, γ 0, 1 exp ( x/σ), γ = 0, where σ>0andx 0ifγ 0and0 x σ/γ if γ<0, as a candidate. The GPD tail-fitting approach is based on the Pickands theorem that provides the GPD as a limiting distribution of the excesses over a high fixed threshold, Beirlant et al. (2004). However, a choice of parameters which provide a linear-looking QQ-plot is not unique. For heavy-tailed distributions the mean excess function e(u) = E[X u X>u] tends to infinity. For example, in case of the Pareto distribution it increases linearly. For the data this can be tested with the empirical mean excess function, defined as e n (u) = n n (X i u)1 {Xi >u}/ i=1 i=1 1 {Xi >u}. The more detailed description of these methods is given in Markovich and Krieger (2006). 2.2 Practice The mentioned estimators of the EVI were applied to observed flow sizes and durations. The results, again in the form of [min, max]-ranges of these statistics over all blocks when n = and m = 61, are given in Table 2. For each sample the parameter k was estimated by a bootstrapping method. For the group estimator we used r = l = 100 = n. The positive signs of all estimates allow to assume that both flow sizes and durations have heavy-tailed

6 204 Ann Oper Res (2009) 170: Fig. 1 On the left are examples of n R n (p) calculated from one block of the durations and for a variety of p-values: p = 0.5(lowest curve), p = 1.0(middle curve)andp = 2.0(topmost curve). Since R n (1) does not decrease as n increases, one may conclude that the moments beginning with the first are not finite. The middle and right plots are examples of the estimation of γ by the Hill s (solid line) and by the moment estimator (dotted line) for the flow size and the duration of transmission (right). Horizontal lines show the bootstrap selected values: in the middle plot they are ˆγ H (n, k) = and ˆγ M (n, k) = and in the right plot they are ˆγ H (n, k) = and ˆγ M (n, k) = For this block of data the estimate of the tail index α is larger than 1. It may imply, that the fist moment of the corresponding distribution is finite Fig. 2 On the left is an example of a QQ-plot for the duration with quantiles of the GPD(0.85,1). In the middle and on the right are examples of the exceedance e n (u) against the threshold u for the flow size and duration, respectively Table 2 Estimation of the EVI for flow sizes and durations γ H (n, k) γ l γ M (n, k) γ UH (n, k) Min Max Min Max Min Max Min Max Content Transmitted SYN-FIN distributions. All estimators apart of the group estimator indicate that the block-samples of flow sizes (both content and transmitted) may have infinite variance under the assumption that their distributions are regularly varying. Some block-samples of flow durations may have finite two first moments. Examples of the results of this analysis for one typical block-sample of observations are given on Figs. 1 2 and a summary over all m = 61 blocks is given in Table 3. The column Estimators of γ summarizes Table 2. The indication <1 implies that only the pth moments with p<1 may exist. The conclusion of Table 3 does not mean that the GPD was a good model, it means only that in the QQ-plot the few largest values of the tail could be forced into a roughly linear relationship with the model quantiles and that the e n (u) seems

7 Ann Oper Res (2009) 170: Table 3 Conclusion over all m = 61 samples Amount of first finite moments Type of distribution R n (p) Estimators of γ QQ-plot e n (u) Content 1 2 or 1 GPD(σ = 1,γ = 1.3) Pareto-like Transmitted 1 2 or < 1 GPD(σ = 1,γ = 1.3) Pareto-like SYN-FIN <1 2or <1 GPD(σ = 1,γ = 0.85) Pareto-like to grow linearly when u is small as shown in Fig. 2. Whenu is large the amount of data points larger than u is very small and, thus, e n (u) is not very reliable. We emphasize that the identical distribution here is affected by factors specific to the trace. It is also affected by our choice to consider only port 80 traffic. The results cannot be generalized to arbitrary TCP-flows. However, these results are in line with similar results obtained elsewhere, see e.g. those referred to by D Auria and Resnick (2006). 3 Testing for dependence of univariate tails 3.1 Theory The autocorrelation function (ACF) at lag h ρ X (h) = ρ(x t,x t+h ) = E ((X t EX t )(X t+h EX t+h )) /Var(X t ) is considered as an important indicator of the dependence structure of a time series. For a stationary time series {X t,t = 0, ±1, ±2,...} the standard sample ACF at lag h Z is defined by ρ n,x (h) = n h t=1 (X t X n )(X t+h X n ) n t=1 (X t X n ) 2, (4) where X n = 1 n n t=1 X t represents the sample mean. The relevance of this estimate is determined by its rate of convergence to the real ACF. When the distribution of the X t s is very heavy-tailed (in the sense that EXt 4 =, i.e. the tail index is small enough α<4inthis case), this rate can be extremely slow. For heavy-tailed data with infinite variance it is better to use the following slightly modified estimate without the usual centering by the sample mean: n h t=1 ρ n,x (h) = X tx t+h n. (5) t=1 X2 t However, this estimate may behave in a very unpredictable way and not estimate anything reasonable if one uses the class of non-linear processes in the sense that this sample ACF may converge in distribution to a non-degenerate r.v. depending on h. For linear processes it converges in distribution to a constant depending on h (Davis and Resnick 1985). Resnick (1997) says: If on graphing the sample heavy tailed ACF ρ n,x (h) one finds only small values, then it may be possible to model the data as i.i.d. If the sample ACF is small beyond lag q, then there is some evidence that the moving average process MA(q) may be an appropriate model.

8 206 Ann Oper Res (2009) 170: Usual short range dependent data sets would show a sample ACF dying after only a few lags and then persisting within the 95% Gaussian confidence window. Independence can be assumed with a certainty of 95% if ρ n,x (h) 1.96/ n. The boundaries ±1.96/ n provided by Bartlett s formula (Brockwell and Davis 1991) are valid only for linear processes with Gaussian noise. In the case of regularly varying linear processes (Resnick 2006)or nonlinear processes (Mikosch 2002) such boundaries cannot be indicated so easily. Then the solution regarding independence cannot evidently be done from the observation of Bartlett s interval. We have to test also the long range dependence which means that there is dependence in a time series over an unusually long period of time. More exactly, the time series (X t ) is long range dependent if ρ X (h) =. (6) h=0 The Hurst parameter is a tool to test the LRD. One can assume that for some constant c ρ > 0 ρ X (h) c ρ h 2(H 1) for large h and some H (0.5, 1), (in this case (6) holds). The closer the Hurst parameter H is to 1 the slower is the rate of ρ X (h) to zero as h, i.e., the longer is the range of dependence in the time series. 3.2 Practice Since we treat unequally spaced data as equally spaced we are completely blind to dependencies caused by the fact that a single user s web session can, and often do, generate several TCP connections which obviously are dependent. Hence, we know beforehand that the data is not independent and the task is merely to quantify how dependent the univariate time series data are. Non-stationarities caused by the daily traffic profile, tariffing policy and other similar trace specific features easily affect the ACFs. Therefore we did the analysis using block sizes n = 10000, n = 1000 and, few times, even n = 100. When the traffic load is low, n = may cover a too long period for the necessary stationarity assumption to hold. When the traffic load is high n = 1000 may be too short and biased due to trace specific features. This resulted in hundreds of pictures, only few of which can be shown here. Very large values also easily affect the ACFs. The previous analysis (see Table 3) shows that the TCP-flow data appear as heavy-tailed with possibly infinite variance. Therefore, the application of formula (5) is relevant. However, as indicated in Resnick (1997), there is no general knowledge of what small values in ACF (5) should be and without known confidence lines inference from any single example of (5) alone is not possible. Pictures of (5) contain the suggestive confidence lines ±1.96/ n. Therefore, comparison of ACFs when calculated in the cases where the data is ranked in arrival order vs. when the same data was ranked into random order was made. Putting the data in a random order should destroy dependence structure caused by correlated arrivals. Moreover, the operation of random ordering can be repeated many times for any fixed block. Figure 3 shows one comparison of heavy-tailed ACFs. The general conclusion of this type of comparisons was that dependence due to correlated arrivals is not very significant since the overall shape of the (heavy tailed) ACF did not usually change much. Large spikes are due to large values, not due to significant dependence at any lag. The standard ACFs of the TCP-flow durations have small

9 Ann Oper Res (2009) 170: Fig. 3 On the left is an example of (4) for a block of n = flow durations, in the middle is an example of heavy-tailed ACF (5) for the same data, and on the right is the same data but ranked in a random order before making the ACF Fig. 4 Left and middle plots show estimates of sample heavy tailed ACF (5) ofj = 10th block (n = 10000) of flow sizes, on the left the data is ranked in the arrival order and on the middle the data is ranked in the departure order before division into blocks. The right plot shows the sample values of Spearman s Rho, r S,n (D ATi,D DTi ), n = 10000, between durations in arrival and in departure rankings of all m = 61 blocks values except of several lags. One can recognize at least three clusters at the ACF-plot that may indicate the dependence, Mikosch (2002). The heavy-tailed ACFs are small beyond approximately 5000th and 7000th lag, respectively for plots on the left and on the right. This may indicate on the possible linear model of the considered time series and dependence of the data. Since the values of both ACFs are of the same order, it may imply that the correlation of arrivals does not influence on the ACF. For flow sizes S i we also compared ACFs when the whole data was first ranked in the arrival versus in the departure order and the division to blocks was done after this departure ranking. In this case the same jth block does not consist exactly of the same points. This could bring forth some dependencies which are due to correlated flow departures. For flow durations this may not be meaningful since, due to functional relationship AT i + D i = DT i for the data, it is hard to interpret the results. An example of this comparison is shown in Fig. 4. Again, the differencies were typically small and could be explained by nonstationarities. The ACFs of the TCP-flow sizes may indicate a weak dependence. The claim that the mobile access/backbone network was not congested at any time during the measurement means that, as a system, it was in a (trivial) equilibrium state. Since congestion at the access link, due to trace specific user behaviour, and congestion at the Internet are only local the flow departure process should be roughly independent of flow arrival process. The Spearman s Rho, denoted as ρ S, was used to check whether sizes and durations, when ranked into arrival order were independent of the exactly the same data when ranked in departure order. That is, division to blocks was done first and rankings were done within the block only. The sample version r S,n of ρ S satisfies the well-known Kendall (1970) inequality Var(r S,n ) 3 n (1 ρ2 S ) which provides us confidence lines ± 3 n.theyare shown in the right plot of Fig. 4. This type of a plot could indicate that ignorance of daily structure due to uncongested mobile backbone would not be justified, but here it confirms this knowledge. The largest exceedings occur during the night time due to non-stationarity.

10 208 Ann Oper Res (2009) 170: Fig. 5 Left plot shows estimates of the Hurst parameter for all block samples of durations when n = The middle and right plots are examples of ACFs calculated for the flow inter-arrival times and for the empirical rates, respectively Also LRD was tested. It would be slightly weird if Content, as chosen by a user, would be LRD but such statistics as Transmitted and (SYN-FIN) duration could in principle be affected by the LRD nature of the aggregate packet level traffic. Again, since the assumptions of various existing methods, see for example Taqqu and Teverovsky (1998), cannot typically be absolutely verified, different estimators of the Hurst Parameter were compared. Figure 5 shows the results using the estimator Ĥ n = 0.5(1 + log 2 (1 + ρ n,x (1))) from Kettani and Gubner (2006). Confidence lines Ĥ n ± 5 n are based on assumption of the exactly secondorder self similar process. The exceedances of confidence lines in the left plot of Fig. 5 are due to non-stationarity at the night time. None of the time series is LRD. Finally, since we really would need to know the dependencies between bivariate time series (S i,d i ), i = 1,...,n, we also constructed ACFs of flow inter-arrival times AT i+1 AT i, empirical rates R i = S i /D i,and Si 2 + Di 2, i = 1,...,n. Two examples are shown in Fig. 5. These ACFs could indicate some significant bivariate time series independencies that would not be visible in other ways, but they were extremely well behaving and, furthermore, empirical rates confirmed the fact that the mobile backbone was not congested at any time during the over 30 hour long trace collection. As a conclusion of the analysis presented in this section we can say that there are dependencies in the univariate time series data but it is certainly not LRD. On the contrary, it is close to independent data. Sizes (Content) are less dependent than durations, which is natural. We can also assume that the bivariate time series data (S i,d i ), i = 1,...,n is only weakly dependent. As a conclusion of the data analysis of both the previous and of this section we now can conclude that the data not only appear but really is heavy-tailed. 4 Non-parametric estimation of the Pickands dependence function 4.1 Theory Let (X 1,Y 1 ),...,(X n,y n ) be a bivariate i.i.d. (or weak dependent) sample with a bivariate max-stable distribution G. It implies that there exist normalizing constants a j,n > 0and b j,n R, j = 1, 2 such that as n, P{(M 1,n b 1,n )/a 1,n x,(m 2,n b 2,n )/a 2,n y} G(x,y), (7) where M 1,n = max{x 1,...,X n }, M 2,n = max{y 1,...,Y n } are the componentwise maxima (Fougères 2004). An asymptotical bivariate distribution G of normalized maxima (M 1,M 2 )

11 Ann Oper Res (2009) 170: may be determined by its margins G 1 and G 2 by the representation ( G(x,y) = exp log ( G 1 (x)g 2 (y) ) ( )) log G 2 (y) A log(g 1 (x)g 2 (y)) (8) where A(t), t [0, 1] is the Pickands dependence function, Beirlant et al. (2004). One should not mix up marginal CDFs F 1 and F 2 of r.v.s X and Y and corresponding univariate extreme value CDFs G 1 and G 2 of their maxima M 1,n and M 2,n. In order to get an approximation of F R (x) we can replace F(y,z) in (1) by estimated G(y, z), Beirlant et al. (2004). Generally, the detailed motivation of the replacement of a d-variate CDF F by a d-variate extreme value CDF G isgiveninbeirlantetal.(2004), Sect. 9.4, (see also formula (9.62)) under the assumption that F is in the domain of the attraction of G, i.e. (7) is fulfilled. We obtain F R (x) zx 0 0 ( ( d exp log ( Ĝ 1 (y)ĝ 2 (z) ) ( Â log Ĝ 2 (z) log ( Ĝ 1 (y)ĝ 2 (z) ) ))). (9) Let the random pair (X,Y ) have the DF G(x,y). The Pickands function indicates the degree of dependence among X and Y. In practice, (X,Y ) are component-wise maxima over large blocks of data. In the bivariate case the function A(t) satisfies A(0) = A(1) = 1, it is convex and lies inside the triangle determined by points (0, 1), (1, 1) and (0.5, 0.5). Cases A 1andA(t) = max{1 t,t} correspond to total independence and total dependence, respectively. It is convenient to transform initial random pairs (Xi,Y i ) to new pairs (ξ i,η i ), i = 1,...,m,wherem is the number of blocks, in such a way that the margins are all the same. For example, the transformations ξ i = log G 1 (X i ), η i = log G 2 (Y i ) (10) leads to exponentially distributed r.v.s ξ and η. In this case we get ( ( )) y P{ξ >x,η>y}=exp (x + y) A, x 0,y 0. (11) x + y The simulation study provided in Hall and Tajvidi (2000) indicates that the best estimators of A are Â C m from Capéraà et al. (1997)andÂHT m from Hall and Tajvidi (2000): and log Â C m (t) = 1 m m j=1 m (t) = 1 m Â HT } m { ξ j /ξ min m 1 t, η j /η m 1 t j=1 log max { t ξ j,(1 t) η j } t 1 m m log ξ j (1 t) 1 m j=1, (12) m log η j. (13) Here ξ j = log Ĝ 1 (Xj ) and η j = log Ĝ 2 (Yj ), j = 1,...,m, ξ m = 1 m m j=1 ξ j, η m = 1 m m i=1 η j. When estimating A we face the following three problems. 1. The estimators are not convex. The easiest solution is to take the convex hull of them. j=1

12 210 Ann Oper Res (2009) 170: The margins G 1 and G 2 are unknown. One has to replace them by their estimates Ĝ 1 and Ĝ 2. One can choose some parametric model. The GPD(σ, γ ) or the univariate generalized extreme value distribution (GEV) may be applied as such a model, i.e. { exp( (1 + γ( x μ σ H γ (x) = )) 1/γ ), γ 0, x μ ( exp( e σ ) ), γ = 0, where 1 + γ(x μ)/σ > 0. This approach requires estimates for the parameters. In the i.i.d.-case the ML method can be applied. 3. The pairs of component-wise maxima may be not observable (artificial) in a sense such pairs do not presence in the initial bivariate sample. Under certain conditions one can estimate a bivariate extreme value distribution by an initial random sample, see Sect. 9.4 of Beirlant et al. (2004). The presence of artificial pairs does not affect the asymptotic distribution (7) and representations (8), (11). The excluding of artificial pairs may reduce the sample size. In consequence, the accuracy of the A-function estimation may decrease. 4.2 Practice Given the jth block of size n, let us denote the block maxima by Sj = max{s i i = 1,...,n} and Dj = max{d i i = 1,...,n}. If the number of such blocks is m we obtain bivariate block maxima data (Sj,D j ) (j = 1,...,m). The selection of the number of blocks m is a principal problem. Obviously, a larger amount of blocks leads to a lower variance and a larger bias of parameter estimates. The optimal amount of blocks is connected with the dependence in the data. Roughly speaking, the size of blocks should provide the approximate independence of the maxima sample. The more dependent the data the larger the size of the blocks should be Leadbetter (1983). As concluded in Sect. 3 we can assume independence of the original pairs (S i,d i ). Experimentally the block sizes n = 200, 500, 1000 and were tested. Since we have enough data the analysis of flow size and duration pairs is done from true (not artificial) data points only, i.e., from flows where both size and duration are simultaneously block maxima. However, every block will not contain such pairs, hence m is typically smaller than N/n. The left plot of Fig. 6 shows a scatter plot of such pairs when n = 1000 and m = 211. Before the estimation of the marginal CDFs of component-wise maxima of size and duration by the parametric GEV model the independence was checked with ACFs like explained in Sect. 3. The ML estimates of GEV are summarized in Table 4. We checked the GEV models by QQ-plots, and examples when n = 1000 and m = 211 are shown in Fig. 6. One can see that the QQ-plots show a quite linear behavior close to the diagonal line except at the few tail points which are much less heavy than the parametric model suggests. The cases n = 500 and n = 200 were worse and in those cases the GEV model is not adequate. The normalizing constants of (7) are embedded in the parameters μ and σ of Table 4. Hence they depend on the block size n, but the values of γ should correspond to the values of Table 2 of Sect. 2. Since the fit was good (except the outliers) only in the case n = 1 000, this case provides the best estimates of EVI. The obtained estimates of the Pickands function, showninfig.7, show strong dependence between S and D (since both estimates are

13 Ann Oper Res (2009) 170: Fig. 6 On the left is a scatter plot of pairs of block maxima (Sj,D j ), j = 1,...,m,wheren = 1000 and m = 211. QQ-plots against the GEV model of S (middle) andd (right) whenn = 1000 and m = 211 Fig. 7 On the left plot are the estimates Â HT m (t) (dashed line) andâc m (t) (solid line) calculated from (Sj,D j ), j = 1,...,m,wheren = 1000 and m = 211 when GEV models were used for the (10) transform. The middle and right plots show estimates of A for the (Sj,R j ) and (D j,r j ), respectively, where Rj = max 1 i n R ij is the maximal rate S i /D i in the jth block, j = 1,...,m, m = 610, n = 1000 Table 4 ML estimates of GEV parameters of TCP-flow data Statistic Definition Block size Maximas γ μ σ Size Content n = 1000 m = n = 500 m = n = 200 m = Duration SYN-FIN n = 1000 m = n = 500 m = n = 200 m = situated deeply under the upper boundary of the triangle), weak dependence between R and S and almost independence of R and D (since both estimates, especially Â C m (t), are close to the horizontal line). The corresponding dependence (or independence) of the initial r.v.s S and D, R and S, R and D follows from the dependence (or independence) of

14 212 Ann Oper Res (2009) 170: Fig. 8 Left: PP-plot test of A HT m (t) with the GEV marginals when n = 1000 and m = 211. Middle: The empirical CDF based estimation of the Pickands dependence function by estimators Â C m (t) (solid line) and Â HT m (t) (dashed line) whenn = 200 and m = 857. Right: PP-plot test of AHT m (t) with the empirical CDF as marginals when n = 200 and m = 857 their extreme values assuming that corresponding sequences of pairs (S i,d i ), (R i,s i ) and (R i,d i ) are independent. 1 Since the GEV model was adequate only in the case n = 1000 and still there were outliers, the empirical CDFs were used as a model for margins when estimating Â HT m (t) and Â C m (t) in the cases n = 200, n = 500 and also for n = The middle plot of Fig. 8 shows the case when n = 200 and m = Testing of estimates of A-function Finally, we need a method to show that the estimated A is good. Assume that the second derivative A exists. It is a reasonable assumption since A is known to be convex. One such a method proceeds as follows. We first transform the block maxima data (S i,d i ), i = 1,...,minto (ξ i,η i ), i = 1,...,mdata (10) by using either parametric GEV models as margins or simply empirical CDFs. Then we define H(x,y) by H(x,y) = P{ξ x,η y}=1 P{ξ >x} P{η>y}+P{ξ >x,η>y} ( ( )) y = 1 e x e y + exp (x + y)a. x + y We use the notation h ξ η (x y) = h(x,y) e y for the conditional density of ξ given that η = y. Then P{ξ x η = y}= x 0 x h(w, y) h ξ η (w y) dw = dw = 0 e y H (x, y) y e y 1 Really, suppose the maxima X and Y are independent. It means that P {X x,y y}=p {X x}p {Y y}. Then for i.i.d. pairs (X i,y i ), i = 1,...,nwe have P {X x}p {Y y}=p n {X i x}p n {Y i y}=p n {X i x,y i y}. From here the independence of X i and Y i follows. Suppose now that the maxima X and Y of r.v.s X and Y are dependent. Then evidently X and Y cannot be independent.

15 Ann Oper Res (2009) 170: and P{ξ/η z}= 0 P{ξ zy η = y}e y dy = 0 H (zy,y) dy. y Now, a straightforward calculation shows that the distribution of χ = ξ/η turns out to be z(1 + z A ( 1+z 1 ) ) A( 1+z 1 F χ (z) = ). (14) (1 + z) 2 The value of the derivative A (t) can be taken to be the slope of the edge of the convex hull of the estimate of A. After constructing values χ i = ξ i /η i, i = 1,...,m, we can make a probability plot (PP-plot) ( F χ (χ (i) ), i ) (i = 1,...,m) (15) m and see whether the fit is good. The proposed method makes use of the well known fact that if a random variable X has a continuous CDF F(x),thenF(X) is uniformly distributed. Hence, the PP-plot is close to a linear one if the model of the distribution is selected correctly, Coles (2001). The accuracy of this test depends on the accuracy of the estimation of both marginal distributions G 1 and G 2. The PP-plot can indicate an appropriate value of m for a selected model of the marginal distribution. The left and right plots of Fig. 8 show two examples of this test. In the left plot of Fig. 8 the model is the convex hull of A HT m model is the convex hull of A HT m Demonstration of the use of the approximation formula (t), m = 211 with GEV marginals, in the right plot the (t), m = 857 with empirical CDFs as marginals. In order to demonstrate the use of the approximation formula (9) we introduce the notation Rj = Sj /D j, j = 1,...,m. (Note that R j Rj = max 1 i n R ij ; the notation Rj was used in Fig. 7.) We show a generic method to obtain a parametric model for the distribution of R, F R, and show that it serves as a model for the conditional distribution of R = S/D given that S and D are larger than some thresholds s 0 and d 0, respectively. (Note that, since the access rates are bounded, a threshold s 0 for the size induces an implicit threshold d 0 for the duration.) First we list briefly some motivations why a parametric model for F R is expected to be useful: (1) The quantiles of F R contain information about bottlenecks of the rates of all flows with size larger than s 0. (2) The LRD of the aggregate traffic is known to be a consequence of heavy-tailed file sizes. Thus, the facility of simulation of rates of large flows can be assumed to be helpful in studying LRD and burstiness phenomena of the aggregate traffic. The margins G 1 and G 2 of the right hand side of (9) can be modelled, for example, by the fitted GEV distributions or by GPDs, see Sect. 2. The POT package for the R statistical software, Ribatet (2006), was found useful when dealing with GPDs. However, for this data, the GEV distribution with parameters taken from Table 4 with m = 211 was finally the best choice. Next, guided by the non-parametric estimates of A, we searched for a suitable parametric model for A. In the literature, see e.g. Hall and Tajvidi (2000), Ribatet (2006), there exist at

16 214 Ann Oper Res (2009) 170: Fig. 9 Left: A comparison of A HT m, m = 857 (dashed curve) with the logistic model A r when r = 2(solid curve). Middle: A PP-plot of the GEV and A 2 based parametric model F R against all flows with size larger than a threshold value 200 kb. Right: The parametric model CDF of F R (x) (dashed curve) andthe empirical CDF of rates ofall flows size S 200 kb (solid curve) least six different families of parametric models for A. The left plot of Fig. 9 shows that the best estimates of A, namely A HT m (and AC m ) with m = 857 can be approximated, for example, by a logistic model of the dependence function, Hall and Tajvidi (2000), Ribatet (2006), of the type A r (t) = (t r + (1 t) r ) 1/r when r = 2. On the other hand, if A r was assumed a priori as a model for A and the value for r was estimated from the data we also got r 2. To demonstrate the relevance of the model F R to empirical data, we show the empirical CDF of corresponded observations R i = S i /D i and a PP-plot of the model. The middle plot of Fig. 9 shows a PP-plot of the parametric model F R, whereg 1 and G 2 are modelled by GEVs with parameters taken from Table 4 (m = 211) and A 2 chosen as a model for A, against all flows with size larger than a threshold value s 0 = 200 kb. This threshold value was found by requiring the PP-plot to be as linear as possible. It corresponds to the 99.5% empirical quantile of sizes and there were flows with S s 0. The implicit threshold for durations was d 0 = 13.4 s which corresponds to the 65% empirical quantile of durations. The difference in the quantiles is due to the fact that there were a lot of flows with size smaller than 200 kb but duration longer than 13.4 s. Finally, the right plot of Fig. 9 shows the parametric CDF F R where G 1 and G 2 are modelled by GEVs with parameters taken from Table 4 (m = 211) and A 2 chosen as a model for A. For a comparison it shows also the empirical CDF of R = S/D of all flows with S s 0 = 200 kb. 5 Conclusions In this paper there are several new ideas concerning the TCP-flow data analysis which are both interesting for statistics and telecommunication. The distribution of the rate R of TCP-flows of a given access network is approximated by means of the bivariate extreme value distribution of the flow size S and duration D using the representation by Pickands A-function. The dependence of extreme values of size, duration and rate of TCP-flow is investigated by means of Pickands function and different models, namely, the GEV and the GPD, of marginal distributions. We found that, for this data, the TCP-flow size and duration are dependent random variables, the throughput rate and size are weak dependent and the throughput rate and duration are almost independent.

17 Ann Oper Res (2009) 170: Regarding the statistical innovations the proposition of statistic (14) and the further use of the PP-plot technique (15) provide a new method to test the relevance of estimates of Pickands function. Another new statistical method is proposed to estimate the most sensitive smoothing parameter of Pickands function, that is the number of blocks m. To our best knowledge, this was done the first time in this paper. Moreover, we gave a motivation for this type of analysis and performed the analysis with one concrete data example. A data transformation technique (10) to new data sets with exponentially distributed random variables is applied to estimate and to test the A-function. The usage of such transformation allows to extend the described methodology to other tailed (not necessary exponentially distributed) data sets, specifically to other data on TCP flows. References Beirlant, J., Goegebeur, Y., Teugels, J., & Segers, J. (2004). Statistics of extremes. New York: Wiley. Brockwell, P., & Davis, R. (1991). Time series: theory and methods. Berlin: Springer. Capéraà, P., Fougères, A.-L., & Genest, C. (1997). Estimation of bivariate extreme value copulas. Biometrika, 84, Castellana, J. V., & Leadbetter, M. R. (1986). On smoothed probability density estimation for stationary processes. Stochastic Processes and their Applications, 21, Coles, S. (2001). An introduction to statistical modeling of extreme values. Berlin: Springer. D Auria, B., & Resnick, S. (2006). Data network models of burstiness. Advances in Applied Probability, 38, Davis, R., & Resnick, S. (1985). Limit theory for moving averages of random variables with regularly varying tail probabilities. Annals of Probability, 13, Davydov, Y., Paulauskas, V., & Račkauskas, A. (2000). More on p-stable convex sets in Banach spaces. Journal of Theoretical Probability, 13(1), Embrechts, P., Klüppelberg, C., & Mikosch, T. (1997). Modelling extremal events for finance and insurance. Berlin: Springer. Fougères, A.-L. (2004). Multivariate extremes. In Extreme values in finance, telecommunications and the environment (pp ). London: Chapman & Hall. Hall, P., & Tajvidi, N. (2000). Distribution and dependence-function estimation for bivariate extreme-value distributions. Bernoulli, 6, Hall, P., Lahiri, S. N., & Truong, Y. K. (1995). On bandwidth choice for density estimation with dependent data. Annals of Statistics, 23(6), Kendall, M. (1970). Rank correlation methods (4th edn.). London: Griffin. Kettani, H., & Gubner, J. A. (2006). A novel approach to the estimation of the long-range dependence parameter. IEEE Transactions on Circuits and Systems-II: Express Briefs, 53(6), Kilpi, J., & Lassila, P. (2006). Micro- and macroscopic analysis of RTT variability in GPRS and UMTS networks. In F. Boavida et al. (Eds.), LNCS: Vol Networking 2006 (pp ). Berlin: Springer. Leadbetter, M. R. (1983). Extremes and local dependence in stationary sequences. Probability Theory and Related Fields, 65(2), Markovich, N. M. (2005). On-line estimation of the tail index for heavy-tailed distributions with applications to WWW-traffic. In Proceedings of the EuroNGI first conference on NGI: traffic engineering, Rome, Italy. Markovitch, N. M., & Krieger, U. R. (2002). The estimation of heavy-tailed probability density functions, their mixtures and quantiles. Computer Networks, 40(3), Markovich, N. M., & Krieger, U. R. (2006). Inspection and analysis techniques for traffic data arising from the Internet. In Proceedings of the HETNETs 04 2nd international working conference on performance modelling and evaluation of heterogeneous networks, Ilkley, West Yorkshire (pp. 72/1 72/9). Mikosch, T. (2002). Modeling dependence and tails of financial time series (Technical Report Working Paper No. 181). University of Copenhagen, Laboratory of Actuarial Mathematics. Resnick, S. (1997). Heavy tail modeling and teletraffic data. Annals of Statistics, 25, With discussion and a rejoinder by the author.

18 216 Ann Oper Res (2009) 170: Resnick, S. (2006). Heavy-tail phenomena. Probabilistic and statistical modeling. Berlin: Springer. Ribatet, M. (2006). A user s guide to the POT package (Version 1.0). Taqqu, M., & Teverovsky, V. (1998). On estimating the intensity of long-range dependence in finite and infinite variance time series. In R. J. Adler, F. E. Feldman, & M. S. Taqqu (Eds.), A practical guide to heavy tails (pp ). Basel: Birkhäuser. Tstat (2007). Tstat: TCP statistic and analysis tool. van de Meent, R., & Mandjes, M. (2005). Evaluation of user-oriented and black-box traffic models for link provisioning. In Proceedings of the 1st EuroNGI conference on next generation Internet networks traffic engineering, Rome, Italy. New York: IEEE Press.

Analysis methods of heavy-tailed data

Analysis methods of heavy-tailed data Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course Chapter