Methods, Techniques and Tools for IP Traffic Characterization, Measurements and Statistical Methods

Size: px

Start display at page:

Download "Methods, Techniques and Tools for IP Traffic Characterization, Measurements and Statistical Methods"

Dennis Page
6 years ago
Views:

1 Information Society Technologies (IST) - 6th Framework Programme Deliverable No: D.WP.JRA Methods, Techniques and Tools for IP Traffic Characterization, Measurements and Statistical Methods Deliverable Version No: Version 1.0 Sending Date: Editor s Name: Udo Krieger Lead Contractor No: Dissemination level: Public Report URL Reference of Workpackage:

2 Project Acronym: Euro-NGI Project Full Title: Design and Engineering of the Next Generation Internet, towards convergent multi-service networks. Type of contract: Network of Excellence contract No: Project URL:

3 Information Society Technologies (IST) - 6th Framework Programme Editor s name: Editor s address: Udo Krieger udo.krieger@ieee.org With Contributions of the following partners: Partner Num. Partner Name Contributor Name Contributor Comments 8 VTT Jorma Kilpi Jorma.Kilpi@vtt.fi 9 HUT Pasi Lassila Pasi.Lassila@hut.fi 21 UniBA Udo Krieger udo.krieger@ieee.org 26 TNG-Polito Luca Muscariello luca.muscariello@polito.it 26 TNG-Polito Marco Mellia marco.mellia@polito.it 35 IITiS PAN Przemysław Glomb przemg@iitis.gliwice.pl 35 IITiS PAN Krzysztof Grochla kgrochla@iitis.gliwice.pl 38 IT Rui Valadas rv@det.ua.pt 38 IT Paulo Salvador salvador@av.it.pt 39 CEMAT M. Rosário de Oliveira rosario.oliveira@math.ist.utl.pt 39 CEMAT Fátima Ferreira mmferrei@utad.pt 39 CEMAT Antonio Pacheco apacheco@math.ist.utl.pt 41 ICS Natalia Markovich markovic@ipu.rssi.ru 42 UPC David Rincon drincon@mat.upc.es 42 UPC Sebastià Sallent sallent@mat.upc.es 49 BTH Patrick Arlos Patrick.Arlos@bth.se 49 BTH Adrian Popescu apo@bth.se Version History Version /01/ UK udo.krieger@ieee.org Internal Version Version /01/ UK udo.krieger@ieee.org First Public Release Project Acronym: Euro-NGI Project Full Title: Design and Engineering of the Next Generation Internet, towards convergent multi-service networks. Type of contract: Network of Excellence contract No: Project URL:

5 Contents 1 Research Achievements 9 2 Statistical Inspection and Analysis Techniques 13 3 Cluster Analysis of Internet Users 29 4 Web User Session Characterization 49 5 Passive Identification and Analysis of TCP Anomalies 61 6 Micro- and Macroscopic Analysis of RTT Variability in GPRS and UMTS 73 7 On-line Segmentation of Non-stationary Fractal Network Traffic 85 8 Representing Single-Link Traffic Quality 99 9 Nonparametric Estimation of the Renewal Function Comparison of Level-Crossing Times Selected (Joint) Publications during the Reporting Period 139

7 List of Figures 2.1 n ln R n (p) of the duration of sub-sessions for a variety of p-values n ln R n (p) of the size of sub-sessions for a variety of p-values n ln R n (p) of the inter-response times for a variety of p-values n ln R n (p) of the size of responses for a variety of p-values Exceedance e n (u) against the threshold u for the duration of sub-sessions Exceedance e n (u) against the threshold u for the size of sub-sessions Exceedance e n (u) against the threshold u for the inter-response times Exceedance e n (u) against the threshold u for the size of responses QQ-plots for the duration of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(0.3;1) (bottom) against d.s.s./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F QQ-plot for the size of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom, first plot) and the GPD(0.05;0.3) (bottom, second plot) against s.s.s./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F QQ-plots for the inter-response times. Left: exponential quantiles (top) and quantiles of the GPD(0.8;0.015) (bottom) against i.r.t./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F QQ-plot for the size of responses. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom) against s.r./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F The EV I estimation by Hill s estimator and the group estimator γ l for the data sets s.s.s. (left), d.s.s. (right)

8 2.14 The EV I estimation by Hill s estimator and the group estimator γ l for the data sets i.r.t. (left), s.r. (right) ACF estimation by a standard sample ACF (2.4) (the first and the third plots) and the modified sample ACF (2.5) (the second and the fourth plots) for the data sets s.s.s., d.s.s.. The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s ACF estimation by a standard sample ACF (2.4) (the first and the third plots) and the modified sample ACF (2.5) (the second and the fourth plots) for the data sets i.r.t., s.r.. The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s Relative utilization Average activity duration Average transfer rate Total download transfer rate Download transfer rates of 10 randomly selected ISP1 users Loadings of the first two sample principal components for ISP1 and ISP Scores of the first two sample principal components, ISP Dendrogram, ISP Dendrogram, ISP Medoids, ISP1 and ISP Half-hour medians for each cluster, ISP1 and ISP Scores on the two first sample principal components, marking differently users from different clusters, ISP Scores on the two first sample principal components, marking differently users from different clusters ISP Downloaded traffic, clustered users Transfer rate, clustered users Activity duration, clustered users Median of downloaded traffic, clustered users, ISP Median of downloaded traffic, clustered users, ISP Median of activity duration, clustered users, ISP Median of activity duration, clustered users, ISP Median of transfer rate, clustered users, ISP Median of transfer rate, clustered users, ISP PDF (and Complementary CDF in the inset) of the session length PDF (and Complementary CDF in the inset) of the client-to-server and server-to-client data sent in each session

9 4.3 PDF (and Complementary CDF in the inset) of the number of TCP connection in each session CDF of the mean user session inter-arrival and mean users flow inter-arrival Fit of user session inter-arrivals to a Weibull distribution: normal day in the top plot and during a worm attack on the bottom plot Auto correlation function of the inter-arrival process of sessions: normal day in the top plot and during a worm attack on the bottom plot Evolution of a TCP flow and the RTT Estimation Process (a) and Anomalous Events Classification Flow-Chart (b) CDFs of the ratio between the inverted packet gap T over the estimated RTT min (outset) and of the ratio between the recovery time RT over the actual RTO estimation (inset) Amount of incoming (bottom y-axis) and outgoing (top y-axis) anomalies at different timescales Anomalies percentage normalized over the total traffic (top) and anomalies breakdown (bottom) Hurst parameter estimate for different anomalies and traces Time plot of anomalous events for the selected elephant flow Energy plots for the time series of number of segments and number of out of orders within the selected elephant flow Left: The SNMP data from MIBs shows that overall daily traffic profile is significantly affected by the subgroup of users who pay flat rate between 18:00-06:00. Right: Division into subsamples according to TCP connection arrival time Time series of Flow 1 (left) and its LPs (middle and right) Time series of Flow 2 (left) and its LP (right) Rate change for Flow 2 (left) and simultaneous TCP flows (right) Time series of Flow 3 (left) and Flows 4 and 5 (right) Energy function plot (left) and PDF estimates (right) of aggregate RTTs Comparison of CCDFs (left) and maximum number of simultaneous TCP connections from the same mobile (right) Maximum RTTs with no significant upstream traffic (left) and significant upstream traffic (right) Robust estimates of conditional expectations of maximum RTTs for GPRS (left) and UMTS (right). Right plot also shows individual samples of maximum RTTs On the left, the DWT as filter bank for a level 3 decomposition. On the right, the subbands of the normalized spectrum, with the approximation a 3 and details d On the left, the temporal relationship between the coefficients of the DWT. On the right, the phase-corrected coefficients Change point candidates at each scale. On the left, DWT-SIC at 8 levels and significance= On the right, SWT-SIC at 8 levels and significance=

10 7.4 Bellcore trace, DWT-SIC at 7 levels, significance= On the left, the evolution of the DWT-SIC algorithm when applied to the synthetic trace. Granularity = 5000 samples, significance = 0.001, scales 1-6. On the right, the time-versustime projection of the same figure On the left, evolution of the DWT-SIC algorithm applied to the paug89 trace. Granularity = samples, significance = 10 6, scales 1-8. On the right, the scale-versus-time projection of the same figure Various characteristics of the data trace Constructing of kernel representation of the quality function The averages of the MSEs from Tables 9.1, 9.2 over different T for fixed α = 0.1(0.1)0.7, β {0.3, 0.5, 0.7} and for a Gamma distribution (left) and an exponential distribution (right) The averages of absolute values of BIASes in Tables 9.1, 9.2 over different T for fixed α = 0.1(0.1)0.7, β {0.3, 0.5, 0.7} and for a Gamma distribution (left) and an exponential distribution (right) Estimation of the renewal function of a Weibull distribution: H(t, k, l) against t Estimation of the renewal function of a Pareto distribution: H(t, k, l) against t Estimation of the renewal function of an exponential distribution: H(t, k, l) against t

11 List of Tables 2.1 Description of the data Estimation of the tail index for Web traffic characteristics Comparison of the recommended methods for Web traffic data Amount of operations for Frees and the histogram-type estimator with a fixed k and a plotestimated k Statistics of the aggregate of applications Interpretation of ISP1 and ISP2 clusters Cluster sizes for ISP1 and ISP Error rates for ISP1 and ISP Averages for the aggregate of applications, clustered users, ISP1 (left) and ISP2 (right) Pairwise comparisons for the aggregate of applications, clustered users, α = Pairwise comparisons for the various groups of applications, clustered users, α = DWT-SIC results for paug89 trace Quality of estimate (9.4): Gamma (s = 2, λ = 1, IEτ = 2), sample size l = Quality of estimate (9.4): Exp (λ = 1, IEτ = 1), sample size l = The minimal values min MSE and min BIAS of averages MSE and BIAS calculated by Tables 9.1, 9.2 and corresponding α and β Comparison of H 3n (t) for the sample size 30 and H(t) for the sample size 100 regarding averages of the MSE and the BIAS for different distributions Part I: Log normal (µ = 0, σ = 1, IEτ = exp(1/2)) Part II: Log normal (µ = 0, σ = 1, IEτ = exp(1/2)) Part I: Weibull (s = 3, IEτ = 0.89) Part II: Weibull (s = 3, IEτ = 0.89)

12 Abstract The deliverable describes the Methods, Techniques and Tools for IP Traffic Characterization, Measurements and Statistical Methods that members of WP.JRA.5.1 IP Traffic Characterization, Measurements and Statistical Methods have developed during the second phase of the action EuroNGI. Main streams of the investigations on traffic measurements, statistical analysis and traffic characterization of IP traffic and new services in evolving wired and wireless IP networks are reflected by the presented research results that have been selected for this report. 8

13 1 Research Achievements 1.1 Research Objectives The research activities of NoE action EuroNGI, WP.JRA.5.1 IP Traffic Characterization, Measurements and Statistical Methods has been divided into three working areas: Traffic characterization of different service models and network environments Traffic analysis, estimation and generation techniques Architectures and tools for traffic monitoring and modeling While the first two items represent active research areas of members of the work package, the third item reflects the integration aspects of the activity. The deliverable presents selected research results that are related to these research objectives. They are structured according to the tasks identified in the specification Traffic modeling for new services of the work package. The latter represents the outcome of the preparation phase of WP.JRA Outline of the Document The scientific investigation of WP.JRA.5.1 has been based on the achievements of the contributing parties during the second phase of the action EuroNGI. It is organized according to the main topics in the stated working areas and addresses the subsequently sketched issues. Traffic characterization of different service models and network environments Internet and Web traffic modeling Regarding the analysis of Internet and Web traffic the contributions have been classified into the following subcategories: Statistical inspection and analysis techniques In chapter 2 Markovitch and Krieger discuss the statistical methodology to analyze measurements arising from Internet traffic. Since the underlying traffic characteristics are often distributed with heavy tails. specific analysis and estimation techniques are required. In the contribution fast and simple methods for the detection of these heavy-tailed characteristics in traffic data are considered. Moreover, a traffic characterization by recurrent marked point processes is discussed. Finally, the recommended statistical tools are applied to real Web data. 9

14 1.2. Outline of the Document 1. Research Achievements Modeling of user behavior Internet access traffic follows hourly patterns that depend on various factors, such as the periods users stay on-line at the access point (e.g. at home or in the office) or their preferences for applications. The clustering of Internet users may provide important information for traffic engineering and tariffing. For example, it can be used to set up service differentiation according to hourly behavior, resource optimization based on multi-hour routing and definition of tariffs that promote Internet access in low busy hours. In chapter 3 de Oliveira et al. study the usr behavior in the Portuguese Internet. The authors identify patterns of similar behavior by grouping Internet users of two distinct Portuguese ISPs, one using a CATV access network and the other an ADSL one, offering distinct traffic contracts. The grouping of the users is based on their traffic utilization measured every half-hour (over one day). Cluster analysis is used to identify the relevant Internet usage profiles, with the partitioning around medoids and Ward s method being the preferred clustering methods. For the two data sets, these methods lead to 3 clusters with different hourly traffic utilization profiles. The cluster structure is validated through discriminant analysis. After the identification of the clusters, the used type of applications as well as the amount of downloaded traffic, the activity duration and the transfer rate are analyzed for each cluster resulting in coherent outcomes. Client-Server traffic modeling In chapter 4 A. Bianco et al. have studied the identification of Web user-sessions. The definition is derived from an aggregation of several TCP connections generated by the same source host on the basis of TCP connection opening time and creates non-trivial identification issues. Applying clustering techniques, a novel methodology to identify Web user-sessions without an a priori definition of threshold values has been developed. Moreover, the characteristics of user sessions were extracted from real traces and the statistical properties of the identified sessions have been studied. It is shown that Web user-sessions tend to be Poissonian, but correlation may arise during periods of network/hosts anomalous functioning. Traffic modeling of elastic transport services In chapter 5 Mellia et al. have studied elastic bearer services in the Internet. Focussing on passive measurements of TCP traffic, a heuristic technique is proposed to classify anomalies that may occur during the lifetime of a TCP flow, such as out-of-sequence and duplicate segments. The possibility to carefully distinguish the causes of anomalies in TCP traffic is very appealing and may be instrumental to the deep understanding of TCP behavior in real environments and to protocol engineering. In the study a proposed heuristic is applied to traffic traces collected at both networks edges and at backbone links. Studying the statistical properties of TCP anomalies, it is shown that their aggregate exhibit Long Range Dependence phenomena, but that anomalies suffered by single long-lived flows are on the contrary uncorrelated. Interestingly, no dependence to the actual link load is observed. Traffic modeling and analysis of mobile users in wireless network environments 10

15 1. Research Achievements 1.2. Outline of the Document In chapter 6 Kilpi et al. analyze data from a Finnish GPRS/UMTS network operator arising from a passive traffic measurement of a 30 hour IP traffic trace. Regarding its analysis the study focusses on the variability of Round Trip Times (RTTs) of TCP flows in the network. The RTTs are analyzed at micro- and macroscopic level. The microscopic level involves detailed analysis of individual flows. Various issues affecting the RTT process of a flow are identified, such as changes in the rate of the radio channel and the impact of simultaneous TCP flows at the mobile device. Lomb periodograms are used to detect periodic behavior in the applications. In the macroscopic analysis dominating RTT values are searched first by Energy Function Plots and then by exploring the spikes of empirical PDFs of aggregated RTTs. The comparison of empirical CDFs shows a significant shift in the quantiles after considerable traffic increase. It caused by bandwidth sharing at individual mobile devices. Furthermore, the impact of bandwidth sharing at the mobile device has been investigated. It is shown how an increased amount of simultaneous TCP flows seriously affects the RTTs observed by a given flow, both in GPRS and in UMTS. Traffic analysis, estimation and generation techniques Reconstruction techniques for LRD and self-similar traffic by wavelet techniques In chapter 7 Rincón and Sallent identify and reconstruct fractal traffic characteristics, such as self-similarity and long-range dependence. There are several estimators of the fractal parameters, and those based on the discrete wavelet transform (DWT) are the best in terms of efficiency and accuracy. However, the DWT estimator does not consider the possibility of changes to the fractal parameters over time. The study proposes the use of the Schwarz information criterion (SIC) to detect changes in the variance structure of the wavelet decomposition and then a segmentation of the trace into pieces with homogeneous characteristics for the Hurst parameter. The procedure can be extended to the stationary wavelet transform (SWT), a non-orthogonal transform that provides higher accuracy in the estimation of the change points. It is an advantage that the SIC analysis can be performed progressively. The DWT-SIC and SWT-SIC algorithms were tested against synthetic and well-known real traffic traces, and illustrated the feasibility of the proposed new approach. Statistical estimation and prediction of traffic characteristics In chapter 8 Głomb and Krzysztof have modelled single-link Internet traffic. Looking at the problem of adaptive real-time sources like video streaming, the transmission system should detect network events (i.e. congestion, increase or decrease of available bandwidth) and adapt the sources to provide a maximal possible service quality. Regarding an efficient realization of this service issues, traffic prediction methods are needed. One of the formulations of the prediction problem states that given time, connection and packet size, it is required to approximate the distribution of transmission times. With this data, the accuracy of bandwidth utilization can be increased. In the contribution a kernel PCA method is used to represent the data of a single transmission 11

16 1.2. Outline of the Document 1. Research Achievements path. Data collection and analysis, and parameter tuning are studied. Mathematical and statistical tools Estimation of renewal models In chapter 9 Markovitch and Krieger consider statistical tools to estimate the renewal function (rf) H(t). The latter arises from a renewal point process and can be applied to model the incoming request streams of Web sessions or denial of service attacks as shown by others studies. The renewal function is estimated using a limited number of independent observations of interarrival times for an unknown interarrival-time distribution (itd). The proposed nonparametric estimate is derived from the rf-representation as series of distribution functions (dfs) of consecutive arrival times using a finite summation and approximations of the latter by empirical dfs. Due to the limited number of observed interarrival times the estimate is accurate just for closed time intervals [0, t]. An important aspect is given by the selection of an optimal number of terms k of the finite sum. Here two methods are proposed: (1) an a priori choice of k as function of the sample size l which provides almost surely (a.s.) the uniform convergence of the estimate to the rf for light- and heavy-tailed itds if the time interval is not too large, and (2) a data-dependent selection of k by a bootstrap method. A Monte Carlo study is performed to evaluate both the efficiency of the estimate and the selection methods of k. In conclusion, a reliable computationally tractable histogram-type estimate of the renewal function is proposed that is also applicable to heavy-tailed interarrival-time distributions. It has many applications in traffic engineering of the Internet and the design of Web systems. Comparison of stochastic models Finally, Ferreira and Pacheco develop mathematical tools to compare different teletraffic models in chapter 10. Continuous-time Markov chains (CTMCs) and semi-markov processes (SMPs) are considered that often arise as basic mathematical objects in traffic characterization and estimation problems. Sufficient conditions for the level-crossing ordering of continuous-time Markov chains (CTMCs) and semi-markov processes (SMPs) are derived. The former ones constitute a relaxation of Kirstein s conditions for stochastic ordering of CTMCs in the usual sense, whereas the latter ones are an extension of conditions established by Di Crescenzo and Ricciardi and Irle for skip-free to the right SMPs. 12

17 2 Statistical Inspection and Analysis Techniques for Traffic Data Arising from the Internet Natalia M. Markovich Institute of Control Sciences, Russian Academy of Sciences, Profsoyuznaya 65, Moscow, Russia Udo R. Krieger Otto-Friedrich-Universität, Fakultät WIAI Feldkirchenstr. 21, D Bamberg, Germany Abstract The statistical characterization of measurements arising from Internet traffic requires specific analysis and estimation techniques since the underlying traffic characteristics are often distributed with heavy tails. Here we consider fast and simple methods for the detection of these heavy-tailed characteristics in traffic data and discuss a traffic characterization by recurrent marked point processes. Furthermore, the recommended tools are applied to real Web data. 2.1 Introduction Currently, traffic measurements of high-speed packet-switched networks that are taken with high resolution at different time scales constitute an important topic of data mining and teletraffic engineering in the Internet. Associated issues such as the statistical data analysis of high resolution measurements, the on-line estimation of the underlying random characteristics of the developed Web and Internet traffic models and network management actions based on on-line traffic analysis are studied intensively, too. Based on these methods new traffic engineering techniques such as bandwidth allocation and control techniques will be derived in the near future. Considering the transported traffic in IP networks, besides the growing peer-to-peer traffic the majority of the flows are currently generated by classical client-server Web applications (cf. [5]). From the statistical perspective, many observed Web traffic characteristics are independent and are often distributed with heavy-tails (e.g., [1], [10], [20]). The specific features of heavy-tailed distributions are the following: (1) tails of distributions tend to zero at infinity slower than an exponential tail; (2) the possible non-existence of some moments; 13

18 2.2. Rough tests and estimation of heavy-tailed features 2. Statistical Inspection and Analysis Techniques (3) sparse data at the tail domain of the distribution. They require specific statistical methods for the investigation of the data. For instance, the histogram is a good estimate of the controlling probability density function (PDF). However, in the case of distributions with heavy tails it shows a misleading estimate in the tail domain or oversmoothes the main part of the density. The same is true for kernel estimators (cf. [19]). Usually, quantiles can be estimated by means of an empirical distribution function (DF). However, high quantiles like 99%, 99.9% cannot be calculated in such a way, since the empirical DF is equal to 1 outside the range of the empirical sample. In [11] one can find estimates of heavy-tailed PDFs constructed by a preliminary transformation of the data to a new sample. Another way to reconstruct such PDFs is to use kernel estimates with variable bandwidth kernels (cf. [19]). In [13] a new estimate for high quantiles is proposed. It is a key issue whether such specific estimation procedures have to be applied to reconstruct the traffic characteristics. To resolve it, nonparametric test procedures (cf. [9]) or a set of rough statistical methods for heavy-tailed features can be applied (cf. [8]). Considering the analysis of heavy-tailed data, the maximum of the sample is applied as a model of the DF behavior and an important issue is related to the use of its asymptotic. Such models always include a parameter called tail index. The tail index α IR (or the extreme-value index (EV I) γ = 1/α) shows the shape of the tail. In this paper, we consider several simple procedures that may help us to detect heavy tails and the dependence structure of a gathered Internet data set. We illustrate by real data how these methods are applied to analyze traffic measurements. Moreover, we describe the simplest modeling of Internet traffic at the burst or packet level by a recurrent marked point process of inter-arrival times and transferred packet or burst sizes. Furthermore, a nonparametric estimation technique of the renewal function proposed in [14] is described. The paper is organized as follows. In Section 2 we sketch simple inspection techniques for heavy-tailed traffic characteristics. In Section 3 we illustrate their application by some real traffic data. In Section 4 the dependence structure of the Web data is given. In Section 5 we describe some simple analysis methods for Internet traffic that is modelled at very coarse time scales by a random marked recurrent process. Finally, we present some conclusions. 2.2 Rough tests and estimation of heavy-tailed features Here, we consider several methods in order to check whether measurements X (n) = {X 1, X 2,..., X n } are derived from a heavy-tailed DF F (x) = IP{X x} or not. We may also give rough estimates of the number of finite moments of the DF F. 1. Ratio of the maximum to the sum Let X 1, X 2,..., X n be i.i.d. r.v.s. We define the following statistic (cf. [8, p.308]) where R n (p) = M n(p), n 1, p > 0, S n (p) S n (p) = X 1 p +... X n p, M n (p) = max ( X 1 p,..., X n p ), n 1 14

19 2. Statistical Inspection and Analysis Techniques 2.2. Rough tests and estimation of heavy-tailed features to check the moment conditions of the data. Then the following equivalent assertions R n (p) R n (p) R n (p) a.s. 0 IE X p <, p 0 IE X p 1I{ X x} R 0, p 1 IP{ X > x} R 0, can be exploited, where 1I{A} denotes the indicator of an event A. The class R α of distributions with regularly varying tails and the tail index α = 1/γ, γ > 0, is defined by R α = {F 1 F (x) = x α l(x), x > 0}, where l(x) is a slowly varying function that satisfies the condition lim x l(tx)/l(x) = 1 for all t > 0. (cf. [8]). For different values of p the plot of n R n (p) gives a preliminary information about the distribution IP{ X > x}. Then IE X p < follows, if R n (p) is small for large n. For large n a significant difference between R n (p) and zero indicates that the moment IE X p is infinite. 2. QQ-plot ( )) It is the idea of a QQ-plot (or quantiles against quantiles -plot) to draw the dependence { (X (k), F n k+1 n+1 : k = 1,..., n}, where X (1)... X (n) are the order statistics of the sample, and F is an inverse function of the DF F. Normally, the QQ-plot is built as a dependence of exponential quantiles against the order statistics of the underlying sample. Then F is an inverse function of the exponential DF. Then a linear QQ-plot corresponds to the exponential distribution. Generally, one can investigate the quantiles of any other distribution instead of an exponential one to get a linear QQ-plot. An alternative QQ-plot is based on the following fact. It is well known that for an i.i.d. sample with a continuous DF F the r.v.s U i = F (X i ), i = 1,..., n (2.1) are independent and uniformly distributed on [0, 1]. The distribution function of the r.v. U i is linear if the data X 1,..., X n belong to the distribution F. Then one can investigate the empirical distribution function F n (x) as U i taking different models of F in (2.1). 3. Plot of the mean excess function To study the tail behavior in more detail, this simple test allows us to detect visually the type of the tail, i.e. whether it is light or heavy. Let X be a r.v. with the finite right endpoint X F = sup{x IR : F (x) < 1}. Then e(u) = IE (X u X > u), 0 u < X F is the mean excess function of the r.v. X over the threshold u. The empirical mean excess function is defined by: n n e n (u) = (X i u)1i{x i > u}/ 1I{X i > u} i=1 i=1 15

20 2.2. Rough tests and estimation of heavy-tailed features 2. Statistical Inspection and Analysis Techniques For heavy-tailed distributions the function e(u) tends to infinity. A linear plot u e(u) corresponds to a Pareto distribution, the constant 1/λ corresponds to an exponential distribution and e(u) tends to 0 for light-tailed distributions. For different samples the curves e n (u) may differ strongly towards the higher values of u since only sparse observations may exceed the threshold u for large u. It makes the precise interpretation of e n (u) difficult. 4. Hill-plot of the tail index The Hill estimate [8] γ n,k = 1 k k log X (n i+1) log X (n k) (2.2) i=1 is used to estimate the positive extreme-value index γ of a heavy-tailed r.v. X. Here k is the integer parameter of this estimate. X (1)... X (n) denotes the order statistics of the sample {X 1, X 2,..., X n } of not necessarily independent r.v.s. The Hill estimate may be considered as the empirical mean excess function of the r.v. ln X for the level u = ln X (n k). The Hill estimate (2.2) works bad if the underlying DF does not have a regularly varying tail, i.e. F R α does not hold, α is not positive, the sample size is not large enough and the tail is not heavy enough, i.e. γ is not big. The estimation by Hill s formula (2.2) strongly depends on the type of the slowly varying function if F R α, that is usually unknown. The disadvantages of Hill s estimate show that one has to apply several estimates of the tail index to deal with the complex analysis of data (see, for instance, [8] for a review). 5. The group estimator of the tail index Here we present the group estimator [4] that seems to be the only known estimator of the tail index having the recursive property. It gives rise to use it for on-line calculations [15]. According to this estimator the sample of i.i.d. r.v.s X (1)... X (n) is divided into l groups V 1,..., V l, each group containing m random variables, i.e. n = l m. In practice m is chosen and then l = [n/m] is set, where [a] denotes the integer part of a number a > 0. Let and let M (2) li M (1) li = max{x j : X j V i } denote the second largest element in the same group V i. We set k li = M (2) li /M (1) li, z l = (1/l) l k li. (2.3) i=1 Then the statistic γ l = 1/z l 1 is suggested as the estimate of the extreme-value index γ. The asymptotic convergence of the statistic z l to the EV I γ is proved for the class of distributions R α. Hence, the group estimator is worse, for instance, for a Weibull distribution (cf. [15]). Both Hill s and the group estimator are very sensitive to the choice of the smoothing parameters. This is the amount of the largest order statistics k for Hill s estimator (2.2) and the amount of observations m in each 16

21 2. Statistical Inspection and Analysis Techniques 2.3. An illustrative analysis of Web traffic characteristics group for the group estimator (2.3). The use of plots where the estimator is presented against the smoothing parameter is the easiest way to select these parameters. More exactly, one can plot {(m, z m ), m 0 < m < M 0 }, m 0 > 2, M 0 < n/2, where z m = (m/n) [n/m] i=1 k (n/m)i (or similarly a Hill plot {(k, ˆγ n,k ), 1 k n 1}) and then choose the estimate of z m or ˆγ n,k from an interval in which these functions demonstrate stability. The background of this approach is provided by the consistency of the statistics z m and ˆγ n,k as n, namely, z m 1/(1+γ) as n, m, m < n (cf. [4]) and ˆγ n,k γ as n, k, k/n 0 with probability 1 (cf. [8]). We suggest to take the average value of the estimate within the interval of stability as the unknown value of γ (or 1/(1 + γ) in the case of the group estimator). The selected value of the smoothing parameter corresponds to this average. The tail-index estimates are particularly applied to investigate the amount of finite moments of the r.v.. It is well known that the pth moment exists, i.e., IE X 1 p < holds, if the tail index α = 1/γ satisfies 0 < p < α and when the distribution has regularly varying tails, i.e., belongs to class R α (cf. [8, p.330]). A positive tail index indicates the presence of a heavy tail. These sketched simple tests can be successfully applied to analyze visually the features of a data attribute arising from Internet traffic measurements. 2.3 An illustrative analysis of Web traffic characteristics To illustrate the efficiency of the sketched methodology, real data of Web traffic have been gathered in the Ethernet segment of the Department of Computer Science at the University of Würzburg and have been analyzed preliminary (cf. [10], [13]) Description of the Web data The data are described by a conventional hierarchical Web traffic model distinguishing a session and subsession level as well as a page and a IP-packet level that is not considered in detail here (cf. [10], [13]). Consequently, the data are described at the related coarse time scales by two basic characteristics and four related r.v.s, namely, the characteristics of sub-sessions, i.e., the size of a sub-session (s.s.s) in bytes and its duration (d.s.s.) in seconds, as well as the characteristics of the transferred Web pages, i.e., the size of the response (s.r.) in bytes and the inter-response time (i.r.t.) in seconds were measured. The analyzed data of page sizes contain the information about 7480 Web pages which have been down-loaded during 14 days by several TCP/IP connections. The size of a response is defined as the sum of the sizes of all IP-packets which are down-loaded from a Web server to the client upon a request. To perform the analysis, we have used samples with reduced sample sizes, which have been observed in a shorter period within these two weeks. The description of all these r.v.s is presented in Table 2.1. For simplicity of the calculations, the data were scaled, i.e., divided by a scale parameter s. The value of s is indicated in Table Results of the Web traffic analysis Considering the data sets d.s.s., s.s.s., i.r.t. and s.r., the Figures show the dependencies n ln R n (p) for various p. In all cases, the values R n (p) are dramatically large for large n and p 2, e.g., in the case of 17

22 2.3. An illustrative analysis of Web traffic characteristics 2. Statistical Inspection and Analysis Techniques Table 2.1 Description of the data r.v. Sample Size Minimum Maximum Mean StDev s s.s.s.(b) d.s.s.(sec) s.r.(b) i.r.t.(sec) the duration of sub-sessions and n = 350 ln R n (p) 10 for p = 2 and ln R n (p) 10 3 for p = 3. Hence, one may conclude that all moments of the considered r.v.s apart from the first one are not finite. Figure 2.2 n ln R n (p) of the size of sub-sessions for a variety of p-values. Figure 2.4 n ln R n (p) of the size of responses for a variety of p-values. Table 2.2 Estimation of the tail index for Web traffic characteristics s.s.s. s.r. i.r.t. d.s.s. γ p l ˆγ n,k γ p l ˆγ n,k γ p l ˆγ n,k γ p l ˆγ n,k

23 2. Statistical Inspection and Analysis Techniques 2.3. An illustrative analysis of Web traffic characteristics Figure 2.6 Exceedance e n (u) against the threshold u for the size of sub-sessions. Figure 2.8 Exceedance e n (u) against the threshold u for the size of responses. Table 2.3 Comparison of the recommended methods for Web traffic data Amount of finite moments Type of distribution r.v. R n (p) Hill & Group QQ-plot e n (u) estimator s.s.s.(b) 1 1 GP D(1; 0.015), Pareto-like GP D(0.05; 0.3) d.s.s.(sec) 1 1 GP D(0.3; 1), Pareto-like lognormal s.r.(b) 1 1 GP D(1; 0.015) Pareto-like i.r.t.(sec) 1 2 GP D(0.8; 0.015) Pareto-like Furthermore, the plots u e n (u) tend to infinity for large u implying heavy tails. These plots are close to a linear shape for all sets of data (Fig ). The latter implies that the considered distributions can be modelled by a DF of Pareto type. Two versions of QQ-plots of d.s.s., s.s.s., i.r.t. and s.r. are shown in Figures Obviously, the QQ-plot of d.s.s. corresponding to lognormal and Pareto distributions are the closest to a linear shape. Regarding s.s.s., i.r.t. and s.r. such QQ-plot corresponds to a DF of Pareto type, respectively (see the right plots). The left top plots show that the exponential distribution cannot be accepted as an appropriate model for these 19

24 2.3. An illustrative analysis of Web traffic characteristics 2. Statistical Inspection and Analysis Techniques Figure 2.9 QQ-plots for the duration of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(0.3;1) (bottom) against d.s.s./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F. r.v.s. The left bottom plots show that the distributions of the samples d.s.s., s.s.s., i.r.t. and s.r. are close to a Generalized Pareto distribution (GPD) with the DF { 1 (1 + γx/σ) 1/γ, γ 0, Ψ σ,γ (x) = 1 exp ( x/σ), γ = 0, with different values of the parameters γ and σ, see also Table 2.3. In Fig it is shown as an example that both GPD(1;0.015) and GPD(0.05;0.3) could be appropriate models of s.s.s.. It implies that the QQ-plot does not give a unique model to fit the underlying distribution. In Table 2.2 one can see the estimation of the EV I for data samples s.s.s., d.s.s., i.r.t. and s.r. by means of the group estimator γ l with the selection of m by the plot. The latter is denoted by γ p l. To calculate γp l the average is performed over m [10, 35], m [10, 40], m [100, 200], m [10, 70] in the cases of s.s.s., d.s.s., i.r.t., s.r., respectively. Hill s estimate ˆγ n,k with the plot-selected k is also presented. Figures illustrate the estimation of the EV I by Hill s estimator (2.2) and the group estimator γ l for the data sets s.s.s., d.s.s., i.r.t., s.r., respectively. The straight lines correspond to γ p l and the recommended value (the dotted line) for Hill s estimate. One observes that those values of γ recommended by both estimates are similar. In the case of d.s.s. the difference of the values (0.739 and 0.5) arises as a result of the selection of k in Hill s estimate corresponding to one of the stability intervals of the Hill plot, see Fig (right). Indeed, one may select another stability interval corresponding to a value closer to The latter example demonstrates the obvious disadvantage of the Hill-plot approach, namely, the selection of k and the necessity to use different tail-index estimates. Observing the estimates of γ one may conclude that the estimates of the tail index α = 1/γ are always less 20

25 2. Statistical Inspection and Analysis Techniques 2.3. An illustrative analysis of Web traffic characteristics Figure 2.10 QQ-plot for the size of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom, first plot) and the GPD(0.05;0.3) (bottom, second plot) against s.s.s./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F. than 2 for all considered data sets, apart from i.r.t. with 2 < α < 3. It follows from the extreme value theory [8] that at least the βth moments, β 2, of the distribution of s.s.s., s.r., d.s.s. are not finite. The distribution of i.r.t. has two finite moments. The distributions of the considered Web traffic characteristics are heavy-tailed. The tail of the s.s.s. distribution is the heaviest one since γ is the largest. Table 2.3 summarizes the result of this preliminary analysis of the samples with the skteched set of exploratory methods. 21

26 2.4. Dependence structure and sample autocorrelation functions 2. Statistical Inspection and Analysis Techniques Figure 2.11 QQ-plots for the inter-response times. Left: exponential quantiles (top) and quantiles of the GPD(0.8;0.015) (bottom) against i.r.t./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F. 2.4 Dependence structure and sample autocorrelation functions The autocorrelation function (ACF ) is considered as an important indicator of the dependence structure of a time series. The standard sample autocorrelation function at lag h Z is determined as ρ n,x (h) = n h t=1 (X t X n )(X t+h X n ) n t=1 (X t X n ) 2, (2.4) Here X n = 1 n n t=1 X t represents the sample mean. The relevance of this estimate is determined by its rate of convergence to the real ACF. When the distribution of the X t s is very heavy-tailed (in the sense that IEX t 4 = ), this rate can be extremely slow. For heavy-tailed data with infinite variance it is better to use the following slightly modified estimate without the usual centering by the sample mean: ρ n,x (h) = n h t=1 X tx t+h n t=1 X2 t (2.5) However, this estimate may behave in a very unpredictable way if one uses the class of non-linear processes in the sense that this sample ACF may converge in distribution to a non-degenerate r.v. depending on h. For linear models it converges in distribution to a constant depending on h (cf. [3]). The previous analysis (see Table 2.3) shows that the considered Web data are heavy-tailed with infinite variance. Therefore, the application of formula (2.5) is relevant. For a comparison we represent both estimates of the ACF for our Web data in the Figures Evidently, the Web data exhibit a long range dependence structure. Regarding an extended survey on the relation between the tail behavior and the dependence structure we refer 22

27 2. Statistical Inspection and Analysis Techniques 2.5. Traffic modeling by recurrent marked point processes Figure 2.12 QQ-plot for the size of responses. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom) against s.r./s (the linear curves correspond to the dependencies quantile against the quantile of the same distributions). Right: Empirical DFs of the r.v. U i = F (X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the empirical DF against the exponential quantiles of the exponential DF F. Figure 2.13 The EV I estimation by Hill s estimator and the group estimator γ l for the data sets s.s.s. (left), d.s.s. (right). to both [16] and [18]. 2.5 Traffic modeling by recurrent marked point processes Considering measurements at the burst and/or packet level, the generated Internet traffic can be characterized by a marked point process N {(τ i, Z i ) i = 1, 2..., } of the inter-arrival times τ i of bursts (or, in a more 23

28 2.5. Traffic modeling by recurrent marked point processes 2. Statistical Inspection and Analysis Techniques Figure 2.14 The EV I estimation by Hill s estimator and the group estimator γ l for the data sets i.r.t. (left), s.r. (right). Figure 2.15 ACF estimation by a standard sample ACF (2.4) (the first and the third plots) and the modified sample ACF (2.5) (the second and the fourth plots) for the data sets s.s.s., d.s.s.. The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s. general setting, of IP packets) of sizes Z i. Then t n = n i=1 τ i, t 0 = 0 is the n-th arrival epoch based on the sampled inter-arrival times {τ 1,..., τ l } with an assumed absolutely continuous DF F. The overall volume of IP packets or Web traffic in an observation period [0, t] is determined by V (t) = i:t i t N t Z i = Z i, where N t = max{n : t n t} is the number of burst (or, IP packets) arrivals in [0, t]. Assuming that Z 1, Z 2,... are i.i.d. r.v.s with IE Z i < and IEN t <, we obtain by Wald s equation that for all t > 0 i=1 N t IE(V (t)) = IE Z i = IE(Z i )IE(N t ) = IE(Z)H(t) i=1 holds (cf. [20]). Here H(t) = IE(N t ) denotes the renewal function of the corresponding non-marked arrival process {τ i, i = 1, 2,...} of transferred files and pages, respectively. For all fixed t > 0 the variance of the 24

29 2. Statistical Inspection and Analysis Techniques 2.5. Traffic modeling by recurrent marked point processes Figure 2.16 ACF estimation by a standard sample ACF (2.4) (the first and the third plots) and the modified sample ACF (2.5) (the second and the fourth plots) for the data sets i.r.t., s.r.. The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s. overall volume is determined by: Var(V (t)) = Var(Z i )H(t) + (IE(Z i )) 2 Var(N t ) if IEZ i 2 < holds (cf. [20]). It may be evaluated computationally at any t > 0 by the estimation of H(t) and the expectation and variance of Z i. Hence, a key issue concerns the estimation of the renewal function for moderate time intervals [0, t]. To apply the approach related to a renewal process, the measurements {τ 1, τ 2,...} should be i.i.d. r.v.s.. Recent empirical studies (cf. [2]) have provided some further investigation of the dependence structure of the Internet traffic above, at and below the critical time scales of round-trip times of the TCP-streams arising from the http connections between clients and Web servers and both at the consecutive packet emission periods of corresponding IP sources towards their paths through routers, and at the related buffering delays of IP packets in routers, i.e. the burst intervals of IP traffic, and finally at the very small transmission periods of IP packets or even very small byte structures on high speed links. There is evidence that due to the coarse observation and measurement periods of our data gathering the modeling of arrival processes of Internet traffic above those critical time scales can be performed by a first simplified attempt. Therefore, we make the reasonable assumption that the related traffic constitutes a stationary recurrent process of file or page arrival epochs with independently marked burst length Z. Of course, at finer time scales other more sophisticated models such as correlated Semi-Markovian streams or long-range dependent aggregated traffic streams, e.g. described by fractional Brownian motion or related mathematical point processes, may be more adequate. Obviously, all modeling must be driven by the resolution and time scales that are implemented and can be distinguished during the measurement processes. By their definition, the renewal function H(t) is expressed by H(t) = IE (N t ) = IP {t n < t} = F n (t), t 0 (2.6) n=1 n=1 where F n denotes the n-fold recursive Stieltjes convolution of F. There are not many statistical estimates of H(t) for moderate time intervals [0, t] by empirical data. We refer to [14] for a survey. 25

30 2.5. Traffic modeling by recurrent marked point processes 2. Statistical Inspection and Analysis Techniques The well known estimate proposed by Frees [10] is formulated as follows H l (t, k) = k n=1 F (n) l (t), where F (n) l (t) = ( l n ) 1 θ(t (τ i τ in )) c is the unbiased estimate of the convolution F n (t) with the minimal variance, ( ) c denotes the sum over all l distinct index combinations {i 1, i 2,..., i n } of length n. n In [12] the following histogram-type estimator has been proposed k 1 H (t, k, l) = l n=1 n l n i=1 θ(t t i n), (2.7) where t i n = n i q=1+n(i 1) τ q, i = 1,..., l n, l n = [ l n], n = 1,..., l defines the observations of the r.v. tn (θ(t) = 1I(t 0)), constructed by an empirical sample of i.i.d. inter-arrival times T (l) = {τ 1,..., τ l }. The reliability and convergence rates of this estimate are highlighted in [14]. The parameter k in both estimates can be selected as a constant. It was proposed in [14] to select k in H(t, k, l) as k plot = arg min{k : H(t, k, l) = H(t, k + 1, l), k = 1,..., l 1.} The latter selection corresponds to a minimal k associated with the stabilization interval of the plot k H(t, k, l) for a fixed t. Using larger sample sizes l and an adaptive selection of k by the data T (l), the histogram-type estimate may provide a smaller mean squared error compared to Frees estimate (cf. [14]). In contrast to (2.7) Frees estimate requires a lot of calculations and hence, it cannot be evaluated for large sample sizes. The amount of operations required to calculate H l (t, k) and H(t, k, l) with a (bootstrap-selected) constant and a plot-selected k is shown in Table 2.4. Table 2.4 Amount of operations for Frees and the histogram-type estimator with a fixed k and a plotestimated k. Estimator Amount of Example operations k = 10, l = 20 Frees estimator Histogram-type estimator with a fixed k Histogram-type estimator with plot-selected k A 1 = k n=1 (l n)(n + 1) = 2 l (1 + l/2) l n=k+1 (l n) l n=k+1 n(l n) 1 A 2 = k [ l ] n=1 n (n + 1) 258 [l/2] [ l n = [l/2] k=1 k n=1 k=1 A ] (n + 1) 26

31 2. Statistical Inspection and Analysis Techniques 2.6. Conclusions 2.6 Conclusions Considering the statistical characterization of Internet traffic, three items have to be investigated: (1) the preliminary detection of heavy tails; (2) the dependence structure of the data; (3) the traffic modeling and evaluation. For heavy-tailed models and, in particular, models with infinite variance, the classical statistical methods are not adequate or applicable and flexible enough. An example is given by the sample autocorrelation function (ACF) that has a specific form for heavy-tailed distributions with infinite variance. We should also distinguish between methods that are valid for independent and dependent data. The exploratory techniques introduced in Section 2 like the ratio of the maximum to the sum, one version of a QQ-plot and the group estimator of the tail index started from an i.i.d. assumption of the underlying data. Their interpretation may become hazardous when they are applied to the non-iid case. However, we have applied these tools to all Web data that we have gathered despite of their evident non-iid structure. The reason is that the considered methods (apart from the mentioned QQ-plot) have an asymptotic background and their application to samples of moderate sizes requires additional research. As one may conclude from Table 2.3, all these methods were not in contradiction regarding their conclusions. Apart from simple methods for the detection of heavy-tailed features and dependence in traffic data, we have considered an evaluation of the mean number of traffic events, when the inter-arrivals of Web objects are assumed to be i.i.d. r.v.s, i.e., of the renewal function. In the latter case, we have considered a marked renewal process as traffic model. It is an important item that the histogram-type estimate of the renewal function allows us to restore successfully this function for any heavy-tailed inter-arrival distribution using only a few operations. In the near future, more sophisticated techniques derived from the theory of statistical ill-posed problems and the on-line identification of time series with short- and long-range dependence have to be investigated to simplify and improve the understanding of the evolving new services implemented on top of high-speed IP networks with new quality of service differentiation schemes. Chapter 11. Package natbib Warning: There were multiply defined citations. Package natbib Warning: Citation(s) may have changed. (natbib) Rerun to get citations correct. LaTeX Warning: There were multiply-defined labels. Bibliography [1] Crovella, M.E., Bestavros, A.. Self-similarity in World Wide Web traffic: Evidence and possible causes. ACM Sigmetrics, , [2] Cao, J., Cleveland, W.S., Lin, D., Sun, D.X.. Internet Traffic tends toward Poisson and independent as the load increases. Technical Report, Bell Labs, Murray Hill, [3] Davis, R., Resnick, S.. Limit theory for moving averages of random variables with regularly varying tail probabilities. Ann. Probability, 13, ,

32 2.6. Conclusions 2. Statistical Inspection and Analysis Techniques [4] Davydov,Y., Paulauskas, V., Račkauskas, A.. More on P-stable convex sets in Banach spaces. J.Theoret. Probab., 13(1), 39 64, [5] Eurescom Project P1112, NEW DIMENSIONS Network Dimensioning based on modeling of Internet traffic. Traffic characteristics and statistical estimation. Technical Report D8, Heidelberg, June [8] Embrechts, P., Klüppelberg, C. and Mikosch, T.. Modeling Extremal Events. Springer, Berlin, [10] Frees, E.W.. Warranty Analysis and Renewal Function Estimation. Nav. Res. Logist. Quart., 33, , [8] Hill, B.M.. A simple general approach to inference about the tail of a distribution. Ann. Statist., 3, , [9] Jurečková J., Picek, J.. A class of tests on the tail index. Extremes, 4, , [10] Krieger, U.R., Markovitch, N.M., and Vicari, N.. Analysis of World Wide Web traffic by nonparametric estimation techniques. In K. Guto et al., eds., Performance and QoS of Next Generatione Networking, 67-83, Springer, London, [11] Maiboroda, R.E., Markovich, N.M.. Estimation of heavy-tailed probability density function with application to Web data. Computational Statistics, 19, , [12] Markovitch, N.M., Krieger, U.R.. Estimating Basic Characteristics of Arrival Processes in Telecommunication Networks by Empirical Data. Telecomm. Systems, 20,11-31, [13] Markovitch, N.M. and Krieger, U.R.. The estimation of heavy-tailed probability density functions, their mixtures and quantiles. Computer Networks, Vol. 40(3), , [14] Markovich, N.M.. Nonparametric renewal function estimation and smoothing by empirical data, Technical Report, ETH, Zuerich, [15] Markovich, N.M.. On-line estimation of the tail index for heavy-tailed distributions with application to WWW-traffic. In Proceedings of EuroNGI First Conference on NGI: Traffic Engineering, April 18-20, Rome, Italy, [16] Mikosch, T.. Modeling dependence and tails of financial time series. Working Paper No.181, Laboratory of Actuarial Mathematics University of Copenhagen, [20] Nabe, M., Murata, M., Miyahara, H.. Analysis and modeling of World Wide Web traffic for capacity dimensioning of Internet access lines. Performance Evaluation, 34, , [18] Resnick, S.I. Heavy tail modeling and teletraffic data. With discussion and a rejoinder by the author. Ann. Statist., 25, , [19] Silverman, B.W.. Density Estimation for Statistics and Data Analysis. Chapman & Hall, New York, [20] Trivedi, K.S.. Probability & Statistics with Reliability, Queuing, and Computer Science Applications. Prentice Hall of India,

33 3 Cluster Analysis of Internet Users Based on Hourly Traffic Utilization M. Rosário de Oliveira Instituto Superior Técnico - UTL, Department of Mathematics and CEMAT, Av. Rovisco Pais, Lisboa, Portugal. rosario.oliveira@math.ist.utl.pt Rui Valadas University of Aveiro / Institute of Telecommunications Aveiro, Campus de Santiago, Aveiro, Portugal rv@det.ua.pt António Pacheco Instituto Superior Técnico - UTL, Department of Mathematics and CEMAT, Av. Rovisco Pais, Lisboa, Portugal. apacheco@math.ist.utl.pt Paulo Salvador University of Aveiro / Institute of Telecommunications Aveiro Campus de Santiago, Aveiro, Portugal salvador@av.it.pt Abstract Internet access traffic follows hourly patterns that depend on various factors, such as the periods users stay on-line at the access point (e.g. at home or in the office) or their preferences for applications. The clustering of Internet users may provide important information for traffic engineering and tariffing. For example, it can be used to set up service differentiation according to hourly behavior, resource optimization based on multi-hour routing and definition of tariffs that promote Internet access in low busy hours. In this work, we identify patterns of similar behavior by grouping Internet users of two distinct Portuguese ISPs, one using a CATV access network and the other an ADSL one, offering distinct traffic contracts. The grouping of the users is based on their traffic utilization measured every half-hour (over one day). Cluster analysis is used to identify the relevant Internet usage profiles, with the partitioning around medoids and 29

34 3.1. Introduction 3. Cluster Analysis of Internet Users Ward s method being the preferred clustering methods. For the two data sets, these methods lead to 3 clusters with different hourly traffic utilization profiles. The cluster structure is validated through discriminant analysis. Having identified the clusters, the type of applications used as well as the amount of downloaded traffic, the activity duration and the transfer rate are analyzed for each cluster resulting in coherent outcomes. Keywords: Access networks, cluster analysis, discriminant analysis, principal component analysis, Internet traffic characterization, traffic measurements. 3.1 Introduction The behavior of Internet users is changing rapidly due to the continuous emergence of new applications and services and the introduction of broadband access network technologies, such as ADSL and CATV, offering large access bandwidths. An example of a recent user practice with a significant impact on access network traffic is the download of large files of music and films, using file sharing applications such as Kazaa and edonkey. The characterization of Internet access traffic is important for both Internet Service Providers (ISPs) and access network operators. Due to the rapid changes in the traffic characteristics, this activity requires frequent traffic measurements. Traffic measurements can be performed with different resolution levels. The highest temporal resolution is obtained when information relative to every packet passing an observation point is recorded. Typically this information includes the arrival instant and selected header fields containing, e.g., the packet size and the origin and destination addresses. This method can generate large amounts of data, placing a severe limit on the observation period. Longer observation periods can be obtained at the cost of a lower resolution, by recording data only at predefined times, e.g., at the (periodic) end of sampling intervals. In this case, the recorded data maybe the number of bytes and/or the number of packets observed in the sampling period. The resolution of a traffic measurement should match the particular task it is targeted for. For example, traffic measurements for accounting purposes only record information at the end of a login session, which is typically several hours long. Traffic measurements can be executed over many different types of objects, e.g., all traffic downloaded by a user, the traffic aggregate on a link or all traffic generated by a specific application. An important object type is the so-called traffic flow, introduced in [1]. A flow is defined as a sequence of packets crossing an observation point in the network during a certain time interval. All packets belonging to a particular flow have a set of common properties (e.g. destination IP address, destination port number, next hop IP address, output interface). A flow is considered active as long as its packets remain separated in time by less than a specified timeout value. The IETF is developing a specification for measurement systems based on IP flows [2]. Several network resource and revenue management tasks require traffic measurements with a resolution close to one hour. Two examples are the periodic update of routing performed by off-line traffic engineering algorithms with the objective of optimizing the resource utilization and the definition of hourly based tariffing policies that may promote Internet access during the least busy hours. These tasks also benefit from a detailed knowledge of the main types of individual user behavior, which may remain hidden if traffic is analyzed at an aggregate level. Aiming at the clustering of Internet users with similar patterns of Internet utilization, we analyze in this paper the behavior of ISP customers (users) based on their hourly traffic utilization. In particular, a user is characterized by the average transfer rate of downloaded traffic in half-hour periods (over 30

35 3. Cluster Analysis of Internet Users 3.1. Introduction one day). The users are grouped by means of cluster analysis [3], a set of techniques whose purpose is to partition a set of objects into groups or clusters in such a way that objects in the same group are similar, whereas objects in different clusters are fairly distinct. Generally speaking, cluster analysis methods are of two types: hierarchical and non-hierarchical. The hierarchical clustering techniques proceed by either a successive series of merges (agglomerative hierarchical methods) or by a series of successive divisions (divisive hierarchical methods). The partitioning methods, the most common type of non-hierarchical methods, start with a fixed number of clusters, and either (i) an initial partition of objects or (ii) an initial set of seed points, which will form the nuclei of clusters. Each object is (re)assigned to one of the clusters in such a way that a criteria of internal cohesion and external separation among clusters is satisfied. In our analysis we use, in particular, the Ward s method, an agglomerative hierarchical method, and the partitioning around medoids method, a partitioning method. Any sound statistical analysis requires a preliminary data analysis aimed at highlighting the main characteristics and/or inconsistencies of the data as well as the need for transforming the data in some way. Accordingly, prior to performing the cluster analysis, we present an overview of the traffic traces analyzed and use principal component analysis (PCA) as an exploratory tool. PCA [4] is a multivariate technique that seeks to maximize the variance of linear combinations of the observed variables. These linear combinations, called principal components, are non-correlated. PCA can be used as a dimension reduction device if a small number of PCs explains a large proportion of the total sample variance of the observed variables. Effective statistical analysis must be supported by validation procedures. Consequently, we validate the results of cluster analysis using discriminant analysis [5]. In particular, we compute discriminant functions that best separate the groups obtained from each cluster analysis. The validation of the cluster structure is performed by using the discriminant functions to obtain a classification procedure that is assessed using several error rates. After validation, the clusters obtained are analyzed based on various statistics of user flows. Internet traffic characterization has been the subject of several works [1, 6, 7, 8] and is addressed on a permanent basis by organizations such as CAIDA [9]. In particular, [6] analyzed the holding times and the call interarrival times of Internet access traffic at an ISDN central and [8] studied the influence of access speed on the Internet user behavior. Recently, there has been a renewed interest in the application of multivariate statistical analysis techniques to the characterization of Internet traffic. Lakhina et al. [10] have used principal component analysis to decompose the structure of origin-destination flows of backbone networks into three main constituents, accounting for common periodic trends, short-lived bursts and noise. Liu et al. [11] proposed a methodology for the clustering and classification of Web traffic patterns extracted from access logs. The paper is organized as follows. Section 3.2 gives an overview of the traffic traces analyzed and in Section 3.3 principal components are used as a preliminary data analysis to the cluster analysis presented in Section 3.4. The results of cluster analysis are validated, using discriminant analysis, in Section 3.5 and the clusters obtained are characterized, based on several statistics, in Section 3.6. In Section 3.7 we describe several applications of hourly traffic profiles and, finally, in Section 3.8 we present our conclusions. 31

36 3.2. Overview of the traffic traces 3. Cluster Analysis of Internet Users Table 3.1 Statistics of the aggregate of applications. ISP1 ISP2 Capture date Nov. 9, 2002 Oct. 19, 2002 Number of users Average downloaded traffic (MBytes) Average activity duration (hours) Average transfer rate (Kbits/sec) Overview of the traffic traces Our analysis resorts to two data traces measured in two distinct ISPs, that will henceforth be designated by ISP1 and ISP2. ISP1 uses a CATV network and ISP2 an ADSL one. Both ISPs offer several types of services, characterized by maximum allowed transfer rates in the downstream/upstream directions. For ISP1 the services are (in Kbit/s) 128/64, 256/128 and 512/256 and, for ISP2, 512/128 and 1024/256. Both traces were measured on a Saturday: November 9, 2002 (ISP1) and October 19, 2002 (ISP2). The measurements were detailed packet level measurements, where the arrival instant and the first 57 bytes of each packet were recorded. This includes information on the packet size, the origin and destination IP addresses, the origin and destination port numbers, and the IP protocol type. The traffic analyzer was a 1.2 GHz AMD Athlon PC, with 1.5 Gbytes of RAM and running WinDump. No packet drops were reported by WinDump in both measurements. Users were identified by matching IP addresses with accounting information. The data set of ISP1 includes 3432 users and the one of ISP2 includes 874 users. In order to characterize the traffic traces we considered several statistics calculated over the set of users: the average downloaded traffic, the average transfer rate and the average activity duration. The last two statistics resort to the notion of flow. A flow is defined as a sequence of packets with inter-arrival times smaller than 30 minutes. A flow starts upon arrival of the first packet and ends after the arrival of the last packet, that preceded a silence period greater than 30 minutes. Flows can be defined for the aggregate of applications (i.e. considering all packets of a user) or for groups of applications (i.e. considering only packets belonging to specific applications). Using the notion of flow, the activity duration is defined as the sum of all flows durations and the transfer rate is defined as the downloaded traffic divided by the activity duration. Note that the notion of flow is different from the notion of session considered in accounting systems such as RADIUS. In RADIUS, a session corresponds to the period of time a user is logged in the system, irrespective of its level of activity. In contrast, a flow reflects only packet level activity. We first concentrate on the statistics for the aggregate of applications, i.e., where flows are defined considering packets of all types. The main characteristics of the traffic traces are presented in Table 3.1. The average downloaded traffic is higher in ISP2, since this ISP offers service contracts with higher allowed transfer rates and, in addition, was not imposing any limits in the total amount of downloaded traffic by the time these measurements were carried out. This goes along with higher sample mean values of activity duration and transfer rates. We note, in particular, the high values of the average activity duration, reaching average values as high as 8.5 hours. We turn now our attention to the most typical applications. Applications were identified by port number. For ISP1 / ISP2, we enumerated all applications responsible for more than 0.1% / 0.05% of the downloaded 32

37 3. Cluster Analysis of Internet Users 3.2. Overview of the traffic traces traffic; the remaining applications were classified as Others. We used a higher percentage in ISP1 because a larger number of applications were observed in this ISP. Then, to allow an easier interpretation, these applications were grouped by type in the following way (we include the port number in parentheses): File sharing: Kazaa (TCP 1214), edonkey2000 (TCP 4662), direct connect (TCP 412, 1412), WinMX (TCP 6699) and FTP (TCP 20, 21). HTTP: HTTP (80) and secure HTTP (443). Games: Operation FlashPoint (TCP 2234), Medal of Honor (UDP 12203), Half-Life (UDP 27005) and all traffic from the Game Connection Port (TCP 2346). IRC/news: IRC traffic (TCP 1025, 1026) and Newsgroups access (TCP 119). Mail: POP3 mail access (TCP 110). Streaming: MS-Streaming (TCP 1755) and RealPlayer (UDP 6970). Others: other applications and all unidentified traffic. To characterize the groups of applications we use again several statistics: the relative utilization (percentage of downloaded bytes in each application group), the average activity duration and average transfer rate (where flows are defined considering only packets belonging to the applications of each group). The results are shown in figures 3.1, 3.2 and 3.3; we first comment the results for the identified application groups. Among these groups, file sharing and HTTP are the dominant ones in terms of relative utilization, and this is accompanied by higher activity durations. The transfer rates of these application groups are also high. Mail applications achieve relatively high transfer rates. This is due to the fact that mail downloads are mainly done from the mail server located in the ISP premises; therefore, mail transfer rates are only conditioned by the access network itself (and not by the Internet). The transfer rates are always higher in ISP2 (vide Figure 3.3). This is a direct consequence of the type of service contracts offered by this ISP. In ISP1 (vide Figure 3.1), there is not a strong difference between the utilization of file sharing and HTTP, as it is the case in ISP2. It seems that, in ISP1, a higher percentage of users is doing file transfer and sharing through HTTP. This can also explain the fact that, in ISP1, HTTP has higher activity durations than file sharing (vide Figure 3.2). The group called Others includes a significant percentage of the traffic, despite the fact that individually the port numbers assigned to this group generated few traffic (according to our criteria less than 0.1% of the downloaded bytes in ISP1 and 0.05% in ISP2). Given the high values of the activity durations and transfer rates we are led to suspect that most of these ports are being used by file sharing or video applications, that eventually distribute its traffic by a number of (non-standard) ports. In this paper, we are going to perform cluster analysis based on the download transfer rates measured in half-hour intervals. Figure 3.4 shows the (total) transfer rate of the traffic aggregates of both ISP1 and ISP2, as a function of time period, along the day. The two ISPs exhibit coherent average hourly profiles, showing a quasi-sinusoidal shape, with the lowest utilization in the morning period and the highest one in the afternoon period. The agreement between the two ISPs average hourly profiles may suggest that coherent user hourly profiles may be present across the two ISPs. However, it does not preclude that such an aggregate representation of the traffic may hide groups of users (clusters) with specific hourly profiles, markedly distinct from the average aggregate hourly profile. Evidence of the existence of very distinct user hourly profiles is given in Figure 3.5, 33

38 3.3. Principal component analysis as an exploratory tool 3. Cluster Analysis of Internet Users Figure 3.2 Average activity duration % 20 hours ISP1 ISP2 0 ISP1 ISP2 File Sharing HTTP Games IRC/news Mail Streaming Others File Sharing HTTP Games IRC/news Mail Streaming Others applications applications Figure 3.3 Average transfer rate. 80 ISP1 ISP2 60 kbits/sec File Sharing HTTP Games IRC/news Mail Streaming Others applications where the transfer rates of 10 selected ISP1 users are plotted. The nonhomogeneous character of hourly user profiles motivates the use of cluster analysis to address the possible identification of groups of users (clusters) with similar hourly profiles, for each ISP. Of interest is also the comparison of the hourly profiles of clusters for the two ISPs, apart from the hourly traffic utilization rates. 3.3 Principal component analysis as an exploratory tool Principal component analysis (PCA) [4] seeks to maximize the variance of linear combinations of the variables, called principal components. If a small number of principal components explains a large proportion of the total variance of the p original variables, then PCA can successfully be used as a dimension reduction technique. Moreover, the analysts are also interested in the interpretation of the principal components, i.e., the meaning of the new directions where the data is projected. Given a set of n observations on the random variables X 1, X 2,..., X p, the k-th principal component (PC k) is defined as the linear combination, Z k = α k1 X 1 + α k2 X α kp X p (3.1) 34

39 3. Cluster Analysis of Internet Users 3.3. Principal component analysis as an exploratory tool Figure 3.5 Download transfer rates of 10 randomly selected ISP1 users. such that the loadings of Z k, α k = (α k1, α k2,..., α kp ) t, have unitary Euclidean norm (α t i α i = 1). Z k has maximum variance and PC k, k 2, is uncorrelated with the previous PCs, which in fact means that α t i α k = 0, i = 1,..., k 1. Thus, the first principal component is the linear combination of the observed variables with maximum variance. The second principal component verifies a similar optimal criteria and is uncorrelated with PC 1, and so on. As a result, the principal components are indexed by decreasing variance, i.e., λ 1 λ 2... λ s, where λ i denotes the variance of PC i and s = min (n, p) is the maximum number of PCs that can be extracted (in the present paper n > p). It can be proved [4] that the vector of loadings of the k-th principal component, α k, is the eigenvector associated with the k-th highest eigenvalue, λ k, of the covariance matrix of X 1, X 2,..., X p. Therefore, the k-th highest eigenvalue of the sample covariance matrix is the variance of PC k, i.e. λ k = Var(Z k ). The proportion of the total variance explained by the first r principal components is λ 1 + λ λ r λ 1 + λ λ p. (3.2) If this proportion is close to one, than there is almost as much information in the first r principal components as in the original p variables. In practice, the number r of considered principal components should be chosen as small as possible, taking into account that the proportion of the explained variance, (3.2), is large enough. Once the loadings of the principal components are obtained, the score of user i on PC k is given by z ik = α k1 x i1 + α k2 x i α kp x ip (3.3) where x i = (x i1,..., x ip ) t is the data corresponding to user i. Plotting the scores on the first PCs we have a graphical representation of the data. In this paper, PCA is used as an exploratory tool and applied to the data corresponding to the transfer rate (in Kbits/s) in the j-th half-hour interval, X j, j = 1, 2,..., 48, for each of the two ISPs. The principal components were estimated based on the sample correlation matrix. The first two sample principal components explain 56.7% of the total sample variance for ISP1, and 54.5% for ISP2. The loadings of the first two sample principal components for each ISP are represented in Figure 3.6. Note that all PC 1 loadings are positive and 35

40 3.3. Principal component analysis as an exploratory tool 3. Cluster Analysis of Internet Users Figure 3.7 Scores of the first two sample principal components, ISP1. PC * * * * * * * * * * * ***** * * * * * * * * * * ** * ** * * * ******* * * * * * * * * * * * ** * * * * * * * * * * * * * * * * * * ** * * * * * * * * * * * * * *** * * * * * * * * * PC1 of similar magnitude, while PC 2 loadings are negative for the first period of the day and positive for the second part of the day. Taking into account not only the loadings but also the sample correlation between each observed variable and each of the first two principal components it can be said that, for both data sets, PC 1 is an average of the hourly traffic utilization along the day and PC 2 is a measure of contrast between the morning" and afternoon" utilizations. Thus, high values of PC 1 are associated with users with high Internet utilization rates along the day and low values represent users with low Internet transfer rates. Likewise, high (positive) values of PC 2 can be interpreted as describing users with high rates of utilization only in the last period of the day ( afternoon"), while very small (negative) values of PC 2 represent users with high rates of utilization in the first period of the day ( morning") and low rates of utilization during the second period of the day ( afternoon"). The scores of ISP1 users in the first two principal components are displayed in Figure 3.7; a similar shape is observed for the (not shown) scores of ISP2 users in the first two PCs. The peculiar pattern exhibited in Figure 3.7 leads to the conclusion that the variability of PC 2 increases with the value of PC 1. This suggests that the variability of Internet utilization in half-hour intervals along the day increases with the values of PC 1, i.e., with the daily average Internet utilization. Taking into account this fact, and in order to smooth this variability, the following transformation of the data was performed, in each set of data: Y j = ln (1 + X j ) (3.4) for j = 1,..., 48. Repeating the PCA for the transformed data, y 1, y 2,..., y n, where y i = (y i1, y i2,..., y i48 ) t and n is the number of users, we conclude that the first two PCs explain 50.6% of the total variance for ISP1 and 60.4% for ISP2. These PCs have the following interpretation for both ISPs: PC 1 is an average of Internet utilization along the day and PC 2 is a measure of contrast between the morning" utilization ( morning" means from 2:30 am until 1 pm for ISP1 and from 0 am until 1:30 pm for ISP2) and the afternoon" utilization. Moreover, the transformation used succeeded in reducing the variability of PC 2 with the value of PC 1. Thus, the transformed data is going to be used in sections In the next section, we use cluster analysis to define groups of users with similar traffic utilization, based on the data transformed by (3.4). 36

41 3. Cluster Analysis of Internet Users 3.4. Cluster analysis 3.4 Cluster analysis The purpose of cluster analysis [3] is to partition a set of objects into clusters in such a way that objects in the same cluster are similar, whereas objects in different clusters are distinct. The concept of cluster is linked with the concept of proximity or distance between objects and groups of objects. Generally speaking, clustering methods are of two types: hierarchical and non-hierarchical, with the former being the most common. The hierarchical clustering techniques proceed by either a successive series of merges (agglomerative hierarchical methods) or by a series of successive divisions (divisive hierarchical methods). The agglomerative methods start with as many clusters as objects and end with only one cluster, containing all the objects. The divisive methods work in the opposite direction. The results of hierarchical methods may be displayed in the form of a diagram tree, called dendrogram (vide figures ), that illustrates the merges or divisions which have been made. These methods are based on: (i) a measure of dissimilarity or similarity between objects and (ii) a measure of proximity between clusters that defines which are the two closest clusters to be merged in each step. In the present work, and since the observations are continuous, the chosen distance between two objects (users), x i = (x i1,..., x i48 ) t and x j = (x j1,..., x j48 ) t, is the Euclidean distance: d ij = { (x i x j ) t (x i x j ) } 1/2. The merging criteria used leads to what is called Ward s method, also called incremental sum of squares method. Let n r be the number of objects in cluster r and x ir be the observation vector associated with object i of cluster r. The within-cluster distance is given by n r SSW r = (x ir x r ) t (x ir x r ), i=1 where x r is the mean vector for the n r observation vector in cluster r. Let (r, s) denote the cluster containing the users from clusters r and s. At each step, Ward s method merges the two clusters r and s for which SSW (r,s) (SSW r + SSW s ) is minimum and, whence, seeks to minimize the increase in total within-cluster sum of squares. In addition to Ward s method we also have used the (partitioning around) medoids method, for which the analyst has to decide in advance how many clusters, say K, he wants to consider. The method starts by choosing the K medoids, here denoted by m 1, m 2,..., m K. These are representative objects that are chosen such that the total (Euclidean) distance of all objects to their nearest medoid is minimal, i.e., the algorithm finds a subset {m 1,..., m K } {1,..., n} which minimizes the function n min d im t. t=1,...,k i=1 Each object is then assigned to the cluster corresponding to the nearest medoid. That is, object i is assigned to cluster C j whose associated medoid, m j, is nearest to the object i, i.e., d imj d imt, for all t {1, 2,..., K}. 37

42 3.4. Cluster analysis 3. Cluster Analysis of Internet Users In the present study, the users are the objects and the variables are the (transformed) half-hour interval transfer rates along the day. Thus, we want to group users with similar pattern of Internet utilization, characterized by their transfer rates in each half hour interval. Figure 3.9 Dendrogram, ISP Height Height For both ISPs, the two considered clustering methods lead to the choice of 3 clusters. The dendrograms in figures illustrate the merging of clusters using Ward s method while Figure 3.10 describes the medoids for each ISP using 3 clusters. The considered clusters using Ward s method were obtained by cutting the trees of figures by a distance that leads to 3 branches. This choice was confirmed by the medoids method, since we obtained with 3 medoids the highest values of the overall average silhouette width: 0.49 for ISP1 and 0.32 for ISP2. The silhouette width is a quality index that measures how strong the cluster structure is and helps the analyst to choose the number of clusters to be considered [12]. In [12] it is recommended that this index should be higher than In both cases, other numbers of clusters as well as other clustering methods were considered leading to less clear and coherent results. Figure 3.11 Half-hour medians for each cluster, ISP1 and ISP2. The clusters formed through the medoids and Ward s methods lead to very high rates of in-between cluster 38

43 3. Cluster Analysis of Internet Users 3.4. Cluster analysis Figure 3.13 Scores on the two first sample principal components, marking differently users from different clusters ISP2. CP C1 C2 C3 CP o o o o oo oo oo oo oo o oo o o o o ooo o o o oo o o o o o o oo o o o o o o o o o o o o oo o o o o o o o o o o o o o o oo o o o o o oo o o o o o o o o o o o o C1 C2 C CP CP1 agreement, for both ISPs. In fact, 84.4% (84.0%) of the ISP1 (ISP2) users were assigned to the same cluster by both the medoids and Ward s methods. To help in the interpretation of the 3 clusters obtained for each ISP, graphs of the median and average half-hour transfer rates along the day within each cluster were obtained. All graphs lead to similar conclusions, so only the graphs of the median, for each ISP, using the medoids method are presented; see Figure In addition, the representation of the typical user in each cluster (i.e., the cluster s medoid) is given in Figure The clusters obtained can be interpreted in the same way for both ISPs and are described in Table 3.2. This interpretation is supported by figures , where each observation is projected in the first two principal components and users from different clusters are marked differently. Taking into consideration that PC 1 is an average of the hourly traffic utilization along the day and PC2 a measure of the contrast between morning" and afternoon" utilization, we can confirm that users from C1 have the highest transfer rate all day long (high values on PC 1). Users from C2 have positive values for PC2, which means high transfer rates during the afternoon and low during the morning. Finally, C3 users have low values on PC1 and distribute symmetrically on PC2, which suggests that they are characterized by low transfer rates in all periods of the day. From Table 3.3 we can conclude that cluster C1 (characterized by high transfer rate in all periods) is the cluster with fewest users (with the exception of ISP2, medoids) and cluster C3 (characterized by low transfer rate in all periods) is the one with highest number of users. It is also worthwhile to mention that the discrepancy among cluster sizes is smaller for ISP2 than for ISP1. That is, proportionally, there are more ISP1 users characterized by low transfer rates and fewer users associated with high transfer rates all day long. The structure of 3 clusters has to be validated. If the division of the population in clusters is in fact clear, the groups should also be well separated and it is expected that classification rules correctly assign most of the Table 3.2 Interpretation of ISP1 and ISP2 clusters. Cluster Interpretation C1 high transfer rate in all periods C2 low/high transfer rate in the morning/afternoon C3 low transfer rate in all periods 39

44 3.5. Validation using discriminant analysis 3. Cluster Analysis of Internet Users Table 3.3 Cluster sizes for ISP1 and ISP2. ISP1 ISP2 Cluster Medoids Ward Medoids Ward C1 4.23% 4.87% 18.52% 20.00% C2 7.75% 22.29% 14.17% 21.94% C % 72.84% 67.32% 58.06% users. The cluster structure is going to be evaluated, in the next section, using discriminant analysis. 3.5 Validation using discriminant analysis Discriminant analysis [5] is a multivariate technique concerned with the separation among different sets of objects and the classification of a new object into one of the previous defined groups. If, in fact, there is a cluster structure, then the separation among the clusters (groups) should be clear, and discriminant analysis be able to detected it. Moreover, if this separation is clear, it is anticipated that the estimated classification rules are able to allocate the majority of the users to the right cluster. If so, the cluster analysis is validated. R. A. Fisher suggested to find linear functions of the data, called discriminant functions, that best separate g groups characterized by p variables [13]. The first discriminant function is the linear combination of the observed variables, l t 1x = l 11 x l 1p x p, that maximizes the ratio of the between-group sum of squares to the within-group sum of squares. Which means that the separation is made in such a way that within each group the objects are as similar as possible but, at the same time, the groups are as different as possible. A maximum of s = min (g 1, p) discriminant functions can be defined as linear functions of the observed variables, uncorrelated with the previous ones, which verify the same optimality criteria and it is assumed that the groups belong to populations with equal covariance matrix of full rank, i.e. Σ 1 =... = Σ g = Σ, where Σ i is the covariance matrix of population i. Let x ir be the vector of observations on user i of group r (with n r users) where x ir = (x i1r,..., x ipr ) t. The sample mean vector for group r is x r = ( x 1r,..., x pr ) t, where x jr = n r i=1 x ijr/n r, j = 1,..., p, r = 1,..., g. The within-group sum of squares, W, is defined by W = g n r (x ir x r ) (x ir x r ) t = r=1 i=1 g (n r 1)S r where S i is the sample covariance matrix of group i, i = 1,..., g. The between-group sum of squares matrix is B = where the overall sample mean is x = r=1 g n r ( x r x ) ( x r x ) t, r=1 g n r r=1 i=1 x ir /n, and n = n n g. It can be proved that l j, the l-th discriminant function coefficient vector is the eigenvector associated with the j-th biggest eigenvalue of the matrix W 1 B, scaled such that l t j S pl j = 1, where S p = W/(n g) is an estimate of Σ. Sometimes a smaller number of linear combinations k < s can be chosen to explain the characteristics that best separate the groups [14]. 40

45 3. Cluster Analysis of Internet Users 3.5. Validation using discriminant analysis The Fisher s discriminant functions were derived to obtain a representation of the data that separates the populations as much as possible. However, it can be used to produce a discrimination rule [14]. To classify the object x 0 we have to compute the values of the discriminant functions for the object: (l t 1x 0,..., l t sx 0 ) t. The object should be allocated to the group r for which the Euclidean distance to (l t 1 x r,..., l t s x r) t is the smallest (in our case we chose k = s). If the groups have very different sizes, prior probabilities associated with each group can be used to obtain a better classification rule. If the prior is such that p 1 =... = p g = 1/g, this rule is equivalent to the classification rule obtained when the observations have normal distribution and the groups have different mean vectors but equal covariance matrices. Discriminant analysis can be used to validate the obtained clusters. In fact, if we consider that each user belongs to a group, determined by the cluster he was classified in, we can obtain discriminant functions that best separates these groups (clusters). If the groups are well separated, then it is expected that the estimated classification rules should produce lower rates of misclassification, validating the cluster structure. Once again, several procedures to estimate the discriminant functions were considered, however the one that lead to the lowest error rates was the Fisher s discriminant functions, modified in order to consider prior probabilities [14]. A discriminant rule can be evaluated by reclassifying the users, and calculating the number of individuals badly classified in each group. Classifying the individuals used to estimate the discriminant functions determines what is called the apparent error rate. The apparent error rate tends to underestimate the true error rate, thus misleading the conclusions. Accordingly, 3 other error rates were considered: The first error rate is obtained by a procedure called leave-one-out. Each user is left out, a classification rule using all the other (n 1) users is estimated and the user left out is classified. The second error rate is obtained by cross-validation. To be more precise, each data set is divided in two. The first data set, containing 80% of the data (called the training set) is used to estimate the classification rule and the rest of the data is classified, leading to an error rate. The third error rate is obtained by double cross-validation. Once again, the data set is divided in two groups, each of them containing 50% of the data. Take the first group to estimate the classification rule, use it to classify the members of the second group and compute the associated error rate. Then use the second group to estimate the classification rule, classify the members of the first group and calculate a second error rate. The final error rate is the mean of the two previously computed rates. Given that the obtained clusters have very different sizes (vide Table 3.3) the classification rules, as well as the error rates were obtained taking into account the proportion of users belonging to each group, i.e. p r = n r /n. If all these error rates are low it is expected that the groups are well separated, thus the cluster structure is well defined and therefore validated. The error rates are presented in Table 3.4. For both ISPs, the clusters obtained using the partition around medoids lead to lower error rates. But in general, all the rates are low, validating the cluster structure proposed. In the evaluation of the classification rules, the actual group that each user belongs to is known. Therefore, a matrix crossing the group each user belongs to with the group to which it is classified in can be very helpful and is called confusion matrix. Having analyzed all the confusion matrices we can conclude that there are two major types of misclassifications: between clusters C2 and C3 and between clusters C1 and C2. The first case can be justified by the comparison of the average transfer rates of C2 and C3 users in each period. We can conclude that, at a 5% significant level, the average transfer rates of C2 and C3 users are equal in the periods 41

46 3.6. Evaluation of clusters 3. Cluster Analysis of Internet Users Table 3.4 Error rates for ISP1 and ISP2. Error ISP1 ISP2 rate Medoids Ward Medoids Ward Apparent 1.31% 9.47% 3.09% 6.74% Jackknife 1.81% 9.99% 5.71% 9.49% Cross-valid. 1.78% 10.05% 3.26% 9.57% Double cross 2.18% 10.02% 6.63% 10.86% Table 3.5 Averages for the aggregate of applications, clustered users, ISP1 (left) and ISP2 (right). Downloaded traffic (MBytes) Activity duration (hours) Transfer rate (Kbits/sec) C / / / C / / / C / / / 78.7 All 84.5 / / / between 4:30 am and 11 am, for ISP1, and between 0 am and 0:30 pm, for ISP2 (using the partition obtained by the medoids method). So since there is a period in the morning where, in average, the two clusters are equal, then the classification rules have more difficulty in assigning users to the right clusters. For the second case it can be said that, in a small period of the afternoon (between 4 pm and 6 pm, for ISP1 and between 4 pm and 5 pm, for ISP2 if the partition into clusters used is the one obtained by the medoids method) the transfer rates of C1 and C2 users are also not statistically different in average (at a 5% significance level). 3.6 Evaluation of clusters The goal here is to assess if the clusters obtained in previous sections are also meaningful from the point of view of other traffic characteristics that were used in the cluster analysis. In particular, we are interested in the way the per-user downloaded traffic, transfer rate and activity duration map into the clusters C1, C2 and C3. We restrict our attention to the clusters obtained using the medoids method, since this method lead to the lowest misclassification rates (vide Table 3.4). As before, we first consider the statistics of the aggregate of applications. We show in Table 3.5 the average values and in figures the boxplots for the three statistics. To ease the analysis we repeat the values corresponding to the aggregate of users (line All" in Table 3.5). The figures suggest that there are significant differences among the clusters for each of the three statistics (downloaded traffic, transfer rate and activity duration). To verify this claim, we tested the null hypothesis that the three clusters have the same downloaded traffic (transfer rate, activity duration) location parameter, i.e. H 0 : µ 1 = µ 2 = µ 3, where µ i is the location parameter of cluster Ci. The alternative hypothesis is that at least one of the clusters has different location parameter than the others. Since the data do not have a normal distribution a nonparametric test, the Kruskal-Wallis rank test, was used [15]. All the three null hypothesis were rejected at a 5% significance level, and thus, we are interested in comparing the magnitude of these differences. Accordingly, a pairwise comparison procedure based on the ranks of the observations was used. In this procedure, suitable for large samples, a Bonferroni correction is made (vide [15]) and a 10% significance level was used. The results of the pairwise tests are reported in Table 3.6. For both ISPs, the downloaded traffic, the activity duration, and the transfer rate are higher in C1 and C2 42

47 3. Cluster Analysis of Internet Users 3.6. Evaluation of clusters than in C3 (vide Table 3.6), thus the C3 users are the least intensive ones. Note that, as the majority of users belong to C3, the values for the aggregate of users (line All" in the Table 3.5) are close to the C3 values. This also shows that the values for the aggregate of users hide the fact that a small number of users consumes most of the resources and that cluster analysis can enhance traffic characterization by highlighting the behavior of different groups of users. For both ISPs, the activity duration is greater in C1 than in C2, and C1 users have particularly high activity durations, with averages greater than 18 hours in both ISPs. In ISP2 the downloaded traffic is higher in C1 than in C2, whereas in ISP1 it is similar. This can be explained by the fact that ISP2 users had no limitation in the amount of downloaded traffic by the time the measurements were carried out. Moreover, the dispersion in the downloaded traffic values is higher in C1 than in C2, which may be due to the fact that C1 accommodates the users with high utilizations irrespective of their hourly profiles (vide figures ). The transfer rate is similar in clusters C1 and C2. We consider now the most typical applications within each cluster. We present the medians of the downloaded traffic (figures ), the activity duration (figures ) and the transfer rate (figures ), for both ISP1 and ISP2. Note that these statistics were based only on the users that exhibit some activity on the applications. In order to compare the differences among clusters for each group of applications a Kruskal-Wallis rank test was used. In each case, we test if the three clusters have the same location parameter (H 0 : µ 1 = µ 2 = µ 3 ), at a 5% significance level. For the cases where we reject the null hypothesis a pairwise comparison was performed, following exactly the same procedure as before. The conclusions are summarized in Table 3.7. Note that blank cells mean that we have accepted the null hypothesis, at a 5% significance level. The pairwise comparison tests show that C3 users download less traffic than C1 and C2 users in File Sharing, HTTP, and IRC/news in both ISPs, as well as in Games for ISP1. Similarly, the activity durations of C3 users are smaller than those of C1 and C2 users in File Sharing, HTTP, and IRC/news for both ISPs, as well as in Games, Mail, and Streaming in ISP1. In addition, C3 users have smaller File Sharing transfer rate than C1 and C2 users for both ISPs. Thus, there is agreement of the comparisons of C3 users with C1 and C2 users at the applications level with the corresponding comparisons at the aggregate level. Note that C3 users in ISP1 have higher transfer rates in HTTP, IRC/News and Mail. This may be explained by the fact that C3 users from ISP1 have very low File Sharing utilization (vide Figure 3.17). The download traffic statistics show that the main application of C1 and C2 users are File Sharing and HTTP for both ISPs; see figures Conversely, if the same is true for C3 users in ISP2, in ISP1 C3 users use almost exclusively HTTP. Note that in both ISPs, C1 and C2 users download relatively more File Sharing traffic compared to HTTP traffic than C3 users. In all clusters of both ISPs, the downloaded traffic in the remaining applications was negligible. Some of the differences between ISP1 and ISP2 users can be explained by the fact that ISP2 had no limit on the amount of download traffic, when the measurements were carried out, strengthened by the fact that this offer was temporary. This promoted a more intense use of File Sharing applications by C2 users of ISP2. The medians of the activity duration (figures ) in File Sharing are higher for C1 users than for C2 users, for both ISPs. However, the difference in activity duration in File Sharing between C1 and C2 users is significant only in ISP2 (vide Table 3.7). In fact, the activity durations of C1 and C2 users are similar, with the only additional significant difference being observed for Games at ISP1. Despite that, the activity durations of the applications aggregate are much higher for C1 users. These findings indicate that C1 users tend to use different applications in different time periods, whereas C2 users tend to use more applications 43

48 3.7. Applications of cluster analysis 3. Cluster Analysis of Internet Users Table 3.6 Pairwise comparisons for the aggregate of applications, clustered users, α = Downloaded traffic Activity duration Transfer rate ISP1 µ 1 = µ 2 > µ 3 µ 1 > µ 2 > µ 3 µ 1 = µ 2 > µ 3 ISP2 µ 1 > µ 2 > µ 3 µ 1 > µ 2 > µ 3 µ 1 = µ 2 > µ 3 simultaneously. In ISP1 the downloaded traffic and activity duration for Games decrease from C1 to C3, whereas in ISP2 there are no marked differences. This is due to ISP1 having a larger community of on-line game players, which can be explained by the combination of two factors: first, ISP1 was offering broadband services for a longer time than ISP2 and, second, ISP1 was providing game servers. The use of HTTP by C1 and C2 users was similar in both ISPs. Although HTTP can be used to download large files, its traffic is usually the result of frequent user interactions, which is not the case in File Sharing. Thus, it can be concluded that C1 and C2 users stay approximately the same time interacting with their computers. To conclude, the main characteristics that distinguish the clusters of the two ISPs in terms of applications are: C1 and C2 use more File Sharing applications, at a higher transfer rate, and during longer periods than C3 users; there is a tendency for C1 users to use File Sharing applications more intensely than C2 users; C3 users have lower downloaded traffic and activity duration in HTTP than C1 and C2 users. Figure 3.15 Transfer rate, clustered users. MBytes Kbits/sec C1 C2 C3 C1 C2 C3 C1 C2 C3 C1 C2 C3 ISP1 ISP2 ISP1 ISP2 3.7 Applications of cluster analysis The cluster analysis of Internet users based on the hourly traffic utilization, besides allowing a deeper understanding of the traffic characteristics, can be applied in tasks that require assessment of the hourly behavior of traffic, such as resource management and tariffing. We give some examples in the remainder of this section. Resource management - Cluster analysis of Internet users can be employed by ISPs and network operators in resource management tasks that are performed on an hourly basis. For example, when routing is allowed to change several times per day, a feature sometimes called multi-hour routing and used in telephony, ATM and MPLS networks, cluster analysis can be used to delimit the time periods when routing is kept fixed 44

49 3. Cluster Analysis of Internet Users 3.7. Applications of cluster analysis Figure 3.16 Activity duration, clustered users. Hours C1 C2 C3 C1 C2 C3 ISP1 ISP2 Table 3.7 Pairwise comparisons for the various groups of applications, clustered users, α = Groups of Applications File sharing HTTP Games ISP1 Downloaded traffic µ 1 = µ 2 > µ 3 µ 1 = µ 2 > µ 3 µ 1 > µ 2 > µ 3 Activity duration µ 1 = µ 2 > µ 3 µ 1 = µ 2 > µ 3 µ 1 > µ 2 > µ 3 Transfer Rate µ 1 = µ 2 > µ 3 µ 1 < µ 2 < µ 3 ISP2 Downloaded traffic µ 1 = µ 2 > µ 3 µ 1 = µ 2 > µ 3 Activity duration µ 1 > µ 2 > µ 3 µ 1 = µ 2 > µ 3 Transfer Rate µ 1 = µ 2 > µ 3 Groups of Applications IRC/news Mail Streaming ISP1 Downloaded traffic µ 1 = µ 2 > µ 3 Activity duration µ 1 = µ 2 > µ 3 µ 1 = µ 2 > µ 3 µ 1 = µ 2 > µ 3 Transfer Rate µ 1 = µ 2 < µ 3 µ 1 = µ 2 < µ 3 ISP2 Downloaded traffic µ 1 = µ 2 > µ 3 Activity duration µ 1 = µ 2 > µ 3 Transfer Rate and to identify the groups of users with similar routing. Cluster analysis also can be employed by ISPs in differentiating service given to users based on their hourly utilization, e.g., through traffic shaping. Tariffing - Hourly based tariffs can be used, for example, to promote Internet access in the least busy hours. Cluster analysis of Internet users allows ISPs to assess whether or not hourly based tariffs are advantageous. This may be the case when a significant number of users follows a non-flat usage profile, as in our C2 cluster. In this situation, cluster analysis can help deciding on the number and type of hourly based tariffs, and to delimit the time periods of each tariff. Moreover, in case an ISP offers hourly based tariffs, cluster and discriminant analysis can be used to advise users on the type of tariff they should select, based on their previous usage. Cluster analysis can also be used when ISPs need to lease circuits from access network operators for the transport of traffic aggregates, a common situation in ADSL networks. On one hand, access network operators have to do similar studies as ISPs in the previous example, to assess if it is beneficial to offer circuits with hourly based tariffs and to define the characteristics of these tariffs. On the other hand, ISPs can use cluster and discriminant analysis to decide on what tariffs to choose. 45

50 3.8. Conclusions 3. Cluster Analysis of Internet Users Figure 3.18 Median of downloaded traffic, clustered users, ISP2. 120,0 400,0 350,0 100,0 300,0 80,0 250,0 MBytes 60,0 C1 C2 C3 MBytes 200,0 150,0 C1 C2 C3 40,0 100,0 20,0 50,0 0,0 0,0 File Sharing HTTP Games IRC/news Mail Streaming Applications File Sharing HTTP Games IRC/news Mail Streaming Applications Figure 3.20 Median of activity duration, clustered users, ISP Hours 8 6 C1 C2 C3 Hours 10 8 C1 C2 C File Sharing HTTP Games IRC/news Mail Streaming File Sharing HTTP Games IRC/news Mail Streaming Applications Applications 3.8 Conclusions In this paper, we have partitioned Internet users based on their hourly traffic utilization using cluster analysis. In particular, we have used two clustering methods, the partitioning around medoids and Ward s methods. The analysis resorted to two traffic traces measured at distinct Portuguese ISPs offering distinct traffic contracts, one using a CATV access network and the other an ADSL one. In spite of the ISP s differences and the use of different clustering methods, based on hourly traffic utilization, three clusters, with similar interpretations, were obtained for both ISPs. The typical user of each cluster (C1, C2 and C3) is characterized as follows: a high utilization rate in all time periods, for C1; a low utilization rate in the first half of the day and a high utilization rate in the second half, for C2; and, finally, a low utilization rate in all time periods, for C3. This cluster structure was validated by discriminant analysis since the computed misclassification rates were low. These results show that the users of both ISPs have a typical residential profile. The three clusters were also evaluated in terms of several characteristics of the user traffic not used in the cluster analysis, such as the amount of downloaded traffic, the activity duration and the transfer rate. The results showed that the clusters have distinctive characteristics: C1 and C2 use more File Sharing applications, at a higher transfer rate, and during longer periods than C3 users, with a tendency for C1 users to have a more intense use than C2 ones; and, C3 users have lower downloaded traffic and activity duration in HTTP than C1 and C2 users. We have also highlighted the importance of hourly Internet utilization profiles in the context of 46

51 3. Cluster Analysis of Internet Users 3.8. Conclusions Figure 3.22 Median of transfer rate, clustered users, ISP Kbits/sec 15 C1 C2 C3 Kbits/sec C1 C2 C File Sharing HTTP Games IRC/news Mail Streaming 0 File Sharing HTTP Games IRC/news Mail Streaming Applications Applications traffic engineering and tariffing applications. Bibliography [1] K. Claffy, H. Braun, G. Polyzos, A parameterizable methodology for Internet traffic flow profiling, IEEE Journal of Selected Areas in Communications 13 (8) (1995) [2] J. Quittek, T. Zseby, B. Claise, S. Zander, Requirements for IP flow information export, Internet-draft, IETF, (Feb. 2003). [3] L. Kaufman, P. J. Rousseeuw, Finding Groups in Data: An Introduction to Cluster Analysis, Wiley, New York, [4] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, [5] G. J. McLachlan, Discriminant Analysis and Statistical Pattern Recognition, Wiley, New York, [6] J. Färber, S. Bodamer, J.Charzinski, Statistical evaluation and modelling of Internet dial-up traffic, in: Proc. SPIE Photonics East Conf. "Performance and Control of Network Systems III", Boston, MA, USA, [7] P. Barford, A. Bestavros, A. Bradley, M. E. Crovella, Changes in web client access patterns: Characteristics and caching implications, World Wide Web, Special Issue on Characterization and Performance Evaluation 2 (1999) [8] N. Vicari, S. Kohler, J. Charzinski, The dependence of Internet user traffic characteristics on access speed, in: Proceedings of the 25th Local Computer Networks (LCN) Conference, Tampa, USA, [9] [10] A. Lakhina, K. Papagiannaki, M. Crovella, C. Diot, E. Kolaczyk, N. Taft, Structural analysis of network traffic flows, in: Proceedings of Sigmetrics 2004 / Performance 2004, 2004, pp [11] Z. Liu, M. Squillante, C. Xia, S. Yu, L. Zhang, Profile-based traffic characterization of commercial web sites, in: Proceedings of the 18th International Teletraffic Congress - ITC-18, 2003, pp

52 3.8. Conclusions 3. Cluster Analysis of Internet Users [12] A. Struyf, H. Mia, P. J. Rousseeuw, Integrating robust clustering techniques in S-PLUS, Computational Statistics and Data Analysis 26 (1997) [13] J. D. Jobson, Applied Multivariate Data Analysis. Volume II: Categorical and Multivariate Methods, Springer-Verlag, New York, [14] R. A. Johnson, D. W. Wichern, Applied Multivariate Statistical Analysis, 5 th Edition, Prentice-Hall, Inc, London, [15] J. Neter, M. H. Kutner, C. J. Nachtsheim, W. Wasserman, Applied Linear Statistical Models, 4 th Edition, Irwin, Chicago,

53 4 Web User Session Characterization via Clustering Techniques A. Bianco, G. Mardente, M. Mellia, M. Munafò, L. Muscariello Dipartimento di Elettronica, Politecnico di Torino {last Abstract We focus on the identification and definition of Web user-sessions, an aggregation of several TCP connections generated by the same source host on the basis of TCP connection opening time. The identification of a user session is non trivial; traditional approaches rely on threshold based mechanisms, which are very sensitive to the value assumed for the threshold and may be difficult to correctly set. By applying clustering techniques, we define a novel methodology to identify Web user-sessions without requiring an a priori definition of threshold values. We analyze the characteristics of user sessions extracted from real traces, studying the statistical properties of the identified sessions. From the study it emerges that Web user-sessions tend to be Poisson, but correlation may arise during periods of network/hosts anomalous functioning. Keywords: Network Measurements 4.1 Introduction The identification of user-session plays an important role both in Internet traffic characterization and in the proper dimensioning of network resources. We concentrate on the identification of web sessions generated by a single user, as WWW is the most widely used interactive service. We assume that only a single user runs a browser on each host, a reasonable assumption today given the vast majority of PC based hosts. The informal definition of a user session can be obtained by describing a typical behavior of a user running a Web browser: an activity period, when the user browses the Web, alternates with a silent period over which the user is not active on the Internet. This activity period, named session in this paper, may comprise several TCP connections, opened by the same host toward possibly different servers. Unfortunately, the identification of active and silent periods is not trivial. Traditional approaches [9, 8] rely on the adoption of a threshold η: TCP connections are aggregated in the same session if the inter-arrival time between two TCP connections is smaller than the threshold value; otherwise, a new session is identified. This approach works well if the threshold value is correctly matched to the average value of connection and session inter-arrival time; however, to know these values in advance is unrealistic in practice. If the threshold value is not correctly matched to user session statistical behavior, threshold based mechanisms are highly error prone in session identification. 49

54 4.2. Clustering Techniques 4. Web User Session Characterization Clustering techniques [4] are used in many areas to partition a given data set in similar subsets, by defining a proper notion of similarity. Typically, several metrics over which a distance measure can be defined are associated to points (named samples in this paper) in the data set; informally, the partitioning process tries to put in the same subset neighboring samples and in different subsets distant samples. The main advantage of using such approach is that there is no need to define a priori any threshold value. Thus, this methodology should be less error prone than simpler threshold based mechanisms. The contributions of this paper are the following: first, we adapted classical clustering techniques to the described scenario, a non-trivial task that requires a lot of ingenuity to optimize the performance of user session identification algorithms both in terms of speed and precision. In [1], the proposed methodology was tested using artificially generated traces to assess the error performance of the proposed technique and to compare it with traditional threshold based mechanisms, proving the strength of clustering approaches versus threshold based algorithms. In this paper, the defined algorithms are run over real traffic traces, to obtain statistical information on user sessions, such as distributions of i) session duration, ii) amount of data carried over a single session, iii) number of connection within a single session. A study of the inter-arrival times of Web user-sessions is also presented, from which it emerges that Web user-sessions tend to be Poisson, but correlation may arise during network/hosts anomalous functioning. 4.2 Clustering Techniques In this section we briefly describe two clustering techniques, which will be used in the next sections as key tools. Clustering is a general framework which can be applied to many different areas to infer some sort of similarity among subsets of data, typically on a large size repository. More details on clustering techniques can be found in [4]. Let us consider a metric space X, which we refer to as a sampling space, and a set of samples A = {x 1,..., x N x i X} which have to be clustered into K subsets: we wish to find a partition C = {C 1,..., C K }, such that i C i = A and C i C j i =, with K possibly being unknown a priori. The subsets in the partition are named clusters. They contain similar samples, whereas samples associated to different clusters should be dissimilar, the similarity being measured via the sample-to-sample and cluster-to-cluster distances. Depending on the dataset, ad hoc distance definition must be provided on the basis of a trial and error procedure. From now on, we assume that: X = R n, x k i represents the k-th component of sample x i, the sample distance n d(x i, x j ) = k=1 (xk i xk j )2 is the classical Euclidean metric; the distance between two clusters C i,c j is defined as min d(c i, C j ) = d(x, y) (4.1) x R(C i ),y R(C j ) where R(C) C is a set of selected points representing the whole cluster C The hierarchical agglomerative approach At the beginning of the procedure, each sample is associated to a different cluster, i.e., C i = {x i }, thus the number of cluster N c is equal to N = A. Then, based on a definition of a cluster-to-cluster distance, the nearest clusters are merged to form a new cluster. Iterating this step, the procedure ends when all samples belong to the same cluster C = A = {x 1,..., x N }, and N c = 1. This procedure defines a merging sequence 50

55 4. Web User Session Characterization 4.3. Using Clustering Techniques on Measured Data Set based on minimum distance between clusters. At each step i = 1,..., N, the quality indicator function γ (i) is evaluated. The set A is finally clustered by selecting the number of clusters N c = N (i 1) such that γ (i) γ (i 1) is maximized. Intuitively, the quality indicator function γ (i) measures the distance between the two closest cluster at step i. A sharp increase in the value of γ (i) is an indication that the merging procedure is merging two clusters which are too far apart, thus suggesting to adopt the previous partition as the best cluster configuration. Refer to [1] to see a typical behavior of γ. This approach can be rather time consuming, especially when the data set is very large, given the need of starting with an initial number of clusters equal to N c = A The partitional approach This technique is used when the number K of final clusters is known. The procedure starts with an initial configuration comprising K clusters, selected according to some criteria; the cluster configuration s procedure is iterated. Cluster C i is represented by a subset of samples R(C i ) in Eq. (4.1). When the cluster representative is the so called centroid ĉ i, defined as the mean value of the cluster samples, i.e., ĉ k i = 1 C i x C i x k k = 1,..., n the algorithm is named K-means algorithm in the literature. At the beginning, K clusters are created, with cluster centroids positioned according to a given rule in the measure space, e.g., randomly positioned or partitioning the measured space in K +1 equi-spaced areas. Each sample is associated to the closest cluster, according to the distance between the samples and the centroid of each cluster. When all samples are assigned to a cluster, new centroids are computed and the procedure is iterated. This algorithm ends when either a prefixed number of iterations is reached, or the number of samples which are moved to a different cluster is negligible according to a predefined threshold. 4.3 Using Clustering Techniques on Measured Data Set We start by giving some details about the dataset of traces that will be analyzed, to define the variables that will be used by the clustering algorithm Traffic trace description Traffic traces were collected on the Internet access link of Politecnico di Torino, i.e., between the border router of Politecnico and the access router of GARR/B-TEN[10], the Italian and European Research network. Within the Politecnico campus LAN, there are approximately 7,000 hosts; most of them are clients, but several servers are also regularly accessed from the outside. The backbone of the campus LAN is based on a switched Fast Ethernet infrastructure. It behaves as a stub Internet subnetwork, which is connected to the public Internet via a single 28Mbps link. A strict regulation of the network facilities is imposed by means of a firewall architecture which blocks (most of) the peer-to-peer traffic. Thus, still today the majority of our Internet traffic is built by Web browsing. Details on the measurements setup and traffic characteristics can be obtained from [6, 7]. 51

56 4.3. Using Clustering Techniques on Measured Data Set 4. Web User Session Characterization Since 2001, several traces have been regularly collected. Among the available data, we selected the time period from from 04/29/2004 to 05/06//2004 (named APR.04), which comprises more than a week of continuously traced data. We performed the analysis by considering only the working period, i.e., traffic from 8AM to 8PM, Monday to Friday. The APR.04 dataset includes two working days (the 5th and 6th of May) during which terminals in our campus were attacked and infected by the so called Sasser.B worm [11]. The worm infection does not affect our measurement campaign, because the spreading of the worm itself is not based on the HTTP protocol. Nonetheless, the drawbacks of the network and host malfunctions and subsequent requirements to download worm removal tools and Operating System patches have quite a large impact on the properties of the user session, as we will show later in this paper. Bidirectional packet level traces have been collected using tcpdump [5], and then later processed by Tstat [7] to obtain a TCP level trace. Tstat is an open source tool developed at the Politecnico di Torino for the collection and statistical analysis of TCP/IP traffic. In addition to standard and recognized performance figures, Tstat infers TCP connection status from traces. For the purpose of this paper, we used Tstat to track: f id = (C IP, S IP, C T CP, S T CP ): the 4-tuple identifying the flow, i.e., IP addresses and TCP port numbers of client and server (when the IP protocol field is set to TCP). t: the flow start time, identified by the time stamp of the first client SYN message. t d : the time instant in which the last segment carrying data is observed (either from the client or form the server). t e : the flow end time, identified by the time instant in which the tear-down procedure is terminated. B c and B s : the byte-wise amount of data sent from the client and server respectively (excluding retransmissions) Clustering definition The first fundamental choice regards the n statistical variables used to define the metric space X = R n to be used in the clustering analysis; this implies also to select the metric space that best fits our problem. The typical and easiest approach is to let the clustering algorithm to run over a very large number of statistical variables, typically including the vast majority of available data (in our example, potential statistical variables may be IP source address, TCP destination port, TCP flows starting and ending time, etc). However, after several trials, an accurate pre-filtering of data available in traces both improves algorithmic speed and provides more accurate results. Recalling that we wish to identify a Web user-session, i.e., a group of TCP connections corresponding to the activity period of a user running a Web browser, we used the opening time t of a TCP connection as the only statistical variable for the clustering process; thus n = 1, and X = R. Before running the clustering algorithm on this variable, traces are preprocessed. We assume that a user is identified by its client IP address C IP, and only connections having TCP server port S T CP equal to 80 (HTTP protocol) are considered to be Web connections. To consider only significant IP (users), we selected the most active hosts, i.e., the top 1500 campus LAN IP addresses with respect to the number of generated TCP connections. Each user trace is preprocessed according to the following steps: i) data are partitioned day by day, ii) only working hours of working days are considered to obtain a set of statistically homogeneous 52

57 4. Web User Session Characterization 4.3. Using Clustering Techniques on Measured Data Set samples and iii) opening times of two subsequent flows separated by more than half an hour are a priori considered as two independent data sets. Thus, for a given host IP, the set of samples A(IP) = {t C IP = IP, S T CP = 80, t i t i+1 < 1800s} represents the opening times of TCP connections within a given time-frame. The intuition behind this is to allow the clustering algorithm to concentrate only on TCP connections created by a single user. After the metric space definition, we need to select a cluster analysis method. To take the advantages and to avoid the drawbacks of hierarchical and partitional clustering methods, we use a mix of them. Thus, for each A(IP), the following 3-step algorithm is run to identify sessions relative to a given user: 1. an initial smart clustering is obtained by selecting a number of clusters large enough but smaller than the number of samples of A 2. a hierarchical agglomerative algorithm is used to aggregate the clusters and to obtain a good estimation of the final number of them N c 3. a partitional algorithm is used to obtain a fine definition of the N c clusters Initial clustering selection We use a partitional method with K clusters, where K is large enough, but significantly smaller than the total number of samples (a study of the impact of K is presented in [1]). To efficiently position the K centroids at the first step, in our uni-dimensional metric space, we evaluate the distance between any of two adjacent samples t i, t i+1. According to the distance metric d(t i, t i+1 ) = t i t i+1, we take the farthest (K 1) couples and determine K intervals. Let t i,inf, t i,sup be the inferior and superior bounds of interval i, the centroid position of each cluster is set to ĉ i = (t i,sup + t i,inf )/2, and the partitional method is run for up to 1000 iterations: therefore, we define K initial clusters. At this step, we represent each cluster C with a small subset R(C) of samples; R(C) 2 is enough in our case, since the metric space is R. Possible choices for R(C) are: (i) the cluster centroid, which gives the name centroid method (or K-means) to the procedure; (ii) the g-th and (100 g)-th percentiles, which yields the so-called (iii) single linkage procedure when g = The hierarchical agglomerative procedure A hierarchical method is iteratively run using only the representative samples {R(C)} to evaluate the distance between two clusters, and starting with K initial clusters, therefore enormously reducing the number of steps of the hierarchical method. At each iteration, the hierarchical procedure merges the two closest clusters; distances among clusters are recomputed. After K 1 iterations, the process ends. To decide which is the optimal number of clusters among those determined in the iterative process, we define the clustering quality indicator function γ (i). Denote the j-th cluster at step i as C (i) j ; at each step, we evaluate the function γ (i) : (i) d min γ (i) = d(i) min d (i) min 53

58 4.4. Performance Analysis of Trace Data Set 4. Web User Session Characterization where ( and d C (i) j d (i) min = min ( j, k j d C (i) j ), C (i) k is defined according to Eq.( 4.1) ), C (i) k, d(i) min = i 1 l=1 d(l) min A sharp increase in the value of γ (i) is an indication that the merging procedure is artificially merging two clusters which are too far apart. The optimal number of clusters N c is the one that correspond to the sharpest increase in γ (i) (the first one is chosen in case of multiple occurrences): N c = N [ argmax i 1 i 1 ] ( γ (i) γ (i 1)) Final clustering creation Finally, a partitional clustering procedure is run over the original dataset which includes all samples, using the optimal number of clusters N c determined so far, therefore using the same procedures adopted in the first step (either the centroid, g percentile or single linkage methods). Then a fixed number of iterations of the partitional approach are run, to permit a final refinement on the clustering definition. This third phase is not strictly required, given that at the end of the hierarchical procedure a partition is already given. However, it produces clusters of real points instead of centroids (which may not coincide with any data point). In addition, the computational cost of this phase is almost negligible compared to the previous one. The complete clustering algorithm has been implemented by a number of Perl scripts, and by using the R-Project for Statistical Computing tool [2], which efficiently implements many clustering algorithms. 4.4 Performance Analysis of Trace Data Set In this section we summarize the results obtained by running the clustering algorithm for each user separately, according to the trace pre-processing algorithm previously described Web user-session characterization In Figs. 4.1, 4.2, and 4.3 the major characteristics of the Web user-session identified during APR.04 are shown. We plot Probability Density Functions (PDFs) using a linear/log scale: since the support is quite large, the PDF is represented only for a limited range of values, and the complementary Cumulative Distribution Function (CDF) is shown using a log/log scale in an inset to highlight the characteristics of the distribution s tail. Fig.4.1 shows the probability density function for the duration of the identified sessions. The two different distributions shown in Fig. 4.1 represent the effect of different definitions for the Web user-session. Considering TCP connections belonging to the same session, we define the session duration as the time between the first SYN segment of the first connection and: (i) for protocol session, the last segment observed during the last connection tear-down (solid line); (ii) for user session, the last segment carrying payload of the last connection (dotted line). Therefore, using the notation introduced in Sec. 4.3, for a given session/cluster C, 54

59 4. Web User Session Characterization 4.4. Performance Analysis of Trace Data Set Figure 4.1 PDF (and Complementary CDF in the inset) of the session length First SYN Last TCP Tear-Down First SYN Last Data Segment 1 PDF Compl. CDF e Session Length [s] Session Length [s] Figure 4.2 PDF (and Complementary CDF in the inset) of the client-to-server and server-to-client data sent in each session. PDF Compl. CDF Server Client Client Server Session data [bytes] Session data [bytes] we can define T e = max e(f id )) min id)) f id C f id C T d = max d(f id )) min id)) f id C f id C in which T e and T d are the protocol and user session duration respectively. The first definition is relevant for example when either web server or client resources are considered, since TCP connections must be managed until the tear-down procedure has ended. The second definition on the contrary is relevant to model the user behavior, since users are satisfied when all data have been sent/received. Clearly, the distribution of the protocol session duration shows longer duration, but also biased peaks at 20s, 60s and 3600s, corresponding to application layer timers imposed by Web browsers or HTTP servers which trigger connection tear-down procedure after idle periods. For example, due to HTTP protocol settings, servers may wait for a timer to expire (usually set to 20 seconds) before closing the connection; similarly, HTTP 1.1 and Persistent-HTTP 1.0 protocols use an additional timer, usually set to a multiple of 60 seconds. Therefore, for protocol session duration, the bias induced by those timers is evident. The same bias disappears when the session duration is evaluated considering user session duration. Notice that the session duration distributions have a quite large support, showing a great variability in users behavior. Indeed, there is a percentage of very short sessions (less than few seconds), but also users whose 55

60 4.4. Performance Analysis of Trace Data Set 4. Web User Session Characterization Figure 4.3 PDF (and Complementary CDF in the inset) of the number of TCP connection in each session. 0.3 PDF Compl. CDF Number of TCP connections per session Number of TCP connections per session Figure 4.4 CDF of the mean user session inter-arrival and mean users flow inter-arrival. CDF T off T arr Time [s] activities last for several hours. Indeed, the tail of the complementary CDF shown in the inset underlines the heavy-tailed distribution of session duration, which can be a possible cause of Long Range Dependence (LRD) at both the connection and packet layers. In Fig. 4.2 we have the PDF for the volume of data exchanged during each session in the client-server (dashed lines) and the server-client (solid lines) directions : for a given session/cluster C, D c = f id C (B c(f id )) and D s = f id C (B s(f id )) define the session data volumes. As expected, much less data are transfered from clients to servers and the distribution tail is shorter; the number of sessions transferring more than 10 Mbytes in the server-client direction is not negligible. A peculiar number of peaks are present in the initial part of both PDFs. Investigating further, we discovered that those peaks are due to the identification of sessions which are not generated by users, but instead by automatic reload procedure imposed by the Web page being browsed. For example, news or trading on line services impose periodic updates of pages which causes the client to automatically reload the pages. If the automatic reload is triggered periodically, the clustering algorithm tends to identify for each connection a separate session, thus causing a bias in the session data distribution. This is clearly evident also from Fig. 4.3, which reports the number of TCP connections per session C. Indeed, more than 25% of sessions count for only one connection. Moreover, most of the identified sessions is built by very few connections (about 50% by 4 connections or less), indicating also that i) the client is usually able to obtain all the required data over few TCP connections, ii) the number of external objects required is limited, and iii) the time spent by the users over one web page is large enough to define each web 56

61 4. Web User Session Characterization 4.4. Performance Analysis of Trace Data Set Figure 4.5 Fit of user session inter-arrivals to a Weibull distribution: normal day in the top plot and during a worm attack on the bottom plot. 12 Weibull(a=0.93, b=1.67) 25 Weibull(a=0.67, b=1.46) Data Quantiles [s] Data Quantiles [s] Weibull Quantiles Weibull Quantiles transaction as a session. The complementary distribution of the number of TCP connections per session reported in the inset of Fig. 4.3 shows a linear trend, highlighting that the distribution has an heavy tail. This could be one of the possible causes of LRD properties at the flow level, as already known. Fig. 4.4 reports the CDF of the average user flow inter-arrival T arr and average user session OFF period T off, i.e., for each user, we evaluated the average T arr and T off, and then derived the corresponding distribution over users. As expected, T arr assumes smaller values than T off, but the two distributions overlap as underlined by the two vertical lines. This shows the variability in user behavior. Notice that the overlapping of the two distributions would have not appeared if a threshold methodology had been applied, whichever adopted threshold. Moreover the variability of the mean T off and mean T arr makes it very difficult to select an appropriate values for η. This confirms the limits of threshold based approaches and the need of using clustering approaches that automatically adapt to different scenarios Statistical properties of session arrival process Finally, statistical properties of the aggregate session inter-arrival times are investigated. We obtained a trace of session arrivals by multiplexing all sessions identified for each IP source address (user) during the same time period, i.e, by considering the Web user-session arrival process to the access router of our institution. Fig. 4.5 reports the Q-Q plot of the aggregate session inter-arrival distribution with respect to the best fitted Weibull distribution over the same data set. This distribution has been recognized as a good model for TCP inter-arrival times [3]. The parameters a, b of the Weibull distribution represent the so called shape and scale parameters. When the shape parameter is set to 1, the Weibull distribution degenerates into an exponential distribution. When it is smaller than 1, the tail of the distribution tends to be heavier, while for values of a larger than 1 the shape of the distribution tends to assume a dumbbell form. The classical maximum likelihood method was used to obtain the best a and b parameters for the fitting procedure. The upper plot of Fig. 4.5 refers to a typical measurement day and shows a good matching of the data samples with the indicated Weibull distribution. Being a = 0.93, it also shows that the distribution is also very close to an exponential distribution, therefore hinting that the arrival process of Web user-session tends to be Poisson as pointed out by previous studies [8]. The Q-Q plot shows also that the tail of the distribution is in general less heavy than the tail of the fitted Weibull distribution. There is therefore a bias toward small values of 57

62 4.5. Conclusions 4. Web User Session Characterization Figure 4.6 Auto correlation function of the inter-arrival process of sessions: normal day in the top plot and during a worm attack on the bottom plot. ACF ACF Lag Lag session inter-arrivals, with large inter-arrivals which are rare. The Q-Q plot of the same data with respect to the best-fitted exponential distribution showed almost the same behavior. On the contrary, the lower plot in Fig. 4.5, which refers to the samples collected on May 5th, 2004, shows a distribution that cannot be fitted by a Weibull distribution. The best fit is obtained by a shape parameters a = 0.67 indicating an heavy tailed distribution. Moreover, the best fit Weibull distribution deviates from the measured dataset on large quantiles, showing that the tail of the real distribution is heavier than the one of the best fit distribution. The anomalous behavior is due to the spreading of the Sasser.B worm [11] in our institution, as previously described, and to the subsequent download of the needed OS patches and Anti-virus updates, which introduced correlation in the arrival process of sessions, being driven by the (large) download time of files. Fig.4.6 finally reports the autocorrelation function evaluated on the session inter-arrival obtained during a typical day on top plot, while bottom plot refers to the autocorrelation estimated during the anomalous day during the worm attack. Top plot confirms that Poisson assumption holds for normal day, being the autocorrelation function almost negligible except that in the origin. Similarly, the autocorrelation function among session inter-arrivals can be quite relevant on days during which user activities is driven by external factors, as shown by the bottom plot which refers to the day of the worm infection. 4.5 Conclusions We propose an application of clustering techniques to a large set of real Internet traffic traces to identify Web user-sessions. The proposed clustering method has been applied to measurement trace data sets to study the characteristics of Web user-sessions, showing that session arrival process tends to be Poisson distributed. Moreover, the analysis of the identified user-sessions shows a wide range of behavior that cannot be captured by any threshold based method. Finally, we think that the clustering algorithms proposed in this paper can be helpful in studying other traffic properties. We intend to apply them to different types of traffic, and to extend the study on the statistical properties of the identified sessions arrival process. 58

63 4. Web User Session Characterization 4.5. Conclusions Acknowledgment We would like to thank people from Ce.S.I.T., the networking facilities of our Campus, who allowed us to access to the real data traces. Bibliography [1] A. Bianco, G. Mardente, M. Mellia, M. Munafò, L. Muscariello, Exploiting Clustering Techniques for Web User-session Inference, IPS-MOME 05, Warsaw, Poland, March [2] P.Dalgaard, Introductory Statistics with R, Springer, [3] A. Feldmann, Characteristics of TCP connection arrivals. In Self-Similar Network Traffic and Performance Evaluation. Wiley, New York, [4] D.J.Hand, H.Mannila, P.Smyth, Principles of Data Mining, MIT Press, [5] S.McCanne, C.Leres, and V.Jacobson, Tcpdump, [6] M.Mellia, A.Carpani, R.Lo Cigno, Measuring IP and TCP behavior on Edge Nodes, IEEE Globecom 2002, Taipei, TW, November [7] M.Mellia, R.Lo Cigno, F.Neri, Tstat web page, [8] C.Nuzman, I.Saniee, W.Sweldens, A.Weiss, A compound model for TCP connection arrivals, with applications to LAN and WAN, Computer Networks, Special Issue on Long-Range Dependent Traffic, vol.40, no.3, pp , October [9] V.Paxson and S.Floyd, Wide-Area Traffic: The Failure of Poisson Modeling, IEEE/ACM Transactions on Networking, vol.3, no.3, pp , June [10] The GARR Network topology, [11] What You Should Know About the Sasser Worm, May

64 4.5. Conclusions 4. Web User Session Characterization 60

65 5 Passive Identification and Analysis of TCP Anomalies Marco Mellia, Michela Meo, Luca Muscariello, Dario Rossi Dipartimento di Elettronica, Politecnico di Torino, Torino, Italy Abstract In this paper we focus on passive measurements of TCP traffic, main component of nowadays traffic. We propose a heuristic technique for the classification of the anomalies that may occur during the lifetime of a TCP flow, such as out-of-sequence and duplicate segments. Since TCP is a closed-loop protocol that infers network conditions by means of losses and reacts accordingly, the possibility of carefully distinguishing the causes of anomalies in TCP traffic is very appealing and may be instrumental to the deep understanding of TCP behavior in real environments and to protocol engineering. We apply the proposed heuristic to traffic traces collected at both networks edges and at backbone links. By studying the statistical properties of TCP anomalies, we find that their aggregate exhibit Long Range Dependence phenomena, but that anomalies suffered by single long-lived flows are on the contrary uncorrelated. Interestingly, no dependence to the actual link load is observed. 5.1 Introduction In the last ten years, the interest in data collection, measurement and analysis to characterize Internet traffic behavior increased steadily. Indeed, by acknowledging the failure of traditional modeling paradigms, the research community focused on the analysis of the traffic characteristics with the twofold objective of i) understanding the dynamics of traffic and its impact on the network elements and ii) finding simple, yet satisfactory, models for the design and planning of packet-switched data networks, like the Erlang teletraffic theory in telephone networks. By focusing on passive traffic characterization, we face the task of measuring Internet traffic, that is particularly daunting for a number of reasons. First, traffic analysis is made very difficult by the strong correlations both in space and time, which is due to the closed-loop behavior of TCP, to the TCP/IP client-server communication paradigm, and to the fact that the highly variable quality provided to the end-user influences her/his behavior. Second, the complexity of the involved protocols, and of TCP in particular, is such that a number of phenomena can be studied only if a deep knowledge of the protocol details is exploited. Eventually, some of the traffic dynamics can be understood only if the forward and backward directions of flows are jointly analyzed which is especially true for the detection of erratic flows behavior. Starting from [1], where a simple but efficient classification algorithm for out-of-sequence TCP segments is presented, this work aims 61

66 5.2. Methodology 5. Passive Identification and Analysis of TCP Anomalies Figure 5.1 Evolution of a TCP flow and the RTT Estimation Process (a) and Anomalous Events Classification Flow-Chart (b) Client Sniffer SYN SYNACK ACK DATA Server Connection Setup Three Way ( Handshake ) Net. Dup. yes Same IP identifier? Packed already seen? yes no ACK no DATA ACK DATA ACK Data Transmission RTT Client DATA RTT(s,d) ACK RTT(d,s) yes Already ACKed? no yes RT < RTT min? Window Probing? no no RT>RTO RT>RTO yes FC RT>RTO RT < RTT min? no yes T < RTT min? FIN no yes no yes yes no no yes ACK FIN Connection Tear Down no 3dupAck? 3dupAck? 3dupAck? no yes yes yes Unknown Reord. ACK FR RTO Un. FR Un. RTO RTO FR (a) (b) at identifying and analyzing a larger subset of phenomena, including, e.g., network duplicates, unneeded retransmissions and flow control mechanisms triggered or suffered by TCP flows. The proposed classification technique has been applied to a set of real traces collected at different measurement points in the network. Results on both the network core and at the network edge show that the proposed classification allows to inspect a plethora of interesting phenomena: e.g., the limited impact of the daily load variation on the occurrence of anomalous events, the surprisingly large amount of network reordering, the correlation among different phenomena, to name a few. 5.2 Methodology The methodology adopted in this paper furthers the approach followed in [1], where authors presented a simple classification algorithm of out-of-sequence TCP packets and applied it to analyze the Sprint backbone traffic; the algorithm was able to discriminate among out-of-sequence segments due to i) necessary or unnecessary packet retransmissions by the TCP sender, ii) network duplicates, or iii) network reordering. Similarly, rather than using active probe traffic, we adopt a passive measurement technique and, building over the same idea, we complete the classification rule to include other event types. Besides, we not only distinguish among the other possible kinds of out-of-sequence or duplicate packets, but we also focus on the cause that actually triggered the segment retransmission by the sender. As previously mentioned, we record packets in both directions of a TCP flow: thus, both data segments and ACKs are analyzed; moreover, both the IP and TCP layers are exposed, so that the TCP sender status can be tracked. Figure 5.1a) sketches the evolution of a TCP flow: connection setup, data transfer, and tear-down phases are highlighted. It is worth to point out that the measurement point (or sniffer) can be located anywhere in the path between the client and the server; furthermore, as shown in the picture, the sniffer can be close to the client when measurements are taken at the network edge (as it happens, e.g., on our campus LAN), whereas this bias vanishes when the sniffer is deployed on the core of the network (as it happens considering backbone traces). Since we rely on the observation of both DATA and ACK segments to track TCP flow evolution, a problem may arise in case asymmetric routing forces DATA and ACK segment to follow two 62

67 5. Passive Identification and Analysis of TCP Anomalies 5.2. Methodology different path. In such cases, we ignore the half flows. A TCP flow starts when the first SYN from the client is observed, and ends after either the tear-down sequence (the FIN/ACK or RST messages), or when no segment is observed for an amount of time larger then a given threshold 1. By tracking both flow directions, the sniffer correlates the sequence number of TCP data segments to the ACK number of backward receiver acknowledgments and classifies the segments as: In-sequence: if the sender initial sequence number of the current segment corresponds to the expected one; Duplicate: if the data carried by the segment have already been observed before; Out-of-sequence: if the sender initial sequence number is not the expected one, and the data carried by the segment have never been observed before. The last two entries are symptomatic of some anomaly occurred during the data transfer: to discriminate such anomalous events further, we devise a fine-grained heuristic classification and implement it in Tstat [3], which we then use to analyze the collected data. The decision process followed to classify such anomalous events makes use of the following variables: RTT min : the Minimum RTT estimated since the flow started; RT: the Recovery Time is the time elapsed between the time the current anomalous segment has been observed and the time the segment with the largest sequence number has been received; T : the Inverted-packet gap is the difference between the observation time of the current anomalous segment and of the previously received segment; RTO: the sender Retransmission Timer value, estimated according to [4]; DupAck: the number of duplicate ACKs observed on the reverse path Anomalies Classification Heuristic Following the flow diagram of Figure 5.1b, we describe the classification heuristic. Given an anomalous segment, the process initially checks if the segment has already been seen by comparing its TCP sequence number with those carried by segments observed so far: thus, the current segment can be classified as either duplicate or out-of-sequence. In the first case, following the left branch of the decision process, the IP identifier field of the current packet is compared with the same field of the original packet: in case of equality, then the anomalous packet is classified as Network Duplicate (Net. Dup.). Network duplicates may stem from malfunctioning apparatuses, mis-configured networks (e.g., Ethernet LANs with diameter larger than the collision domain size), or, finally, by unnecessary retransmissions at the link layer (e.g., when a MAC layer ACK is lost in a Wireless LAN). Differently from [1], we classify as network duplicate all packets with the same IP identifier regardless of T : indeed, there is no reason to exclude that a network duplicate may be observed at any time, and there is no relation between the RTT and the time a network can produce some duplicate packets. 1 The tear-down sequence of flows under analysis is possibly never observed: to avoid memory starvation we use a 15-minutes timer, which is sufficiently large [2] to avoid discarding on-going flows. 63

68 5.2. Methodology 5. Passive Identification and Analysis of TCP Anomalies When the IP identifiers are different, the TCP sender may have performed a retransmission. If all the bytes carried by the segment have already been acknowledged, then the segment has successfully reached the receiver, and therefore this is an unneeded retransmission, which has to be discriminated further. Indeed, the window probing TCP flow control mechanism performs false retransmissions: this is done to force the immediate transmission of an ACK so as to probe if the receiver window RWND, which was announced to be zero on a previous ACK, is now larger than zero. Therefore we classify as retransmission due to Flow Control (FC) the segments for which the following three conditions hold: i) the sequence number is equal to the expected sequence number decreased by one, ii) the segment payload size is of zero length, and iii) the last announced RWND in the reverse ACK flow was equal to zero. This is a new possible cause of unneeded retransmissions which was previously neglected in [1]. Otherwise, the unnecessary retransmission, could have been triggered because either a Retransmission Timer (RTO) has fired, or the fast retransmit mechanism has been triggered (i.e., when three or more duplicate ACKs have been received for the segment preceding the retransmitted one). We identify three situations: i) if the recovery time is larger than the retransmission timer (RT > RTO), the segment is classified as an Unneeded Retransmission by RTO (Un. RTO); ii) if 3 duplicate ACKs have been observed, the segment is classified as an Unneeded Retransmission by Fast Retransmit (Un. FR); iii) otherwise, if none of the previous conditions holds, we do not know how to classify this segment and we are forced to label it as Unknown (Unk.). Unneeded retransmissions may be due to a misbehaving source, a wrong estimation of the RTO at the sender side, or, finally, to ACK loss on the reverse path; however, distinguishing among these causes is impossible by means of passive measurements. Let us now consider the case of segments that have already been seen but have not been ACKed yet: this is possibly the case of a retransmission following a packet loss. Indeed, given the recovery mechanism adopted by TCP, a retransmission can occur only after at least a RTT, since duplicate ACKs have to traverse the reverse path and trigger the Fast Retransmit mechanism. When the recovery time is smaller than RTT min, then the anomalous segment can only be classified as Unknown 2 ; otherwise, it is possible to distinguish between Retransmission by Fast Retransmit (FR) and Retransmission by RTO (RTO) adopting the same criteria previously used for unneeded retransmissions. Retransmissions of already observed segments may be due to data segments loss on the path from the measurement point to the receiver, and to ACK segments delayed or lost before the measurement point. Eventually, let us consider the right branch of the decision process, which refers to out-of-sequence anomalous segments. In this case, the classification criterion is simpler: indeed, out-of-sequence can be caused either by the retransmission of lost segments, or by network reordering. Since retransmissions can only occur if the recovery time RT is larger than RTT min, by double checking the number of observed duplicate ACKs and by comparing the recovery time with the estimated RTO, we can distinguish retransmissions triggered by RTO, by FR or we can classify the segment as an unknown anomaly. On the contrary, if RT is smaller than RTT min, then a Network Reordering (Reord) is identified if the inverted-packet gap T is smaller than RTT min. Network reordering can be due to either load balancing on parallel paths, or to route changes, or to parallel switching architectures which do not ensure in-sequence delivery of packets [5]. 2 In [1] authors use the average RTT; however, being each RTTpossibly different than the average RTT, and in particular smaller, we believe that the use of the average RTT forces a uselessly larger amount of unknown. 64

69 5. Passive Identification and Analysis of TCP Anomalies 5.2. Methodology Figure 5.2 CDFs of the ratio between the inverted packet gap T over the estimated RTT min (outset) and of the ratio between the recovery time RT over the actual RTO estimation (inset) CDF CDF RT/RTO T/RTT min RTT min and RTO Estimation The algorithm described so far needs to estimate some thresholds from the packet trace itself, whose values may not be very accurate, or even valid, when classifying the anomalous event. Indeed, all the measurements related to the RTT estimation are particularly critical, since they are used to determine the RTT min and the RTO estimates. RTT measurement is updated during the flow evolution according to the moving average estimator standardized in [4]: given a new measurement m of the RTT, we update the estimate of the average RTT by mean of a low pass filter E[RTT] new = (1 α)e[rtt] old + α m where α (0, 1) is equal to 1/8. Since the measurement point is co-located neither at the transmitter nor at the receiver, the measure of RTT is not directly available. Therefore, to get an estimate of the RTT values, we build over the original proposal of [1]. In particular, denote by RTT(s, d) the Half path RTT sample, which represents the delay at the measurement point between an observed data segment flowing from source s to destination d and the corresponding ACK on the reverse path, and denote by RTT(d, s) the delay between the ACK and the following segment, as shown in Figure 5.1a. From the measurement of RTT(s, d) and RTT(d, s) it is possible to derive an estimate of the total round trip time as RTT = RTT(s, d)+rtt(d, s); given the linearity of the expectation operator E[ ], the estimation of the average round trip time E[RTT] = E[RTT(s, d)] + E[RTT(d, s)] is unbiased. Furthermore, the RTT standard deviation, std(rtt), can be estimated following the same approach, given that in our scenario RTT(s, d) and RTT(d, s) are independent measurements. Finally, given E[RTT] and std(rtt), it is possible to estimate both the sender retransmission timer, as in [4], and the minimum RTT: RTO = max(1, E[RTT] + 4std(RTT)) RTT min = min(rtt(s, d)) + min(rtt(d, s)) In general, since RTT min min(rtt), the latter equation gives a conservative estimate of the real minimum RTT, leading to a conservative algorithm that increases the number of anomalies classified as unknown rather than risking some mis-classifications. Besides, we lead a set of measurements to inspect the RTT min and RTO values, in order to double-check the impact of the estimation so far described. The former is involved on the classification of the network reordering anomalies that may occur when identifying two out-of-sequence segments separated by a time 65

70 5.3. Measurement Results 5. Passive Identification and Analysis of TCP Anomalies Figure 5.3 Amount of incoming (bottom y-axis) and outgoing (top y-axis) anomalies at different timescales. 200k 0 4:00 6:00 OUT 8:00 10:00 12:00 14:00-200k 200k Sunday IN Monday Tuesday Hourly 0-200k 200k Tue RTO Reorder Other Wed Fri Sat Sun Mon Daily Tue 0-200k 200k Week 31 Week 32 Week 33 Weekly Week k Monthly Figure 5.4 Anomalies percentage normalized over the total traffic (top) and anomalies breakdown (bottom). 10% Anomaly % (over Total Traffic) 0% 10% 20% Week 31 Week 32 Week 33 Week % Anomaly Breakdown 50% 0% 50% (68.4%) RTO (19.0%) Unknown (10.5%) Reord. (0.84%) NetDup (0.82%) Un.RTO (0.39%) FR (0.04%) FC (0.01%) Un.FR 100% gap smaller than RTT min. The outer plot of Figure 5.2 reports the typical 3 Cumulative Distribution Function (CDF) of the ratio between the inverted packet gap T and the value of the RTT min, when considering anomalous events classified as network reordering only. It is easy to gather that T is much smaller than the RTT min, which suggests that i) the initial choice of RTT min is appropriate, and that ii) the conservative estimation of RTT min does not affect the classification. The inset of Figure 5.2 reports the CDF of the ratio between the actual Recovery Time RT and the corresponding estimation of the RTO relative to retransmissions by RTO events only. The CDF confirms that RT > RTO holds, which is a clear indication that the estimation of the RTO is not critical. Moreover, it can be noted that about 50% of the cases have a recovery time which is more than 5 times larger than the estimated RTO. This apparently counterintuitive result is due to the RTO back-off mechanism implemented in TCP (but not considered by the heuristic), which doubles the RTO value at every retransmission of the same segment. Not considering the back-off mechanism during the classification leads to a robust and conservative approach. 5.3 Measurement Results In this section we present results obtained through traffic traces collected from different networks. The first one is a packet-level trace gathered from the Abilene Internet backbone, publicly available on the NLANR Web site [6]. The trace, which is 4 hours long, has been collected on June 1st, 2004 at the OC192c Packetover-SONET link from Internet2 s Indianapolis node toward Kansas City. The second measurement point is located on the GARR backbone network [7], the nation-wide ISP for research and educational centers, that 3 Since the behavior is qualitatively similar across all the analyzed traces, we report the results referring to a single dataset, namely GARR 05 traces. 66

71 5. Passive Identification and Analysis of TCP Anomalies 5.3. Measurement Results we permanently monitor using Tstat [3]. The monitored link is a OC48 Packet-over-SONET from Milano-1 to Milano-2 nodes, and statistics reported in the following are relative to the month of August The third measurement point is located at the sole egress router of Politecnico campus LAN, which behaves thus as an Internet stub. It is a OC4 AAL5 ATM link from Politecnico to the GARR POP in Torino. Traces show different routing symmetry: while the POLITO and GARR traces reflect almost 100% of traffic symmetry, only 46% of ABILENE traffic is symmetric; this is of particular relevance, since Tstat relies on the observation of both DATA and ACK segments to track TCP flow evolution. All the gathered flow-level and aggregated statistics, related to anomalous traffic as well as to many other measurements that we do not report here, are publicly available on the Tstat Web page [8]. In the following, we will arbitrarily use In and Out tags to discriminate between the traffic directions of GARR and Abilene traces: since the measurement point is located in the backbone, neither an inner nor an outer network regions actually exists Statistical Analysis of Backbone Traffic Figure 5.3 depicts the time evolution of the volume of anomalous segments classified by the proposed heuristic applied to the GARR traffic; in order to make plots readable, only the main causes of anomalies, i.e., RTO and network reordering, are explicitly reported, whereas the other kinds of anomalies are aggregated. Measurements are evaluated with different granularities over different timescales: each point corresponds to a 5 minute window in the hourly plot, a 30 minute window for both daily and weekly plots and a 2 hour interval in the monthly plot. Both link directions are displayed in a single plot: positive values refer to the Out direction, while negative to In. The plot of the entire month clearly shows a night-and-day trend, which is mainly driven by the link load. The trend is less evident at finer time scales, allowing to identify stationary periods during which perform advanced statistical characterization. The same monthly dataset is used in Figure 5.4, which reports in top plot the volume of anomalies normalized over the total link traffic, and in bottom plot the anomaly breakdown, i.e., the amount of anomalies of each class normalized over the total number of anomalous events. Labels report, for each kind of anomaly, the average occurrence over the whole period obtained aggregating both traffic directions. The amount of RTO accounts for the main part of anomalous events, being almost 70%. This is not surprising, since RTO events correspond to the most likely reaction of TCP to segment losses: indeed fast retransmit is not very frequent, since only long TCP flows, which represent a small portion of the total traffic, can actually use this feature [9]. Packet reordering is also present in a significant amount. The average total amount of anomalous segments is about 5% considering Out traffic direction, and about 8% considering In direction, with peaks ranging up to 20%. Besides the precise partitioning of such events, it is interesting to notice that the periodical trend exhibited by the absolute amount of anomalies observed early in the monthly plot of Figure 5.3, almost completely vanishes when considering the normalized amount shown in Figure 5.4. Thus, while the total amount depends on the traffic load, the percentage of anomalies seems quite independent on the load. This is an interesting finding that clashes with the usual assumption that the probability of anomalies and segment losses increases with load. Very similar results were also obtained for the Abilene trace, confirming that not only the percentage of anomalies over the total amount of traffic, but also the anomaly breakdown remain steadily the same even when the amount of anomalies, over the link load, exhibits large fluctuations. A possible intuition that may be the cause of such counterintuitive result may be due to the greedy nature of TCP, which pushes the instantaneous offered load to 1, therefore producing packet losses. Indeed, even if the average offered load is 67

72 5.3. Measurement Results 5. Passive Identification and Analysis of TCP Anomalies Figure 5.5 Hurst parameter estimate for different anomalies and traces. 1.1 ALL RTO Reordering FR 1 Hurst Parameter H Abilene (In) Abilene (Out) GARR (In) GARR (Out) much smaller during off-peak periods, congestion may arise on bottleneck links (i.e., possibly at the network access) when there are active TCP flows. Another interesting statistics we can look into is the correlation structure of the time series counting the number of anomalies per time unit. Though this could be presented in many different ways, we decided to use the Hurst parameter H as a very compact index to identify the presence of Long Range Dependence (LRD) in the series. The estimation of H as been carried over whenever it has been possible to analyze stationary time-windows. In particular, this holds true for the two backbone links when considering few hours period, while this was not verified on campus LAN traces, due to much smaller number of events. The estimate of the Hurst parameter H, measured through the wavelet based estimator [10], is depicted in Figure 5.5 for different kinds of anomalies, measurement points and traffic directions. Roughly speaking, H = 0.5 hints to uncorrelated processes, while increasing values of H [0.5, 1] are symptomatic of increasing correlation in the process, with values around H = 0.7 usually considered typical of LRD. In all cases, H is considerably higher than 0.5, meaning that these series are LRD. moreover, H for Reordering and FR series is smaller than for the RTO series, for all traces and traffic directions. Eventually, we stress that by further measuring H in different subsets of the original trace and by using different time units (from 10 ms to a few seconds), we gathered sufficient informations to confirm the presence of LRD in the anomalous traffic Analysis of a Single Elephant Connection Other interesting observations can be drawn by focusing on a specific network path, where it can be assumed that network setup is homogeneous: in particular, we considered the longest TCP flows that has been measured on our campus LAN. The longest elephant that we ever observed corresponds to a large file download from a host in Canada which lasted about ten hours (from 9:00am to 8:00pm of Wednesday the 5th of May). The flow throughput varied from 60 KBytes/s up to 160 KBytes/s, while the measured RTT varied in the [300, 550] ms range. Figure 5.6 plots a set of time series regarding the flow anomalies, where each series corresponds to the occurrence of events of one of the anomalies identified by the proposed heuristic. First of all, we point out that, while each time series is non-stationary, the aggregated series is stationary. Then, observe that the largest number of anomalies refer to RTO and FR, and this holds for the whole flow lifetime. It is worth noticing that typically, while the number of RTO increases, the number of FR decreases (and viceversa): high numbers of RTO probably correspond to high levels of congestion, driving the TCP congestion control to shrink the sender congestion window, so that the chance of receiving three dup-acks is small. Similar behaviors were observed for other very long elephants not shown here for the sake of brevity. 68

73 5. Passive Identification and Analysis of TCP Anomalies 5.3. Measurement Results Figure 5.6 Time plot of anomalous events for the selected elephant flow. 4 4 Number of Identified Events per Unit Time RTO Un.RTO Un.FR FR Unk. Net.Dup :00 10:00 11:00 12:00 13:00 14:00 15:00 16:00 17:00 18:00 19:00 20:00 0 Time of Day [hours] No network reordering has been identified during the whole flow lifetime. In our experiments we observed that some flows do not suffer from network reordering at all, while others present several network reordering events: this behavior suggests network reordering to be path dependent, and a similar reasoning applies to network duplicate events. Some of the anomalies could not be classified other than Unknown. Investigating further the reasons of such misidentification, we found that their majority is due to segments whose recovery time was smaller than the estimated RTO (and therefore could not classified as retransmission due to RTOfiring), but larger than the RTT min (thus forbidding to identify them as network reordering) and, eventually, the number of duplicate ACKs was smaller than 3 (therefore forbidding to identify them as retransmission due to Fast Retransmit). We suppose that either the sender was triggering FR with a number of Duplicate ACKs smaller than 3, or that the RTO value estimated by the sender was smaller than the RTO estimation at the measurement point. Notice again that, as we already mentioned, our classification strictly follows the TCP standards, avoiding mis-classifications, at the price of the increase of unidentified events. We now focus on the fluctuation of the rate and the statistical properties of the counting process of i) the received packets per time unit and ii) the anomalous events per time unit, when the time unit is 100 m long. We analyze the scaling behavior of the data through the Multi Resolution Analysis (MRA) [11] with the code provided in [12]; according to MRA, both these processes are stationary: moreover, from our measurements, stationarity usually holds for long flows in general. Figure 5.7 depicts the log-scale diagram for both the considered time series. The log-scale diagram can be seen as a complex way to evaluate the power spectra, but it has good properties once dealing with LRD random signals and in particular it delivers very good estimates of scaling exponents [10]. The diagram reports the estimate of the correlation of the considered process: an horizontal line is symptomatic of an uncorrelated process, while linear slopes for large scales have to be read as signs of high correlation. Therefore, from Figure 5.7 we observe that the anomalous events that TCP handles are not correlated. On the contrary, as already observed on traffic aggregates [10], we found the traffic generated by the elephant flow to be highly correlated for the whole flow lifetime. The slope of this curve leads to an estimation of the parameter H about 0.74; the time scale of the RTT is visible as the bump around 400 ms in the left-hand side of the curve, which represents an intrinsic time periodicity for the flow dynamic related to the RTT. This is not surprising, since the RTT has also been recognized as a natural time cutoff scale i) below which traffic is highly irregular and uncorrelated and ii) above which is long-term correlated. These results confirm similar findings in the literature: as already observed in [13] and [14], TCP exhibits an ON/OFF kind of behavior which is responsible for the burstiness over a limited range of time scales: in other words, TCP generates bursts of traffic clocked with the RTT. 69

74 5.4. Conclusions and Discussion 5. Passive Identification and Analysis of TCP Anomalies Figure 5.7 Energy plots for the time series of number of segments and number of out of orders within the selected elephant flow ms 3.2s 3min 2h Anomalous Events within the flow Packets within the flow 6 log2(energy) log2(scale) 5.4 Conclusions and Discussion In this paper, we have defined, identified and studied the TCP anomalies, i.e., segments received either outof-sequence or duplicate. We have extended and refined the heuristic originally proposed in [1], and applied it the real traffic traces collected at both network edges and core links. By studying the statistical properties of these phenomena, we can state few guidelines to understand the stochastic properties of TCP traffic. First, we observed that, though the absolute amount of anomalies is highly dependent on the link load, both its relative amount over the total traffic and the anomaly breakdown are almost independent from the current load. This is a quite interesting finding that clashes with the usual assumption that the probability of anomalies and segment losses increases with load. As a possible justification to this, we point out the greedy nature of TCP congestion control algorithm, that pushes the instantaneous offered traffic to link capacity, even though the average link utilization is much smaller, thus causing anomalies to arise. Second, we have shown that aggregated anomalies at any Internet link are correlated, as clearly shown by the estimation of the Hurst parameter H. Nevertheless, when considering single flows, we observed that the anomalies arrival process is on the contrary uncorrelated, even though the general packet arrival process generated by the same flow is known to be highly correlated, as we experimentally verified. The correlation exhibited by the aggregated anomalies is intuitively tied to the duration of their causes (i.e., congestion periods, path reordering, etc.) which may last for large timescales. When considering a single flow, the sub-sampling process induced by packet arrivals (i.e., the process observed by flow packets crossing a particular node/link/path), and the congestion control mechanisms adopted by TCP tend to uncorrelate the process of anomalous segments observed by the flow itself. Indeed, recalling the large number of very short flows and the small number of long-lived flows, it is intuitive that only a very small fraction of anomalous segment observed over a single link may be due to a single flow. Thus, we can then conclude that, even though there is high correlation in the aggregate process, its impact on single flows is marginal. 5.5 Acknowledgments This work was founded by the European Community EuroNGI Network of Excellence. We would like to thank the GARR for allowing us to monitor some of their backbone links, the Politecnico di Torino network facilities CESIT for their patience in supporting the monitoring at our campus LAN, and finally the people from NLANR for proving the research community their invaluable traces. 70

75 5. Passive Identification and Analysis of TCP Anomalies 5.5. Acknowledgments Bibliography [1] S. Jaiswal, G. Iannaccone, C. Diot, J. Kurose, D. Towsley, Measurement and Classification of Out-of- Sequence Packets in a Tier-1 IP Backbone,IEEE Infocom, San Francisco, CA, March [2] G.Iannaccone, C.Diot, I.Graham, and N.McKeown, Monitoring very high speed links, ACM Internet Measurement Workshop, San Francisco, CA, November [3] M. Mellia, R. Lo Cigno and F. Neri, Measuring IP and TCP behavior on edge nodes with Tstat, Computer Networks, Vol. 47, No. 1, pp. 1-21, Jan [4] V. Paxson an, M. Allman, Computing TCP s Retransmission Timer, RFC 2988, Nov [5] J.C.R. Bennett, C.C. Partridge and N. Shectman, Packet reordering is not pathological network behavior, IEEE/ACM Transactions on Networking, Vol. 7, N. 6, pp , Dec [6] AbileneIII packet trace, [7] GARR Network, [8] Tstat Web Page, [9] M. Mellia, M. Meo and C. Casetti, TCP Smart Framing: a Segmentation Algorithm to Reduce TCP latency, IEEE/ACM Transactions on Networking, Vol. 13, No. 2, pp , Apr [10] P. Abry and D. Veitch. Wavelet Analysis of Long-Range Dependent Traffic, IEEE Transaction on Information Theory, Vol. 44, No. 1, pp:2-15, Jan [11] D. Veitch and P. Abry, A statistical test for the time constancy of Scaling Exponents, IEEE Transactions on Signal Processing, Vol. 49 No. 1, pp , Oct [12] Darryl Veitch Home Page, Wavelet Estimation Tools, ~darryl [13] D. R. Figueiredo, B. Liu, V. Misra and D. F. Towsley, On the autocorrelation structure of TCP traffic, Computer Networks, Vol. 40, No. 3, pp , [14] H. Jiang and C. Dovrolis, Why is the Internet traffic bursty in short (sub-rtt) time scales?, ACM SIGMETRICS 2005, Banff, CA, Jun

76 5.5. Acknowledgments 5. Passive Identification and Analysis of TCP Anomalies 72

77 6 Micro- and macroscopic analysis of RTT variability in GPRS and UMTS networks Jorma Kilpi VTT Information Technology, P.O.Box 12022, FIN VTT, Finland. Pasi Lassila Helsinki University of Technology, P.O.Box 3000, FIN HUT, Finland. Abstract We study the data from a passive traffic measurement representing a 30 hour IP traffic trace from a Finnish operator s GPRS/UMTS network. Of specific interest is the variability of Round Trip Times (RTTs) of TCP flows in the network. The RTTs are analysed at micro- and macroscopic level. The microscopic level involves detailed analysis of individual flows. We are able to infer various issues affecting the RTT process of a flow, such as changes in the rate of the radio channel and the impact of simultaneous TCP flows at the mobile device. Also, Lomb periodograms are used to detect periodic behavior in the applications. In the macroscopic analysis we search for dominating RTT values first by Energy Function Plots and then by exploring the spikes of empirical PDFs of aggregate RTTs. Comparison of empirical CDFs show a significant shift in quantiles after considerable traffic increase which is found to be due to bandwidth sharing at individual mobile devices. We investigate further the impact of bandwidth sharing at the mobile device and show how an increased amount of simultaneous TCP flows seriously affects the RTTs observed by a given flow, both in GPRS and in UMTS. Keywords: traffic measurements, RTT variability, GPRS and UMTS networks, Lomb periodogram, Haar wavelets 6.1 Introduction Traffic measurements are an important means when assessing the performance of real networks. Using measurements one obtains direct evidence of the performance and, by examining the measured data more closely, it is also possible to detect bottlenecks and/or anomalous behavior in the networks. Technology development is presently especially rapid in the field of wireless networks. Hence, measurements for example in GPRS and UMTS networks are needed to understand the issues affecting the performance of TCP/IP traffic in such networks. In this paper we study the data from a passive non-intrusive traffic capture measurement representing a 30 73

78 6.1. Introduction 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS and UMTS hour IP traffic trace measured from one GGSN node of a GPRS/UMTS network in a major Finnish operator s production network. TCP is the most often used transport protocol in IP networks and the objective of the measurement is to analyze certain properties of TCP flows where one end point of the flow is in the GPRS/UMTS network and the other in the public Internet. The measurement point was the monitoring port of the Gi interface of one of the GGSN nodes. Of specific interest in the data is the variability of the Round Trip Times (RTTs) observed by the TCP flows. Large variability of the RTTs may cause problems for example to TCP s retransmission mechanism in the form of spurious timeouts (see, e.g., [1] and [2]). As opposed to an active measurement approach, we have a passive measurement of the actual network traffic (other related work is commented in Section 1.1). The objective of this study is to analyse thoroughly the statistical properties of the RTT process as observed by the TCP flows. To this end, we perform a detailed analysis of the RTTs of selected individual flows (microscopic analysis) and the RTTs of the aggregate traffic (macroscopic analysis). Furthermore, as the measurement data contains TCP flows both from GPRS users and UMTS users, we are able to compare the properties of the RTT process for both technologies. Briefly, our measurement analysis methodology and our main observations are the following. First the quality of the time stamps on the captured IP packets is validated and we demonstrate that the measurement equipment or the setup itself has not impaired the data. The summary data is obtained by using the tstat-program [3], which gives information about TCP connections/flows. We use the summary data to get information about the raw data and select TCP flows that seem to be interesting for our microscopic-level analysis. The idea is to select long flows, so that it is possible to detect qualitative changes in the RTT process over a relatively long period of time. In the microscopic analysis, we are able to address issues such as changes in the sending rate of a TCP flow and the impact of simultaneous TCP connections at the same mobile. Furthermore, by using the Lomb periodogram, one can identify if transmission patterns behind the RTT process are periodical. In the macroscopic analysis we search for dominating RTT values first by Energy Function Plots, based on the Haar wavelet, and then by exploring spikes of empirical PDFs of aggregate RTTs. Comparison of empirical CDFs show a significant shift in quantiles after a considerable traffic increase. We also show how the change of tariff has a clear impact on the way that the users behave - during the flat rate period users open more simultaneous TCP connections than during the day time indicating increased web activity. We argue that the increase in the quantiles is due to higher degree of bandwidth sharing at the mobile devices during the flat rate period. This issue is investigated in detail and, indeed, we show that increased degree of bandwidth sharing results in larger (even excessive) maximum RTTs. Thus, the impact simultaneous TCP connections on the maximum RTTs a flow observes, lead us to believe that it is the presence of bandwidth sharing at the mobile device that results the high values of RTTs that some flows observe. The paper is organized as follows. Section 2 describes our measurement set up and methodology. The microscopic analysis is given in Section 3 and the macroscopic analysis in Section 4. Conclusions are provided in Section Related Work and Our Contribution RTT variation in fixed IP networks with a large data set has been considered in [4]. They defined a connection experiencing high variability if the interquartile range (IQR) was strictly larger than the difference between the median and the minimum. However, the problem with this statistic is that the large RTT values are not taken into account. If, for example, every 10th RTT is 10 seconds while others are about 1 second, a user notices it but the statistic does not. Thus, we do not present results for this approach in our study. 74

79 6. Micro- and Macroscopic Analysis of RTT Variability 6.2. Measurement in GPRS and UMTS Setup and Classification Methodology Figure 6.1 Left: The SNMP data from MIBs shows that overall daily traffic profile is significantly affected by the subgroup of users who pay flat rate between 18:00-06:00. Right: Division into subsamples according to TCP connection arrival time. Sample TCP Connection Flat Rate Number Arrival Time Population Present? 1 10:30-18:00 No 2 18:00-21:00 Yes 3 21:00-24:00 Yes 4 00:00-06:00 Yes 5 06:00-16:30 No A few measurements of RTT variation have been recently made in mobile networks. For example, a GPRS measurement made in Austria is reported in [5] and a study of spurious timeout events in TCP/GPRS is reported in [2]. In particular, in [2] an algorithm for detecting spurious events is given. In addition to this, a comprehensive GPRS measurement from seven different European and Asian countries is reported in [6]. The paper gives not only results in the properties of RTTs, but also analyzes TCP connection establishment performance, and loss and throughput characteristics of the TCP flows. Our study complements these studies by representing a GPRS traffic measurement made in Finland. However, our focus is quite different than in the earlier studies. We study how individual flows observe the RTT process and also the aggregate RTT process of all flows, whereas previous studies only consider the aggregate RTT process. We also experiment with other potentially useful statistical tools not used in the previous GPRS studies (Lomb periodograms and wavelets). The application of wavelets was motivated by the findings in [7], where wavelets, or more precisely energy function plots (EFPs), were applied to the signal of the aggregate traffic to detect network performance problems indicated by significant changes in the dominating RTT values. Also, the use of a statistical tool called the Lomb periodogram (LP) [8] was inspired by the results in [9]. In that study LPs were used to detect the embedded timing properties of a data flow from the trace that contained only traffic of another data flow. This latter flow had only shared some resources with the former flow. Finally, our observation of the noticeable impact of simultaneous TCP connections from a given mobile (self congestion) has not appeared in the previous studies. 6.2 Measurement Setup and Classification Methodology The measurement point was the monitoring port of the Gi interface of one of the GGSN elements of the operator s backbone. The measurement was planned such that a statistically representative sample of the traffic was obtained. We carefully verified the accuracy of the time stamps in the obtained packet data - the accuracy proved to be ±50µs, which is more than sufficient. The flow-level statistics were obtained by using tstat[3]. A detailed description of the measurement set up and time stamp accuracy validation is available in [10]. Traffic profile: The trace was obtained from a 30 hour measurement on November 10, 2004 and consists of traffic from both GPRS and piloting UMTS users. Some of the subscribers have only volume based charging but some portion of the subscribers had also flat rate between during working days and on weekends. The effect of this flat rate population is clearly significant, see Figure 6.1 (left). The arrival time of a completely observed TCP connection is determined by the time stamp of the handshak- 75

80 6.3. Microscopic analysis of long 6. TCP Micro- flows and Macroscopic Analysis of RTT Variability in GPRS and UMTS ing SYN packet sent by the client. Downloadings are first classified according to TCP connection arrival times into 5 distinct groups as shown in Figure 6.1 (right). Samples 1 and 5 were statistically similar to each other, also samples 2 and 3 were similar to each other. Thus, to analyse the impact of tariff change in the macroscopic analysis (Section 4), we compare only Sample 1 versus 2. Analysis of the Sample 4 is not contained in this paper. Semi-RTT and RTT Count: In this study, we focus on the properties of RTTs as experienced by TCP flows. As we are not measuring directly at the sender/receiver, the RTT process can not be fully observed. Instead, the notion of semi-rtt is used, similarly as in [5]. Semi-RTT refers to the difference between the time stamps of a (downstream) TCP/IP packet carrying data payload and of the corresponding ACK packet. This is explained in more detail in [10]. For each completely observed TCP connection tstat provides some (semi-) RTT information automatically: minimum, maximum, mean, standard deviation and the number of times a data segment and the corresponding ACK has been observed. This latter value, denoted here by RTT Count, is directly relative to the size of the flow. Moreover, to avoid that out-of-sequence and duplicate segments or ACKs provide unreliable measurements that affect the RTT measurement itself, only samples referring to in-sequence segments are considered. This RTT information is given in both down- and upstream directions. However, since the built-in RTT statistics of tstat were not enough for this study the code of the tstat-program was modified in a way that we got out every valid RTT value of every completely observed connection and enough information to associate each individual RTT value with the connection that it belongs to. Further division of data is based on the concept of the (downstream) RTT Count which is the number of valid RTT samples per downloading part of a TCP connection. Since the effect of the packet size is significant, the RTT value obtained from the three-way handshake is not included in our analysis. RTT Count is a measure of the size of the downstream flow of a TCP connection. It is relative both to the duration of a download and to the amount of unique bytes that are downloaded. The same concept was used in [5] as a measure of size. We are not very interested in the duration of a flow. The word long is used here to mean long in the sense that the RTT Count is large. More detailed statistical analysis is made only for those flows that were longer than at least 6 RTT counts, with RTT from the handshaking excluded. Analysis showed that using the value 6 as a discriminating value divides each sample into approximately 25% of long flows and 75% of short flows. The tariff change does not have any effect on that. More details on this can be found in [10]. 6.3 Microscopic analysis of long TCP flows A rather complete analysis of a number of individual long flows were made in order to understand how the radio interface affects the RTTs. There were also other questions, such as does the considerable increase of traffic due to the tariff change show up somehow in RTTs? The question that concerns the operator is that does the flat rate population disturb ordinary subscribers who pay per volume? The flat rate population is not directly related to the operator but through a household appliance selling company. In general, the idea is to find out what can be deduced from changes in the RTT during a flow. Some specifically chosen examples are presented here in order to give the reader a flavour of what is behind the macroscopic level analysis that will be presented later. Example flows were chosen because the time 76

81 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS6.3. andmicroscopic UMTS analysis of long TCP flows series {(t i, RT T (t i )) i = 1,..., RTT Count} (6.1) had some distinctive qualitative features. In (6.1), t i is the time stamp of an ACK packet and RT T (t i ) is the RTT value calculated from the ACK packet. Additionally, we focused only on the very longest flows to be able to detect clear changes over a comparably long time interval. In the examples below, all flows used the SACK option and were several hours in length. The amount of transferred data was several MBs for all flows Individual TCP flows in GSM/GPRS Flow 1: The first flow was chosen since we wanted to understand what caused the improvement in RTT values after about half an hour as shown in Figure 6.2 (left). Another point to notice with Flow 1 is that the rise in traffic due to tariff change at 18:00 might have some effect for this particular flow. Figure 6.2 Time series of Flow 1 (left) and its LPs (middle and right). 6 Flow 1 Lomb Periodogram of Flow 1 Lomb Periodogram of Flow 1 RTT (s) :00 16:00 17:00 18:00 19:00 Time (hour) Spectral Power Begin Middle Time Period (s) Spectral Power Middle End Time Period (s) Improvement in RTTs after about half an hour was easily seen to be due to change from large to a smaller segment size, from 1380 to 536 bytes. What caused this change? Flow 1 is most probably an example of a streaming application, audio, even though it was behind TCP port 80. This streaming application probably had a 3 second play-out buffer. Indeed, analysis of Flow 1 showed a very regular pattern not typical to TCP. Analysis of the receiving rate of packets revealed that the source is receiving data at a constant rate of 18.2 kbit/s. Careful inspection of Flow 1 showed that a data burst of fixed size was sent every 3rd second on the average. It also takes about 3 seconds before every segment of one burst has been acknowledged by the mobile host. However, after the change in the segment size, the data burst could be sent slightly better within the 3 second interval in the sense that the sender received ACKs slightly earlier. Thus, probably the streaming application forced the change in the segment size. Finally, another important thing to notice was that there were essentially no simultaneous TCP connections from the same mobile. To further analyse the flow, the LP has been computed corresponding to the beginning, middle and end of the flow, see Figure 6.2 (middle and right). From the middle plot it is clear the in the beginning the burst period is slightly above 3 seconds. After the change in the segment size, the burst period is somewhat below 3 seconds. Examining Figure 6.2 (left) one can observe that after the tariff change at 18:00 the level of the RTTs rises slightly again (probably due to increased traffic in the network). Computing the LP from the end shows that the burst period increases to slightly above 3 seconds again, see Figure 6.2 (right). Flow 2: Figure 6.3 (left) shows the remarkable property of Flow 2, namely that RTTs at a level of seconds seems to be normal for this connection. Also, the minimum is less than 1 second, hence a better level could be possible. Unlike for Flow 1, the use of LP showed that there was no significant regular periodic 77

82 6.3. Microscopic analysis of long 6. TCP Micro- flows and Macroscopic Analysis of RTT Variability in GPRS and UMTS structure for Flow 2, see Figure 6.3 (right). There it can be seen that Flow 2 has a small spike in the LP at slightly below 1 second, but it is negligible compared to the pronounced spike of Flow 1 at 3 seconds. Figure 6.3 Time series of Flow 2 (left) and its LP (right). RTT (s) Flow 2 Spectral Power Lomb Periodogram of Flows 1 and 2 Flow 1 Flow 2 17:30 17:45 Time (hour) Time Period (s) Moreover, as Figure 6.4 (left) indicates, there was a change in the downlink capacity from 40.2 kb/s to 26.8 kb/s, which corresponds to a loss of one PDCH in the downlink when using CS-2 channel coding scheme. Finally, as Figure 6.4 (right) shows, the change in the link capacity for Flow 2 was due to self-congestion caused by other simultaneous TCP connections from the same mobile host. Large RTT values were probably due to a low terminal capacity. Flow 3: Again, use of LP showed no regular periodicity. However, Flow 3 had some peculiar regular intervals of bad RTTs. In this case the self-congestion is also an explanation since the user was simultaneously running an application (MSNP) which initiated a new flow in port 80 regularly with approximately 8 minute intervals downloading a file of size kb. These downloads occur exactly at the bad intervals visible in Figure 6.5 (left) Individual TCP flows in UMTS During the measurement there were only piloting UMTS users in the network (commercial usage was initiated after the measurement). However, the absolute number of observed TCP connections that could be associated to UMTS was sufficient to make comparative analysis against GPRS also from our data set. Due to the measurement setup, exact identification of whether a flow originated from GPRS or UMTS was not possible. However, RTTs in UMTS are at best about one magnitude smaller than in GPRS, see also [5]. Thus, UMTS flows were identified by using a threshold on the flow s minimum RTT, i.e., flows with minimum RTT smaller than the threshold were considered UMTS flows. The threshold value we used was 0.4 ms (in [5] a minimum value of ms was observed for GPRS flows). Hence, a classification error would occur if a GPRS flow Figure 6.4 Rate change for Flow 2 (left) and simultaneous TCP flows (right). Volume (MB) End: 26.8 kb/s Flow 2 Begin: 40.2 kb/s Time from the First Segment (min) i:th Flow Simultaneous flows with Flow 2 20 Flow :15 17:30 17:45 18:00 18:15 Flow End Points 78

83 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS 6.4. Properties and UMTSof RTT at the macroscopic level observes an RTT smaller than 0.4 ms, but we expect this to be rare. Flows 4 and 5 were originated from the same mobile. They are both shown in Figure 6.5 (right). The level of RTTs of Flow 4 increase when Flow 5 starts. There were also several other long flows simultaneously from the same mobile. Almost all RTT variations are explained by these other flows. Figure 6.5 Time series of Flow 3 (left) and Flows 4 and 5 (right). 10 Flow 3 10 Flows 4 and 5 (UMTS) 8 8 RTT (s) 6 4 RTT (s) :00 21:00 Time (hour) Time (hour) 6.4 Properties of RTT at the macroscopic level The dominating RTT values A superposition of many TCP connections can be assumed to produce local periodicity in the packet level aggregate traffic since the dynamics of an individual TCP connection depend on RTT. If there are dominant values of RTT, they should then be hidden also in the aggregate TCP traffic rate. Figure 6.6 (left) shows how the EFP (using Haar wavelets [7]) looks for our data. It contains energy values at different scales for Samples 1 (normal tariff users only) and 2 (including flat rate users). Significant periodicity at the corresponding scale should show up in the EFP as dips [7]. The 0th scale in Figure 6.6 (left) contains packets per 10ms bins. The dips occur in all cases at scales 6,7, 8 and 9, which corresponds to 0.64s, 1.28s, 2.56s and 5.12s bins. Thus, the dominant RTT values, if they exist, can be expected to be found between seconds. In order to see more closely the values where the dips occur we simply varied the initial bin sizes. We made this for samples where the flat rate population is present and for samples where the flat rate population was not present. The differences were found small. Figure 6.6 Energy function plot (left) and PDF estimates (right) of aggregate RTTs. log 2 Ej Energy Function Plot j:th Scale PDF Positions of Spikes in PDFs Sample 1 Sample RTT (s) We now turn back from the packet level data to the RTT data. From now on the word aggregate in this section means the set of all RTT samples without making any distinction what flows they belong to (also data 79

84 6.4. Properties of RTT at the macroscopic 6. Micro- and level Macroscopic Analysis of RTT Variability in GPRS and UMTS from short flows is included). To compare the dominating RTT values we calculated estimates of the PDFs simply as normalised histograms, see Figure 6.6 (right). In GPRS, the Abis interface causes a granularity of 20 ms in the upstream traffic. Smaller variations are basically not due to radio interface. We wanted to capture the variations due to radio interface but we were not so interested in the variations caused by the network. Hence we decided to use a bin size of 10 ms. The long flows (25% of all) dominate the histograms: If the short flows (75% of all) were left out, the histograms would still have the same shapes and spikes at the same positions. On the other hand, leaving out the longest flows of each sample does not change the histograms either. Sample 1 had spikes at positions 0.7, 1.3, 1.9 and 4.3 seconds, and they are highlighted in Figure 6.6 (right). Sample 2 had also spikes at the same positions. Note that the positions of the spikes correspond to the typical transmission delays of a packet and its acknowledgement when the packet is sent at SGSN, where the buffer is per mobile, directly without any buffering, after a queuing delay of one data packet, etc. This also implies, that the long flows are not much affected by other traffic in the network. They are only constrained by their own limited radio access link. The prior information about the flat rate population, that some of them use GPRS as an alternative to dialup connections in the regions where no broadband is available, suggests that typically they are far from base stations also. Hence, they may often have heavier FEC in their channel codings than the regular users. Indeed, Sample 2 has clear spikes also at about 1 second and slightly less than 3 seconds in places where Sample 1 does not have such clear spikes. Also, the positions of spikes do not practically differ among the samples so histograms of Samples 3 and 5 need not be shown here. Especially the traffic increase due to tariff change does not seem to have any significant effect on positions of spikes. This suggests that, in case of long flows, the flat rate population does not seem to affect the regular users significantly. Finally, as seen in Figure 6.3 (left) RTTs of 6 seconds may occur. Indeed, this is not even so uncommon as is confirmed by the presence of a minor spike in the PDF of Sample 1 at around 6 seconds in Figure 6.6 (right) Changes in quantiles From the histogram based estimates of the PDFs we calculated also CDFs, and long flows dominate also them. Figure 6.7 (left) compares Samples 1 and 2. Shift in the 0.25 and 0.5 quantiles is likely due to the increased amount of long connections that are not close to base stations and thus use a heavier channel coding scheme. However, an interesting observation is that based on the discussion in Section 4.1 the network does not seem to be congested. Thus, we claim that the significant increase of the 0.95 quantile, from 5.9s to 7.4s, is explained by the increased amount simultaneous TCP connections from a mobile (bandwidth sharing). Indeed, in Figure 6.7 (right) we have calculated the maximum number of simultaneous TCP connections observed from the same mobile. As shown in the figure, before 18:00 50% of mobiles had at most one TCP connection at a time and 50% had at least two simultaneous TCP connections, but during 18:00-21:00 50% of mobiles had more than 5 simultaneous TCP connections active at least once during their whole Internet session Self-congestion In order to further study how simultaneous TCP connections from the same mobile actually affect the RTTs, we divided such GPRS sessions, that had at least two simultaneous TCP connections, into two groups. The 80

85 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS 6.4. Properties and UMTSof RTT at the macroscopic level Figure 6.7 Comparison of CCDFs (left) and maximum number of simultaneous TCP connections from the same mobile (right). CDF Change in Quantiles Sample 1 Sample RTT (s) 1 0 Number of Flows Maximum of Simultaneous Flows First Arrival Time (h) Group 0 did not send much data (Unique Bytes) into upstream direction whereas Group 1 contained those that must have had at least one non-trivial data transfer in the upstream direction. More precisely, Group 1 was defined as the set of those mobiles for which the total amount of upstream Unique Bytes was larger than 1 kb times the total number of TCP connections from that mobile. Figure 6.8 shows the maximum RTT of all TCP connections of a mobile against the maximum number of simultaneous connections observed from the same mobile for Group 0 (left figure) and 1 (right figure). The scales of axes are chosen to be the same for both plots in Figure 6.8 in order to show that extremely large RTT values occur almost solely for Group 1 (significant upstream traffic), whereas very large number of simultaneous TCP connections occur for Group 0 (only downloading traffic). This implies that the context switching between transmitting and receiving packets sometimes causes problems. Figure 6.8 Maximum RTTs with no significant upstream traffic (left) and significant upstream traffic (right). 250 Group Group 1 Max RTT (s) Max RTT (s) TCP Connections (Max ) TCP Connections (Max ) It can be expected that observed maximum RTTs increase as the amount of simultaneous TCP connections increases. This is verified in Figure 6.9 (left), which shows a robust estimate of the expectation of maximum RTT over all simultaneous flows, conditioned on the maximum number of simultaneous connections. Figure 6.9 Robust estimates of conditional expectations of maximum RTTs for GPRS (left) and UMTS (right). Right plot also shows individual samples of maximum RTTs. Max RTT (s) Conditional Expectation Group 1 Group Simultaneous TCP Connections (Max ) Max RTT (s) Conditional Expectation (UMTS) Simultaneous TCP Connections (Max) Our UMTS sample is not large enough to make the same division into groups as with GPRS. Figure 6.9 (right) shows the individual values of maximum RTTs as a function of the number of simultaneous TCP flows during a session for all sessions (red dots), and the conditional expectation of maximum RTT over all 81

86 6.5. Conclusions 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS and UMTS simultaneous flows, conditioned on the maximum number of simultaneous connections (solid line). Based on the figure, self-congestion is also a problem for UMTS mobiles, although slightly less serious. 6.5 Conclusions We studied the data from a passive traffic measurement representing a 30 hour IP traffic trace from a Finnish operator s GPRS/UMTS network. Of specific interest was the variability of the Round Trip Times (RTTs) of the TCP flows in the network. The RTTs were analysed at both micro- and macroscopic level. The microscopic level involved detailed analysis of individual flows and we are able to infer various issues affecting the RTT process of a particular flow, such as changes in the rate of the radio channel and the impact of simultaneous TCP flows at the mobile device. Also, Lomb periodograms were used to detect periodic behavior in the applications. In the macroscopic analysis we searched for dominating RTT values first by EFPs and then by exploring spikes of empirical PDFs of aggregate RTTs. Comparison of empirical CDFs showed a significant shift in quantiles after considerable traffic increase which was found to be due to bandwidth sharing at individual mobile devices. Finally, we investigated further the impact of bandwidth sharing at the mobile device and showed how the increased amount of simultaneous TCP connections seriously affects the maximum RTTs observed by a given flow, both in GPRS and in UMTS Discussion about statistical methods The Lomb periodogram, applied to the RTT process (6.1), was found possibly useful in detecting performance of streaming applications over TCP as the analysis of Flow 1 showed. For an ordinary non-streaming TCP flow the LP typically does not indicate any regular periodicity. Hence it could be used to detect streaming applications behind port 80. For example, whenever LP detects something significantly and regularly periodic, a streaming application may be a cause. The periodicity may be found out also in other ways. The point is that LP is systematic: given (6.1) as input it produces a clear picture as output. Or, if some significance level is given, it could classify long flows in port 80 into possibly streaming applications or definitely not streaming applications. The way we applied EFPs was straightforward. Again the point in the EFP is that producing Figure 6.6 (left) and varying the initial bin size requires much less work than that of Figure 6.6 (right). However, Figure 6.6 (left) still contains essentially the same information about dominating RTTs as Figure 6.6 (right). Acknowledgement The authors would like to thank Vesa Antervo from Elisa for his efforts on making and anonymising the measured trace, Jussi Vähäpassi from Elisa for financial support and Marco Mellia for his help about tstat. 82

87 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS and UMTS 6.5. Conclusions Bibliography [1] Gurtov, A., Ludwig, R.: Responding to spurious timeouts in TCP. In: Proceedings of INFOCOM 2003, San Fransisco, California, USA (2003) [2] Vacirca, F., Ziegler, T., Hasenleithner, E.: Large Scale Estimation of TCP Spurious Timeout Events in Operational GPRS Networks. In: COST 279. (2005) [3] : Tstat:TCP Statistics and Analysis Tool. (2005) [4] Aikat, J., Kaur, J., Donelson Smith, F., Jeffay, K.: Variability in TCP Round-trip Times. In: Proceedings of IMC 03, Miami Beach, Florida, ACM (2003) [5] Vacirca, F., Ricciato, F., Pilz, R.: Large-Scale RTT Measurements from an Operational UMTS/GPRS Network. In: First International Conference on Wireless Internet (IEEE WICON 05). (2005) [6] Benko, P., Malicsko, G., Veres, A.: A Large-scale, Passive Analysis of End-to-End TCP performance over GPRS. In: INFOCOM (2004) [7] Huang, P., Feldmann, A., Willinger, W.: A non-intrusive, wavelet-based approach to detecting network performance problems. In: Proceedings of ACM Internet Measurement Workshop (2001) [8] Lomb, N.: Least-squares frequency analysis of unequally spaced data. Astrophysics and Space 39 (1976) [9] Partridge, C., Cousins, D., Jackson, A., Krishnan, R., Saxena, T., Strayer, W.: Using Signal Processing to Analyze Wireless Data Traffic. In: ACM Workshop on Wireless Security, Atlanta, Georgia, USA, ACM (2002) [10] Kilpi, J., Lassila, P.: Statistical analysis of RTT variability in GPRS and UMTS networks. Technical report, VTT and TKK, rtt-report.pdf (2005) 83

88 6.5. Conclusions 6. Micro- and Macroscopic Analysis of RTT Variability in GPRS and UMTS 84

89 7 On-line Segmentation of Non-stationary Fractal Network Traffic with Wavelet Transforms and Log-likelihood-based Statistics 1 David Rincón and Sebastià Sallent Universitat Politècnica de Catalunya (UPC) Av. Canal Olímpic s/n, Castelldefels, Barcelona, Spain {drincon,sallent}@mat.upc.es Abstract Network traffic exhibits fractal characteristics, such as self-similarity and long-range dependence. Traffic fractality and its associated burstiness have important consequences for the performance of computer networks, such as higher queue delays and losses than predicted by classical models. There are several estimators of the fractal parameters, and those based on the discrete wavelet transform (DWT) are the best in terms of efficiency and accuracy. The DWT estimator does not consider the possibility of changes to the fractal parameters over time. We propose using the Schwarz information criterion (SIC) to detect changes in the variance structure of the wavelet decomposition and then segmenting the trace into pieces with homogeneous characteristics for the Hurst parameter. The procedure can be extended to the stationary wavelet transform (SWT), a non-orthogonal transform that provides higher accuracy in the estimation of the change points. The SIC analysis can be performed progressively. The DWT-SIC and SWT-SIC algorithms were tested against synthetic and well-known real traffic traces, with promising results. 7.1 Introduction It is well known [1, 2] that network traffic exhibits some fractal properties (heavy tails, slow-decaying autocorrelation, and scaling, among others) that cannot be captured by traffic models based on Poissonian or Markovian stochastic processes. New fractal-aware algorithms, which exploit the properties of self-similarity, long-range dependence, and multifractality, have been developed for traffic analysis and generation. Few studies (to our knowledge, only [3], [4] and [5]) have explored the possibility of the non-stationarity of fractal parameters. Real traffic is expected to change its behavior as time goes on, and detecting the change points (i.e., the instants that bound the segments that show homogeneous fractal behavior) can be useful for some algorithms and network mechanisms that exploit the long-memory properties of network traffic. Some examples are the TCP congestion control described in [6], a predictive bandwidth control for MPEG sources [7], the effective bandwidth estimator described in [8] and the novel application of traffic fractality to the design 1 This work was supported in part by the Spanish Government under grant CICYT TIC C04-02, and the EuroNGI European Network of Excellence (IST-FP6) 85

90 7.2. Long-range Dependence and its Estimation 7. On-line Segmentation of Non-stationary Fractal Network Traffic of VLSI video decoders [9]. Apart from those applications, a better knowledge of traffic characteristics is needed when creating synthetic traces for simulation or testing purposes. Our long-term aim is to accurately describe the temporal and frequential evolution of fractal and multifractal traffic parameters, which, in turn, can provide us with a better understanding of network dynamics. Among the fractal parameter estimation algorithms, those based on the use of the Discrete Wavelet Transform (DWT) exhibit a higher performance in terms of accuracy, computational efficiency, and adaptability. The Abry-Veitch estimator is widely accepted as the best and most efficient DWT-based LRD estimator, since it is capable of performing a joint estimation of α (the scaling parameter, related to the Hurst parameter) and c f (related to the variance of the series under study), and computing the associated confidence intervals. The DWT decomposes the traffic trace in a multiresolution analysis (MRA) that turns the 1/f spectrum of fractal traces into a certain structure of the variance across scales, which can be easily detected. If the scaling parameter changes at any moment, so does the complete variance structure. Therefore, if a change point detection algorithm monitors the variance at every scale, the simultaneous detection of changes across scales will indicate a change in the scaling parameter. Changes must appear at several scales to be significant; a change in variance at a single scale only tells us of non-stationarity in the variance at that scale. A final phase of automatic selection and clustering of the candidate change points must be performed in order to infer the real changes of the fractal parameters. We propose the use of the Schwarz Information Criterion (SIC) as the change point detection algorithm. This algorithm was developed from the Akaike Information Criterion (AIC) for model selection, widely used in statistical analysis. The SIC is based on the maximum likelihood function for the model, and can be easily applied to change point detection by the comparison of the likelihood of the null hypothesis (no changes in the variance series) against the opposite hypothesis (a change is present). The time-frequency characterization provided by the wavelet transforms and the SIC statistic is useful for off-line analysis, but can also be easily adapted to perform an on-line (in the sense of sequential, progressive) monitoring of the behavior of traffic. Both the DWT and SWT can be performed sequentially, and the SIC statistic can be updated appropriately. This paper presents the SIC algorithm working with the DWT and its non-decimated version, the Stationary Wavelet Transform (SWT). The SWT provides higher temporal accuracy for the change point estimation, due to its inherent time redundancy. Both the DWT-SIC and SWT-SIC algorithms are described and applied to synthetic traces and real traffic. The rest of the paper is organized as follows: Section II gives a brief introduction to LRD parameters and their estimation; in Section III, the AIC and SIC algorithms are presented; Section IV explains how the DWT/SWT and the SIC statistic can work together, along with tests performed on synthetic traces; Section V shows the results of applying the algorithms to a real traffic sample; Section VI discusses the on-line performance of the analyzers; and Section VII concludes the paper. 7.2 Long-range Dependence and its Estimation A Brief Review of Long-range Dependence A stationary stochastic process x(t) is considered long-range dependent (LRD) if its autocovariance function decays at a rate slower than a negative exponential. Equivalently, LRD can also be defined in the frequency domain as a 1/f-like spectrum around the origin: S x (f) c f f α when f 0. The LRD parameters are α and c f. The scaling parameter α is related to the intensity of the LRD phenomenon (a qualitative measure) and is usually expressed as the Hurst parameter H = (1 + α)/2, while c f has 86

91 7. On-line Segmentation of Non-stationary Fractal Network Traffic 7.2. Long-range Dependence and its Estimation dimensions of variance and can be interpreted as a quantitative measure of LRD. Though the Hurst parameter has received more attention in published literature, c f is not negligible, since it appears in the expression of loss probability when LRD traffic is fed into a queue, and also in the variance of the sample mean of an LRD process (and thus determines the confidence intervals for its estimation) [10]. In the rest of the paper we will focus our attention on α and H, although our algorithms can be extended to an analysis of the evolution of c f The Discrete Wavelet Transform The Discrete Wavelet Transform is a powerful tool that allows a fast, efficient and precise estimation of LRD parameters. Given a signal x(n), we denote by d x (j, k) = x, ψ j,k the coefficients of the DWT, where ψ j,k is a function with finite support, and j and k are respectively the scale and the location where the analysis is performed. The basis of the decomposition is actually a family generated from translated and dilated versions of the mother wavelet ψ 0,0 : ψ j,k (n) = 2 j/2 ψ 0,0 (2 t n k) for j = 1... J, k Z. The DWT can be understood as the result of a cascade of quadrature-mirror low-pass and high-pass filters (h(n) and g(n), respectively), in which the output of the high-pass filter (the details of the signal) are the coefficients of the DWT at each scale j, while the output of the low-pass filter (the approximation of the signal) is filtered again iteratively. The output of the filters is decimated in order to maintain orthogonality; therefore, the number of coefficients is halved at each iteration. In terms of the signal spectrum the output of the DWT is a decomposition in subbands that are halved at each step, giving rise to a multiresolution analysis in which the original signal is decomposed into a low-pass approximation at scale J, a x (J, k) and a set of high-pass details d x (j, k) for each scale j = 1... J. Figure 7.1 illustrates the filter bank interpretation. Figure 7.1 On the left, the DWT as filter bank for a level 3 decomposition. On the right, the subbands of the normalized spectrum, with the approximation a 3 and details d DWT Analysis of LRD Traffic: the Abry-Veitch Estimator Abry and Veitch developed the LogScale diagram [10], an unbiased and efficient estimator of LRD parameters. It estimates the power of the subbands of the wavelet decomposition as the sample variance of the coefficients at each branch of the filter, µ j. Due to the power law, µ j = E[d 2 x(j, k)] = 2 jα c f C(α, ψ 0 ) (7.1) 87

92 7.3. Variance Change Point Estimation with 7. On-line the Schwarz Segmentation Information of Non-stationary Criterion Fractal Network Traffic and taking logarithms on both sides, log 2 (µ j ) = jα + log 2 (c f C) + g j (7.2) where g j = ψ(n j /2)/ ln 2 log 2 (n j /2) are bias-correction terms that depend only on n j (the number of wavelet coefficients at scale j), required because the expectation of the logarithm is not the logarithm of the expectation, and ψ(x) is the Psi function. Parameters α and c f C can be estimated from expression (7.1) performing a weighted linear regression on µ j, in which the weight is related to the number of coefficients available at the corresponding scale (decreasing by 2 for each increase in the scale). Assuming µ j follows a Gaussian distribution, confidence intervals for the estimation can be derived Real-time Estimation with the DWT There exists an on-line version of the DWT estimator [4] that performs subband variance computation progressively, but it is accumulative, not dynamic: it returns updated estimations performed over all available samples from t = 0. The same authors studied the stationarity of the scaling exponent (α or H) and developed a statistical test that is capable of determining its constancy [5], but this study is limited to a dyadic temporal decomposition (i.e., the traffic traces are divided into non-overlapping segments whose lengths are powers of 2). We expect traffic to change at arbitrary points (transition points) where the trace changes its behavior (in terms of the distribution of its variance across scales). The change point detection problem is a classical one in statistics, and several algorithms have been developed to address it. In a previous study [11] we used the iterated cumulated sum of squares (ICSS) algorithm, in conjunction with the stationary wavelet transform, to develop a method for the temporal segmentation of the trace into regions that show homogeneous LRD properties (constant Hurst parameter). The results of the ICSS technique are good, but it lacks flexibility in selecting the significance level (critical values are computed using Monte Carlo simulation), and it is also difficult to implement progressively. In this paper we use a more powerful approach, based on an information theory criterion, which can be computed on-line (in the sense of a progressive, sequential analysis). We follow the theory presented in [12] in the context of stock price analysis. 7.3 Variance Change Point Estimation with the Schwarz Information Criterion Statement of the Problem Let x 1, x 2,..., x n be a sequence of independent normal random variables with a common mean m and variances σ 2 1, σ 2 2,..., σ 2 n. We test the null hypothesis H 0 : σ 2 1 = σ 2 2 = = σ 2 n = σ 2 (7.3) versus the alternative H 1 :σ 2 1 = =σ 2 k 1 σ 2 k 1 +1 = =σ 2 k q σ 2 k q +1 = =σ 2 n (7.4) 88

93 7. On-line Segmentation of Non-stationary 7.3. Variance Change Fractal Network Point Estimation Traffic with the Schwarz Information Criterion where q is the unknown number of change points and 1 k 1 < k 2 < < k q < n are the unknown positions of the change points. Following the binary segmentation procedure suggested in [12] we reduce the problem to the detection of a single change point, and then iterate the process in the two subsequences that surround the change point. If a change is detected in the subsequence, split again and iterate until no more changes are found in any of the subsequences. Our problem is then reduced to testing the null hypothesis against H 1 : σ1 2 = = σk 2 σk+1 2 = = σn 2 (7.5) Information-based Criteria: AIC and SIC Now we turn to the discussion about the statistic that will allow us to decide whether a change point exists. One of the most used statistics for change point detection is the Akaike information criterion (AIC) for model selection [13]. Following Akaike s work, other authors have applied information theory criteria in other fields. Schwarz [14] defined the SIC statistic as: SIC(k) = 2 log L(ˆθ) + p log k (7.6) where L(ˆθ) is the maximum likelihood function for the model, p is the number of free parameters, and k is the sample size. In our context there are two different models, which correspond to the null and alternative hypotheses, respectively. Our decision will be taken following the principle of minimum information; that is, do not reject H 0 if SIC(n) min k SIC(k), reject H 0 if SIC(n) > SIC(k) for some k, and estimate the position of the change point by ˆk such that SIC(ˆk) = min 1 k<n SIC(k) where SIC(n) is the SIC statistic under H 0 and SIC(k) is the SIC under H 1 for k = 1... n 1. Then, (7.6) becomes: SIC(n) = n log 2π + n log ˆσ 2 + n + log n (7.7) SIC(k)=nlog 2π + klog ˆσ (n k)log ˆσ n + 2log n (7.8) where ˆσ 2, ˆσ 1 2 and ˆσ 2 2 are the biased estimators of the variances of the sequence and both subsequences 1... k and k n, respectively. These expressions limit our change detection range to 2 ˆk n 1. Reference [12] gives a proof that ˆk is a consistent estimator of the true change point, and it also gives the expression for the computation of the significance level. The authors define the critical level c α, which modifies the criterion: accept H 0 if SIC(n) < min SIC(k) for some k, and estimate the position of the change point by ˆk such that SIC(ˆk) = min SIC(k) + c α (7.9) 2 k n 2 The expression of the critical level c α is derived from the asymptotic null distribution of the statistic [12]. The same paper discusses other versions of the SIC criterion, such as an unbiased estimator and the unknown mean cases. The unbiased estimator yields a slight increase in accuracy at the cost of a high increase in computation. In our study we used only definition (7.9), which provides acceptable results at a lower computational cost. 89

7.4. DWT-SIC and SWT-SIC Algorithms7. foron-line Detecting Segmentation Changes inof the Non-stationary Scaling Parameter Fractal α Network Traffic 7.

94 7.4. DWT-SIC and SWT-SIC Algorithms7. foron-line Detecting Segmentation Changes inof the Non-stationary Scaling Parameter Fractal α Network Traffic 7.4 DWT-SIC and SWT-SIC Algorithms for Detecting Changes in the Scaling Parameter α Connecting the DWT and SIC The main contribution of our work is to apply the SIC change point estimator to the output of each of the branches of the wavelet filter bank. If we find the same variance change position across all or a significant number of scales, it will signal a change point in the scaling (or Hurst) parameter. The number of samples available at each scale is not the same, since each branch of the wavelet bank suffers a different number of decimations. For example, if the DWT filter bank shown in Fig. 7.1 is fed with a traffic trace of 1024 samples, the first branch will output 512 d x (1, k) samples, the second branch will output 256 d x (2, k) samples, and the third branch will output only 128 d x (3, k) samples. There will still be a temporal relationship between them: the second sample of the lowest frequency subband (d x (3, 2)), the third and fourth samples of the middle subband (d x (2, 3) and d x (2, 3)), and the 5 th, 6 th, 7 th and 8 th samples of the highest subband (d x (1, 5)... d x (1, 8)) are all related to a certain time segment, which corresponds to samples 9 to 16 of the input trace. Figure 7.2 illustrates the relationship between samples across scales. In order to provide a good estimation of the change point at all available scales, a phase correction is applied to higher scales, with a scale-dependent delay as illustrated in the right side of Fig This correction aligns the position of the wavelet coefficients with their zone of influence. After the change point detection a clustering-and-decision step is needed, in order to decide if a change is present in enough scales. This step has not yet been finetuned, but we expect to provide a good solution with pattern-matching algorithms; for the moment, we rely on a visual heuristic technique, and therefore our results for the change points are approximate. Figure 7.2 On the left, the temporal relationship between the coefficients of the DWT. On the right, the phase-corrected coefficients The Stationary Wavelet Transform and SIC Another approach to dealing with the phase-correction problem is the use of a non-orthogonal wavelet transform, known as stationary or maximum-overlap wavelet transform (SWT, MODWT). This transform is essentially identical to the DWT except in the decimation step, which is not performed in the SWT. Therefore, the variance change point can be accurately located at each scale. The expression for the estimation of the Hurst parameter is similar to (7.1), although the constant term is not the same, due to time redundancy. µ j = E[d 2 x(j, k)] = 2 jα c f C (α, ψ 0 ) (7.10) The bias-correcting terms are essentially the same as those developed for the DWT, except that their value 90

Analysis methods of heavy-tailed data

Analysis methods of heavy-tailed data Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course Chapter