Analysis methods of heavy-tailed data

Size: px

Start display at page:

Download "Analysis methods of heavy-tailed data"

Chloe Price
5 years ago
Views:

1 Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course

2 Chapter 1 Introduction: definitions and basic properties of classes of heavy-tailed distributions. Tail index estimation. Methods for the selection of the number of the largest order statistics in Hill estimator. Rough methods for the detection of heavy tails and the number of finite moments. The Section 1 contains the introduction with necessary definitions, basic properties and examples of heavy-tailed data. The tail index indicates the shape of the tail and therefore it is the basic characteristic of heavy-tailed data. Methods of tail index estimation are presented. Finally, several rough tools for the detection of heavy-tails, the number of finite moments and dependence are considered.

3 Main assumption. Let X 1,...,X n be a sample of i.i.d. r.v.s distributed with the heavy-tailed DF F(x). Definition A DF F(x) (or the r.v. X) is called heavy-tailed if its tail F(x) = 1 F(x) > 0, x 0, satisfies y 0 lim P{X > x + y X > x} = lim F(x + y)/ F(x) = 1. x x Comparison of probability tails.

4 Examples of heavy- and light-tailed distributions Heavy-tailed Subexponential: distributions: Pareto, Lognormal, Weibull with shape parameter less than 1. With regularly varying tails: Pareto, Cauchy, Burr, Frechét, Zipf-Mandelbrot law. Light-tailed exponential, gamma, Weibull distributions with shape parameter more than 1, normal, finite distributions.

5 Heavy-tailed distributions have been accepted as realistic models for various phenomena: WWW-session characteristics sizes and durations of sub-sessions; sizes of responses inter-response time intervals on/off-periods of packet traffic file sizes service-time in queueing model flood levels of rivers major insurance claims extreme levels of ozon concentrations high wind-speed values wave heights during a storm low and high temperatures

6 Basic definitions Let X n = {X 1,...,X n } be a sample of i.i.d. r.v. distributed with the DF F(x) and M n = max(x 1, X 2,...,X n ). Extreme value theory assumes that for a suitable choice of normalizing constants a n, b n it holds P{(M n b n )/a n x} = F n (b n + a n x) n H γ (x), x R, and an Extreme Value DF H γ (x) is of the following type: exp( x 1/γ ), x > 0, γ > 0 Fréchet, H γ (x) = exp( ( x) 1/γ ), x < 0, γ < 0 Weibull, exp( e x ), γ = 0, x R Gumbel. Definition The parameter γ is called the extreme value index (EVI) and defines the shape of the tail of the r.v. X. The parameter α = 1/γ is called tail index.

7 Pickands s theorem: The limit distribution of the excess distribution of the X i s is necessarily of the Generalized Pareto form lim P(X 1 u > x X 1 > u) (1 + γx) 1/γ +, x R, u x F,u+x<x F where x F = sup{x R : F(x) < 1} is the right endpoint of the distribution F and the shape parameter γ R. Therefore, the Generalized Pareto distribution (GPD) with DF Ψ σ,γ (x) = 1 (1 + γx/σ) 1/γ + is often used as a model of the tail of the distribution.

8 The classes of heavy-tailed distributions distributions with regularly varying tails (RVT ) (X R 1/γ ) P{X > x} = x 1/γ l(x), x > 0, γ > 0, where l is a slowly varying function, i.e. lim l(tx)/l(x) = 1, t > 0. x Examples: Pareto, Burr, Fréchet distributions belong to RVT. Examples of l(x) give c ln x, c ln(ln x) and all functions converging to positive constants.

9 The classes of heavy-tailed distributions subexponential distributions (S) (X S) P{S n > x} np{x 1 > x} P{M n > x} as x, where S n = X X n, n 2, M n = max i=1,...,n {X i }. Example: Weibull with shape parameter less than 1 belongs to S. Intuitively, subexponentiality means that the only way the sum can be large is by one of the summands getting large (in contrast, in the light-tailed case all summands are large if the sum is so).

10 Basic properties of regularly varying distributions: Lemma Let X R α. Then, (i) X S. (ii) E{X β } < if β < α, E{X β } = if β α. (iii) If α > 1, then X r R 1 α and P{X r > x} l(x)x 1 α /((α 1)E{X}) as x. (iv) If Y is non-negative and independent of X such that P{Y > x} = l 2 (x)x α 2, then X + Y R min(α,α2 ) and P{X + Y > x} P{X > x} + P{Y > x} as x. (v) If Y is non-negative and independent of X such that E{Y α+ε } < for some ε > 0 then XY R α and P{XY > x} E{Y α }P{X > x} as x.

11 Important property for the rough detection of heavy tails and the number of finite moments: Let X R α. Then E{X β } <, if β < 1/γ; Examples: E{X β } =, if β > 1/γ. If α = 2, γ = 0.5, then EX 1 < (the first moment is finite), EX1 2 = (the second moment is infinite or does not exist). If α = 0.5, γ = 2, then EX 1 =, EX1 2 =... (all moments are infinite or do not exist).

12 Estimators of tail index X (1) X (2)... X (n) are order statistics of the sample X n = {X 1, X 2,..., X n } For γ > 0: Hill s estimator k ˆγ H (n, k) = 1 k i=1 ln X (n i+1) ln X (n k) Ratio estimator Goldie, Smith, (1987) a n = a n (x n ) = n n ln(x i /x n )I{X i > x n }/ I{X i > x n }, i=1 i=1 x n is am arbitrary threshold level, e.g., x n = X (n k)

13 Estimators of tail index For γ R: Moment estimator, Dekkers, Einmahl, de Haan, (1989): ( 1 ˆγ M (n, k) = ˆγ H (n, k) (ˆγ H (n, k)) 2 /S n,k )), S n,k = (1/k) k ( ) 2 i=1 ln X(n i+1) ln X (n k) UH estimator, Berlinet, (1998): ˆγ UH (n, k) = (1/k) k ln UH i ln UH k+1, UH i = X (n i)ˆγ H (n, i) i=1 Pickands s estimator ˆγ P (n, k) = 1/ ln 2 ln ( X (k) X (2k) ) / ( X(2k) X (4k) ) No recursiveness!

14 Group estimator, Davydov, Paulauskas, Račkauskas, (2000): The sample X n is divided into l groups V 1,..., V l, each group containing m r.v.s, i.e. n = l m. Estimator of the function of the tail index: where z l = (1/l) l i=1 k li = ˆα ˆα+1 = 1 1+ˆγ, k li = M (2) li /M (1) li, M (1) li = max{x j : X j V i } and M (2) li is the second largest element in the same group V i.

15 Group estimator, Davydov, Paulauskas, Račkauskas, (2000).Theoretical background. 1 For distributions with regularly varying tails 1 F(x) = x α l(x), and l = m = [ n], it is proved 2 For distributions i=1 z l a.s. α α + 1 = γ. 1 F(x) = C 1 x α + C 2 x β + o(x β ), with β = 2α it holds ( l l l(l 1 k li α(1+α) 1 ) kli 2 l 1 i=1 ) 1/2 l k li p N(0, 1). i=1

16 Recursiveness of the estimate z l. On-line estimation. Having the next group of observations V l+1 it follows γ l+1 = ( l l k ) 1 l+1 l γ l l + 1 After getting i additional groups with m elements each V l+1,..., V l+i ( ) l 1 γ l+i = (l + i) + k l+1l k l+il+i 1, 1 + γ l i.e. γ l+i is obtained using γ l by O(1) calculations.

17 Recursiveness of the estimate z l. On-line estimation. Since it holds z l+i = lz l + z l+i = 1/(1 + γ l+i ) i j=1 k l+jl+j /(l + i), bias(z l+i ) = bias(z l ), var(z l+i ) < var(z l ) for i > 0 var(z l+i ) = var(z l )l/(l + i),

18 The selection of the number of the largest order statistics in Hill estimator A Hill plot {(k, ˆγ H (n, k)), 1 k n 1}): the estimate of ˆγ H (n, k) is chosen from an interval in which this function demonstrate stability. Plot of the mean excess function {(u, e(u)) : X (1) < u < X (n) }, where e(u) = n (X i u)1{x i > u}/ i=1 n 1{X i > u} is the sample mean excess function over the threshold u. If this plot follows a reasonably straight line above a certain value of u, then this indicates that excesses over u follow e P (u) = (1 + γu)/(1 γ) of generalized Pareto distribution with positive shape parameter. One can take the nearest order statistics to u as the estimate of k. i=1

19 The sensitivity of the Hill estimate to k gamma(k) 6 gamma(k) gamma(k) k k k The Hill s estimate against k for 15 samples of Weibull (left), Pareto (middle) and Frechét (right) distributions, all with shape parameter α = 1/γ = 0.5. Sample size is n = 1000.

20 Plot of the mean excess function. Left: Weibull distribution. Right: Pareto distribution. The shape parameter is 0.5, sample size 1000.

21 Bootstrap method for k selection. Minimizing of the empirical bootstrap estimate of the mean squared error of γ: MSE(ˆγ) = E (ˆγ γ) 2 min k. The bootstrap estimate is obtained by drawing B samples with replacement from the original data set X n. Smaller re-samples {X 1,..., X n 1 } of the size are used. n 1 = n β, 0 < β < 1, The corresponding smaller k 1 and an optimal k are related by: k = k 1 (n/n 1 ) α, 0 < α < 1, where β = 1/2 and α = 2/3, P.Hall, (1990).

22 Empirical bootstrap estimate of the MSE(ˆγ) is where MSE (n 1, k 1 ) = (ˆb (n 1, k 1 )) 2 + var (n 1, k 1 ) min k 1, ˆb (n 1, k 1 ) = 1 B var (n 1, k 1 ) = 1 B 1 B ˆγ b (n 1, k 1 ) ˆγ(n, k), b=1 ( B ˆγ b (n 1, k 1 ) 1 B b=1 B ˆγ b (n 1, k 1 ) b=1 are the empirical bootstrap estimates of the bias and the variance, ) 2 ˆγ b is the Hill s estimate constructed by some re-sample of the size n 1 with the parameter k 1.

23 Double bootstrap method for k selection. Danielsson, de Haan, Peng and de Vries, (1997) requires less parameters than bootstrap method, Hall, (1990): n 1 and B are required, α is not required. Auxiliary statistic: MSE(z n,k ) = E ( z n,k z n,k) 2, where z n,k = M n,k 2(ˆγ H (n, k)) 2, M n,k = 1 k ( ) 2 log X(n j+1) log X k (n k), j=1 z n,k is a bootstrap estimate of z n,k.

24 Double bootstrap method for k selection. M n,k /2ˆγ H (n, k) and ˆγ H (n, k) are consistent estimates for γ, = z n,k 0 as n. = AMSE(z n,k ) = E(z n,k ) 2 min k.

25 Double bootstrap procedure is Draw B bootstrap sub-samples of the size n 1 ( n, n) (e.g., n 1 n 3/4 ) from the original sample and determine the value ˆk n 1 that minimizes MSE of z n1,k. Repeat this for B sub-samples of the size n 2 = [n1 2 /n] ([x] is the integer part of the number) and determine the value ˆk n 2 that minimizes MSE of z n2,k. Calculate ˆk opt n from the formula ˆk n opt = [ (ˆk n 1 ) 2 ( 1 1ˆρ ) 2 2ˆρ 1 1 ], ˆρ1 = ˆk n 2 1 log ˆk n 1 2 log(ˆk n 1 /n 1 ), and estimate γ by the Hill s estimate with ˆk opt n. The method is robust with respect to the choice of n 1, Gomes, Oliveira, (2000).

26 Sequential procedure for k selection is based on the theoretical result: i(ˆγ H (n, i) γ) (log log n) 1/2, 2 i k in probability, Drees & Kaufmann, (1998).

27 Sequential procedure for k selection. Algorithm. An initial estimate ˆγ 0 = ˆγ H (n, 2 n) for the parameter γ is obtained by the Hill s estimate. For r n = 2.5 ˆγ 0 n 0.25 we compute ˆk(r n ) = min{k 1,...,n 1 max 2 i k i(ˆγ H (n, i) ˆγ H (n, k)) > r n }. If r n is too large and max 2 i k i(ˆγ H (n, i) ˆγ H (n, k)) > r n is not satisfied it is recommended repeatedly replace r n by 0.9r n until ˆk(r n ) i well defined. Similarly, ˆk(r n) ε for ε = 0.7 is computed. Optimal k ( ) 1/(1 ε) ˆk opt = 1 ˆk(r ε n ) (2ˆγ 0 ) 1/3 3 (ˆk(r n )) ε is calculated and γ is estimated by ˆγ H (n, ˆk opt ). The method is sensitive to the choice of r n.

28 Plot method for m selection in the group estimator. The plot {(m, 1/z m 1)} for Pareto distribution with γ = 1, the true γ is shown by dotted line. Sample sizes are n = {150, 500, 1000}.

29 Plot method for m selection in the group estimator. Plot: {(m, z m ), m 0 < m < M 0 }, m 0 > 2, M 0 < n/2 m = n/l, z m = (m/n) [n/m] i=1 k (n/m)i From consistency result z l a.s. 1 1+γ it follows that there must be an interval [m, m + ] such that z m α/(1 + α) = (1 + γ) 1 for all m [m, m + ]. We suggest choosing the average value z = mean{1/z m 1 : m [m, m + ]} and m [m, m + ] as a point such that z m = z.

30 Bootstrap method for m automatical selection. Minimizing of the empirical bootstrap estimate of the mean squared error of (1 + γ l ) 1, l = n/m: MSE(γ l ) = E ( 1 l l i=1 ) 2 k li 1 min 1 + γ. m The bootstrap estimate is obtained by drawing B samples with replacement from the original data set X n. Smaller re-samples {X 1,..., X n 1 } of the size are used. n 1 = n d, 0 < d < 1, The re-sample is divided into l 1 subgroups. The size of subgroups m 1 and m are related by: m = m 1 (n/n 1 ) c, 0 < c < 1, where m 1 is the size of subgroups in re-samples.

31 Empirical bootstrap estimate of the MSE where MSE (l 1, m 1 ) = (ˆb (l 1, m 1 )) 2 + var (l 1, m 1 ) min m 1, ˆb (l 1, m 1 ) = 1 B var (l 1, m 1 ) = 1 B 1 B zl b 1 z l, b=1 ( B zl b 1 1 B b=1 ) 2 B zl b 1 b=1 are the empirical bootstrap estimates of the bias and the variance, zl b 1 = 1 l1 l 1 i=1 k l1i is the group estimator constructed by some re-sample. How to select c and d?

32 Simulation study: the selection of c and d. Asymptotic theory (P.Hall, (1990)) recommends d = 1/2 and c = 2/3 for the bootstrap estimation of the parameter k of the Hill s estimate of γ. We check c = {0.05, 0.1(0.1); 0.5} for a fixed d = 0.5. Relative bias and the square root of the mean squared error: Biasγ = 1 1 N R ˆγ i γ, γ N R i=1 RMSEγ = 1 1 N R (ˆγ i γ) 2 γ N R i=1

33 Conclusions: the best values of c for the fixed d = 0.5 are c = { }; the bias of the group estimator is larger for Weibull distribution. Further research: proof of theoretically best values of c and d for the group estimator using the bootstrap.

34 Rough methods for the detection of heavy tails and the number of finite moments. 1. Ratio of the maximum to the sum Let X 1, X 2,..., X n be i.i.d. r.v.s. We define the following statistic R n (p) = M n(p), n 1, p > 0, where S n (p) S n (p) = X 1 p +... X n p, M n (p) = max ( X 1 p,..., X n p), n 1 to check the moment conditions of the data. For different values of p the plot of n R n (p) gives a preliminary information about the distribution P{ X > x}. Then E X p < follows, if R n (p) is small for large n. For large n a significant difference between R n (p) and zero indicates that the moment E X p is infinite.

35 Rough methods for the detection of heavy tails and the number of finite moments. 2. QQ-plot A QQ-plot (or quantiles against quantiles -plot) draws ( the dependence ( )) { X (k), F n k+1 n+1 : k = 1,..., n}, where X (1)... X (n) are the order statistics of the sample, and F is an inverse function of the DF F. Often, the QQ-plot is built as a dependence of exponential quantiles against the order statistics of the underlying sample. Then F is an inverse function of the exponential DF. Then a linear QQ-plot corresponds to the exponential distribution.

36 Rough methods for the detection of heavy tails and the number of finite moments. 3. Plot of the mean excess function n n e n (u) = (X i u)1{x i > u}/ 1{X i > u} i=1 i=1 1 Heavy-tailed distributions: e(u), u. 2 A Pareto distribution: a linear e(u). 3 An exponential distribution: the constant e(u) = 1/λ. 4 Light-tailed distributions: e(u) 0, u.

37 Rough methods for the detection of heavy tails and the number of finite moments. several distributions. Plots of function e(u) for

38 Rough methods for the detection of heavy tails and the number of finite moments. 4. Hill s and other estimators of the tail index The Hill estimate works bad if 1 the underlying DF does not have a regularly varying tail, 2 the tail index α = 1/γ is not positive, 3 the sample size is not large enough, 4 the tail is not heavy enough, i.e. γ is not big, 5 F R α, since it strongly depends on the slowly varying function l(x) that is usually unknown. The disadvantages of Hill s estimate show that one has to apply several estimates of the tail index to deal with the complex analysis of data.

39 Definition of the independence. Definition Random variables ξ 1, ξ 2,..., ξ n are independent if for any x 1, x 2,..., x n it holds P{ξ 1 x 1, ξ 2 x 2,..., ξ n x n } = P{ξ 1 x 1 }P{ξ 2 x 2 }...P{ξ n x n } or in terms of cumulative distribution functions F(x 1, x 2,..., x n ) = F 1 (x 1 )F 2 (x 2 )...F n (x n ), where F k (x k ) is a distribution function of the r.v. ξ k.

40 Rough methods for the dependence detection. Correlation between two r.v.s X 1 and X 2 is defined by where ρ(x 1, X 2 ) = cov(x 1, X 2 ) var(x1 )var(x 2 ), cov(x 1, X 2 ) = E(X 1 X 2 ) E(X 1 )E(X 2 ) is a covariance, and var(x) is a variance. 1 If X 1 and X 2 are independent then ρ(x 1, X 2 ) = 0. 2 If ρ(x 1, X 2 ) = 0 then not necessary X 1 and X 2 are independent! 3 If Gaussian r.v.s X 1 and X 2 are not correlated then they are independent. 4 For non-gaussian r.v.s it may be not true.

41 Rough methods for the dependence detection. Generally, the covariances and correlations cannot indicate the dependence. 1 They describe the degree to which two r.v.s. are linearly dependent: ρ(x 1, X 2 ) [ 1, 1]. 2 ρ(x 1, X 2 ) = ±1 X 1 and X 2 are perfectly linearly dependent, i.e. X 2 = α + βx 1, for some α R and β 0. What tool could be an appropriate to indicate the dependence?

42 Exact methods for the dependence detection. The mixing conditions of a stationary sequence should be considered for dependence detection. Definition (Rosenblatt (1956b)) The strictly stationary ergodic sequence of random vectors X t is strongly mixing with rate function φ k for σ-field, if sup P(A B) P(A) P(B) = φ k 0, k. A σ(x t,t 0),B σ(x t,t>k) The rate φ k shows how fast the dependence between the past and the future decreases.

43 Exact methods for the dependence detection. Another measure of dependence is given by the dependence index: n β n = sup f j (x, y) f(x)f(y), x,y j=1 where f(x) is a marginal probability density function (PDF) of a stationary sequence {X j, j = 1, 2,...}, f j (x, y) is a joint PDF of X 1 and X 1+j, j = 1, 2,... 1 for i.i.d. sequences β n = 0 for all n, 2 for sequences with strong long range dependence β n may tend to infinity, and 3 in between β n may converge to a finite limit at various rates. It is difficult to estimate such dependence measures by statistical tools.

44 Rough methods for the dependence detection. The autocorrelation function (ACF ) is ρ X (h) = ρ(x t, X t+h ) = E((X t E(X t ))(X t+h E(X t+h ))) /Var (X t ) Let {X t, t = 0, ±1, ±2,...} be a stationary sample series. The standard sample ACF at lag h Z is ρ n,x (h) = n h t=1 (X t X n )(X t+h X n ) n t=1 (X t X n ) 2, where X n = 1 n n t=1 X t represents the sample mean. The accuracy of ρ n,x (h) may be poor if the sample size n is small or h is large. The relevance of ρ n,x (h) is determined by its rate of convergence to the real ACF. When the distribution of the X t s is very heavy-tailed (in the sense that EXt 4 = ), this rate can be extremely slow.

45 The autocorrelation function for heavy-tailed data. For heavy-tailed data with infinite variance it is better to use the modified sample ACF : i.e. without X n. ρ n,x (h) = n h t=1 X tx t+h n t=1 X t 2, However, this estimate may behave in a very unpredictable way if one uses the class of non-linear processes in the sense that this sample ACF may converge in distribution to a non-degenerate r.v. depending on h. For linear models it converges in distribution to a constant depending on h, Davis and Resnick (1985).

46 Confidence intervals of the sample ACF: linear processes. The causal ARMA process (autoregressive moving average) X t has the representation X t = ψ j Z t j, t = 0, ±1,..., (1) j=0 where Z t is an i.i.d. noise sequence and {ψ j } is a sequence of real numbers providing the convergence of random series in (1), i.e., j=0 ψ j <, Brockwell and Davis (1991). The model ARMA(p,q) has the form X t = p j=0 θ jx t j + q j=0 ψ jz t j, t = 1,..., n. The model MA(q) has the form X t = q j=0 ψ jz t j.

47 Confidence intervals of the sample ACF: linear processes. Bartlett s formula. If the process is linear (1), j= ψ j < and EZt 4 <, then standard sample ACF ρ n,x (i) has the asymptotic joint normal distribution with mean ρ X (i) (i.e. ACF) and variance var ( ρ n,x (i) ) = c ii /n, c ii = k= [ρ 2 X (k + i) + ρ X(k i)ρ X (k + i) + 2ρ 2 X (i)ρ2 X (k) 4ρ X (i)ρ X (k)ρ X (k + i)], (2) as n. The Bartlett s formula (2) allows to check the hypothesis ρ n,x (i) = 0.

48 Confidence intervals of the sample ACF. Example For i.i.d. white noise Z t we have ρ Z (0) = 1 and ρ Z (i) = 0 for i 0 (since Z t and Z t+i are independent) and var ( ρ n,z (i) ) = 1/n by (2). For ARMA process driven by such a white noise Z t 1 the sample ACF is approximately normally distributed with mean 0 and variance 1/n for sufficiently large n. 2 It provides 95% confidence interval with the bounds ±1.96/ n for the sample ACF. 3 The hypothesis ρ n,x (i) = 0 is accepted if ρ n,x (i) falls within this interval. What would happened if 1 the noise Z t is not normal, and (or) 2 the process is not linear?

49 Confidence intervals of the heavy-tailed sample ACF ρ n,x (h): linear processes. ARMA process with i.i.d. regularly varying noise and tail index 0 < α < 2 (infinite variance). ρ n,x (h) estimates the quantity j ψ jψ j+h / j ψ2 j that represents the autocorrelation cov(x 0, X h ) in the case of a finite variance. Illusion: heavy-tailed sample ACF ρ n,x (h) can be applied to heavy-tailed processes without problem. Recommendation (Resnick (2006), p.349): For α < 1 use ρ n,x (h); for 1 < α < 2 the classical sample ACF ρ n,x (h). The calculation of confidence intervals in both cases is not easy.

50 Confidence intervals of the heavy-tailed sample ACF ρ n,x (h): nonlinear processes. GARCH(p,q) process (Mikosch (2002)): (Z t ) is an i.i.d. sequence, X t = µ + σ t Z t, t Z, p q σt 2 = α 0 + α i Xt i 2 + β j σt j 2, i=1 σ t and Z t are independent for a fixed t, the mean µ is estimated from the data, particularly µ = 0, α i and β j are non-negative constants. j=1 If the marginal distribution of the time series is very heavy-tailed, namely, the fourth moment is infinite, then the asymptotic normal confidence bounds for the sample ACF are not applicable anymore.

51 Testing of long-range dependence (LRD). Definition A stationary process (X t ) is long range dependent if h=0 ρ X(h) =, where ρ X (h) is the ACF at lag h Z, and short range dependent otherwise. This property implies that even though ρ X (h) s are individually small for large lags, their cumulative effect is important. To detect LRD by statistical procedures one replaces ρ X (h) by the sample ACF ρ n,x (h). The LRD effect is typical for long time series, e.g., several thousand points. One can look at lags 250, 300, 350, etc. Usual short range dependent data sets would show a sample ACF dying after only a few lags and then persisting within the 95% Gaussian confidence window ±1.96/ n.

52 Testing of long-range dependence: Hurst parameter. One can assume that for some constant c ρ > 0 ρ X (h) c ρ h 2(H 1) for large h and H (0.5, 1), (LRD case). The constant H (0.5, 1) is called the Hurst parameter. The closer H is to 1 the slower is the rate of ρ X (h) to zero as h, i.e., the longer is the range of dependence in the time series. Kettani and Gubner (2002): Ĥ n = 0.5 ( 1 + log 2 (1 + ρ n,x (1)) ).

53 Dependence detection by bivariate data Measures Rank coefficient Kendall s τ can be estimated by the sample: ρ τ = 2S τ n(n 1), where S τ = n n i=1 j=i+1 sign ( r j r i ), where r i is the order number of the individuum by the second property with the number i by the first property. Rank coefficient Spearman s ρ can be estimated by the sample: ρ S = 1 6S ρ n 3 n, where S ρ = Pickands dependence function A(t). n (r i i) 2 i=1 ρ τ and ρ S can be represented by means of A(t).

54 Pickands dependence function Let (X 1, Y 1 ),...,(X n, Y n ) be a bivariate i.i.d. random sample with a bivariate extreme value distribution G(x, y). Example: X 1 is a TCP-flow file size, Y 1 is a duration of its transmission. Similarly to univariate case it implies, that there exist normalizing constants a j,n > 0 and b j,n R, j = 1, 2 such that as n, P{ ( M 1,n b 1,n ) /a1,n x, ( M 2,n b 2,n ) /a2,n y} (3) = F n ( a 1,n x + b 1,n, a 2,n y + b 2,n ) G(x, y), where M 1,n = max (X 1,..., X n ), M 2,n = max (Y 1,..., Y n ) are the component-wise maxima. The vector (M 1,n, M 2,n ) will in general not be presence in the original data.

55 Pickands dependence function. Definition Let F 1 (x) and F 2 (y) be the DFs of X and Y. Let G j (x), j = 1, 2 be a univariate extreme value DF and F j (x) is in its domain of attraction. G(x, y) may be determined by margins G 1 (x) and G 2 (y) by the representation ( ( )) log (G2 (y)) G(x, y) = exp log (G 1 (x)g 2 (y)) A, log (G 1 (x)g 2 (y)) where A(t), t [0, 1], is the Pickands dependence function.

56 Pickands dependence function: properties. In the bivariate case the function A(t) satisfies two properties: 1 (1 t) t A(t) 1, t [0, 1], i.e. A(0) = A(1) = 1 and lies inside the triangle determined by points (0, 1), (1, 1) and (0.5, 0.5); 2 A(t) is convex. Case A(t) 1 corresponds to a total independence and A(t) = (1 t) t corresponds to a total dependence.

57 A-estimators Capéraà et al. (1997): ( Â HT n (t) = (1/n) ( )) 1 n ˆξ min i /ξ n 1 t, ˆη i/η n, t i=1 Hall and Tajvidi (2000): log ÂC n (t) = 1 n t 1 n n i=1 ) log max (t ˆξ i, (1 t)ˆη i n log ˆξ i (1 t) 1 n i=1 n log ˆη i. Here ˆξ i = log Ĝ1(X i ) and ˆη i = log Ĝ2(Y i ), i = 1,..., n, ξ n = n 1 n ˆξ i=1 i, η n = n 1 n i=1 ˆη i. i=1

58 Problems of A-estimators 1 The estimators are not convex. They may be improved by taking a convex hull. 2 The margins G 1 (x) and G 2 (x) are unknown. One has to replace them by their estimates Ĝ1(x) and Ĝ2(x), e.g., by empirical DFs constructed by component-wise maxima over blocks of data. The amount of these maxima may be very moderate that may reflect on the accuracy of an empirical DF. 3 The component-wise maxima may be not observable together, i.e. there are such pairs of maxima which do not presence in the sample.

59 Bivariate quantile curve Both estimates of A(t) allow to get the estimate Ĝ(x, y) and construct bivariate quantile curves Q(Ĝ, p) = {(x, y) : Ĝ(x, y) = p}, 0 < p < 1, (4) ) (Ĝ2 Ĝ(x, y) exp log (Ĝ1 (x)ĝ2(y)) Â log (y) ), log (Ĝ1 (x)ĝ2(y) assuming that Ĝ1(x) = p (1 w)/â(w) and Ĝ2(x) = p w/â(w) for some w [0, 1] in order to get Ĝ(x, y) = p, Beirlant et al. (2004). The quantile curve consists of the points ) Q(Ĝ, p) = { (Ĝ 1 1 (p(1 w)/â(w) ), Ĝ 1 2 (pw/â(w) ) : w [0, 1]}.

60 Web-traffic analysis Characteristics of sub-sessions: the size of a sub-session (s.s.s); the duration of a sub-session (d.s.s.). Characteristics of the transferred Web-pages: the size of the response (s.r.); the inter-response time (i.r.t.).

61 Description of the Web data Main statistical characteristics of the Web-data. s.s.s.(b) d.s.s.(sec) s.r.(b) i.r.t.(sec) Sample Size Mini mum Maxi mum Mean StDev

62 Results of the Web traffic analysis. Estimation of the tail index for Web-traffic characteristics. r.v. c m b γl b γ p l ˆγ n,k H s.s.s s.r i.r.t d.s.s

63 Results of the Web traffic analysis. Notations to the previous table: γl b is the group estimator with the bootstrap selected parameter m (m b ); γ p l is the group estimator with the plot selected parameter m; ˆγ H n,k k. is the Hill s estimator with the plot selected parameter

64 The EVI estimation by the Hill s estimator and the group estimator γ l for the data sets size of sub-sessions (left) and duration of sub-sessions (right).

65 The EVI estimation by the Hill s estimator and the group estimator γ l for the data sets inter-response times (left) and size of responses (right).

66 Conclusions from analysis of the tail index: 1 the distributions of considered Web-traffic characteristics are heavy-tailed; 2 at least βth moments, β 2 of the distribution of the s.s.s., s.r., d.s.s. are not finite; 3 the distribution of i.r.t. has two finite moments; 4 it might be possible for s.s.s. (when 1 < ˆγ < 2) that α < 1 and the expectation could be also not finite.

67 Results of the Web traffic analysis. Analysis of values R n (p). the values R n (p) are dramatically large for large n and p 2, e.g., in the case of the duration of sub-sessions and n = 350 ln R n (p) 10 for p = 2 and ln R n (p) 10 3 for p = 3. Hence, one may conclude that all moments of the considered r.v.s apart from the first one are not finite.

68 n ln R n (p) of the duration of sub-sessions (left) and the size of sub-sessions (right) for a variety of p-values.

69 n ln R n (p) of the inter-response times (left) and the size of responses (right) for a variety of p-values.

70 Results of the Web traffic analysis. Analysis of the mean excess function. the plots u e n (u) tend to infinity for large u implying heavy tails. These plots are close to a linear shape for all sets of data. The latter implies that the considered distributions can be modelled by a DF of a Pareto type.

71 Exceedance e n (u) against the threshold u for the duration of sub-sessions (left) and the size of sub-sessions (right).

72 Exceedance e n (u) against the threshold u for the inter-response times (left) and the size of responses (right).

73 Results of the Web traffic analysis. Analysis of QQ-plots. Two versions of QQ-plots are shown. Right top plots show, that the distribution of the d.s.s. is close to lognormal and Pareto distributions; distributions of s.s.s., i.r.t. and s.r. are close to Pareto. Left top plots show that the exponential distribution cannot be used as an appropriate model for these r.v.s.; distributions of d.s.s., s.s.s., i.r.t. and s.r. are close to a Generalized Pareto distribution with the DF { 1 (1 + γx/σ) Ψ σ,γ (x) = 1/γ, γ 0, 1 exp ( x/σ), γ = 0, with different values of the parameters γ and σ. The QQ-plot does not give a unique model to fit the underlying distribution.

74 QQ-plots for the duration of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(0.3;1) (bottom) against d.s.s./s (the linear curves correspond to the QQ-plot of the same distributions). Right: Empirical DF s of the r.v. U i = F(X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the exponential distribution.

75 QQ-plot for the size of sub-sessions. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom, first plot) and the GPD(0.05;0.3) (bottom, second plot) against s.s.s./s (the linear curves correspond to the QQ-plot of the same distributions). Right: Empirical DFs of the r.v. U i = F(X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the exponential distribution.

76 QQ-plots for the inter-response times. Left: exponential quantiles (top) and quantiles of the GPD(0.8;0.015) (bottom) against i.r.t./s (the linear curves correspond to the QQ-plot of the same distributions). Right: Empirical DFs of the r.v. U i = F(X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as models of F. The linear curve corresponds to the exponential distribution.

77 QQ-plot for the size of responses. Left: exponential quantiles (top) and quantiles of the GPD(1;0.015) (bottom) against s.r./s (the linear curves correspond to the QQ-plot of the same distributions). Right: Empirical DFs of the r.v. U i = F(X i ). Exponential, Pareto, Weibull, lognormal and normal distributions are used as the models of F. The linear curve corresponds to the exponential distribution.

78 Results of the Web traffic analysis. ACF analysis of Web-data. The previous analysis shows that the considered Web data are heavy-tailed with infinite variance. Therefore, the application of formula is relevant. ρ n,x (h) = n h t=1 X tx t+h n t=1 X 2 t

79 Summary results of the preliminary analysis Comparison of the recommended methods for Web traffic data Amount of finite moments Type of distribution r.v. R n (p) Hill & Group QQ-plot e n (u) estimator s.s.s. 1 1 GPD(0.015; 1), Pareto (B) GPD(0.05; 0.3) -like d.s.s. 1 1 GPD(1; 0.3), Pareto (sec) lognormal -like s.r. 1 1 GPD(0.015; 1) Pareto (B) -like i.r.t. 1 2 GPD(0.015; 0.8) Pareto (sec) -like

80 Testing of dependence ACF estimation by the modified sample ACF and the standard sample ACF for the data sets s.s.s. (first two plots left), d.s.s. (last two plots right). The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s.

81 Testing of dependence ACF estimation by the modified sample ACF and the standard sample ACF for the data sets i.r.t. (first two plots left), s.r. (last two plots right). The dotted horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n) corresponding to the ACF of i.i.d. Gaussian r.v.s.

82 Hurst parameter estimation for Web traffic data Table: Hurst parameter estimation for Web traffic. Data s.s.s. d.s.s. i.r.t. s.r. Ĥ n Main conclusion: all data sets are heavy-tailed and not long-range dependent.

83 TCP-flow analysis We observe TCP-flow sizes S and durations D gathered from one source destination pair. Motivation is to estimate the distribution of the maximal rate (or throughput) R = S/D and the expected throughput R (or S/D) that the transport system provides.

84 Description of the TCP-flow data The analyzed data consist of TCP-flow sizes and durations of transmissions have been measured from the mobile network of the Finnish operator Elisa; mobile TCP connections from periods of low, average and high network load conditions; TCP flows on port 80 (a WWW (HTTP) application). The number of analyzed flows is and, for practical reasons, we consider 61 disjoint bivariate samples, each of size n =

85 Description of the TCP-flow data Statis Unit Definition Sample Mean Sample Variance tic Min Max Min Max Size kb Content Transmitted Dura sec SYN-FIN tion The results, [min,max] ranges over all 61 samples. Content refers to the size of the downloaded web content and Transmitted means Content plus segments retransmitted by TCP. Both are measures of the size of a flow. SYN-FIN means from the three-way handshaking (synchronization) to finish.

86 Results of the TCP-flow data analysis Table: Estimation of the EVI for flow sizes ( Content and Transmitted ) and durations ( SYN-FIN ) γ H (n, k) γ l ˆγ M (n, k) ˆγ UH (n, k) Min Max Min Max Min Max Min Max Content Transmit ted SYN-FIN

87 Conclusions from analysis of the tail index for TCP-flow data: 1 the distributions of TCP-flow size and duration are heavy-tailed; 2 all estimators apart of the group estimator γ l indicate that the flow sizes samples (both content and transmitted) may have infinite variance under the assumption that their distributions are regularly varying; 3 some samples of flow durations may have two finite first moments.

88 Results of the TCP-flow data analysis Table: Comparison of the rough methods for TCP-flow data Amount of first finite Type of distribution moments R n (p) Estimators of γ QQ-plot e n (u) Content 1 2 or 1 GPD(1, 1.3) Paretolike Transmit 1 2 or < 1 GPD(1, 1.3) Paretoted like SYN-FIN < 1 2 or < 1 GPD(1, 0.85) Paretolike

89 Testing of dependence ACF estimation by the modified sample ACF and the standard sample ACF of one sub-sample (n = 1000) of the TCP-flow sizes (first two plots left), and durations (last two plots right). The horizontal lines indicate 95% asymptotic confidence bounds (±1.96/ n). Conclusions: The TCP-flow sizes may be independent. The ACFs of the TCP-flow durations have three clusters that may indicate the dependence.

90 Testing of long-range dependence Table: Hurst parameter estimation for Web traffic and TCP-flow data Data TCP flow size TCP flow duration Ĥ n Main conclusion: TCP-flow size and duration data sets are heavy-tailed and not long-range dependent.

91 Problems of throughput investigation The distributions of both S and D are heavy-tailed and their expectations may not be finite. Thus, ES/D may be not computable. Since S and D are dependent and positive, then the DF of the ratio R = S/D is defined by F R (x) = P{S/D x} = = zx 0 0 zx 0 0 df(y, z), f(y, z)dydz f(y, z) is a joint PDF of S and D, and its expectation by ER = if the latter integral converges. 0 xdf R (x),

92 Bivariate analysis of TCP-flow data First, we check the dependence between the pairs (S 1, D 1 ),...,(S n, D n ) to apply (3). For this purpose, we can calculate the ACF of the r.v.s r i = Si 2 + Di 2, i = 1,..., n. The sample ACF of {r i } is small in absolute value at all lags (possible exception are two lags that do not persist within 95% confidence interval). One may suppose that the sizes-duration pairs are independent.

93 Bivariate analysis of TCP-flow data Block Maximaof Duration (hours) Block Maximaof Size (MB) Scatter plot of pairs of block maxima (M j S,m, Mj D,m ), j = 1,...,610, when the block size is m = The pairs that are presented in the initial sample are marked by dots as far as the pairs that are not presented (e.g., the maximal size in the group does not necessarily correspond to the maximal duration in this group) by circles. Lines D = S/384 and D = S/42 indicate 384 kb/s (EDGE) and 42 kb/s (GPRS) access rates.

94 Testing of dependence of the block maxima ACF ACF Lag Lag Estimates of standard sample ACF of the both maxima samples of size 61 corresponding to TCP-flow sizes (left), and durations (right). The dotted horizontal lines indicate 95% asymptotic confidence bounds ±1.96/ n.

95 Testing of dependence of the block maxima ACF 0 ACF Lag Lag Estimates of standard sample ACF of the both maxima samples of size 610 corresponding to TCP-flow sizes (left), and durations (right). The dotted horizontal lines indicate 95% asymptotic confidence bounds ±1.96/ n.

96 Testing of distribution of the block maxima The Generalized Extreme Value (GEV) distribution { ( exp( (1 + γ x µ ) σ ) 1/γ ), γ 0 H γ (x) = x µ exp( e ( σ ) ), γ = 0. is applied as a model of the block maxima distribution. Maximum likelihood estimates of GEV parameters by block maxima of size 610 of TCP-flow data Statistic Definition γ µ σ Size Content Duration SYN-FIN

97 Testing of distribution of the block maxima Quantilesof GEV Quantilesof GEV Sorted Sample Sorted Sample QQ-plots of block maxima samples corresponding to TCP-flow sizes (left) and durations (right).

98 Testing of dependence of the TCP-flow data A(t) A(t) t t The estimation of the Pickands dependence function by estimators ÂC n (t) (dashed line) and ÂHT n (t) (solid line). The maxima sample of size 61 (left) and of size 610 (right). The marginal distributions G 1 (x) and G 2 (x) of TCP-flow sizes and durations are estimated by GEV. Conclusions: TCP-flow size and duration are dependent.

99 Bivariate quantile curves of the TCP-flow data Using estimates of A(t) one can construct bivariate quantile curves of TCP-flow data by (4) p = Flow Duration p = 0.9 p = 0.75 p = 0.95 Flow Duration p = 0.9 p = Flow Size 0 p = Flow Size Estimated quantile curves of TCP-flow data for p {0.75, 0.9, 0.95} corresponding to estimator ÂC n (t): the maxima sample of size 61 (left), of size 610 (right).

100 Conclusions from bivariate analysis of TCP-flow data The analysis is made from samples of moderate size. Size S and duration D are heavy-tailed with probably infinite second moment. Their distributions are complicated in the sense that they do not belong to any known parametric models. Estimates of the Pickands dependence function show that S and D are dependent. Bivariate quantile curves show that the bivariate extreme value distribution of (S, D) is not quite heavy-tailed in the sense that not many observations fall in the outliers area, i.e. beyond the 97.5% quantile curve. This can be a special property of this mobile TCP data. Bivariate quantile curves are sensitive to, at least, 1 estimation of parameters of margins of G(x, y) and estimates of A(t) and 2 the amount of component-wise maxima, or to the block size.

101 Notes Considering the real data three items have to be investigated: 1 the preliminary detection of heavy tails; 2 the dependence structure of univariate data; 3 the dependence structure of multivariate data.

102 Reference: 1. Markovich, N.M. and Krieger, U.R. (2005) Statistical Inspection and Analysis Techniques for Traffic Data Arising from the Internet, HETNETs 04 special journal issue on Convergent Multi-Service Networks and Next Generation Internet: Performance Modelling and Evaluation pp.72/1-72/9. 2. Beirlant, J., Goegebeur, Y., Teugels, J. and Segers, J. (2004) Statistics of Extremes: Theory and Applications. Wiley, Chichester, West Sussex. 3. Davis, R. and Resnick, S. (1985) Limit theory for moving averages of random variables with regularly varying tail probabilities. Ann. Probability 13, pp Embrechts, P., Klüppelberg, C. and Mikosch, T. (1997) Modeling Extremal Events. Springer, Berlin. 5. Brockwell P.J. and Davis R.A. (1991) Time series: Theory and Methods, 2nd edition. Springer, New York.

103 Reference: 6. Kettani, H. and Gubner, J.A. (2002) A novel approach to the estimation of the hurst parameter in self-similar traffic. In: Proceedings of IEEE conference on local computer networks, Tampa, Florida. 7. Resnick, S.I. (2006) Heavy-Tail Phenomena. Probabilistic and Statistical Modeling. Springer, New York.

Analysis methods of heavy-tailed data

Analysis methods of heavy-tailed data Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course Chapter