Analysis methods of heavy-tailed data

Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course

Chapter 5 Estimation of high quantiles, endpoints, excess functions. In Section 5 several classical methods for quantile estimation are considered. The methods of estimating high quantiles, endpoints, excess functions for heavy-tailed distributions are presented. Application to WWW-traffic data is considered.

Estimation of high quantiles Main assumption. Let X 1,...,X n be a sample of i.i.d. r.v. distributed with the unknown CDF F(x). Definition The (1 p)th quantile of F for a given value of p is the solution x p of the equation F(x) = 1 p A high quantile corresponds to a probability p close to 0 (e.g. p = 0.01, p = 0.001).

Main problem: Properties of heavy-tailed distributions. Sparse observations in the tail domain. The behavior of the true CDF F(x) beyond the sample is unknown. The empirical CDF F n (x) = 1 as x > X (n), where X (n) is a maximal observation in the sample.

Classical approach: To estimate quantiles the empirical distribution function F n (x) is used. But: F n (x) = 1 for x X (n) Then: for p < 1/n it is impossible to estimate the quantiles without the knowledge about the behavior of F at infinity. Weighted estimates: x p = X ([np]+1) x p = (1 g)x (j) + gx (j+1), where j = [np] and g = np j (Dielman et al, 1994).

Results of the extreme value theory. Main assumption is that F(x) behaves beyond the sample according to the limit distribution of maxima M n = max(x 1,..., X n ) of i.i.d. sample, i.e. P{(M n b n )/a n x} = F n (b n + a n x) n H γ (x), x R (1) holds for some real numbers a n > 0 and b n, and the distribution H γ (x) may be only one of three types (Gnedenko, 1943) H γ (x) = { exp( (1 + γx) 1/γ ), γ 0 exp( e x ), γ = 0, (2) where γ is the extreme value index (EVI).

Results of the extreme value theory. To satisfy (1) it is necessary and sufficient (Embrechts et al, 1997) that it holds lim nf(b n + a n x) = ln H γ (x), x R, a n > 0, b n R. n Then from (2) and for γ 0 it holds lim t (1 F(a(t)x + b(t))) = (1 + t γx) 1/γ. For the pth quantile the following approximation ˆx d p = ˆb(n/k) + â(n/k) (k/(pn))ˆγ 1 ˆγ may be used (Dekkers et al, 1989).

Models of the distribution function F(x). The Generalized Pareto Distribution Ψ σ,γ (x) = { 1 (1 + γx/σ) 1/γ, γ 0, 1 exp ( x/σ), γ = 0, where σ > 0 and x 0, as γ 0; 0 x σ/γ, as γ < 0. Pareto-type tail model (Hall, 1990) ( ) 1 F(x) = cx 1/γ 1 + dx β + o(x β ), where γ, β > 0, c > 0, < d <.

Examples of high quantile estimators. Peaks-over-threshold (POT ) method, (McNeil, 1997): the GPD is used as a distribution of excesses of a high threshold u, xp POT = u + ˆσˆγ ( ) p ( ) ˆγ 1, 1 F n (u) where ˆσ and ˆγ are estimates of the parameters of the GPD, u is a suitable pilot threshold. (Software EVIS). Weissman s estimate, (Weissman, 1978): the Pareto tail model is used, ( k + 1 )ˆγ xp w = X (n k), k = 1,...,n 1 (n + 1)p

Examples of high quantile estimators. Markovich, Krieger, (2002): where x c p = X (n k) ( 0.5 + 0.25 + pnc(ˆγ) k ) ˆγ, c(γ) 1 + X 1/γ (n k) + X 2/γ (n k), k = 1,...,n 1 is the normalizing multiplier.

Common problem of high quantile estimates. The influence of k (ˆγ) on the accuracy of the estimation. When k ր (X (n k) becomes small) Var(ˆγ) ց, bias(ˆγ) ր. When k ց (fewer data are used) Var(ˆγ) ր, bias(ˆγ) ց. Choice of k : min E(ˆx p (k) x p ) 2 min as.e(ˆx p (k) x p ) 2 (or its bootstrap version) ( ) min as.e log ˆxp(k) 2 x p (or its bootstrap version) min E ( Fθ ( F 1 (p)) p ) 2 ( Fθ is some estimate of the tail).

Distribution of high quantile estimators Theorem Let the tail distribution be of Pareto-type ( ) 1 F(x) = cx 1/γ 1 + dx β + o(x β ), where γ, β > 0, c > 0, < d <, and k, n, k n 0, p = p n c k n 0, c > 0, as n hold. Then log(x w p /x p ) a δ w d N(0, 1), log(x c p/x p ) (a + ((k + 1)/n) γ ) δ c d N(0, 1),

Distribution of high quantile estimators Theorem where δ 2 w = γ2 k ( ( (k ) ) + 1 γβ a = γdc γβ p γβ, n 1 γβdc γβ ( k + 1 n ) ) γβ 2 ( ( )) k 2 + log, np δ 2 c ( ) = δw 2 + γ2 k + 1 γ ( ) k + 1 γ ( k n n ( ( ) ) k + 1 γβ 2 1 γβdc γβ ). n

Distribution of high quantile estimators Theorem 2 shows that the expectation of the distribution of log(x c p/x p ) is larger than that for log(x w p /x p ) while the variance is less. The difference becomes negligible when the sample size increases.

Estimation of the EVI γ. The high quantile estimators x c p and x w p require the choice of k (the number of the largest order statistics). The estimation of k is also required to estimate the EVI γ. Some γ estimators. Hill s estimator: γ H (n, k) = 1 k k i=1 log X (n i+1) log X (n k), where X (1)... X (n). Other estimates: the moment-estimator, Pickands s estimator, the ratio estimator, the UH-estimator The value k has to minimize the mean squared error: E(ˆx p (k) x p ) 2 = bias(ˆx p ) 2 + variance(ˆx p ) min k

Classical bootstrap: re-samples of the same size n as the sample X n are used; Classical bootstrap for linear estimates (linear regressions, kernel estimators of probability density) the bootstrap estimate of the bias is equal to zero regardless of the true bias of the estimate.

Non-classical bootstrap: re-samples of smaller sizes n 1 = n β, 0 < β < 1, than n are used; Non-classical bootstrap

Non-classical bootstrap. The bootstrap estimate of the mean squared error of the high quantile estimation. MSE(n 1, k 1 ) = E{ (ˆx p(n 1, k 1 ) ˆx p (n, k) ) 2 X n } = (b (n 1, k 1 )) 2 + var (n 1, k 1 ) min k 1, The estimate ˆx p(n 1, k 1 ) is the quantile estimate with parameter k 1, constructed by the re-sample X n 1 of the size n 1 < n. The relation among k and k 1 is k = k 1 (n/n 1 ) α, 0 < α < 1. α, β? The values α = 2/3, β = 1/2 lead to the most accurate results when a bootstrap estimate of the parameter k of the Hill s estimate ˆγ H (n, k) is considered, Hall (1990). Alternative to bootstrap: ML+POT, ML+blocks method.

Estimation of confidence intervals. Theorem 2 does not allow to construct the asymptotic confidence intervals, because of unknown parameters of the Hall-type distribution (γ, β, c, d). Non-asymptotic confidence intervals Assume that estimates of pth quantile ˆx p, 1..., ˆx N R p normally distributed. N R is a number of samples. are Tolerant limits of confidence intervals for finite samples are defined by (Mean(ˆx p ) k StDev(ˆx p ); Mean(ˆx p ) + k StDev(ˆx p )), where Mean(ˆx p ) and StDev(ˆx p ) are the empirical mean and the standard deviation of the N R estimates ˆx 1 p,..., ˆx N R p.

The confidence interval is constructed in such a way that the (1 p)th part of the distribution falls into this interval with the probability P: ( ) k = k 1 + t p + 5t2 p + 10, 2NR 12N R 1 2π k k e t2 /2 dt = 2Φ 0 (k ) = 1 p, where Φ 0 (z) is the Laplace s function. The value t p is calculated from the equation 1 2π t p e t2 /2 dt = 0.5 Φ 0 (t p ) = 1 P. For P = 0.99 Φ 0 (t p ) = P 0.5 = 0.49, t p = 2.33 is found from the table of the Laplace s function, k = 1.645 for p = 0.1, corresponding to 90% interval, k = 1.776 for N R = 500.

Confidence intervals 500 samples ofn=100 observations each 300 150 500 samples ofn=1000 observations each 200 100 100 0 50 0 - Pareto γ = 1 1 - Pareto γ = 1/2 2 - Weibull γ = 2 1 0 1 2 3 99% quantile 90% confidence intervals x_p^c - 90% confidence intervals x_p^w - 0 1 0 1 2 3 99% quantile 90% confidence intervals x_p^c - 90% confidence intervals x_p^w - Tolerant 90% confidence intervals of estimates x w p and x c p of 99% quantiles for heavy-tailed distributions: 500 samples of n = 100 (left) and n = 1000 (right) observations each.

Confidence intervals 6000 500 samples ofn=100 observations each 500 samples ofn=1000 observations each 2000 4000 1500 2000 0 2000 1000 500 0 - Pareto γ = 1 1 - Pareto γ = 1/2 2 - Weibull γ = 2 1 0 1 2 3 99.9% quantile 90% confidence intervals x_p^c - 90% confidence intervals x_p^w - 1 0 1 2 3 99.9% quantile 90% confidence intervals x_p^c - 90% confidence intervals x_p^w - Tolerant 90% confidence intervals of estimates x w p and x w c of 99.9% quantiles for heavy-tailed distributions: 500 samples of n = 100 observations each.

Confidence intervals 1 10 4 MSE for 99% quantile, n=100 8 10 6 MSE for 99.9% quantile, n=100 8000 6000 4000 2000 6 10 6 4 10 6 2 10 6 0 - Pareto γ = 1 1 - Pareto γ = 1/2 2 - Weibull γ = 2 0 0 0.5 1 1.5 2 x_p^c x_p^w 0 0 1 2 x_p^c x_p^w MSE of estimates x w p and x w c of 99% (left) and 99.9% (right) quantiles for heavy-tailed distributions: 500 samples of n = 100 observations each.

Confidence intervals MSE for 99% quantile, n=1000 800 2 10 5 MSE for 99.9% quantile, n=1000 600 400 200 1.5 10 5 1 10 5 5 10 4 0 - Pareto γ = 1 1 - Pareto γ = 1/2 2 - Weibull γ = 2 0 0 1 2 x_p^c x_p^w 0 0 1 2 x_p^c x_p^w MSE of estimates x w p and x w c of 99% (left) and 99.9% (right) quantiles for heavy-tailed distributions: 500 samples of n = 1000 observations each.

Application to WWW-session data Characteristics of sub-sessions: the size of a sub-session (s.s.s) in bytes the duration of a sub-session (d.s.s.) in seconds Characteristics of the transferred Web-pages: the size of the response (s.r.) in bytes the inter-response time (i.r.t.) in seconds

Description of Web-traffic data s.s.s.(b) d.s.s.(sec) s.r.(b) i.r.t.(sec) Sample 373 373 7107 7107 Size Mini 128 2 0 6.543 10 3 mum Maxi 5.884 10 7 9.058 10 4 2.052 10 7 5.676 10 4 mum Mean 1.283 10 6 1.728 10 3 5.395 10 4 80.908 StDev 4.079 10 6 5.206 10 3 4.931 10 5 728.266 s 10 7 10 3 10 6 10 3

The Hill s estimate for Web-traffic data. ˆγ is the Hill s estimate, k is the number of largest statistics. s.s.s. d.s.s. s.r. i.r.t. k 50 50 211 211 ˆγ 0.949 0.601 0.898 0.712 Conclusions: the estimates of the tail index α = 1/γ are less than 2 for all considered data sets; it follows from the extreme value theory, that at least βth moments, β 2 of the distribution of the corresponded r.v.s s.s.s., d.s.s., s.r., i.r.t. are not finite. The distributions of considered Web-traffic characteristics are heavy-tailed.

High quantiles for Web-traffic data Quantile r.v. Quantile value ˆx p 10 4 estimate p = 0.99 p = 0.999 xp c d.s.s. 1.4005 5.812 s.s.s. 2.299 10 3 2.1 10 4 s.r. 69.27 439.5 i.r.t. 0.1445 0.5493 xp w d.s.s. 1.435 5.688 s.s.s. 2.407 10 3 2.02 10 4 s.r. 56.67 431.6 i.r.t. 0.0954 0.5402

Endpoints, excess functions Definition Let X be a r.v. with the finite right endpoint X F = sup{x R : F(x) < 1}. Then e(u) = E(X u X > u), 0 u < X F is the mean excess function of the r.v. X over the threshold u. Assuming 0/0 = 0 the empirical mean excess function is defined by n n e n (u) = (X i u)1{x i > u}/ 1{X i > u} i=1 i=1

Endpoints, excess functions For heavy-tailed distributions the function e(u) tends to infinity; a linear plot u e(u) corresponds to a Pareto distribution e P (u) = (1 + γu)/(1 γ), the constant 1/λ corresponds to an exponential distribution, e(u) tends to 0 for light-tailed distributions.

The mean excess function for different distributions. The mean excess function of the Pareto distribution F(x) = 1 x 2, x > 0, together with 10 empirical mean excess functions e n (u) each based on simulated data (n = 1000) from the above distribution. For different samples the curves e n (u) may differ strongly towards the higher values of u since only sparse observations may exceed the threshold u for large u. It makes the precise interpretation of e n (u) difficult.

Papers: Embrechts, P., Klüppelberg, C., Mikosch, T. (1997) Modelling Extremal Events for Finance and Insurance. Springer, Berlin. Ferreira, A., de Haan, L., Peng, L. (2000) Adaptive estimators for the endpoint and high quantiles of a probability distribution. Eurandom: Research Report No. 99-042. Hall, P. (1990) Using the Bootstrap to Estimate Mean Squared Error and Select Smoothing Parameter in Nonparametric Problems. Journal of Multivariate Analysis, 32, 177 203. Markovitch, N.M. and Krieger, U.R. (2002) The estimation of heavy-tailed probability density functions, their mixtures and quantiles Computer Networks Vol. 40, Issue 3, pp. 459-474.

Papers: Markovich, N.M High Quantiles of Heavy-Tailed Distributions: Their Estimation,Automation and Remote Control Vol. 63, No. 8, pp. 1263-1279, 2002. Markovich N.M. (2005) High quantile estimation for heavy-tailed distributions Performance Evaluation. Weissman, I. (1978) Estimation of parameters and large quantiles based on the k largest observations. Journal of American Statistical Association, 73, 812-815.