Analysis methods of heavy-tailed data

Size: px

Start display at page:

Download "Analysis methods of heavy-tailed data"

Myrtle Harmon
6 years ago
Views:

1 Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course

2 Chapter 3 Heavy-tailed density estimation. Combined parametric-nonparametric methods, Barron s estimate and χ 2 -optimality. Kernel estimators with the variable bandwidth and their smoothing methods: weighted version of squared error cross-validation (WISE), discrepancy method. Re-transformed nonparametric estimators.

3 In Section 3 the problems of the heavy-tailed density estimation are discussed. Three approaches to heavy-tailed density estimation are considered. 1 Combined parametric-nonparametric methods, where the tail domain of the density is fitted by some parametric model and the main part of the density (the body) is fitted by some nonparametric method like a histogram. Similar approach realized by Barron s estimator is considered. 2 Kernel estimates with variable bandwidth. The optimal accuracy of these estimates as well as their disadvantages for heavy-tailed density estimation are discussed. 3 Re-transformed estimates, that use a preliminary transformation of underlying random variable to a new one with the density that is more convenient for restoration.

4 Specific features of analysis of heavy-tailed distributions are the following: Aim: heavy tail goes to zero at slower than at an exponential rate; Cramér s condition is violated; sparse observations in the tail domain of the distribution. Non-parametric PDF estimation with accurate tail behavior. Comparison of PDF s is needed in the classification: classification of measurements belonging to different sources: mobile, fax, normal calls, Internet...; classification of service, using customers behavior.

5 Example of heavy-tailed density estimation. The Fréchet PDF : the body and the tail estimation.

6 Statement of the problem Combined parametric-non-parametric estimators with separate estimation of the tail and the body of the PDF. Kernel estimator ˆfh (x) = 1 nh n ( ) x xi K h i=1 provide peeks at the tail domain or over-smooth the main part of the PDF for the finite samples;

7 Statement of the problem Variable bandwidth kernel estimator ( ) ˆf A (x) = 1 n f(x i ) 1/2 (x X K i )f(x i ) 1/2, nh h i=1 ( ) f A (x) = 1 n ˆf(Xi ) 1/2 (x X K i )ˆf(X i ) 1/2, nh h i=1 ˆf(Xi ) is a pilot estimate of f(x), e.g. a standard kernel estimate. Advantage in comparison with standard kernel estimator. Local adaptation to the sample by means of ˆf(X i ) 1/2 h with a fixed h.

8 Problems of kernel estimates at finite PDF s. Boundary effects of kernel estimates. Epanechnikov s kernel for h1 < 1 Y (n) : the truncation of the kernel; h2 = 1 Y (n) : the kernel corresponds to a triangular PDF at the neighborhood of 1; h3 > 1 Y (n) : oversmoothing of the PDF. Y (n) = 0.8

9 Outline: Heavy-tailed density estimation: 1 the combined approach; 2 variable bandwidth kernel estimators. 3 the usage of the transform-re-transform scheme. Boundary kernels The discrepancy method and cross-validation as smoothing tools for a variable bandwidth kernel estimator.

10 Main assumption: The asymptotical behavior of F(x) at is based on the asymptotic limit distribution of the maximum of the sample. Gnedenko, (1943): if F(x) is such that the limit distribution of the maximum M n = max(x 1, X 2,...,X n ) exists, then this limit distribution can only be of the following form for some normalizing constants a n > 0, b n R P{(M n b n )/a n x} = F n (b n + a n x) n H γ (x), x R, where H γ (x) = exp( x 1/γ ), x > 0, γ > 0 Fréchet, exp( ( x) 1/γ ), x < 0, γ < 0 Weibull, exp( e x ), γ = 0, x R Gumbel.

11 Combined estimators for heavy-tailed densities. Combined parametric-nonparametric method. { f f(t, N (t), t [0, X γ, N) = (n k) ], f γ (t), t (X (n k), ), X (n k) is some r.v., f γ (x) = (1/γ)x 1/γ 1 + (2/γ)x 2/γ 1, is the parametrical tail model of Pareto type, f N (t) = 1 X (n k) N t λ j ϕ j ( ), X (n k) is the non-parametrical estimator of the main part of the PDF that is an expansion by basis functions ϕ k (t), k = 1, 2,.... j=1

12 Estimation of mixtures of two PDFs by combined estimator. Estimation of the PDF of the mixture of Gamma and Pareto distributions (left) and two Gamma distributions (right) by combined estimator.

13 Barron s estimator and χ 2 -optimality. Let P n = {A n1... A nmn } be partitions of the real line (0, ) into finite intervals (bins) by quantiles G 1 (j/m n ), 1 j m n 1 of arbitrary distribution G(x), δ i = df n (x) = 1 n 1 Ani (X i ), n is sample size. A ni n i=1 Estimator, Barron, Györfi, van der Meulen, (1992): 1/n + δ ˆfB (x) = g(x) i, x A nj, 1 j m n, 1/n + 1/m n g(x) is a tail model. Histogram-type estimate at [0, X (k) ] superposes with the tail model g(x). The estimate is consistent in a sense of χ 2 -divergence if m n and m n /n 0 as n.

14 Problems of ˆf B (x): Optimal selection of partitions. Optimal selection of g(x). The behavior of the true DF for x > X (n) is unknown. One has to apply asymptotic results of the extreme value theory regarding the behavior of the DF at infinity. Examples of auxiliary distribution G(x) are lognormal, normal, Weibull distributions. The choice of the auxiliary density g(x) = G (x) influences hardly on the estimate in the tail domain x A nmn = [X (k), ): ˆF(x) = 1/n + δ m n 1/n + 1/m n x g(x)dx = 1/n + δ m n 1/n + 1/m n (1 G(x)). For samples of moderate sizes the tail model g(x) distorts the estimate of the body of the PDF. For large samples this influence becomes weaker.

15 Kernel estimates with the variable bandwidth. Let X n = {X 1,...,X n } be a sample of i.i.d. r.v. distributed with the heavy-tailed DF F(x) and the PDF f(x). Variable bandwidth kernel estimate, Abramson (1982). n ( ) ˆf A (x h) = (nh) 1 f(x i ) 1/2 K (x X i )f(x i ) 1/2 /h Practical version f A (x h 1, h) = (nh) 1 i=1 n ˆfh1 (X i ) 1/2 K i=1 ( ) (x X i )ˆf h1 (X i ) 1/2 /h Main advantages: Non-negativity; The best mean squared error.

16 Mean squared errors (MSE) for kernel estimates: MSE = E (ˆfh (x) f(x)) 2 dx MSE of a standard kernel estimate. MSE(ˆf h ) n 4/5 (bias h 2 ; variance (nh) 1 ) if a second-order kernel is used, h n 1/5, f has two continuous derivatives. MSE of a variable bandwidth kernel estimate. MSE(ˆf A (x h)) n 8/9 (bias h 4 ; variance (nh) 1 ) if a symmetric kernel such as x 4 K(x) dx < is used, h n 1/9, f has four continuous derivatives.

17 Cross-validation for a variable bandwidth kernel estimator (P.Hall (1992)). Weighted integrated squared error where WISE = f i (x; h) = 1 nh p f i (x; h) 2 ω(x)dx 2 n j=1,j i ˆf i (X j, h 1 ) p/2 K 1 ( x X j Ah ), A > 0, f i (x; h) 2 f(x)ω(x)dx, ω(x) is a bounded, nonnegative function (a weight). ( ) (x Xj )ˆf i (X j, h 1 ) 1/2 h

18 Cross-validation for a variable bandwidth kernel estimator (P.Hall (1992)). Example of the weight function: ω(x) = { 1, for Σ 1/2 (x µ) 2 z η, 0, otherwise, where µ and Σ denote the sample mean and variance, is Euclidean distance, z η is the upper (1 η)-level critical point of the chi-squared distribution. Practical version: ŴISE = f i (x; h) 2 ω(x)dx 2 n How good is h? n f i (X i ; h) 2 ω(x i ) i=1

19 Discrepancy method for variable bandwidth kernel estimator. Let h be a solution of the discrepancy equation sup F n (x) Fh,h A 1 (x) = δn 1/2, (1) <x< F n (x) is an empirical DF. F A h,h 1 (x) = x f A (t h 1, h)dt, f A (t h 1, h) is a variable bandwidth kernel estimator. δ is a quantile of the Kolmogorov-Smirnov statistic ndn = n sup <x< F n (x) F(x).

20 Discrepancy method for variable bandwidth kernel estimator. The bias rate. Let h is a solution of the discrepancy equation. It is possible to prove the following. Assuming h 1 = cn 1/5 we have ( P{h > n 1/9 } < exp 2n 1 2/α), for α > 2. ( P{E f A (x h 1, h ) f(x) > ψ(x)n 4/9 } < 2 exp 2n 1/9), where ψ(x) is a function that is independent on n.

21 Discrepancy method for variable bandwidth kernel estimator. Practical version. nmax (ˆD+ n, ˆD ) n = 0.5, where ˆD n ˆD + n = ( ) i n max 1 i n n F h,h A 1 (X (i) ), = ( n max Fh,h A 1 i n 1 (X (i) ) i 1 ), n X (1) X (2)... X (n) are order statistics.

22 Approach with transformations. X 1,..., X n T Y 1,..., Y n, Y j = T(X j ), j = 1,...,n Let T(x) be a monotone increasing one-to-one transformation function (T is continuous). The re-transformed estimate of the PDF of X i is ˆf(x) = ĝ(t(x))t (x), g(x) is the PDF of the r.v. Y i. The DF of the r.v. Y i is G(x) = P{Y i x} = P{T(X i ) x} = F(T 1 (x))

23 Approach with transformations. Preliminary transformations to a new r.v. Y j = T(X j ), j = 1,...,n: Fixed transformations: ln x, 2/π arctan x. Features of fixed transformations: Advantage: they do not require any knowledge about the distribution of X. Disadvantage: they can lead to densities of the transformed r.v.s Y j with discontinuity, which are difficult for the estimation.

24 Approach with transformations. First adapted transformations to a new r.v. Y j = T(X j ), j = 1,...,n: (Wand et al, 1991) T(x) = { xλ sign(λ), λ 0, ln x, λ = 0 ( λ = arg min g (y) ) 2 dy, R where g is the PDF of the transformed r.v. Y 1 = T λ (X 1 ).

25 Approach with transformations. First adapted transformations to a new r.v. Y j = T(X j ), j = 1,...,n: T : R + [0, 1], F(x) is some parametric model (Devroye and Györfi, 1985) The transformation to an isosceles triangular PDF φ tri (x) on [0, 1] F(x), F(x) 0.5, T (x) = F(x) 2, F(x) > 0.5, for kernel estimates with compact kernels and the transformation T(x) = F(x) to a uniform PDF φ uni (x) for a histogram, provide the minimal convergence rate in L 1 : min g E 1 0 ĝ 0 (x) g(x) dx

26 Approach with transformations. Problems of transform-re-transform scheme: The DF F(x) is unknown: impossibility to transform to exact desirable PDF. Selection of a parametric or non-parametric family of distributions as guess DF s. Selection of a target PDF to provide the stability of the re-transformed estimates to minor perturbations in the tail index estimates. Selection of the PDF estimate to keep the tail decay rate (of the true PDF ) after the inverse transformation.

27 Approach with transformations. Adaptive transformation (Maiboroda & Markovich (2004)): Tˆγ (x) = Φ 1 (Ψˆγ (x)) = 1 (1 + ˆγx) 1/(2ˆγ), the guess DF F of X i is assumed to be the Generalized Pareto distribution { 1 (1 + ˆγx) Ψˆγ (x) = 1/ˆγ, x 0, 0, x < 0, the target DF G of Y i is Φ(x) = (2x x 2 )1{x [0, 1]} + 1{x > 1}.

28 Approach with transformations. Adaptive transformation The transformation provides the PDF g(x) at [0, 1] that is continuous in the neighborhood of 1 for typical distributions (with regularly varying tails, lognormal-type tails and Weibull-like tails) for a consistent estimate ˆγ of γ. Estimators ĝ(x): polygram, kernel estimate 1/(nh) n i=1 K ((x x i)/h) The choice of Generalized Pareto distribution is widespread and motivated by Pickands s theorem which states that, for a certain class of distributions and for a sufficiently high threshold u of the r.v. X, the conditional distribution of the overshoot Y = X u, provided that X exceeds u, converges to a Generalized Pareto distribution.

29 Adaptive transformation approach. Estimation algorithm: 1 The tail index of X j is estimated by the sample {X 1,...,X n } using the Hill estimate ˆγ k = 1/k k i=1 log X (n i+1) log X (n k). (X (1)... X (n) is the order statistics of the sample.) 2 The transformation T = Tˆγk is constructed as follows: if ξ has the guess DF Ψˆγk then Tˆγk (ξ) has the target DF Φ (e.g. a triangular). (Here ˆγ k is considered as a fixed value). 3 The transformed sample Y j = Tˆγk (X j ), j = 1,...,n is constructed. 4 The PDF of Y 1,...,Y n is estimated by some estimate ĝ h (x). 5 The PDF of X j is estimated by ˆf h (x) = ĝ h (T(x))T (x).

30 The Pareto PDF with γ = 1. Sample size is n = 100. The PDF of the Gaussian distribution N(0, σ) is used as ˆf(x) for f A (x). The Gaussian kernel is used in the re-transformed kernel estimator. h1 = σ(y)n 1/5 = 0.099, h2 = 1.06σ(X)n 1/5 = 9.453, σ(x) and σ(y) are standard deviations of samples X n = {X 1,..., X n } and Y n = Tˆγ (X n ).

31 The Pareto PDF with γ = 1. Sample size is n = 100. The Gaussian kernel is used in the re-transformed kernel estimator and for ˆf(x) in f A (x). Tˆγ (x) = 1 (1 + ˆγx) 1/(2ˆγ) is the adapted transformation. ˆγ is the Hill s estimator, k is estimated by bootstrap.

32 The Weibull PDF with γ = 0.5. Sample size is n = 100. The Gaussian kernel is used for the re-transformed kernel estimator and for ˆf(x) in f A (x). h1 = σ(y)n 1/5 = 0.102, h2 = 1.06σ(X)n 1/5 = 3.673, Tˆγ (x) = 1 (1 + ˆγx) 1/(2ˆγ)

33 The accuracy of the re-transformed estimators. Mean integrated squared error (MISE). MISE h (ˆγ,Ω) = E = E = E Ω Ω (ˆf(x) f(x)) 2 dx (ĝ h (Tˆγ (x)) g(tˆγ (x))) 2 T ˆγ (x)dtˆγ(x) Ω (ĝ h (y) g(y)) 2 T 1 ˆγ (Tˆγ (y))dy, where Ω = Tˆγ (Ω). For the fixed transformations and non-random intervals Ω : MISE h (Ω) = T (T 1 (y))e(ĝ h (y) g(y)) 2 dy. Ω

34 The MISE of re-transformed kernel estimators. Mean integrated squared error (MISE). If 0 < T (T 1 (x)) c holds at Ω for the transformation T (not necessary fixed), then we have MISE h (Ω) c E(ĝ h (y) g(y)) 2 dy Ω for a non-random Ω. MSE of kernel estimates. MSE(ĝ h ) n 4/5 if a non-variable bandwidth kernel estimator as ĝ h (y) is used, h n 1/5, g (2) is continuous, MSE(ĝ h ) n 8/9 if a variable bandwidth kernel estimator as ĝ h (y) is used, h n 1/9, g (4) is continuous.

35 The rate of decay of re-transformed estimators at by γ. Boundary kernels. A bias of the estimate at the boundary.

36 Boundary kernels. Example. Let the PDF be f(x) = The re-transformed estimate is { l(x)(1 + γx) (1/γ+1), x 0 0, x < 0 ˆf(x) = ĝh (Tˆγ (x))t ˆγ (x) = 0.5ĝ h (Tˆγ (x))(1 + ˆγx) 1/(2ˆγ) 1, Transformation is Tˆγ (x) = 1 (1 + ˆγx) 1/(2ˆγ) Smoothed polygram gives ĝ n (x) = C n (1 x), x 1, ˆfn (x) 0.5C n (1 + ˆγx) (1/ˆγ+1) Kernel estimator gives ˆf h (x) 0.5ĝ h (1)(1 + ˆγx) (1/(2ˆγ)+1), i.e. the EVI is two times larger than needed.

37 Boundary kernels. Example. Principles of the selection of boundary kernels. The kernel coincides with the target PDF : K(y) = g(y), y [Y (n), 1], Direct fitting of the boundary : ( ) 1 T(x) Y(n) h : h K T (x) = h ˆf(x), because ĝ h (y) 1 ( ) y Y(n) h K h y (Y (n), 1]

38 Reduction of the boundary bias, Simonoff (1996): Let a new kernel independent on the PDF be B(x) = (a 2(p) a 1 (p)x) K(x) a 0 (p)a 2 (p) a1 2(p), a l (p) = 1 1 p u l K(u)du, 0 < p < 1. The bias of the kernel estimator with such a kernel in the boundary region is O(h 2 ) and the variance is O((nh) 1 ) (the same as in the interior) when second derivative of the underlying density is continuous.

39 Overcoming of boundary effects. Combination of two approaches: usage of B(x) and K(y) = g(y). For the adapted transformation and h = 1 Y (n) we have ĝ h (Tˆγ (x)) = 1 h 1 ( ) Tˆγ h B (x) Y (n) h ( a 2 (p) a 1 (p) Tˆγ(x) Y (n) h (1 + ˆγx) 1/(2ˆγ), ĝ h (y) 1 ( ) y Y(n) h K h ) K a 0 (p)a 2 (p) a 2 1 (p) y (Y (n), 1] ( ) Tˆγ (x) Y (n) h

40 Re-transformed kernel estimators applied to Web data. PDF estimation of the sizes of sub-sessions (left) and inter-response times (right). K(x) is Epanechnikov s kernel. h = σn 1/5, h1 = 1.01 Tˆγ (X (n) ), h < h1, σ is the variance of the transformed data.

41 Comparison of re-transformed kernel estimate and variable bandwidth kernel estimate. Retransformed standard kernel estimate and variable bandwidth kernel estimate with Epanechnikov s kernel for Pareto distribution: body (left) and tail (right). h is selected by D-method.

42 Comparison of re-transformed kernel estimate and variable bandwidth kernel estimate. Conclusions: Pure variable bandwidth kernel estimator does not fit the density at infinity at least with compact supported kernels in contrast to a variable bandwidth kernel estimator that uses transformation of the data.

43 Papers: Markovitch, N.M., Krieger U.R. (2000a) Nonparametric estimation of long-tailed density functions and its application to the analysis of World Wide Web traffic. Performance Evaluation, 42(2-3), Markovitch, N.M. and Krieger, U.R. (2002) The estimation of heavy-tailed probability density functions, their mixtures and quantiles. Computer Networks, Vol. 40, Issue 3, Maiboroda R.E., Markovich N.M. (2004) Estimation of heavy-tailed probability density function with application to Web data. Computational Statistics, 4. Barron,A.R., Chyong-Hwa Sheu. (1991) Approximation of density functions by sequences of exponential families. Annals of statistics, 19, 3,

44 Papers: Barron, A.R., Györfi, L., van der Meulen, E. (1992) Distribution estimation consistent in total variation and in two types of information divergence. IEEE Trans.Inform Theory, 38, Silverman, B.W. (1986) Density Estimation for Statistics and Data Analysis, New York: Chapman&Hall. Hall, P. (1992) On global properties of variable bandwidth density estimators. Annals of Statistics, 20, 2,

Analysis methods of heavy-tailed data

Analysis methods of heavy-tailed data Institute of Control Sciences Russian Academy of Sciences, Moscow, Russia February, 13-18, 2006, Bamberg, Germany June, 19-23, 2006, Brest, France May, 14-19, 2007, Trondheim, Norway PhD course Chapter