Two applications of max stable distributions: Random sketches and Heavy tail exponent estimation

Size: px

Start display at page:

Download "Two applications of max stable distributions: Random sketches and Heavy tail exponent estimation"

Cecil Blair
6 years ago
Views:

1 Two applications of max stable distributions: Random sketches and Heavy tail exponent estimation Stilian Stoev Department of Statistics University of Michigan Boston University Probability Seminar March 1, 2007 Based on joint works with Marios Hadjieleftheriou, George Kollios, George Michailidis and Murad Taqqu 1

2 Outline: Max stable distributions: background Max stable sketches: norm, distance and dominance norms Max spectrum: heavy tail exponent estimation 2

3 Max stable distributions Th (Fisher-Tippett-Gnedenko) Let X 1,..., X n,... be iid. If 1 a n (X 1 X n ) b n d Z, (n ) (1) then the non degenerate Z has one of the three types of distributions: (i) Frechet: G α (x) := exp{ x α }, x > 0, with α > 0. (ii) Negative Frechet: Ψ α (x) := exp{ ( x) α }, x < 0, with α > 0. (iii) Gumbel: Λ(x) := exp{ e x }, x R. Analog of the CLT when replaces +. Roughly: the tails of X i determine the type of limit. Def A rv X is max stable if, a, b > 0, c > 0, d: where X, X, X iid. ax bx d = cx + d, Th The limit Z in (1) are max stable. Conversely any max stable Z appears in a limit of iid maxima as in (1). 3

4 More on Frechet distributions Def Z is α Fréchet (α > 0) if P{Z x} = exp{ cx α }, x > 0, (c > 0). Properties scale family: For all a > 0, P{aZ x} = exp{ (a α c)x α }, x > 0 Notation: For the scale c 1/α : Z α := c 1/α, so az α = a Z α, a > 0. heavy tails: As x, P{Z > x} = 1 e cx α cx α. moments: EZ p < iff p < α and EZ p = 0 x p de cx α = c p Γ(1 p/α). Max stability: For iid Z, Z 1,..., Z n : Indeed: Z 1 Z n d = n 1/α Z. P{Z i x, i = 1,..., n} = P{Z x} n = e ncx α, x > 0. 4

5 Application I: Random Sketches First, mathematical concepts, then explanation & use. 5

6 Max stable sketches Signal : f(i) 0, i = 1,..., n, with large n. Def For α > 0, and k N, let n E j (f) := f(i)z j (i), 1 j k, i=1 for iid standard α Fréchet Z j (i) s P{Z j (i) x} = exp{ x α }, x > 0 i.e. Z j (i) α = 1. Max stable sketch of f: E j (f), 1 j k. Idea: use {E j (f)} with k << n as a proxy to f! More on this later... Properties E j (f) are α Fréchet with n E j (f) α = ( f(i) α ) 1/α =: f lα. Indeed, for x > 0: i=1 P{E j (f) x} = i P{Z x/f(i)} = exp{ ( i f(i) α )x α }. max linearity: For a, b > 0 and another signal g(i) E j (af bg) = ae j (f) be j (g), 1 j k, where (af bg)(i) = af(i) bg(i). 6

7 Max stable sketches: estimation Observations The E j (f) s are iid in j = 1,..., k The scale E j (f) α equals the norm f lα. Estimators: (moment) For 0 < p < α, define L p (f) := c k p,α E j (f) p k with c p,α := Γ(1 p/α) 1. j=1 Motivation: 1 E j (f) p EE j (f) p = Γ(1 p/α) E j (f) p k α. j (median) Define, Motivation: 1/p M(f) := ln(2) 1/α med{e j (f), 1 j k}. e xα = 1/2 x = ln(2) 1/α., 7

8 Norms and distances Given the max stable sketch {E j (f)} of a signal f. For large k s, we can estimate the norm: f lα L p (f) or f lα M(f). Performance guarantees later... Given another signal g, define the distance: ρ α (f, g) := n f(i) α g(i) α. i=1 Note that a b = 2(a b) a b, a, b 0, so: ρ α (f, g) = 2 i f(i) α g(i) α i f(i) α i g(i) α, and hence: ρ α (f, g) = 2 f g α l α f α l α g α l α. Thus, given the sketches E j (f) and E j (g), we can estimate ρ α (f, g) by or by ρ α (f, g) 2L p (f g) α L p (f) α L p (g) α ρ α (f, g) 2M(f g) α M(f) α M(g) α The key is: E j (f g) = E j (f) E j (g) so E j (f g) is available from E j (f) and E j (g). 8

9 Dominance norms Given are several signals f t (i), 1 i n, indexed by t = 1,..., T. Def The dominant signal is: f (i) := max 1 t T f t(i), 1 i n. The dominance norm of the f t s is f lα. Problem: Estimate f lα from the sketches {E j (f t )}, t = 1,..., T. Without accessing the signals f t directly! Solution: Max linearity implies E j (f ) = E j (f 1 ) E j (f T ), 1 j k. The sketch E j (f ) is readily available! Proceed as above and estimate f lα by: f lα L p (f ) or by f lα M(f ). 9

10 Motivation: Data streams Signals which cannot be stored and processed with conventional methods are emerging in: Telephone call monitoring (unmanageably large data bases). Internet traffic monitoring (vast, rapidly growing amounts of data). Sensing applications (restrictions on power, bandwidth, etc.). Online transactions (Banking, Stock market data, E-commerce, etc.). The streaming model a general framework. Let f(i), 1 i n be a signal with large domain size n. Starting with f(i) = 0, update f sequentially in time: cash register Observe (i, a(i)), and update f: f(i) := f(i) + a(i) aggregated Observe (i, f(i)) directly. 10

11 Examples: Call detail records & IP monitoring Phone calls: Each time a phone call ends, a CDR packet is sent to a main station. Think of: CDR = (i, a(i), ), with i = (phone no.) a(i) = time used. Given a CDR (i, a(i), update f(i) := f(i) + a(i). So the signal f(i) contains info on all phone numbers! Multiple signals: f t (i), t = 1,...,7 capture the calling patterns for different days of the week. IP monitoring: Monitor a fast Internet link during a given day (or hour). How many distinct IPs used it? How many bytes were transmitted from a given range of IPs? Can you give me estimates in real time??? Signal: f(i) = Bytes transmitted by IP i. Stream items: A packed from IP i contains info: (i, b(i)), i = (IP number), b(i) = (Bytes) On observing (i, b(i)), update: f(i) := f(i) + b(i). Impossible to store and process all data in real time. 11

12 Classical (additive) random sketches Instead of storing the signal f(i), 1 i n, keep: n S j (f) = f(i)x j (i), 1 j k, i=1 where k << n and where X j (i) are iid random variables. The S j (f) s 1 j k is called the random sketch of f. Important: Can generate X j (i) s on the fly from small O(klog 2 (n)) set of seeds in main memory. Can update the sketch sequentially in the most general, cash register model. Can approximate norms and inner products! If Var(X j (i)) = σ 2, then Var(S j (f)) = f 2 l 2 σ 2, f 2 l 2 = i f(i)2. So f 2 l 2 1 K S j (f) 2. For two signals f and g, by linearity, j f g 2 l 2 1 k k S j (f) S j (g) 2, j=1 12

13 Back to max stable sketches Strengths 1. Max stable sketches are max linear. Natural use in: sensor data, Internet traffic monitoring, Transaction record tracking, max sequential updates. 2. Work for any α > 0 and not only for 0 < α 2 (unlike the sum stable sketches). 3. Tailor made for: dominance norm estimation. Illustration: Signals: f t (i) = Bytes of IP number i during t th hour of monitoring. Large domain: 1 i n where n = Problem: Estimate the norm of the heaviest load scenario during the day, i.e. for f (i) = max 1 t 24 f t(i) estimate the dominance norm f lα? Problem: Compare several network traffic scenarios f, g, h,... without accessing the enormous data directly. Compute pair wise distances, norms and dominance norms. Max stable sketches provide efficient solutions. Limitation: Max stable sketches are not linear cannot be updated in the linear cash register format! 13

14 Max stable sketches: Performance guarantees Norms Given a max stable sketch E j (f), j = 1,..., k of a signal f(i) 0. Recall that M(f) is the median based estimator of f lα. Th 1(SHKT 07) Let ǫ, δ (0,1). Then, P{ M(f)/ f lα 1 ǫ} 1 δ, if k C log(1/δ)/ǫ 2, for some C > 0. For optimal value of the constant, as ǫ, δ 0, we have k = k(ǫ, δ) 2 α 2 ln 2 (2) ln(1/δ) ǫ 2. Distances Let now g be another signal and let D(f, g) := 2M(f g) α M(f) M(g) ρ α (f, g) = f α g α l1. Th 4 (SHKT 07) Let ǫ, δ, η (0,1). If ρ α (f) η f g α l α, then P{ D(f, g)/ρ α (f, g) 1 ǫ} 1 δ, for k C log(1/δ)/ǫ 2, with some C > 0. 14

15 Sketches: summary & credits Some seminal papers & algorithmic applications Alon, Matias & Szegedy (1996) Feingenbaum, Kannanm Strauss & Viswanathan (1999) Fong & Strauss (2000) Indyk (2000) Gilbert, Kotidis, Muthukrishnan & Strauss (2001, 2002) Cormode & Muthukrishnan (2003) Datar, Immorlica, Indyk & Mirrokni (2004) Review paper: Muthukrishnan (2003) More on max stable sketches: Stoev, Hadjieleftheriou, Kollios & Taqqu (2007) Code: By Marios Hadjieleftheriou (AT&T Research) A lesson: (For probabilists and computer scientists) Study each other s fields! 15

16 Application II: Heavy tail exponents 16

17 Heavy tailed data A random variable X is said to be heavy tailed if P{ X x} L(x)x α, as x, for some α > 0 and a slowly varying function L. Here we focus on the simpler but important context: X 0, a.s. and P{X > x} Cx α, as x. X (infinite moments) For p > 0, EX p < if and only if p < α. In particular, and 0 < α 2 Var(X) = 0 < α 1 E X =. The estimation of the heavy tail exponent α is an important problem with rich history. Why do we need heavy tail models? Every finite sample X 1,..., X n has finite sample mean, variance and all sample moments! Why consider heavy tailed models in practice?! 17

18 Why use heavy tailed models? All models are wrong, but some are useful. George Box Let F and G be any two distributions with positive densities on (0, ). Let ǫ > 0 and x 1,..., x n (0, ) be arbitrary, then both: and P F {X i (x i ǫ, x i + ǫ), i = 1,..., n} > 0 are positive! P G {X i (x i ǫ, x i + ǫ), i = 1,..., n} > 0 For a given sample, very many models apply. The ones that continue to work as the sample grows are most suitable. We next present real data sets of Financial, Insurance and Internet data. They can be very heavy tailed. 18

19 Traded volumes on the Intel stock 10 x Traded Volumes No. Stocks, INTC, Nov 1, x x

20 Insurance claims due to fire loss 250 Danish Fire Loss Data: Hill plot: α (k) = H order statistics Max Spectrum H= ( ), α = Scales j 20

21 TCP flow sizes (in number of packets) x 10 4 TCP Flow Sizes (packets): UNC link 2001 (~ 36 min) time x 10 4 The first minute

22 History Hill (1975) worked out the MLE in the Pareto model P{X > x} = x α, x 1 and introduced the Hill plot: α H (k) := ( 1 k k log(x i,n ) log(x k+1,n )) 1, i=1 where X 1,n X 2,n X k,n are the top K order statistics of the sample. How to choose k? pick k where the plot of α H (k) vs. k stabilizes. serious problems in practice: volatile & hard to interpret: Hill horror plot confidence intervals robustness Consistency and asymptotic normality resolved: Weissman (1978), Hall (1982) in semi parametric setting. Many other estimators: kernel based Csörgő, Deheuvels and Mason (1985), moment Dekkers, Einmahl and de Haan (1989), among many others. Most estimators exploit the top order statistics. 22

23 Hill horror plots: TCP flow sizes 2 Hill plot 1.5 α Order statistics k x 10 4 α H (300) = α H (12000) = α 1 α Order statistics k Order statistics k x

24 Fréchet max stable laws Consider i.i.d. X i s with P{X 1 > x} Cx α, x. By the Fisher Tippett Gnedenko theorem: 1 max n X 1/α i 1 n d X 1 i n n 1/α i Z, as n, i=1 where Z is α Fréchet extreme value distribution: P{Z x} = exp{ Cx α }, x > 0. The extreme value distributions are max stable. In particular, for i.i.d. α Fréchet Z,& Z i s: Z 1 Z n d = n 1/α Z. A time series of i.i.d. α Fréchet {Z k } is 1/α max-selfsimilar: m N : { 1 i m Z m(k 1)+i } k N d = m 1/α {Z k } k N. Block maxima of size m have the same distribution as the original data, rescaled by the factor m 1/α. Any heavy tailed {X k } (i.i.d.) data set is asymptotically max self similar. 24

25 Max spectrum Given a positive sample X 1,..., X n, consider the dyadic block maxima: D(j, k) := max 1 i 2 j X 2j (k 1)+i with k = 1,..., n j = [n/2 j ]. In view of the asymptotic scaling: observe that Y j := 1 n j 1 d 2j/αD(j, k) Z, n j j=1 where c := Elog 2 Z. 2 j i=1 (j ), X 2j (k 1)+i, log 2 D(j, k) j/α + c, (j, n j ) The last asymptotics follow from the LLN since D(j, k) s are independent in k. The max spectrum of the data is defined as the statistics: Y j, j = 1,...,[log 2 (n)]. Can identify α from the slope of the Y j s vs. j, for large j s. 25

26 Max spectrum estimators of α Given the max spectrum Y j, j = 1,...,[log 2 (n)], define j 2 α(j 1, j 2 ) := ( w j Y j ) 1, j=j 1 where j jw j = 1 and j w j = 0. That is, use linear regression to estimate the slope 1/α. Issues: weighted or generalized least squares must be used, since Var(Y j ) 1/n j 2 j, The choice of the scales j 1 and j 2 is critical. These issues and confidence intervals, are addressed in Stoev, Michailidis and Taqqu (2006). 26

27 Examples of max spectra Frechet(α = 1.5), Estimated α = Intel Volume: α = Max Spectrum 10 5 Max Spectrum Scales j 1.3 stable(β=1), α = Scales j T dist (df = 3), α = Max Spectrum Max Spectrum Scales j Scales j 27

28 Robustness of the max spectrum 20 Max self similarity: α(5,10) = , α(10,12) = Max Spectrum Scales j Compare with the Hill horror plot of the Internet flow size data (next slide). 28

29 Hill horror plots: TCP flow sizes 2 Hill plot 1.5 α Order statistics k x 10 4 α H (300) = α H (12000) = α 1 α Order statistics k Order statistics k x 10 4 Compare with the max spectrum plot of the Internet flow size data (previous slide). 29

30 Asymptotic normality of the max spectrum Let P{X k x} =: F(x), where 1 F(x) = Cx α (1 + O(x β )), (x ), (2) with some β > 0. Consider the max spectrum {Y j } for the range of scales: r(n) + j 1 j r(n) + j 2, with fixed j 1 < j 2 and r(n), n. Theorem Under (2) and another technical condition, P{ n j2 +r(( θ, Y r ) ( θ, µ r )) x} Φ(x/σ θ ) sup x R C θ (1/2 rβ/α + r2 r/2 / n), (3) where Φ is the standard Normal c.d.f. Here Y r = (Y r+j ) j 2 j=j 1 and and ( θ, Y r ) = j 2 j=j 1 θ j Y r+j, σ 2 θ = ( θ,σ j 2 αθ) := µ r (j) = (r + j)/α + c i,j=j 1 θ i Σ α (i, j)θ j > 0. 30

31 The covariance structure of the max spectrum The asymptotic covariance matrix Σ is the same as if the data were i.i.d. α Fréchet! The intuition is that the block maxima {D(r + j, k), k = 1,..., n r+j } behave like i.i.d. α Fréchet variables, as r = r(n). The covariance entries are given by: with Σ α (i, j) = α 2 2 i j j 2ψ( i j ), ψ(a) := Cov(log 2 (Z 1 ),log 2 (Z 1 (2 a 1)Z 2 ), a 0, for i.i.d. 1 Fréchet Z 1 & Z 2. Note that α appears only as a factor in Σ α : Σ α = α 2 Σ 1. It does not affect the correlation structure of the max spectrum. GLS estimators for α use the matrix Σ α. Asymptotic normality for α(r + j 1, r + j 2 ) follows from the last theorem. Stoev, Michailidis and Taqqu (2006) has details on confidence intervals for α and automatic selection of j 1 &j 2. 31

32 Automatic selection of scales Null hypothesis : Y = {Y j } J j=1 is multivariate Normal with covariance matrix α 2 Σ 1 and means EY j = j/α + const,1 j J. Given a range of scales j 1 j j 2, set Ĥ(j 1, j 2 ) 1/ α(j 1, j 2 ) := w j 1,j 2 Y, for the GLS estimator of the slope H = 1/α of Y j vs j. Algorithm: 1. Set j 2 J = [log 2 (n)] and j 1 := j Compute Ĥ +1 := Ĥ(j 1 1, j 2 ) and Ĥ 0 := Ĥ(j 1, j 2 ). 3. Compute a conf int for Ĥ +1 Ĥ 0 (based on the null hypothesis). 4. If the conf int does not contain 0, then stop. Else, set j 1 := j 1 1 and repeat Step 2. Important details: Plug in for α used in the covariance matrix Level of conf int to be set by the user Multiple testing issue not addressed The method works very well in practice! 32

33 Automatic selection of scales: an illustration Simulated: a mixture of 1 Fréchet (10%) and Exponential with mean 5 (90%). Sample size: n = 2 17 = 131,072. Root MSE Automatic choices of j 1 : p= j 1 Square root MSE j Automatic j 1 : α estimates α MSE best j 1 =10: α estimates α 33

34 Rates for moment functionals of maxima Let X 1,..., X n be i.i.d. from F and recall that M n := 1 d X n 1/α i Z, (n ). 1 i n Are there results on the rate of convergence for reasonable f s? Ef(M n ) Ef(Z), (n ) Pickands (1975) shows only the convergence of moments (no rates). Our approach: consider with F(x) = exp{ σ α (x)x α }, x R, (4) σ α (x) C, (x ). Note: 1 F(x) Cx α, x is equivalent to (4). Extra assumption: σ α (x) C Dx β, for large x. 34

35 Rates (cont d) But (4) is very convenient to handle rates! Note that Thus, Ef(M n ) = and also P{M n x} Ef(Z) = 0 0 = P{X 1 n 1/α x} n = F(n 1/α x) n = exp{ σ α (n 1/α x)x α }. (5) f(x)df n (x) = f(x)dg(x) = 0 Now, integration by parts yields: E(f(M n ) f(z)) = 0 0 f(x)dexp{ σ α (n 1/α x)x α }, f(x)dexp{ Cx α }. (G(x) F n (x))f (x)dx. However G(x) and F n (x) are of the same exponential form! 35

36 Rates (cont d) By the mean value theorem: F n (x) G(x) = exp{ σ α (n 1/α x)x α } exp{ Cx α } σ α (n 1/α x) C x α exp{ cx α } Dn β/α x (α+β) exp{ cx α }, as x, Thus, for any ǫ > 0, as n, ǫ F n (x) G(x) f (x) dx Dn β/α f (x) x (α+β) e cx α dx = O(n β/α ). 0 By taking ǫ = ǫ n 0, and using a mild technical condition we can also bound the integral near zero ǫn 0 F n (x) G(x) f (x) dx. Th If σ α (x) Dx β, as x, then as n, n β/α (Ef(M n ) Ef(Z)) D 0 x (α+β) f (x)e Cx α dx, provided mild technical conditions on σ(x) at 0 and on f at 0 and hold. 36

37 Rates (cont d) We have thus obtained exact rates for moment functionals Ef( 1 max n X 1/α i) 1 i n They are valid for a large class of absolutely continuous f, including: f(x) = log 2 (x), and f(x) = x p, p (0, α), for example. More details can be found in Stoev, Michailidis and Taqqu (2006). As a corollary, we also get rates of convergence for Cov(log 2 D(r+j 1, k),log 2 D(r+j 2, k)), as r = r(n). These tools and the Berry Esseen Theorem, yield the uniform asymptotic normality of the max spectrum. Thank you! 37

38 References Alon, N., Matias, Y. & Szegedy, M. (1996), The space complexity of approximating the frequency moments, in Proceedings of the 28th ACM Symposium on Theory of Computing, pp Cormode, G. & Muthukrishnan, S. (2003), Algorithms - ESA 2003, Vol of Lecture Notes in Computer Science, Springer, chapter Estimating Dominance Norms of Multiple Data Streams, pp Csörgő, S., Deheuvels, P. & Mason, D. (1985), Kernel estimates of the tail index of a distribution, Annals of Statistics 13(3), Datar, M., Immorlica, N., Indyk, P. & Mirrokni, V. S. (2004), Locality sensitive hashing scheme based on p stable distributions, in SCG 04: Proceedings of the twentieth annual symposium on Computational geometry, ACM Press, New York, NY, USA, pp Dekkers, A., Einmahl, J. & de Haan, L. (1989), A moment estimator for the index of an extreme value distribution, Ann. Statist. 17(4), Feigenbaum, J., Kannan, S., Strauss, M. & Viswanathan, M. (1999), An approximate l1-difference algorithm for massive data streams, in FOCS 99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, IEEE Computer Society, Washington, DC, USA, p Fong, J. H. & Strauss, M. (2000), An approximate lp-difference algorithm for massive data streams, in STACS 00: Proceedings of the 17th Annual Symposium on Theoretical Aspects of Computer Science, Springer-Verlag, London, UK, pp Gilbert, A., Kotidis, Y., Muthukrishnan, S. & Strauss, M. (2001), Surfing wavelets on streams: one-pass summaries for approximate aggregate queries, in Procedings of VLDB, Rome, Italy.

39 Gilbert, A., Kotidis, Y., Muthukrishnan, S. & Strauss, M. (2002), How to summarize the universe: Dynamic maintenance of quantiles. Hall, P. (1982), On some simple estimates of an exponent of regular variation, J. Roy. Stat. Assoc. 44, Series B. Hill, B. M. (1975), A simple general approach to inference about the tail of a distribution, The Annals of Statistics 3, Indyk, P. (2000), Stable distributions, pseudorandom generators, embeddings and data stream computation, in IEEE Symposium on Foundations of Computer Science, pp Muthukrishnan, S. (2003), Data streams: algorithms and applications, Preprint. muthu/. Pickands, J. (1975), Statistical inference using extreme order statistics, Ann. Statist. 3, Stoev, S., Hadjieleftheriou, M., Kollios, G. & Taqqu, M. (2007), Norm, point and distance estimation over multiple signals using max stable distributions, in 23rd International Conference on Data Engineering, ICDE, IEEE. To appear. Stoev, S., Michailidis, G. & Taqqu, M. (2006), Estimating heavy tail exponents through max self similarity, Preprint. Weissman, I. (1978), Estimation of parameters and large quantiles based on the k largest observations, Journal of the American Statistical Association 73,

On the estimation of the heavy tail exponent in time series using the max spectrum. Stilian A. Stoev

On the estimation of the heavy tail exponent in time series using the max spectrum Stilian A. Stoev (sstoev@umich.edu) University of Michigan, Ann Arbor, U.S.A. JSM, Salt Lake City, 007 joint work with: