The largest eigenvalues of the sample covariance matrix. in the heavy-tail case

The largest eigenvalues of the sample covariance matrix 1 in the heavy-tail case Thomas Mikosch University of Copenhagen Joint work with Richard A. Davis (Columbia NY), Johannes Heiny (Aarhus University) 1 Suzhou, June 2018 1

2 4 3.5 3 lower tail index 2.5 2 1.5 1 0.5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 Upper tail index Figure 1. Estimation of tail indices for S&P 500.

1. Heavy-tailed matrices: preliminaries 3 We consider p n matrices X = X n = ( X it )i,...,p;t=1,...,n, where all entries have the same distribution with heavy tails; X denotes a generic element with this distribution. We are interested in the asymptotic behavior of the largest eigenvalues and corresponding eigenvectors of the sample covariance matrices for p = p n : ( n XX = X it X jt, n = 1, 2,..., )i,j=1,...,p t=1

4 1.1. An excursion to regularly varying random variables. The random variable X and its distribution are regularly varying with index α > 0 if there exists a slowly varying function L and constants p ± such that p + + p = 1 and P(X > x) p + L(x) x α and P(X > x) p L(x) x α, x > 0. Then P(X 2 > x) L( x) x α/2. For iid copies (X i ) of X > 0, some slowly varying l. Embrechts and Goldie (1984): P(X 1 X 2 > x) l(x) x α.

5 If X, σ > 0 independent and E[σ α+δ ] <, then Breiman (1965) P(σ X > x) E[σ α ] P(X > x), x. A regularly varying random variable X > 0 and its distribution are subexponential: for some (any) n 2, P(X 1 + + X n > x) n P(X > x), n. Other subexponential distributions: log-normal, Weibull F (x) = exp( x τ ), τ (0, 1). Assume X > 0. There exists a sequence x n such that P(X 1 + + X n > x n ) n P(X > x n ), n, if and only if X is subexponential. Cline, Hsing (1998)

6 1.2. Heavy-tail large deviations. Define for iid (X i ) with a generic element X, S n = X 1 + + X n, n 1. Assume E[X] = 0 if the first moment is finite. If x n, A is bounded away from zero, and x 1 n S n P 0, n, we call P(x 1 n S n A) a large deviation probability.

If (X i ) is iid regularly varying with index α > 0 then P(±S n > x) sup x γ n n P( X > x) p ± 0, n, 7 where γ n /a n for α (0, 2), n P( X > a n ) 1, γ n = c n log n for c > α 2, α > 2. A.V. Nagaev (1969), S.V. Nagaev (1979), C.C. Heyde (1969), Cline, Hsing (1998),... If α > 2, 0 < c < α 2, var(x) = 1, uniformly for x < c n log n, P(S n > x) Φ(x/ n), n.

8 Separating sequences for the normal and subexponential approximations to large deviation probabilities exist for general subexponential distributions.

2. Heavy-tailed matrices with iid entries: Finite p 9 We consider p n matrices X = X n = ( X it )i=1,...,p;t=1,...,n, where all entries have the same distribution with heavy tails L(x) L(x) P(X > x) p + and P(X < x) p x α x α for a slowly varying function L, some α > 0, p + + p = 1. Assume E[X] = 0 if expectation is finite. Therefore Gnedenko and Kolmogorov (1949), Feller (1971), Resnick (2007) ( n a 2 n X 2 it c ) d (α/2) n ξ i, α (0, 4), t=1 where P( X > a n ) n 1 and (ξ (γ) i ) are iid γ-stable, γ (0, 2],

10 and for i j, b 1 n n t=1 X itx jt d ξ (2 α) i,j, α (0, 4), where P( X 1 X 2 > b n ) n 1 for α (0, 2) and b n = n for α (2, 4). Since a 2 n n2/α and b n n 1/(2 α), ( a 2 n XX diag(xx ) ) 2 b2 n 2 a 4 n 1 i j p ( 1 b n n ) 2 P X it X jt 0. t=1 Therefore, by Hermann Weyl s inequality, a 2 n max λ(i) λ (i) (diag(xx )) i=1,...,p a 2 XX diag(xx P ) 0, 2 n where λ (1) (A) λ (p) (A) for any symmetric matrix A and λ (i) = λ (i) (XX ).

11 Hence for α (0, 4), ξ (α/2) (1) ξ (α/2) (p) ( ) a 2 n λ(i) c n i=1,...,p d ( ξ (α/2) ) (i) i=1,...,p. In particular, n a 2 n (λ (1) n E[X2 ] 1 (2,4) (α) ) d max i=1,...,p ξ(α/2) i. The unit eigenvector V j associated with λ (j) = λ Lj can be localized: V j e Lj P 0.

12 We have for α (2, 4), with c n = n E[X 2 ] 1 ( p det(xx ) cn) p = a 2 n cp 1 n d = i=1 p i=1 p i=1 a 2 n (λ (i) c n ) ξ (α/2) (i) ξ (α/2) i d = p 2/α ξ (α/2) 1. i 1 j=1 λ (j) c n

3. Heavy-tailed matrices with iid entries: p increases with n 13 We consider p n matrices X = X n = ( X it )i,...,p;t=1,...,n, where the entries are iid and have a distribution with heavy tails P(X > x) p + x α and P(X < x) p x α for some α (0, 4), p + + p = 1, and p = p n = n β for some β > 0. Then for a np = n (1+β)/α, i.e., np P( X > a np ) 1, a 2 np XX diag(xx ) P 2 0, β (0, 1], X X diag(x X) P 2 0, β (1, ). a 2 np

14 One needs Nagaev-type large deviations: for i j and β 1, ( p 2 P a 2 np n ) X it X jt > ε p 2 n P( X 1 X 2 > εa 2 np ) P 0. t=1 Observe that XX and X X have the same positive eigenvalues. For β > 1, interchange the roles of n and p, and observe that n = p 1/β. For the moment, we consider the case β 1 only.

Write λ 1,..., λ p for the eigenvalues of XX and 15 λ (p) λ (1). We also write n D i = X 2 it = λ i(diag(xx )), i = 1,..., p, t=1 and for the ordered values By Weyl s inequality, 1 λ (i) D (i) 1 a 2 np max i=1,...,p D (p) D (1). a 2 np XX diag(xx ) 2 P 0, β (0, 1].

16 Limit theory for the order statistics of (D i ) can be handled by point process convergence towards a Poisson process: for example, p i=1 ε a 2 np (D i c n ) d i=1 ε Γ 2/α i = N Γ for Γ i = E 1 + + E i, (E i ) iid standard exponential, if and only if Resnick (2007) p P(a 2 np (D i c n ) > x) x α/2, x > 0, p P(a 2 np (D i c n ) < x) 0, x > 0. This follows by Nagaev-type large deviations.

Centering with c n is only needed for α [2, 4) and if 17 (n p)/a 2 np 0. Centering is not needed if for example when min(β, β 1 ) ( (α/2 1) +, 1 ], (3.1) p/n γ (0, ). Under (3.1), this was proved by Soshnikov (2004,2006) for α (0, 2) and by Auffinger, Ben Arous, Péché (2009) for α (0, 4). They do not make explicit use of large deviation probabilities but prove their results by approximating the largest eigenvalue λ (i) by the corresponding ith order statistic X(i) 2 from (Xjt 2 ) k=1,...,p ;t=1,...,n.

18 Partial results (for particular choices of p) were proved in Davis, Pfaffel and Stelzer (2014), Davis, Mikosch, Pfaffel (2016) for (X it ) satisfying some linear dependence conditions. If α < 1, (p n ) could be chosen of almost exponential rate.

In the light-tailed case, limit results for eigenvalues of XX are very sensitive with respect to the growth rate of p. 19 For example, Johnstone (2001) showed for iid standard normal X it under (3.1) that n + p ( 1/ n + 1/ p ) 1/3 ( λ (1) ( n + p) 2 1 ) d Tracy-Widom distr. This result remains valid for iid entries X it whose first four moments match those of the standard normal, and all moments of X are finite Tao and Vu (2011).

20 In the iid regular variation case, using a.s. continuous mappings, folklore (Resnick (1987,2007)) applies to prove the joint convergence of the order statistics (possibly with centering c n for α [2, 4)) ( ) a 2 d ( 2/α np λ(1),..., λ (k) Γ 1,..., Γ 2/α ) k, in particular, P(Γ 2/α 1 x) = exp( x α/2 ), x > 0, is the Fréchet distribution Φ α/2. the joint convergence of the ratios of successive order statistics (λ (2) λ (1),..., λ (k 1) λ (k) ) d ((Γ 1 Γ 2 ) 2/α,..., (Γ k 1 Γ k ) 2/α ).

the convergence of the ratio of largest eigenvalue to the 21 trace of XX for α (0, 2) 2/α λ (1) d Γ 1 λ 1 + + λ p Γ 2/α 1 + Γ 2/α 2 + With these techniques one can handle results involving finitely many of the largest eigenvalues of XX. One cannot handle the smallest eigenvalues, determinants,.... The largest eigenvalues have very distinct sizes and the eigenvectors are localized: let V k be the unit eigenvector corresponding to λ (k) = λ Lk. Then V k e Lk P 0.

22 log(λ(i+1) λ(i)) 2.0 1.5 1.0 0.5 0.0 0 10 20 30 40 50 i Figure 2. The logarithms of the ratios λ (i+1) /λ (i) for the S&P 500 series. We also show the 1, 50 and 99% quantiles (bottom, middle, top lines, respectively) of the variables log((γ i /Γ i+1 ) 2 ).

23 12 x 105 Eigenvalues and their ratios as in Lam & Yao 10 8 λ (i) 6 4 2 0 0 50 100 150 200 250 300 350 400 450 i 1 0.8 λ (i+1) /λ (i) 0.6 0.4 0.2 0 0 50 100 150 200 250 300 350 400 450 i Figure 3. Eigenvalues and ratios λ (i+1) /λ (i) for S&P 2002-2014, p = 442

24 0 50 100 150 200 1e 15 5e 16 0e+00 5e 16 Normal data Indices of components Size of components 0 50 100 150 200 0.0 0.2 0.4 0.6 0.8 1.0 Pareto data Indices of components Size of components Figure 4. The components of the eigenvector V 1. Right: The case of iid Pareto(0.8) entries. Left: The case of iid standard normal entries. We choose p = 200 and n = 1, 000.

4. Stochastic volatility models 25 The main idea in the iid case is to exploit the heavier tails of X 2 in comparison with those of X 1 X 2. Similar behavior can be observed for regularly varying stochastic volatility models X it = σ it Z it, i, t Z, where (σ it ) is a strictly stationary random field 2 (with finite moments of order α + δ) independent of the iid random field (Z it ) assumed regularly varying with index α (0, 4). 2(log σ it ) is often assumed Gaussian.

26 By Breiman s lemma, (X it ) is (jointly) regularly varying and P(X 2 > x) E[σ α ] P(Z 2 > x) x α/2, P(X it X jt > x) E[(σ it σ jt ) α ] P(Z 1 Z 2 > x) x α. Nagaev-type large deviations for stochastic volatility models Mikosch and Wintenberger (2016) confirm that a 2 XX diag(xx ) P 2 0. and np p i=1 ε a 2 np (λ i c n ) d i=1 ε Γ 2/α i Then, essentially, the same results as for iid (X it ) hold..

Thinning of X 27 Similar techniques apply when σ it = σ (n) it assumes zero with high probability, e.g. (σ (n) it ) iid Bernoulli distributed for n 1 and q (n) 0 = P(σ (n) = 0) 1 and lim n n P(σ (n) = 1) = lim n n q (n) 1 > 0, and independent of (Z it ); see Auffinger, Tang (2016), q (n) 1 = n v, v (0, 1) New normalization (b n ) n p E[(σ (n) ) α ] P( Z > b n ) = n p q (n) 1 P( Z > b n) 1, n. or b n = a [ n p E[(σ (n) ) α ]]

28 Then, for β 1, using Nagaev-type large deviations if lim n n p (n) 1 = or regular variation if lim n n p (n) b 2 n sup i=1,...,p λ(i) λ (i) ( diag(xx ) ) b 2 XX diag(xx ) 2 0 n 1 is finite and p i=1 ε b 2 n (λ i c n ) d i=1 ε Γ 2/α i.

5. Another structure where the squares dominate: linear 29 processes. Davis, Pfaffel, Stelzer (2014), Davis, Mikosch, Pfaffel (2016) Special case: X it = θ 0 Z i,t + θ 1 Z i 1,t for iid (Z it ) with regularly varying Z with index α (0, 4), real coefficients θ i, E[Z] = 0 if finite. Choose (a n ) such that n P( Z > a n ) 1. Observe that n X 2 i,t = t=1 n t=1 ( θ 2 0 Z2 i,t + ) n θ2 1 Z2 i 1,t + 2θ0 θ 1 Z i,t Z i 1,t t=1 = θ 2 0 D i + θ 2 1 D i 1 + o P (a 2 n ).

30 Here we used Nagaev-type large deviations and the fact that Z 2 has tail index α/2, while Z 1 Z 2 has tail index α. Similarly, n X i,t X i+1,t = θ 0 θ 1 t=1 n t=1 Z 2 i,t + o P (a 2 n ) = θ 0 θ 1 D i + o P (a 2 n ). This leads to the approximation ( n n t=1 X2 i,t t=1 X i,tx i+1,t ) n t=1 X i,tx i+1,t ( θ 2 0 θ 0 θ 1 θ 0 θ 1 θ 2 1 ) D i + n t=1 X2 i+1,t ( θ 2 1 0 0 0 ) D i 1 + ( 0 0 0 θ 2 0 ) D i+1.

The sample covariance matrix can be approximated by p XX D 2 i M i = o P (a 2 np ), where M 1 = i=1 θ 2 0 θ 0 θ 1 0... 0 θ 0 θ 1 θ 2 1 0... 0 0 0 0... 0....... 0 0 0... 0, M 2 = Denote the order statistics of the D i D L i = D (i). 0 0 0... 0 0 θ 2 0 θ 0 θ 1... 0 0 θ 0 θ 1 θ 2 1... 0 0 0 0... 0....... 0 0 0... 0,... by D (1) D (p) and 31 Then a 2 np XX p D 2 P L i M Li 0. i=1

32 For k = k n slowly, a 2 np XX k D 2 P L i M Li 0. i=1 Since (D i ) is iid, (L 1,..., L p ) is a random permutation of (1,..., p), hence the event A k = { L i L j > 1, i j = 1,..., k} has probability close to one provided k 2 = o(p). On the set A k, the matrix k i=1 D L i M Li is block-diagonal with non-zero eigenvalues DL i (θ0 2 + θ2 1 ), i = 1,..., k. Here we used that M Li has rank 1 with non-zero eigenvalue equal to θ 2 0 + θ2 1.

By Weyl s inequality, λ (i) D L i (θ 2 0 + θ2 1 ) a 2 a 2 np max i=1,...,k np XX k D 2 P L i M Li 0. i=1 33 Extension to general linear structure: X it = k,l h kl Z k i,l t Use truncation of the coefficient matrix H = (h kl ) of the linear process. Then a 2 np max λ(i) δ (i) P 0, n, i=1,...,p where δ (1),..., δ (p) are the p ordered values (with respect to absolute value) of the set

34 {(D i E[D 1 ])v j, i = 1,..., k; j = 1, 2,...} for α (0, 2), (α (2, 4)) where (v j ) are the eigenvalues of H H = ( h il h jl )i,j=1,2,... l=0 The mapping theorem implies for suitable real or complex numbers (v j ) j=1 p i=1 ε a 2 np (D i ED 1 )v j d j=1 The limit is a Poisson cluster process. i=1 ε Γ 2/α i v j. An example: The separable case: We assume h kl = θ k c l. The matrix HH = ( ) l=0 c2 l θi θ j has rank r = 1 and i,j 0 v (1) = θ 2 k. l=0 c 2 l k=0

The limit point process is Poisson as in the iid case. For α < 2, ( ) d ( 2/α λ(1),..., λ (k) v(1) Γ 1,..., Γ 2/α ), a 2 np 2/α λ (1) d Γ 1 λ (1) + + λ (k) Γ 2/α k 2/α λ (1) d Γ 1 λ 1 + λ 2 + + λ p Γ 2/α 1 + + Γ 2/α 1 + Γ 2/α 2 +, k, 35

36 6. Final comments The asymptotic behavior of the largest eigenvalues of a heavy-tailed large sample covariance matrix XX are understood * if the diagonal diag(xx ) = ( n t=1 X 2 it ) i=1,...,p dominates the asymptotic behavior of the eigenvalues * for linear random fields X it = k,l h k,lz i k,t l with iid heavy-tailed (Z it ) Davis, Mikosch, Pfaffel (2016), Davis, Heiny, Mikosch, Xie (2017) * if the rows are iid. Davis, Pfaffel, Stelzer (2015) Nagaev-type large deviations are key techniques.

The eigenvalues of the sample covariance matrices of other 37 types of heavy-tailed multivariate time series are less understood.