Future Random Matrix Tools for Large Dimensional Signal Processing

Size: px

Start display at page:

Download "Future Random Matrix Tools for Large Dimensional Signal Processing"

Gerard Wells
5 years ago
Views:

Abla KAMMOUN and Romain COUILLET 2 King s Abdullah

1 /42 Future Random Matrix Tools for Large Dimensional Signal Processing EUSIPCO 204, Lisbon, Portugal. Abla KAMMOUN and Romain COUILLET 2 King s Abdullah University of Technology and Science, Saudi Arabia 2 SUPELEC, France September st, 204

2 Introduction/ 2/42 High-dimensional data Consider n observations x,, x n of size N, independent and identically distributed with zero-mean and covariance C N, i.e, E [ x x H ] = CN, Let X N = [x,, x n ]. The sample covariance estimate ŜN of C N is given by: n Ŝ N = n X NX H N = n x i xi, i= From the law of large numbers, as n +, Ŝ N a.s. C N. Convergence in the operator norm In practice, it might be difficult to afford n +, if n N, Ŝ N can be sufficiently accurate, if N/n = O(), we model this scenario by the following assumption: N + and n + with N n c, Under this assumption, we have pointwise convergence to each element of C N, i.e, ) a.s. (ŜN (C N ) i,j i,j but S N C N does not converge to zero. The convergence in the operator norm does not hold.

3 Introduction/ 3/42 Illustration Consider C N = I N, the spectrum of Ŝ N is different from that of C N Spectrum of eigenvalues Marchenko-Pastur Law 0.8 Histogram Eigenvalues of ŜN Figure: Spectrum of eigenvalues when N = 400 and n = 2000 The asymptotic spectrum can be characterized by the Marchenko-Pastur Law.

4 Introduction/ 4/42 Reasons of interest for signal processing Scale similarity in array processing applications: large antenna arrays vs limited number of observations, Need for detection and estimation based on large dimensional random inputs: subspace methods in array processing. The assumption number of obervations dimension of observation is no longer valid: large arrays, systems with fast dynamics. Example MUSIC with few samples (or in large arrays) Call A(Θ) = [a(θ ),..., a(θ K )] C N K, N large, K small, the steering vectors to identify and X = [x,..., x n ] C N n the n samples, taken from x t = K a(θ k ) p k s k,t + σw t. k= The MUSIC localization function reads γ(θ) = a(θ) HÛ W ÛH W a(θ) in the signal vs. noise spectral decomposition XX H = Û S ˆΛS Û H S + Û W ˆΛW Û H W. Writing equivalently A(Θ)PA(Θ) H + σ 2 I N = U S Λ S U H S + σ2 U W U H W, as n, N, n/n c, from our previous remarks Û W Û H W U W U H W Music is NOT consistent in the large N, n regime! We need improved RMT-based solutions.

5 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 7/42 Stieltjes Transform Definition Let F be a real probability distribution function. The Stieltjes transform m F of F is the function defined, for z C +, as m F (z) = df (λ) λ z For a < b continuity points of F, denoting z = x + iy, we have the inverse formula If F has a density f at x, then F (b) F (a) = lim y 0 π b I[m F (x + iy)]dx a f (x) = lim y 0 π I[m F (x + iy)] The Stieltjes transform is to the Cauchy transform as the characteristic functin is to the Fourier transform. Equivalence F m F Similar to the Fourier transform, knowing m F is the same as knowing F.

6 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 8/42 Stieltjes transform of a Hermitian matrix Let X be a N N random matrix. Denote by df X the empirical measure of its eigenvalues λ,, λ N, i.e, df X = N Ni= δ λi. The Stieltjes transform of X denoted by m X = m F is the stieltjes transform of its empirical measure: m X (z) = λ z df (λ) = N N i= λ i z = N tr (X zi N ). The Stieltjes transform of a random matrix is the trace of the resolvent matrix Q(z) = (X zi N ). The resolvent matrix plays a key role in the derivation of many of the results of random matrix theory. For compactly supported F, mf (z) is linked to the moments M k = E N tr Xk, + m F (z) = M k z k m F is defined in general on C + but exists everywhere outside the support of F. k=0

7 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 9/42 Side remark: the Shannon -transform A. M. Tulino, S. Verdù, Random matrix theory and wireless communications, Now Publishers Inc., Definition Let F be a probability distribution, m F its Stieltjes transform, then the Shannon-transform V F of F is defined as V F (x) log( + xλ)df (λ) = 0 x ( t m F ( t)) dt This quantity is fundamental to wireless communication purposes! Note that m F itself is of interest, not F!

8 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 0/42 Proof of the Mar cenko-pastur law V. A. Mar cenko, L. A. Pastur, Distributions of eigenvalues for some sets of random matrices, Math USSR-Sbornik, vol., no. 4, pp , 967. The theorem to be proven is the following Theorem Let X N C N n have i.i.d. zero mean variance /n entries with finite eighth order moments. As n, N with N n c (0, ), the e.s.d. of X NX H N converges almost surely to a nonrandom distribution function F c with density f c given by where a = ( c) 2, and b = ( + c) 2. f c (x) = ( c ) + δ(x) + (x a) 2πcx + (b x) +

9 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method /42 The Mar cenko-pastur density.2 c = 0. c = 0.2 c = Density fc (x) x Figure: Mar cenko-pastur law for different limit ratios c = lim N N/n.

10 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 2/42 Diagonal entries of the resolvent Since we want an expression of m F, we start by identifying the diagonal entries of the resolvent (X N X H N zi N ) of X N X H N. Denote [ ] y H XN = Y Now, for z C +, we have ( ) [ X N X H y N zi H y z y H Y H ] N = Yy YY H zi N Consider the first diagonal element of (R N zi N ). From the matrix inversion lemma, ( ) ( A B (A BD = C) A B(D CA B) ) C D (A BD C) CA (D CA B) which here gives [ ( ) ] X N X H N zi N = z zy H (Y H Y zi n ) y

11 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 3/42 Trace Lemma Z. Bai, J. Silverstein, Spectral Analysis of Large Dimensional Random Matrices, Springer Series in Statistics, To go further, we need the following result, Theorem Let {A N } C N N with bounded spectral norm. Let {x N } C N, be a random vector of i.i.d. entries with zero mean, variance /N and finite 8 th order moment, independent of A N. Then x H N A Nx N N tr A N a.s. 0. For large N, we therefore have approximately [ ( ) ] X N X H N zi N z z N tr (YH Y zi n )

12 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 4/42 Rank- perturbation lemma J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp , 995. It is somewhat intuitive that adding a single column to Y won t affect the trace in the limit. Theorem Let A and B be N N with B Hermitian positive definite, and v C N. For z C \ R, ( N tr (B zi N ) (B + vv H zi N ) ) A A N dist(z, R + ) with A the spectral norm of A, and dist(z, A) = inf y A y z. Therefore, for large N, we have approximately, [ ( ) ] X N X H N zi N z z N tr (YH Y zi n ) z z N tr (XH N X N zi n ) = z z n N m F (z) in which we recognize the Stieltjes transform m F of the l.s.d. of X H N X N.

13 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 5/42 End of the proof We have again the relation hence n N m F (z) = m F (z) + N n N z [ ( ) ] X N X H N zi N n N z zm F (z) Note that the choice (, ) is irrelevant here, so the expression is valid for all pair (i, i). Summing over the N terms and averaging, we finally have m F (z) = N tr (X N X H N zi N which solve a polynomial of second order. Finally m F (z) = c 2z ) c z zm F (z) (c z) z. 2z From the inverse Stieltjes transform formula, we then verify that m F is the Stieltjes transform of the Mar cenko-pastur law.

14 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 6/42 Related bibliography V. A. Mar cenko, L. A. Pastur, Distributions of eigenvalues for some sets of random matrices, Math USSR-Sbornik, vol., no. 4, pp , 967. J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp , 995. Z. D. Bai and J. W. Silverstein, Spectral analysis of large dimensional random matrices, 2nd Edition Springer Series in Statistics, R. B. Dozier, J. W. Silverstein, On the empirical distribution of eigenvalues of large dimensional information-plus-noise-type matrices, Journal of Multivariate Analysis, vol. 98, no. 4, pp , V. L. Girko, Theory of Random Determinants, Kluwer, Dordrecht, 990. A. M. Tulino, S. Verdù, Random matrix theory and wireless communications, Now Publishers Inc., 2004.

15 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 7/42 Asymptotic results involving Stieltjes transform J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp , 995. Theorem Let Y N = n X N C 2 N, where X N C n N has i.i.d entries of mean 0 and variance. Consider the regime n, N + with N n c. Let ˆm N be the Stieltjes transform associated to X N X N. Then, ˆm N m N 0 almost surely for all z C\R +, where m N (z) is the unique solution in the set {z C +, m N (z) C + } to: ( m N (z) = ctdf C ) N + tm N (z) z in general, no explicit expression for F N, the distribution whose Stietljes transform is m N (z). The theorem above characterizes also the Stieltjes transform of B N = X H N X N denoted by m N, m N = cm N + (c ) z This gives access to the spectrum of the sample covariance matrix model of x, when y i = C 2 N x i, x i i.i.d., C N = E[yy H ].

16 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 8/42 Getting F from m F Remember that, for a < b real, F (x) = lim y 0 π I[m F (x + iy)] where m F is (up to now) only defined on C +. to plot the density F, first approach: span z = x + iy on the line {x R, y = ε} parallel but close to the real axis, solve m F (z) for each z, and plot I[m F (z)]. refined approach: spectral analysis, to come next. Example (Sample covariance matrix) For N multiple of 3, let F C (x) = 3 x + 3 x x K and let B N = n C 2 N Z H N Z NC 2 N with F B N F, then m F = cm F + (c ) z ( ) t m F (z) = c + tm F (z) df C (t) z We take c = /0 and alternatively K = 7 and K = 4.

17 Part : Fundamentals of Random Matrix Theory/.. The Stieltjes Transform Method 9/42 Spectrum of the sample covariance matrix Empirical eigenvalue distribution Empirical eigenvalue distribution 0.6 Limit law 0.6 Limit law Density 0.4 Density Eigenvalues Eigenvalues Figure: Histogram of the eigenvalues of B N = n C 2 N Z H N Z N C 2 N, N = 3000, n = 300, with C N diagonal composed of three evenly weighted masses in (i), 3 and 7 on top, (ii), 3 and 4 at bottom.

18 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 2/42 Support of a distribution The support of a density f is the closure of the set {x, f (x) 0}. For instance the support of the mar cenko-pastur law is [ ( c) 2, ( + c) 2] Density fc (x) [ ] Support of the Marchenko-Pastur law x Figure: Mar cenko-pastur law for different limit ratios c = 0.5.

19 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 22/42 Extreme eigenvalues Limiting spectral results are insufficient to infer about the location of extreme eigenvalues. Example: Consider df N (x) = N Nk= δ ak. Then, dfn 0 = N N df N + N δ A N (x) and df N with A N a N satisfy: df N dfn 0 0. However, the supports of F N and F N0 differ by the mass A N. Question: How is the behaviour of the extreme eigenvalues of random covariance matrices?

20 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 23/42 No eigenvalue outside the support of sample covariance matrices Z. D. Bai, J. W. Silverstein, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices, The Annals of Probability, vol. 26, no. pp , 998. Theorem Let X N C N n with i.i.d. entries with zero mean, unit variance and infinite fourth order. Let C N C N N be nonrandom and bounded in norm. Let m N be the unique solution in C + of ( m N = z N ) τ df C N (τ), m n + τm N (z) = N N n m N (z) + N n n z, z C +, Let F N be the distribution associated to the Stieltjes transform m N (z). Consider B N = n C 2 N X N X H N C 2 N. We know that F B N F N converge weakly to zero. Choose N 0 N and [a, b], a > 0, outside the support of F N for all N N 0. Denote L N the set of eigenvalues of B N. Then, P(L N [a, b] i.o.) = 0.

21 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 24/42 No eigenvalue outside the support: which models? J. W. Silverstein, P. Debashis, No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix, J. of Multivariate Analysis vol. 00, no., pp , It has already been shown that (for all large N) there is no eigenvalues outside the support of Mar cenko-pastur law: XX H, X i.i.d. with zero mean, variance /N, finite 4 th order moment. Sample covariance matrix: C 2 XX H C 2 and X H CX, X i.i.d. with zero mean, variance /N, finite 4 th order moment. Doubly-correlated matrix: R 2 XCX H R 2, X with i.i.d. zero mean, variance /N, finite 4 th order moment. J. W. Silverstein, Z.D. Bai, Y.Q. Yin, A note on the largest eigenvalue of a large dimensional sample covariance matrix, Journal of Multivariate Analysis, vol. 26, no. 2, pp , 988. If 4 th order moment is infinite, lim sup λ XXH max = N J. Silverstein, Z. Bai, No eigenvalues outside the support of the limiting spectral distribution of information-plus-noise type matrices to appear in Random Matrices: Theory and Applications. Only recently, information plus noise models, X with i.i.d. zero mean, variance /N, finite 4 th order moment (X + A)(X + A) H, and the generally correlation model where each column of X has correlation R i.

22 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 25/42 Extreme eigenvalues: Deeper into the spectrum In order to derive statistical detection tests, we need more information on the extreme eigenvalues. We will study the fluctuations of the extreme eigenvalues (second order statistics) However, the Stieltjes transform method is not adapted here!

23 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 26/42 Distribution of the largest eigenvalues of XX H C. A. Tracy, H. Widom, On orthogonal and symplectic matrix ensembles, Communications in Mathematical Physics, vol. 77, no. 3, pp , 996. K. Johansson, Shape Fluctuations and Random Matrices, Comm. Math. Phys. vol. 209, pp , Theorem Let X C N n have i.i.d. Gaussian entries of zero mean and variance /n. Denoting λ + N the largest eigenvalue of XX H, then N 2 3 λ+ N ( + c) 2 ( + c) 4 3 c 2 X + F + with c = lim N N/n and F + the Tracy-Widom distribution given by ( ) F + (t) = exp (x t) 2 q 2 (x)dx with q the Painlevé II function that solves the differential equation in which Ai(x) is the Airy function. t q (x) = xq(x) + 2q 3 (x) q(x) x Ai(x)

24 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 27/42 The law of Tracy-Widom 0.5 Empirical Eigenvalues Tracy-Widom law F Density Centered-scaled largest eigenvalue of XX H Figure: Distribution of N 2 3 c 2 ( + c) 4 3 [ λ + N ( + c) 2] against the distribution of X + (distributed as Tracy-Widom law) for N = 500, n = 500, c = /3, for the covariance matrix model XX H. Empirical distribution taken over 0, 000 Monte-Carlo simulations.

25 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 28/42 Techniques of proof Method of proof requires very different tools: orthogonal (Laguerre) polynomials: to write joint unordered eigenvalue distribution as a kernel determinant. ρ N (λ,..., λ p ) = det p K N (λ i, λ j ) i,j= with K(x, y) the kernel Laguerre polynomial. Fredholm determinants: we can write hole probability as a Fredholm determinant. ( P N 2/3 ( λ i ( + c) 2) ) A, i =,..., N = + ( ) k k! k det K k A c A c N (x i, x j ) dx i i,j= det(i N K N ). kernel theory: show that KN converges to a Airy kernel. K N (x, y) K Airy (x, y) = Ai(x)Ai (y) Ai (x)ai(y). x y differential equation tricks: hole probability in [t, ) gives right-most eigenvalue distribution, which is simplified as solution of a Painelvé differential equation: the Tracy-Widom distribution. F + (t) = e t (x t)q(x) 2 dx, q = tq + 2q 3, q(x) x Ai(x).

26 Part : Fundamentals of Random Matrix Theory/.2 Extreme eigenvalues 29/42 Comments on the Tracy-Widom law deeper result than limit eigenvalue result gives a hint on convergence speed fairly biased on the left: even fewer eigenvalues outside the support. can be shown to hold for other distributions than Gaussian under mild assumptions

27 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 3/42 Spiked models We consider n independent observations x,, x n of size N, The correlation structure is in general white + low rank, E [ ] x x H = I + P where P is of low rank, Objective: to infer the eigenvalues and/or the eigenvectors of P

28 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 32/42 The first result J. Baik, J. W. Silverstein, Eigenvalues of large sample covariance matrices of spiked population models, Journal of Multivariate Analysis, vol. 97, no. 6, pp , Theorem Let B N = n (I + P) 2 X N X H N (I + P) 2, where X N C N n has i.i.d., zero mean and unit variance entries, and P N R N N with eigenvalues given by: eig(p) = diag(ω,..., ω K, 0,...,..., 0) }{{} N K with ω >... > ω K >, c = lim N N/n. Let λ,, λ N be the eigenvalues of B N. We then have if ω j > a.s. c, λ j + ω j + c +ω j ω (i.e. beyond the Mar cenko Pastur bulk!) j if ωj (0, a.s. c], λ j ( + c) 2 (i.e. right-edge of the Mar cenko Pastur bulk!) if ωj [ c, 0), λ j a.s. ( c) 2 (i.e. left-edge of the Mar cenko Pastur bulk!) for the other eigenvalues, we discriminate over c: if ω j < a.s. c, c <, λ j + ω j + c +ω j ω (i.e. beyond the Mar cenko Pastur bulk!) j if ω j < a.s. c, c >, λ j ( c) 2 (i.e. left-edge of the Mar cenko Pastur bulk!)

29 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 33/42 Illustration of spiked models 0.8 Mar cenko-pastur law, c = /3 Empirical Eigenvalues 0.6 Density Eigenvalues + ω + c +ω ω, + ω 2 + c +ω 2 ω 2 Figure: Eigenvalues of B N = n (P + I) 2 X N X N H (P + I) 2, where ω = ω 2 = and ω 3 = ω 4 = 2 Dimensions: N = 500, n = 500.

30 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 34/42 Interpretation of the result if c is large, or alternatively, if some population spikes are small, part to all of the population spikes are attracted by the support! if so, no way to decide on the existence of the spikes from looking at the largest eigenvalues in signal processing words, signals might be missed using largest eigenvalues methods. as a consequence, the more the sensors (N), the larger c = lim N/n, the more probable we miss a spike

31 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 35/42 Sketch of the proof We start with a study of the limiting extreme eigenvalues. Let x > 0, then det(b N xi N ) = det(i N + P) det(xx H xi N + x[i N (I N + P) ]) = det(i N + P) det(xx H xi N ) det(i N + xp(i N + P) (XX H xi N ) ). if x eigenvalue of BN but not of XX H, then for n large, x > ( + c) 2 (edge of MP law support) and det(i N + xp(i N + P) (XX H xi N ) ) = det(i r + xω(i N + Ω) U H (XX H xi N ) U) = 0 with P = UΩU H, U C N r. due to unitary invariance of X, U H (XX H xi N ) U a.s. (t x) df MP (t)i r m(x)i r with F MP the MP law, and m(x) the Stieltjes transform of the MP law (often known for r = as trace lemma). finally, we have that the limiting solutions xk satisfy x k m(x k ) + ( + ω k )ω k = 0. replacing m(x), this is finally: a.s. λ k x k + ω k + c( + ω k )ω k, if ω k > c

32 Part : Fundamentals of Random Matrix Theory/.3 Extreme eigenvalues: the spiked models 36/42 Comments on the result there exists a phase transition when the largest population eigenvalues move from inside to outside (0, + c). more importantly, for t < + c, we still have the same Tracy-Widom, no way to see the spike even when zooming in in fact, simulation suggests that convergence rate to the Tracy-Widom is slower with spikes.

33 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 38/42 Stieltjes transform inversion for covariance matrix models J. W. Silverstein, S. Choi, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp , 995. We know for the model C 2N X N, X N C N n that, if F C N F C, the Stieltjes transform of the e.s.d. of B N = n XH N C NX N satisfies m BN (z) a.s. m F (z), with ( ) t m F (z) = z c + tm F (z) df C (t) which is unique on the set {z C +, m F (z) C + }. This can be inverted into z F (m) = m c t + tm df C (t) for m C +.

34 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 39/42 Stieltjes transform inversion and spectrum characterization Remember that we can evaluate the spectrum density by taking a complex line close to R and evaluating I[m F (z)] along this line. Now we can do better. It is shown that lim m F (z) = m 0 (x) exists. z C + z x R We also have, for x0 inside the support, the density f (x) of F in x 0 is π I[m 0] with m 0 the unique solution m C + of [z F (m) =] x 0 = m c t + tm df C (t) let m 0 R and x F the equivalent to z F on the real line. Then x 0 outside the support of F is equivalent to x F (m F (x 0 )) > 0, m F (x 0 ) 0, /m F (x 0 ) outside the support of F C. This provides another way to determine the support!. For m (, 0), evaluate x F (m). Whenever x F decreases, the image is outside the support. The rest is inside.

35 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 40/42 Another way to determine the spectrum: spectrum to analyze 0.6 Empirical eigenvalue distribution Limit law Density Eigenvalues Figure: Histogram of the eigenvalues of B N = n C 2 N X N X H N C 2 N, N = 300, n = 3000, with C N diagonal composed of three evenly weighted masses in, 3 and 7.

36 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 4/42 Another way to determine the spectrum: inverse function method x F (m), m B Support of F 7 x F (m) m Figure: Stieltjes transform of B N = n C 2 N X N X H N C 2 N, N = 300, n = 3000, with C N diagonal composed of three evenly weighted masses in, 3 and 7. The support of F is read on the vertical axis, whenever m F is decreasing.

37 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 42/42 Cluster boundaries in sample covariance matrix models Xavier Mestre, Improved estimation of eigenvalues of covariance matrices and their associated subspaces using their sample estimates, IEEE Transactions on Information Theory, vol. 54, no., Nov Theorem Let X N C N n have i.i.d. entries of zero mean, unit variance, and C N be diagonal such that F C N F C, as n, N, N/n c, where F C has K masses in t,..., t K with multiplicity n,..., n K respectively. Then the l.s.d. of B N = n C 2 N X N X H N C 2 N has support S given by with x q = x F (m q ), x + q = x F (m + q ), and S = [x, x + ] [x 2, x + 2 ]... [x Q, x + Q ] x F (m) = m c K t n k n k + t k= k m with 2Q the number of real-valued solutions counting multiplicities of x F (m) = 0 denoted in order m < m+ m 2 < m m Q < m+ Q.

38 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 43/42 Comments on spectrum characterization Previous results allows to determine the spectrum boundaries the number Q of clusters as a consequence, the total separation (Q = K) or not (Q < K) of the spectrum in K clusters. Mestre goes further: to determine local separability of the spectrum, identify the K inflexion points, i.e. the K solutions m,..., m K to check whether x F (m i ) > 0 and x F (m i+) > 0 x F (m) = 0 if so, the cluster in between corresponds to a single population eigenvalue.

39 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 44/42 Exact eigenvalue separation Z. D. Bai, J. W. Silverstein, Exact Separation of Eigenvalues of Large Dimensional Sample Covariance Matrices, The Annals of Probability, vol. 27, no. 3, pp , 999. Recall that the result on no eigenvalue outside the support says where eigenvalues are not to be found does not say, as we feel, that (if cluster separation) in cluster k, there are exactly n k eigenvalues. This is in fact the case, Empirical eigenvalue distribution 0.6 Limit law Density n n 2 Eigenvalues n 3

40 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 45/42 Eigeninference: Introduction of the problem Reminder: for a sequence x,..., x n C N of independent random variables, is an n-consistent estimator of C N = E[x x H ]. Ĉ N = n x n k x H k k= If n, N have comparable sizes, this no longer holds. Typically, n, N-consistent estimators of the full CN matrix perform very badly. If only the eigenvalues of CN are of interest, things can be done. The process of retrieving information about eigenvalues, eigenspace projections, or functional of these is called eigen-inference.

41 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 46/42 Girko and the G-estimators V. Girko, Ten years of general statistical analysis, Girko has come up with more than 50 N, n-consistent estimators, called G-estimators (Generalized estimators). Among those, we find G -estimator of generalized variance. For [ ] G (ĈN ) = α n(n ) n log det(c N ) + log N (n N) N k= (n k) with α n any sequence such that α 2 n log(n/(n N)) 0, we have in probability. G (Ĉ N ) α n log det(c N ) 0 However, Girko s proofs are rarely readable, if existent.

42 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 47/42 A long standing problem X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE trans. on Information Theory, vol. 54, no., pp , Consider the model BN = n C 2 N X N X H N C 2 N, where F C N is formed of a finite number of masses t,..., t K. It has long been thought the inverse problem of estimating t,..., t K from the Stieltjes transform method was not possible. Only trials were iterative convex optimization methods. The problem was partially solved by Mestre in 2008! His technique uses elegant complex analysis tools. The description of this technique is the subject of this course.

43 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 48/42 Reminders Consider the sample covariance matrix model BN = n C 2 N X N X H N C 2 N. Up to now, we saw: that there is no eigenvalue outside the support with probability for all large N. that for all large N, when the spectrum is divided into clusters, the number of empirical eigenvalues in each cluster is exactly as we expect. these results are of crucial importance for the following.

44 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 49/42 Eigen-inference for the sample covariance matrix model X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE trans. on Information Theory, vol. 54, no., pp , Theorem Consider the model B N = n C 2 N X N X H N C 2 N, with X N C N n, i.i.d. with entries of zero mean, unit variance, and C N R N N is diagonal with K distinct entries t,..., t K of multiplicity N,..., N K of same order as n. Let k {,..., K}. Then, if the cluster associated to t k is separated from the clusters associated to k and k +, as N, n, N/n c, ˆt k = n N k m N k (λ m µ m ) is an N, n-consistent estimator of t k, where N k = {N K i=k N i +,..., N K i=k+ N i }, λ,..., λ N are the eigenvalues of B N and µ,..., µ N are the N solutions of m X H N C N X N (µ) = 0 or equivalently, µ,..., µ N are the eigenvalues of diag(λ) N λ λ T.

45 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 50/42 Remarks on Mestre s result Assuming cluster separation, the result consists in taking the empirical ordered λi s inside the cluster (note that exact separation ensures there are N k of these!) getting the ordered eigenvalues µ,..., µ N of diag(λ) N λ λ T with λ = (λ,..., λ N ) T. Keep only those of index inside N k. take the difference and scale.

46 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 5/42 How to obtain this result? Major trick requires tools from complex analysis Silverstein s Stieltjes transform identity: for the conjugate model B N = n XH N C NX N, ( ) t m N (z) = z c + tm N (z) df C N (t) with m N the deterministic equivalent of m BN. This is the only random matrix result we need. Before going further, we need some reminders from complex analysis.

47 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 52/42 Limiting spectrum of the sample covariance matrix J. W. Silverstein, Z. D. Bai, On the empirical distribution of eigenvalues of a class of large dimensional random matrices, J. of Multivariate Analysis, vol. 54, no. 2, pp , 995. Reminder: If F C N F C, then m BN (z) a.s. m F (z) such that or equivalently ( ) t m F (z) = c + tm F (z) df C (t) z m F C ( /mf (z) ) = zm F (z)m F (z) with m F (z) = cm F (z) + (c ) z and N/n c.

48 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 53/42 Reminders of complex analysis Cauchy integration formula Theorem Let U C be an open set and f : U C be holomorphic on U. Let γ U be a continuous contour (i.e. closed path). Then, for a inside the surface formed by γ, we have f (z) dz = f (a) 2πi γ z a while for a outside the surface formed by γ, f (z) dz = 0. 2πi γ z a

49 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 54/42 Complex integration From Cauchy integral formula, denoting Ck a contour enclosing only t k, t k = ω dω = K ω N 2πi C k ω t k 2πi C k N j dω = N ωm k ω t j= j 2πiN F C (ω)dω. k C k After the variable change ω = /mf (z), When the system dimensions are large, t k = N zm N k 2πi F (z) m F (z) C F,k mf 2 dz, (z) m F (z) m BN (z) N N λ k= k z, with (λ,..., λ N ) = eig(b N ) = eig(yy H ). Dominated convergence arguments then show t k ˆt k a.s. 0 with ˆt k = N zm N k 2πi BN (z) m BN (z) C F,k mb 2 (z) dz N

50 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 55/42 Understanding the contour change x F (m), m B Support of F 7 x F (m) m 2 3 m /x 2 /x m IF CF,k encloses cluster k with real points m < m 2 THEN /m = x < t k < x 2 = /m 2 and C k encloses t k.

51 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 56/42 Poles and residues we find two sets of poles (outside zeros): λ,..., λ N, the eigenvalues of B N. the solutions µ,..., µ N to ˆm N (z) = 0. remember that m BN (w) = n N m B N (w) + n N N w ( ) m residue calculus, denote f (w) = nn wm BN (w) + n N BN (w) N m BN (w) 2, the λ k s are poles of order and lim z λ k (z λ k )f (z) = n N λ k the µ k s are also poles of order and by L Hospital s rule lim (z λ k )f (z) = lim z µ k z µ k n (z µ k )zm B (z) N = n N m BN (z) N µ k So, finally ˆt k = n N k m contour (λ m µ m )

52 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 57/42 Which poles in the contour? we now need to determine which poles are in the contour of interest. Since the µi are rank- perturbations of the λ i, they have the interleaving property λ < µ 2 < λ 2 <... < µ N < λ N what about µ? the trick is to use the fact that 2πi C k z dz = 0 which leads to the empirical version of which is m F (w) 2πi Γ k m F (w) 2 dw = 0 #{i : λ i Γ k } #{i : µ i Γ k } Since their difference tends to 0, there are as many λ k s as µ k s in the contour, hence µ is asymptotically in the integration contour.

53 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 58/42 Related bibliography C. A. Tracy and H. Widom, On orthogonal and symplectic matrix ensembles, Communications in Mathematical Physics, vol. 77, no. 3, pp , 996. G. W. Anderson, A. Guionnet, O. Zeitouni, An introduction to random matrices, Cambridge studies in advanced mathematics, vol. 8, 200. F. Bornemann, On the numerical evaluation of distributions in random matrix theory: A review, Markov Process. Relat. Fields, vol. 6, pp , 200. Y. Q. Yin, Z. D. Bai, P. R. Krishnaiah, On the limit of the largest eigenvalue of the large dimensional sample covariance matrix, Probability Theory and Related Fields, vol. 78, no. 4, pp , 988. J. W. Silverstein, Z.D. Bai and Y.Q. Yin, A note on the largest eigenvalue of a large dimensional sample covariance matrix, Journal of Multivariate Analysis, vol. 26, no. 2, pp C. A. Tracy, H. Widom, On orthogonal and symplectic matrix ensembles, Communications in Mathematical Physics, vol. 77, no. 3, pp , 996. Z. D. Bai, J. W. Silverstein, No eigenvalues outside the support of the limiting spectral distribution of large-dimensional sample covariance matrices, The Annals of Probability, vol. 26, no. pp , 998. Z. D. Bai, J. W. Silverstein, Exact Separation of Eigenvalues of Large Dimensional Sample Covariance Matrices, The Annals of Probability, vol. 27, no. 3, pp , 999. J. W. Silverstein, P. Debashis, No eigenvalues outside the support of the limiting empirical spectral distribution of a separable covariance matrix, J. of Multivariate Analysis vol. 00, no., pp , J. W. Silverstein, J. Baik, Eigenvalues of large sample covariance matrices of spiked population models Journal of Multivariate Analysis, vol. 97, no. 6, pp , I. M. Johnstone, On the distribution of the largest eigenvalue in principal components analysis, Annals of Statistics, vol. 99, no. 2, pp , 200. K. Johansson, Shape Fluctuations and Random Matrices, Comm. Math. Phys. vol. 209, pp , J. Baik, G. Ben Arous, S. Péché, Phase transition of the largest eigenvalue for nonnull complex sample covariance matrices, The Annals of Probability, vol. 33, no. 5, pp , 2005.

54 Part : Fundamentals of Random Matrix Theory/.4 Spectrum Analysis and G-estimation 59/42 Related bibliography (2) J. W. Silverstein, S. Choi, Analysis of the limiting spectral distribution of large dimensional random matrices, Journal of Multivariate Analysis, vol. 54, no. 2, pp , 995. W. Hachem, P. Loubaton, X. Mestre, J. Najim, P. Vallet, A Subspace Estimator for Fixed Rank Perturbations of Large Random Matrices, arxiv preprint , 20. R. Couillet, W. Hachem, Local failure detection and diagnosis in large sensor networks, (submitted to) IEEE Transactions on Information Theory, arxiv preprint F. Benaych-Georges, R. Rao, The eigenvalues and eigenvectors of finite, low rank perturbations of large random matrices, Advances in Mathematics, vol. 227, no., pp , 20. X. Mestre, On the asymptotic behavior of the sample estimates of eigenvalues and eigenvectors of covariance matrices, IEEE Transactions on Signal Processing, vol. 56, no., X. Mestre, Improved estimation of eigenvalues and eigenvectors of covariance matrices using their sample estimates, IEEE trans. on Information Theory, vol. 54, no., pp , R. Couillet, J. W. Silverstein, Z. Bai, M. Debbah, Eigen-Inference for Energy Estimation of Multiple Sources, IEEE Transactions on Information Theory, vol. 57, no. 4, pp , 20. P. Vallet, P. Loubaton and X. Mestre, Improved subspace estimation for multivariate observations of high dimension: the deterministic signals case, arxiv preprint , 200.

55 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 62/42 Problem formulation We want to test the hypothesis H0 against H, { C N n hx Y = T + σw, information plus noise, hypothesis H σw, pure noise, hpothesis H 0 with h C N, x C N, W C N n. We assume no knowledge whatsoever but that W has i.i.d. (non-necessarily Gaussian) entries.

56 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 63/42 Exploiting the conditioning number L. S. Cardoso, M. Debbah, P. Bianchi, J. Najim, Cooperative spectrum sensing using random matrix theory, International Symposium on Wireless Pervasive Computing, pp , under either hypothesis, if H 0, for N large, we expect F YY H close to the Mar cenko-pastur law, of support [σ 2 ( c ) 2, σ 2 ( + c ) 2 ]. if H, if population spike more than + N n, largest eigenvalue is further away. the conditioning number of YY H is therefore asymptotically, as N, n, N/n c, if H 0, ( ) 2 cond(y) λmax c ( λ ) 2 min + c if H, cond(y) t + ( ) 2 ct c t > ( ) 2 + c with t = N k= h k 2 + σ 2 the conditioning number is independent of σ. We then have the decision criterion, whether or not σ is known, ( ) 2 H decide 0 : if cond(yy H Nn ) ( ) 2 + ε + Nn H : otherwise. for some security margin ε.

57 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 64/42 Comments on the method Advantages: much simpler than finite size analysis ratio independent of σ, so σ needs not be known Drawbacks: only stands for very large N (dimension N for which asymptotic results arise function of σ!) ad-hoc method, does not rely on performance criterion.

58 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 65/42 Generalized likelihood ratio test P. Bianchi, M. Debbah, M. Maida, J. Najim, Performance of Statistical Tests for Source Detection using Random Matrix Theory, IEEE Trans. on Information Theory, vol. 57, no. 4, pp , 20. Alternative generalized likelihood ratio test (GLRT) decision criterion, i.e. C(Y) = sup σ 2,h P Y h,σ 2 (Y, h, σ2 ) sup σ 2 P Y σ 2 (Y σ 2. ) Denote T N = λ max(yy H ) N tr YYH To guarantee a maximum false alarm ratio of α, decide { H : H 0 : if ( N ) ( N)n ( ) T n N T ( N)n N N > ξn otherwise. for some threshold ξ N that can be explicitly given as a function of α. Optimal test with respect to GLR. Performs better than conditioning number test.

59 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 66/42 Performance comparison for unknown σ 2, P Neyman-Pearson, Jeffreys prior Neyman-Pearson, uniform prior Conditioning number test GLRT Correct detection rate False alarm rate 0 2 Figure: ROC curve for a priori unknown σ 2 of the Neyman-Pearson test, conditioning number method and GLRT, K =, N = 4, M = 8, SNR = 0 db. For the Neyman-Pearson test, both uniform and Jeffreys prior, with exponent β =, are provided.

60 Application to Signal Sensing and Array Processing/2. Eigenvalue-based detection 67/42 Related biography R. Couillet, M. Debbah, A Bayesian Framework for Collaborative Multi-Source Signal Sensing, IEEE Transactions on Signal Processing, vol. 58, no. 0, pp , 200. T. Ratnarajah, R. Vaillancourt, M. Alvo, Eigenvalues and condition numbers of complex random matrices, SIAM Journal on Matrix Analysis and Applications, vol. 26, no. 2, pp , M. Matthaiou, M. R. McKay, P. J. Smith, J. A. Mossek, On the condition number distribution of complex Wishart matrices, IEEE Transactions on Communications, vol. 58, no. 6, pp , 200. C. Zhong, M. R. McKay, T. Ratnarajah, K. Wong, Distribution of the Demmel condition number of Wishart matrices, IEEE Trans. on Communications, vol. 59, no. 5, pp , 20. L. S. Cardoso, M. Debbah, P. Bianchi, J. Najim, Cooperative spectrum sensing using random matrix theory, International Symposium on Wireless Pervasive Computing, pp , P. Bianchi, M. Debbah, M. Maida, J. Najim, Performance of Statistical Tests for Source Detection using Random Matrix Theory, IEEE Trans. on Information Theory, vol. 57, no. 4, pp , 20.

61 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 69/42 Source localization A uniform array of M antennas receives signal from K radio sources during n signal snapshots. Objective: Estimate the arrival angles θ,, θ K. θ 2 θ

62 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 70/42 Source Localization using Music Algorithm We consider the scenario of K sources and N antenna-array capturing n observations: x t = K a(θ k )s k,t + σw t, t =,, n k= AN = [a N (θ ),, a N (θ K )] with a N (θ) = eıπ sin θ eı(n )π sin θ σ 2 is the noise variance and is set for simplicity, Objective: infer θ,, θ K from the n observations Let XN = [x,, x n ], then, [ S X = AS + W = [A I N ] W] If K is finite while n, N +, the model correponds to the spiked covariance model. MUSIC Algorithm: Let Π be the orthogonal projection matrix on the span of AA and Π = I N Π (orthogonal projector on the noise subspace). Angles θ,, θ K are the unique ones verifying η(θ) a N (θ) Πa N (θ) = 0

63 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 7/42 Traditional MUSIC algorithm Traditional MUSIC algorithm: Angles are estimated as local minima of: a N (θ) ˆΠaN (θ) where ˆΠ is the orthogonal projection matrix on the eigenspace associated to the K largest eigenvalues of n X NX N It is well-known that this estimator is consistent when n + with K, N fixed, We consider the case of K finite spiked covariance model What happens when n, N +?

64 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 72/42 Asymptotic behaviour of the traditional MUSIC () We first need to understand the spectrum of n XXH We know that the weak spectrum is the MP law Up to K eigenvalues can leave the support: we identify here these eigenvalues Denote P = AA H = U S ΩU H S, Ω = diag(ω,..., ω K ), and Z = [S T W T ] T to recover (up to one row) the generic spiked model X = (I N + P) 2 Z. Reminder: If x eigenvalue of n XX H with x > ( + c) 2 (edge of MP law), for all large n, for some k. a.s. x λ k ρ k + ω k + c( + ω k )ω k, if ω k > c

65 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 73/42 Asymptotic behaviour of the traditional MUSIC (2) Recall the MUSIC approach: we want to estimate η(θ) = a(θ) H U W U H W a(θ) (U W C N (N K) such that U H W U S = 0) Instead of this quantity, we start with the study of a(θ) H û i û H i a(θ), k =,..., K with û,..., û N the eigenvectors belonging to λ... λ N. To fall back on known RMT quantities, we use the Cauchy-integral: a(θ) H û i û H i a(θ) = a(θ) H ( 2πı C i n XXH zi N ) a(θ)dz with C i a contour enclosing λ i only. Woodbury s identity (A + UCV ) = A A U(C + VA U) VA gives: a H û i û H i a = a 2πı Ci H (I N + P) ZZ H 2 ( n zi N ) (I N + P) 2 adz + â H 2πı Ĥ â 2dz C i where P = U S ΩU H S, and Ĥ â H â 2 = I K + zω(i K + Ω) U H S ( n ZZH zi N ) U S = za(θ) H (I N + P) 2 ( n ZZH zi N ) U S = Ω(I K + Ω) U H S ( n ZZH zi N ) (I N + P) 2 a(θ).

66 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 74/42 Asymptotic behaviour of the traditional MUSIC (3) For large n, the first term has no pole, while the second converges to T i H = I K + zm(z)ω(i K + Ω) a H 2πı H a 2 dz, with a H = zm(z)a (I N + P) 2 U S C i a 2 = m(z)ω(i K + Ω) U H S (I N + P) 2 a which after development is T i = K + ω l= l 2πı C i +ω l ω l zm 2 (z) + zm(z) dz. Using residue calculus, the sole pole is in ρi and we find Therefore, a(θ) H û i û H i a(θ) a.s. cω 2 i + cω a(θ) H u i u H i a(θ). i ˆη(θ) = a(θ) H ˆΠa(θ) a.s. a(θ)a(θ) H K i= cω 2 i + cω a(θ) H u i u H i a(θ) i

67 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 75/42 Improved G-MUSIC Recall that: a(θ) H u k u H k + cω k a(θ) cω 2 k a(θ) H û k û H a.s. k a(θ) 0 The ω k are however unknown. But they can be estimated from λ k a.s. ρ k = + ω k + c( + ω k )ω k This gives finally K ˆη G (θ) a(θ) H + c ˆω k a(θ) c ˆω 2 a(θ) H û k û H k a(θ) k= k with ˆω k = ˆλ k (c + ) 2 + (c + ˆλk ) 2 4c) We then obtain another (N, n)-consistent MUSIC estimator, only valid for K finite!

68 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 76/42 0 Simulation results 5 Cost function [db] MUSIC G-MUSIC angle [deg] Figure: MUSIC against G-MUSIC for DoA detection of K = 3 signal sources, N = 20 sensors, M = 50 samples, SNR of 0 db. Angles of arrival of 0, 35, and 37.

69 Application to Signal Sensing and Array Processing/2.2 The spiked G-MUSIC algorithm 77/42 Outline of the tutorial Part : Basics of Random Matrix Theory for Sample Covariance Matrices.. Introduction to the Stieltjes transform method, Mar cenko Pastur law, advanced models.2. Extreme eigenvalues: no eigenvalue outside the support, exact separation, Tracy Widom law.3. Extreme eigenvalues: the spiked models.4. Spectrum analysis and G-estimation Part 2: Application to Signal Sensing and Array Processing 2.. Eigenvalue-based detection 2.2. The (spiked) G-MUSIC algorithm Part 3: Advanced Random Matrix Models for Robust Estimation 3.. Robust estimation of scatter 3.2. Robust G-MUSIC 3.3. Robust shrinkage in finance 3.4. Second order robust statistics: GLRT detectors Part 4: Future Directions 4.. Kernel random matrices and kernel methods 4.2. Neural network applications

70 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 80/42 Covariance estimation and sample covariance matrices P.J. Huber, Robust Statistics, 98. Many statistical inference techniques rely on the sample covariance matrix (SCM) taken from i.i.d. observations x,..., x n of a r.v. x C N. The main reasons are: Assuming E[x] = 0, E[xx ] = C N, with X = [x,..., x n], by the LLN Ŝ N n XX a.s. C N as n. Hence, if θ = f (C N ), we often use the n-consistent estimate ˆθ = f ( Ŝ N ). The SCM ŜN is the ML estimate of C N for Gaussian x One therefore expects ˆθ to closely approximate θ for all finite n. This approach however has two limitations: if N, n are of the same order of magnitude, ŜN C N 0 as N, n, N/n c > 0, so that in general ˆθ θ 0 This motivated the introduction of G-estimators. if x is not Gaussian, but has heavier tails, Ŝ N is a poor estimator for C N. This motivated the introduction of robust estimators.

71 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 8/42 Reminders on robust estimation J. T. Kent, D. E. Tyler, Redescending M-estimates of multivariate location and scatter, 99. R. A. Maronna, Robust M-estimators of multivariate location and scatter, 976. Y. Chitour, F. Pascal, Exact maximum likelihood estimates for SIRV covariance matrix: Existence and algorithm analysis, The objectives of robust estimators: Replace the SCM ŜN by another estimate ĈN of C N which: rejects (or downscales) observations deterministically or rejects observations inconsistent with the full set of observations Example: Huber estimator, Ĉ N defined as solution of Ĉ N = { } n β n i x i xi k with β i = α min, 2 i= N x i ĈN x for some α >, k 2 function of ĈN. i Provide scale-free estimators of CN : Example: Tyler s estimator: if one observes x i = τ i z i for unknown scalars τ i, n Ĉ N = n i= N x i ĈN x i x i x i existence and uniqueness of Ĉ N defined up to a constant. few constraints on x,..., x n (N + of them must be linearly independent)

72 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 82/42 Reminders on robust estimation The objectives of robust estimators: replace the SCM Ŝ N by the ML estimate for C N. Example: Maronna s estimator for elliptical x Ĉ N = n ( ) u n N x i ĈN x i x i xi i= with u(s) such that (i) u(s) is continuous and non-increasing on [0, ) (ii) φ(s) = su(s) is non-decreasing, bounded by φ >. Moreover, φ(s) increases where φ(s) < φ. (note that Huber s estimator is compliant with Maronna s estimators) existence is not too demanding uniqueness imposes strictly increasing u(s) (inconsistent with Huber s estimate) consistency result: Ĉ N C N if u(s) meets the ML estimator for C N.

73 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 83/42 Robust Estimation and RMT So far, RMT has mostly focused on the SCM ŜN. x = AN w, w having i.i.d. zero-mean unit variance entries, x satisfies concentration inequalities, e.g. elliptically distributed x. Robust RMT estimation Can we study the performance of estimators based on the Ĉ N? what are the spectral properties of ĈN? can we generate RMT-based estimators relying on Ĉ N?

74 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 84/42 Assumptions: Setting and assumptions Take x,..., x n C N elliptical-like random vectors, i.e. x i = τ i C 2 N w i where τ,..., τ n R + random or deterministic with n ni= a.s. τ i w,..., w n C N random independent with w i / N uniformly distributed over the unit-sphere C N C N N deterministic, with C N 0 and lim sup N C N < We denote c N N/n and consider the growth regime c N c (0, ). Maronna s estimator of scatter: (almost sure) unique solution to Ĉ N = n ( ) u n N x i ĈN x i x i xi i= where u satisfies (i) u : [0, ) (0, ) nonnegative continuous and non-increasing (ii) φ : x xu(x) increasing and bounded with lim x φ(x) φ > (iii) φ < c+. Additional technical assumption: Let ν n n ni= δ τi. For each a > b > 0, a.s. lim sup lim sup n ν n ((t, )) = 0. t φ(at) φ(bt) Controls relative speed of the tail of ν n versus the flattening speed of φ(x) as x. Examples: τ i < M for each i. In this case, ν n((t, )) = 0 a.s. for t > M. For u(t) = ( + α)/(α + t), α > 0, and τ i i.i.d., it is sufficient to have E[τ +ε ] <.

75 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 85/42 Major issues with ĈN: Defined implicitly Heuristic approach Sum of non-independent rank-one matrices from vectors u( N x i But there is some hope: First remark: we can work with C N = I N without generality restriction! Denote Ĉ (j) = n ( ) u n N x i ĈN x i x i xi i j Then intuitively, Ĉ (j) and x j are only weakly dependent. We expect in particular (highly non-rigorous but intuitive!!): N x i Ĉ (i) x i τ i tr Ĉ N (i) τ i tr Ĉ N N. Our heuristic approach: Rewrite N x i ĈN x i as f ( N x i Ĉ (i) x i ) for some function f (later called g ) Deduce that Ĉ N = n ( ) (u f ) n N x i Ĉ (i) x i x i xi i= Use N x i Ĉ (i) x i τ i N tr ĈN to get Ĉ N n ( ) (u f ) τ n i tr Ĉ N N x i xi i= Use random matrix results to find a limiting value γ for N tr ĈN, and conclude Ĉ N n (u f )(τ n i γ)x i xi. i= ĈN x i )x i (Ĉ N depends on all x j s).

76 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 86/42 Heuristic approach in detail: f and γ Determination of f : Recall the identity (A + tvv ) v = A /( + tv A v). Then so that N x i ĈN x i = N x i Ĉ (i) x i = + c N u( N x i N x i Ĉ (i) x i ĈN x i ) N x i Ĉ (i) x i N x i ĈN x i c N φ( N x i Ĉ N x i ). Now the function g : x x/( c N φ(x)) is monotonous increasing (we use the assumption φ < c!), hence, with f = g, N x i ( ) ĈN x i = g N x i Ĉ (i) x i.

77 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 87/42 Heuristic approach in detail: f and γ Determination of γ: From previous calculus, we expect Ĉ N n ( ) (u g ) τ n i tr Ĉ N N x i xi n (u g ) (τ n i γ) x i xi. i= i= Hence γ tr Ĉ N N ( ) n N tr (u g ) (τ n i γ) τ i w i wi. i= Since τ i are independent of w i and γ deterministic, this is a Bai-Silverstein model n WDW, W = [w,..., w n ], D = diag(d ii ) = u g (τ i γ). And we have: γ ( ) ( ) N tr n WDW t(u g = m n WDW (0) 0 )(tγ) + + c(u g )(tγ)m n WDW (0) ν N (dt) = ( n n i= ) τ i (u g )(τ i γ) + cτ i (u g. )(τ i γ)m n WDW (0) Since γ m n WDW (0), this defines γ as a solution of a fixed-point equation: γ = ( n n i= ) τ i (u g )(τ i γ) + cτ i (u g. )(τ i γ)γ

78 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 88/42 Main result R. Couillet, F. Pascal, J. W. Silverstein, The Random Matrix Regime of Maronna s M-estimator with elliptically distributed samples, (submitted to) Elsevier Journal of Multivariate Analysis. Theorem (Asymptotic Equivalence) Under the assumptions defined earlier, we have ĈN Ŝ N a.s. 0, where Ŝ N n n v(τ i γ)x i xi i= v(x) = (u g )(x), ψ(x) = xv(x), g(x) = x/( cφ(x)) and γ > 0 unique solution of = n ψ(τ i γ) n + cψ(τ i= i γ). Remarks: Th. says: first order substitution of Ĉ N by Ŝ N allowed for large N, n. It turns out that v u and ψ φ in general behavior. Corollaries: max λ i (ŜN ) λ i (ĈN ) a.s. 0 i n N tr (ĈN zi N ) N tr (ŜN zi N ) a.s. 0 Important feature for detection and estimation. Proof: So far in the tutorial, we do not have a rigorous proof!

79 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 89/42 Fundamental idea: Showing that all τi N x i Technical trick: Denote and relabel terms such that We shall prove that, for each l > 0, Proof Ĉ (i) x i converge to the same limit γ. ( ) v N xi Ĉ (i) e x i i v(τ i γ) e... e n e > l i.o. and e n < + l i.o. Some basic inequalities: Denoting di τ i N xi Ĉ (i) x i = N w i Ĉ (i) w i, we have e j = ( v ( v τ j N w j τ j N w j ( n ) ) i j τ i v(τ i d i )w i wi wj v(τ j γ) ( n i j τ i v(τ i γ)e n w i wi v(τ j γ) ) wj ) = = ( v τ j N w j ( τ v j e n ( n ) ) i j τ i v(τ i γ)e i w i wi wj v(τ j γ) ( N wj n i j τ i v(τ i γ)w i wi v(τ j γ) ) wj )

80 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 90/42 Specialization to en : e n Proof ( ( τ v n e n N wn n ) ) i n τ i v(τ i γ)w i wi wn v(τ n γ) or equivalently, recalling ψ(x) = xv(x), ( ( N w n n ) ( i n τ i v(τ i γ)w i wi τ wn ψ n e n N w n ) ) n i n τ i v(τ i γ)w i wi wn. γ ψ(τ n γ) Random Matrix results: By trace lemma, we should have N w n τ n i v(τ i γ)w i wi w n N tr τ n i v(τ i γ)w i wi γ i n i n (by definition of γ as in previous slides)... DANGER: by relabeling, w n no longer independent of w,..., w n! Broken trace lemma! Solution: uniform convergence result. By (higher order) moment bounds, Markov inequality, and Borel Cantelli, for all large n a.s. max j n N w j τ n i v(τ i γ)w i wi w j γ < ε. i j

81 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 9/42 Proof Back to original problem: For all large n a.s., we then have (using growth of ψ) γ ε γ ( ) ψ τn e n (γ + ε). ψ(τ n γ) Proof by contradiction: Assume e n > + l i.o., then on a subsequence e n > + l always and γ ε γ ψ ( τn +l (γ + ε)) ψ(τ n γ). Bounded support for τ i : If 0 < τ < τ i < τ + < for all i, n, then on a subsequence where τ n τ 0, γ ε ψ ( τ 0 +l (γ + ε)) γ ψ(τ 0γ) }{{}}{{} ( as ε 0 τ0 ) ψ +l γ < as ε 0 ψ(τ 0 γ) CONTRADICTION! Unbounded support for τ i : Importance of relative growth of τ n versus convergence of ψ to ψ. Proof consists in dividing {τ i } in two groups: few large ones versus all others. Sufficient condition: lim sup lim sup n ν n((t, )) = 0. t φ(at) φ(bt)

82 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 92/ Simulations Empirical eigenvalue distribution of n ni= x i x i Limiting density 0.3 Density Eigenvalues Figure: Histogram of the eigenvalues of n n i= x i x i for n = 2500, N = 500, C N = diag(i 25, 3I 25, 0I 250), τ with Γ (.5, 2)-distribution.

83 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 93/42 Simulations Empirical eigenvalue distribution of ĈN Empirical eigenvalue distribution of ŜN 3 Limiting density 3 Limiting density Density 2 Density Eigenvalues Eigenvalues Figure: Histogram of the eigenvalues of Ĉ N (left) and Ŝ N (right) for n = 2500, N = 500, C N = diag(i 25, 3I 25, 0I 250), τ with Γ (.5, 2)-distribution. Remark/Corollary: Spectrum of Ĉ N a.s. bounded uniformly on n.

84 Advanced Random Matrix Models for Robust Estimation/3. Robust Estimation of Scatter 94/42 Hint on potential applications Spectrum boundedness: for impulsive noise scenarios, SCM spectrum grows unbounded robust scatter estimator spectrum remains bounded Robust estimators improve spectrum separability (important for many statistical inference techniques seen previously) Spiked model generalization: we may expect a generalization to spiked models spikes are swallowed by the bulk in SCM context we expect spikes to re-emerge in robust scatter context We shall see that we get even better than this... Application scenarios: Radar detection in impulsive noise (non-gaussian noise, possibly clutter) Financial data analytics Any application where Gaussianity is too strong an assumption...

85 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 96/42 System Setting Signal model: y i = L pl a l s li + τ i w i = A i w i l= A i [ p a... pl a L τi I N ], wi [s i,..., s Li, w i ] T. with y,..., y n C N satisfying:. τ,..., τ n > 0 random such that ν n n n i= δτ ν weakly and tν(dt) = ; i 2. w,..., w n C N random independent unitarily invariant N-norm; 3. L N, p... p L 0 deterministic; 4. a,..., a L C N deterministic or random with A A a.s. diag(p,..., p L ) as N, with A [ p a,..., p L a L ] C N L. 5. s,,..., s Ln C independent with zero mean, unit variance. Relation to previous model: If L = 0, yi = τ i w i. Elliptical model with covariance a low-rank (L) perturbation of I N. We expect a spiked version of previous results. Application contexts: wireless communications: signals s li from L transmitters, N-antenna receiver; a l random i.i.d. channels (al a l δ l l, e.g. a l CN(0, I N /N)); array processing: L sources emit signals s li at steering angle a l = a(θ l ). For ULA, [a(θ)] j = N 2 exp(2πıdj sin(θ)).

86 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 97/42 Some intuition Signal detection/estimation in impulsive environments: Two scenarios heavy-tailed noise (elliptical, Gaussian mixtures, etc.) Gaussian noise with spurious impulsions Problems expected with SCM: Respectively, unbounded limiting spectrum, no source separation! Invalidates G-MUSIC isolated eigenvalues due to spikes in time direction False alarms induced by noise impulses! Our results: In a spiked model with noise impulsions, whatever noise impulsion type, spectrum of Ĉ N remains bounded isolated largest eigenvalues may appear, two classes: isolated eigenvalues due to noise impulses CANNOT exceed a threshold! all isolated eigenvalues beyond this threshold are due to signal Detection criterion: everything above threshold is signal.

87 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 98/42 Theoretical results Theorem (Extension to spiked robust model) Under the same assumptions as in previous section, where ĈN ŜN a.s. 0 Ŝ N n v(τ n i γ)a i w i w i A i i= with γ the unique solution to and we recall ψ(tγ) = + cψ(tγ) ν(dt) A i [ p a... w i = [s i,..., s Li, w i ] T. pl a L τi I N ] Remark: For L = 0, A i = [0,..., 0, I N ]. Recover previous result A i w i becomes w i.

88 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 99/42 Theorem (Eigenvalue localization) Denote Localization of eigenvalues uk eigenvector of k-th largest eigenvalue of AA = L i= p i a i a i û k eigenvector of k-th largest eigenvalue of ĈN Also define δ(x) unique positive solution to ( ) tv c (tγ) δ(x) = c x + + δ(x)tv c (tγ) ν(dt). Further denote ( ) p lim c δ(x)v c (tγ) x S + + δ(x)tv c (tγ) ν(dt), S + φ ( + c) 2 γ( cφ ). a.s. Then, if p j > p, ˆλj Λ j > S +, otherwise lim sup n ˆλj S + a.s., with Λ j unique positive solution to ( ) v c (τγ) c δ(λ j ) + δ(λ j )τv c (τγ) ν(dτ) = p j.

89 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 00/42.2 Simulation Eigenvalues of n ni= y i y i Limiting spectral measure 0.8 Density Eigenvalues Figure: Histogram of the eigenvalues of n i y i y i against the limiting spectral measure, L = 2, p = p 2 =, N = 200, n = 000, Sudent-t impulsions.

90 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 0/42 Simulation 8 Eigenvalues of ĈN Limiting spectral measure 6 Density 4 Right-edge of support Λ 2 S Eigenvalues Figure: Histogram of the eigenvalues of Ĉ N against the limiting spectral measure, for u(x) = ( + α)/(α + x) with α = 0.2, L = 2, p = p 2 =, N = 200, n = 000, Student-t impulsions.

91 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 02/42 Comments SCM vs robust: Spikes invisible in SCM in impulsive noise, reborn in robust estimate of scatter. Largest eigenvalues: λ i (ĈN ) > S + Presence of a source! λ i (ĈN ) (sup(support), S + ) May be due to a source or to a noise impulse. λ i (ĈN ) < sup(support) As usual, nothing can be said. Induces a natural source detection algorithm.

92 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 03/42 Eigenvalue and eigenvector projection estimates Two scenarios: known ν = lim n n n i= δτ i unknown ν Theorem (Estimation under known ν). Power estimation. For each p j > p, ( ) v c (τγ) c δ(ˆλj ) + δ(ˆλj )τv c (τγ) ν(dτ) a.s. p j. 2. Bilinear form estimation. For each a, b C N with a = b =, and p j > p a u k uk b w k a û k ûk b a.s. 0 k,p k =p j k,p k =p j where v c (tγ) ( ) 2 ν(dt) + δ(ˆλk )tv c (tγ) w k =. v c (tγ) + δ(ˆλk )tv c (tγ) ν(dt) δ(ˆλk ) 2 t 2 v c (tγ) 2 ( ) c 2 ν(dt) + δ(ˆλk )tv c (tγ)

93 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 04/42 Eigenvalue and eigenvector projection estimates Theorem (Estimation under unknown ν). Purely empirical power estimation. For each p j > p, ( ˆδ(ˆλj ) ) n v(ˆτ i ˆγ n ) a.s. p N j. + ˆδ(ˆλj )ˆτ i= i v(ˆτ i ˆγ n ) 2. Purely empirical bilinear form estimation. For each a, b C N with a = b =, and each p j > p, where ŵ k = a u k uk b ŵ k a û k ûk b a.s. 0 k,p k =p j k,p k =p j n v(ˆτ i ˆγ) ( ) n 2 i= + ˆδ(ˆλk )ˆτ i v(ˆτ i ˆγ) n v(ˆτ i ˆγ) n ˆδ(ˆλk ) 2 ˆτ 2 i v(ˆτ i ˆγ) 2 ( ) n + ˆδ(ˆλk )ˆτ i= i v(ˆτ i ˆγ) N 2 i= + ˆδ(ˆλk )ˆτ i v(ˆτ i ˆγ) ˆγ n n N y i Ĉ (i) y i, i= ˆτ i ˆγ N y i Ĉ (i) y i, ˆδ(x) as δ(x) but for (τi, γ) ( ˆτ i, ˆγ).

94 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 05/42 Assume the model ai = a(θ i ) with Corollary (Robust G-MUSIC) Define ˆη RG (θ) and ˆη emp RG (θ) as Then, for each p j > p, Application to G-MUSIC a(θ) = N 2 [exp(2πıdj sin(θ))] N j=0. ˆη RG (θ) = {j,p j >p } k= w k a(θ) û k û k a(θ) {j,pj >p } ˆη emp RG (θ) = ŵ k a(θ) û k û k a(θ). k= where ˆθ j ˆθ emp j a.s. θ j a.s. θ j ˆθ j argmin θ R κ j {ˆη RG (θ)} ˆθ emp j argmin θ R κ j {ˆη emp RG (θ)}.

95 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 06/42 Localization functions ˆη X (θ) Simulations: Single-shot in elliptical noise Robust G-MUSIC Emp. robust G-MUSIC G-MUSIC 0.8 Emp. G-MUSIC Robust MUSIC MUSIC θ [deg] Figure: Random realization of the localization functions for the various MUSIC estimators, with N = 20, n = 00, two sources at 0 and 2, Student-t impulsions with parameter β = 00, u(x) = ( + α)/(α + x) with α = 0.2. Powers p = p 2 = = 5 db.

96 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 07/42 Simulations: Elliptical noise Mean square error E[ ˆθ θ 2 ] Robust G-MUSIC Emp. robust G-MUSIC G-MUSIC Emp. G-MUSIC Robust MUSIC MUSIC p, p 2 [db] Figure: Means square error performance of the estimation of θ = 0, with N = 20, n = 00, two sources at 0 and 2, Student-t impulsions with parameter β = 0, u(x) = ( + α)/(α + x) with α = 0.2, p = p 2.

97 Advanced Random Matrix Models for Robust Estimation/3.2 Spiked model extension and robust G-MUSIC 08/42 Simulations: Spurious impulses Mean square error E[ ˆθ θ 2 ] Robust G-MUSIC Emp. robust G-MUSIC G-MUSIC Emp. G-MUSIC Robust MUSIC MUSIC p, p 2 [db] Figure: Means square error performance of the estimation of θ = 0, with N = 20, n = 00, two sources at 0 and 2, sample outlier scenario τ i =, i < n, τ n = 00, u(x) = ( + α)/(α + x) with α = 0.2, p = p 2.

98 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 0/42 Context Ledoit and Wolf, A well-conditioned estimator for large-dimensional covariance matrices. Pascal, Chitour, Quek, 203. Generalized robust shrinkage estimator Application to STAP data. Chen, Wiesel, Hero, 20. Robust shrinkage estimation of high-dimensional covariance matrices. Shrinkage covariance estimation: For N > n or N n, shrinkage estimator ( ρ) n x n i xi + ρi N, for some ρ [0, ]. i= allows for invertibility, better conditioning ρ may be chosen to minimize an expected error metric Limitation of Maronna s estimator: Maronna and Tyler estimators limited to N < n, otherwise do not exist introducing shrinkage in robust estimator cannot do much harm anyhow... Introducing the robust-shrinkage estimator: The literature proposes two such estimators n Ĉ N (ρ) = ( ρ) n i= Č N (ρ) = ˇB N (ρ) N x i x i xi ĈN (ρ)x i N tr ˇB N (ρ), ˇBN (ρ) = ( ρ) n + ρi N, ρ (max{0, N n }, ] (Pascal) N n i= N x i x i xi ĈN (ρ)x i + ρi N, ρ (0, ] (Chen)

99 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance /42 Main theoretical result Which estimator is better? Having asked to authors of both papers, their estimator was much better than the other, but the arguments we received were quite vague... Our result: In the random matrix regime, both estimators tend to be one and the same! Assumptions: As before, elliptical-like model x i = τ i C 2 N w i This time, C N cannot be taken I N (due to +ρi N )! Maronna-based shrinkage is possible but more involved...

100 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 2/42 Theorem (Pascal s estimator) Pascal s estimator For ε (0, min{, c }), define ˆRε = [ε + max{0, c }, ]. Then, as N, n, N/n c (0, ), sup ĈN (ρ) ŜN (ρ) a.s. 0 where ρ ˆRε n Ĉ N (ρ) = ( ρ) n i= N x i Ŝ N (ρ) = ρ ˆγ(ρ) ( ρ)c n x i xi + ρi Ĉ N (ρ) N x i and ˆγ(ρ) is the unique positive solution to the equation in ˆγ Moreover, ρ ˆγ(ρ) is continuous on (0, ]. n i= = N λ i (C N ) N ˆγρ + ( ρ)λ i= i (C N ). C 2 N w i w i C 2 N + ρi N

101 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 3/42 Moreover, ρ ˇγ(ρ) is continuous on (0, ]. Theorem (Chen s estimator) Chen s estimator For ε (0, ), define Ř ε = [ε, ]. Then, as N, n, N/n c (0, ), sup ČN (ρ) Š N (ρ) a.s. 0 where ρ Ř ε Č N (ρ) = ˇB N (ρ) N tr ˇB N (ρ), ˇB N (ρ) = ( ρ) n n i= ρ n Š N (ρ) = C 2 ρ + T ρ n N w i w 2 i C N + i= in which T ρ = ρˇγ(ρ)f (ˇγ(ρ); ρ) with, for all x > 0, x i xi N x i Č N (ρ) + ρi N x i T ρ ρ + T ρ I N F (x; ρ) = 2 (ρ c( ρ)) + 4 (ρ c( ρ))2 + ( ρ) x and ˇγ(ρ) is the unique positive solution to the equation in ˇγ = N λ i (C N ) N ρ i= ˇγρ + ( ρ)c+f (ˇγ;ρ) λ i (C N ).

102 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 4/42 Asymptotic Model Equivalence Theorem (Model Equivalence) For each ρ (0, ], there exist unique ˆρ (max{0, c }, ] and ˇρ (0, ] such that Ŝ N (ˆρ) ˆρ ˆγ(ˆρ) ( ˆρ)c + ˆρ = Š N (ˇρ) = ( ρ) n C 2 n N w i wi C 2 N + ρi N. i= Besides, (0, ] (max{0, c }, ], ρ ˆρ and (0, ] (0, ], ρ ˇρ are increasing and onto. Up to normalization, both estimators behave the same! Both estimators behave the same as an impulsion-free Ledoit-Wolf estimator About uniformity: Uniformity over ρ in the theorems is essential to find optimal values of ρ.

103 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 5/42 Optimal Shrinkage parameter Chen sought for a Frobenius norm minimizing ρ but got stuck by implicit nature of ČN (ρ) Our results allow for a simplification of the problem for large N, n! Model equivalence says only one problem needs be solved. Theorem (Optimal Shrinkage) For each ρ (0, ], define ˆD N (ρ) = ( N tr ĈN (ρ) N tr ĈN (ρ) C N ) 2, Ď N (ρ) = N tr ( (ČN (ρ) C N ) 2 ). Denote D = c M 2 c+m 2, ρ = c c+m 2, M 2 = lim Ni= N N λ 2 i (C N ) and ˆρ, ˇρ unique solutions to ˆρ ˆρ ˆγ(ˆρ ) ( ˆρ )c + = Tˇρ ˆρ ˇρ + Tˇρ = ρ. Then, letting ε small enough, inf ˆD N (ρ) a.s. D, ρ ˆRε ˆD N (ˆρ ) a.s. D, inf Ď N (ρ) a.s. D ρ Ř ε Ď N (ˇρ ) a.s. D.

104 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 6/42 Estimating ˆρ and ˇρ Theorem only useful if ˆρ and ˇρ can be estimated! Careful control of the proofs provide many ways to estimate these. Proposition below provides one example. Optimal Shrinkage Estimate Let ˆρ N (max{0, c }, ] and ˇρ N (0, ] be solutions (not necessarily unique) to ˇρ N n ni= x i ˇρ N + ˇρ N n ni= x i ˆρ N N tr ĈN (ˆρ N ) = Č N (ˇρ N ) x i x i 2 = Č N (ˇρ N ) x i x i 2 defined arbitrarily when no such solutions exist. Then c N [ ( N tr ni= n c N [ ( N tr ni= n a.s. ˆρ N ˆρ a.s., ˇρ N ˇρ ˆD N (ˆρ N ) a.s. D, Ď N (ˇρ N ) a.s. D. x i x i N x i 2 x i x i N x i 2 ) 2 ] ) 2 ]

105 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 7/42 Simulations 3 inf ρ (0,] {Ď N (ρ)} Ď N ( ˇρ N ) D Ď N ( ˇρ O ) Normalized Frobenius norm n [log 2 scale] Figure: Performance of optimal shrinkage averaged over Monte Carlo simulations, for N = 32, various values of n, [C N ] ij = r i j with r = 0.7; ˇρ N as above; ˇρ O the clairvoyant estimator proposed in (Chen ).

106 Advanced Random Matrix Models for Robust Estimation/3.3 Robust shrinkage and application to mathematical finance 8/42 Simulations 0.8 Shrinkage parameter ˆρ ˆρ N ˆρ ˇρ ˇρ N ˇρ ˇρ O n [log 2 scale] Figure: Shrinkage parameter ρ averaged over Monte Carlo simulations, for N = 32, various values of n, [C N ] ij = r i j with r = 0.7; ˆρ N and ˇρ N as above; ˇρ O the clairvoyant estimator proposed in (Chen ); ˆρ = argmin {ρ (max{0, c N },]}{ ˆD N (ρ)} and ˇρ = argmin {ρ (0,]} {Ď N (ρ)}.

107 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 20/42 Context Hypothesis testing problem: Two sets of data Initial pure-noise data: x,..., x n, x i = τ i C 2 N w i as before. New incoming data y given by: { x, H0 y = αp + x, H with x = τc 2 N w, p C N deterministic known, α unknown. GLRT detection test: H T N (ρ) Γ H 0 for some detection threshold Γ where T N (ρ) and ĈN (ρ) defined in previous section. y Ĉ N (ρ)p y Ĉ N (ρ)y p In fact, originally found to be ĈN (0) but only valid for N < n introducing ρ may bring improved for arbitrary N/n ratios.. Ĉ N (ρ)p

108 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 2/42 Objectives and main results Initial observations: As N, n, N/n c > 0, under H 0, T N (ρ) a.s. 0. Trivial result of little interest! Natural question: for finite N, n and given Γ, find ρ such that P (T N (ρ) > Γ) = min Turns out the correct non-trivial object is, for γ > 0 fixed ( ) P NTN (ρ) > γ = min Objectives: for each ρ, develop central limit theorem to evaluate ( ) lim P NTN (ρ) > γ N,n N/n c determine limiting minimizing ρ empirically estimate minimizing ρ

109 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 22/42 What do we need? CLT over ĈN statistics We know that ĈN (ρ) ŜN (ρ) a.s. 0 Key result so far! What about N(ĈN (ρ) ŜN (ρ))? Does not converge to zero!!! But there is hope... : N(a Ĉ N This is our main result! (ρ)b a ŜN a.s. (ρ)b) 0 This requires much more delicate treatment, not discussed in this tutorial.

110 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 23/42 Main results Theorem (Fluctuation of bilinear forms) Let a, b C N with a = b =. Then, as N, n with N/n c > 0, for any ε > 0 and every k Z, sup a ĈN k (ρ)b a ŜN k (ρ)b a.s. 0 ρ R κ N ε where R κ = [κ + max{0, /c}, ].

111 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 24/42 False alarm performance Theorem (Asymptotic detector performance) As N, n with N/n c (0, ), ( sup ρ R κ P T N (ρ) > γ ) ( ) exp γ2 N 2σ 2 N (ˆρ) 0 where ρ ˆρ is the aforementioned mapping and σ 2 N (ˆρ) p C N QN 2 (ˆρ)p 2 p Q N (ˆρ)p N tr C NQ N (ˆρ) ( c( ρ) 2 m( ˆρ) 2 N tr C N 2 Q2 N (ˆρ)) with Q N (ˆρ) (I N + ( ˆρ)m( ˆρ)C N ). Limiting Rayleigh distribution Weak convergence to Rayleigh variable R N (ˆρ) Remark: σn and ˆρ not a function of γ There exists a uniformly optimal ρ!

112 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 25/42 Simulation Empirical hist. of T N (ρ) 0.6 Distribution of R N ( ˆρ) 0.8 Density Cumulative distribution Empirical dist. of T N (ρ) Values taken Values taken Distribution of R N ( ˆρ) Figure: Histogram distribution function of the NT N (ρ) versus R N (ˆρ), N = 20, p = N 2 [,..., ] T, C N Toeplitz from AR of order 0.7, c N = /2, ρ = 0.2.

113 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 26/42 Simulation Empirical hist. of T N (ρ) Density of R N ( ˆρ) Density 0.4 Cumulative distribution Empirical dist. of T N (ρ) Values taken Values taken Distribution of R N ( ˆρ) Figure: Histogram distribution function of the NT N (ρ) versus R N (ˆρ), N = 00, p = N 2 [,..., ] T, C N Toeplitz from AR of order 0.7, c N = /2, ρ = 0.2.

114 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 27/42 Empirical estimation of optimal ρ Optimal ρ can be found by line search... but CN unknown! We shall successively: empirical estimate σ N (ˆρ) minimize the estimate prove by uniformity asymptotic optimality of estimate Theorem (Empirical performance estimation) For ρ (max{0, cn }, ), let ˆσ 2 N (ˆρ) 2 ( c + c ˆρ N p Ĉ 2 ˆρ N (ρ)p p Ĉ N (ρ)p N tr Ĉ N (ρ) ) ( ˆρ N tr ĈN (ρ) N tr Ĉ N (ρ) Also let ˆσ 2 N () limˆρ ˆσ 2 N (ˆρ). Then sup σ 2 N (ˆρ) ˆσ2 N (ˆρ) a.s. 0. ρ R κ tr ĈN (ρ) N tr Ĉ N (ρ) ).

115 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 28/42 Final result Theorem (Optimality of empirical estimator) Define ˆρ N = argmin {ρ R κ} Then, for every γ > 0, ( ) P NTN (ˆρ N ) > γ inf ρ R κ { {ˆσ 2 N (ˆρ) }. ( )} P NTN (ρ) > γ 0.

116 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 29/ Simulations γ = 2 P( NT N (ρ) > γ) γ = Limiting theory Empirical estimator Detector ρ Figure: False alarm rate P( NT N (ρ) > γ), N = 20, p = N 2 [,..., ] T, C N Toeplitz from AR of order 0.7, c N = /2.

117 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 30/ Simulations γ = 2 P( NT N (ρ) > γ) 0 γ = Limiting theory Empirical estimator Detector ρ Figure: False alarm rate P( NT N (ρ) > γ), N = 00, p = N 2 [,..., ] T, C N Toeplitz from AR of order 0.7, c N = /2.

118 Advanced Random Matrix Models for Robust Estimation/3.4 Optimal robust GLRT detectors 3/42 Simulations 0 0 Limiting theory Detector N = 20 N = Γ Figure: False alarm rate P(T N (ρ) > Γ) for N = 20 and N = 00, p = N 2 [,..., ] T, [C N ] ij = 0.7 i j, c N = /2.

119 Future Directions/4. Kernel matrices and kernel methods 34/42 Motivation: Spectral Clustering N. El Karoui. The spectrum of kernel random matrices. The Annals of Statistics, 38():50, 200. Objective: Clustering data x,..., x n C N in k similarity classes classical machine learning problem brought here to big data! assumes similarity function, e.g. Gaussian kernel f (x i, x j ) = exp ( x i x j 2 ) 2σ 2 naturally brings kernel matrix: W = [W ij ] i,j n = [f (x i, x j )] i,j n. Letting x,..., x n random, leads naturally to studying kernel random matrices. Little is known on such random matrices, but for xi i.i.d. zero mean and covariance I N : ( W α T β ) n WW a.s. 0 for some α, β depending on f and its derivatives. Basically, W gets equivalent to a rank-one matrix.

120 Future Directions/4. Kernel matrices and kernel methods 35/42 Motivation: Spectral Clustering Clustering x,..., x n in k often written as: (RatioCut) k min S,...,S k S... S k =S i= j S i, j S c i i j, S i S j = f (x j, x j ). S i But difficult to solve, NP hard! Can be equivalently rewritten (RatioCut) ( ) min tr M T LM M M, M T M=I k where M = {M = [m ij ] i n, j k, m ij = S j 2 δ xi S j } and [ ] n L = [L ij ] i,j n = [ W + diag(w )] i,j n = f (x i, x j ) + δ i,j f (x i, x l ). l= i,j n Relaxing M to unitary leads to a simple eigenvalue/eigenvector problem: Spectral clustering.

121 Future Directions/4. Kernel matrices and kernel methods 36/42 Objectives Generalization to k distributions for x,..., x n should lead to asymptotically rank-k W matrices. If established, specific choices of known good kernel better understood. Eventually, find optimal choices for kernels.

122 Future Directions/4.2 Neural networks 38/42 Echo-state neural networks Neural network: Input neuron signal s t R (could be multivariate) Output neuron signal y t R (could be multivariate) N neurons with state x t R N at time t connectivity matrix W R N N connectivity vector to input w I R N connectivity vector to output w O R N State evolution x 0 = 0 (say) and x t+ = S (Wx t + w I s t ) with S entry-wise sigmoid function. Output observation y t = w T O xt. Classical neural networks: Learning phase: input-output data (s t, y t ) used to learn W, w O, w I (via e.g. LS) Interpolation phase: W, w O, w I fixed, we observe output y t from new data s t. Poses overlearning problems, difficult to set up, demands lots of learning data. Echo-state neural networks: To solve the problems of neural networks W and w I set to be a random matrix, no longer learned only w O is learned Reduces amount of data to learn, shows striking performances in some scenarios.

123 Future Directions/4.2 Neural networks 39/42 ESN and random matrices W, w I being random, performance study involves random matrices. Stability, chaos regime, etc. involve extreme eigenvalues of W main difficulty is non-linearity caused by S Performance measures: MSE for training data MSE for interpolated data Optimization to be performed on regression method!, e.g. w O = (X train X T train + γi N ) X train y train with X train = [x,..., x T ], y train = [y,..., y T ] T, T train period. In first approximation: S = Id. MSE performance with stationary inputs leads to study W j w I wi T (W T ) j j= New random matrix model, can be analyzed with usual tools though.

124 Future Directions/4.2 Neural networks 40/42 Related biography J. T. Kent, D. E. Tyler, Redescending M-estimates of multivariate location and scatter, 99. R. A. Maronna, Robust M-estimators of multivariate location and scatter, 976. Y. Chitour, F. Pascal, Exact maximum likelihood estimates for SIRV covariance matrix: Existence and algorithm analysis, N. El Karoui, Concentration of measure and spectra of random matrices: applications to correlation matrices, elliptical distributions and beyond, R. Couillet, F. Pascal, J. W. Silverstein, Robust M-Estimation for Array Processing: A Random Matrix Approach, 202. J. Vinogradova, R. Couillet, W. Hachem, Statistical Inference in Large Antenna Arrays under Unknown Noise Pattern, (submitted to) IEEE Transactions on Signal Processing, 202. F. Chapon, R. Couillet, W. Hachem, X. Mestre, On the isolated eigenvalues of large Gram random matrices with a fixed rank deformation, (submitted to) Electronic Journal of Probability, 202, arxiv Preprint R. Couillet, M. Debbah, Signal Processing in Large Systems: a New Paradigm, IEEE Signal Processing Magazine, vol. 30, no., pp , 203. P. Loubaton, P. Vallet, Almost sure localization of the eigenvalues in a Gaussian information plus noise model. Application to the spiked models, Electronic Journal of Probability, 20. P. Vallet, W. Hachem, P. Loubaton, X. Mestre, J. Najim, On the consistency of the G-MUSIC DOA estimator. IEEE Statistical Signal Processing Workshop (SSP), 20.

125 Future Directions/4.2 Neural networks 4/42 To know more about all this Our webpages: I I

126 Future Directions/4.2 Neural networks 42/42 Spraed it! To download this presentation (PDF format): Log in to your Spraed account ( Scan this QR code.

Random Matrices for Big Data Signal Processing and Machine Learning

Random Matrices for Big Data Signal Processing and Machine Learning (ICASSP 2017, New Orleans) Romain COUILLET and Hafiz TIOMOKO ALI CentraleSupélec, France March, 2017 1 / 153 Outline Basics of Random