Error Bound for Compound Wishart Matrices

Size: px

Start display at page:

Download "Error Bound for Compound Wishart Matrices"

Willis Turner
5 years ago
Views:

1 Electronic Journal of Statistics Vol. 0 (2006) 1 8 ISSN: Error Bound for Compound Wishart Matrices Ilya Soloveychik, The Rachel and Selim Benin School of Computer Science and Engineering, The Hebrew University of Jerusalem, Givat Ram, Jerusalem, ilya.soloveychik@mail.huji.ac.il Abstract: In this paper we consider the non-asymptotic behavior of the real compound Wishart matrices. This family of random matrices generalizes the classical real Wishart distribution. In particular, we consider matrices of the form XBX T, where X is a p n matrix with centered Gaussian elements and B is an arbitrary n n matrix; and sequences of such matrices for varying n. We derive high-probability error bounds on the deviations of the compound Wishart matrices from their mean. MSC 2010 subject classifications: Primary 62E99. Keywords and phrases: Compound Wishart distribution, correlated sample covariance matrix, concentration of Gaussian measure, sample complexity. 1. Introduction For a real p n matrix X consisting of normally distributed elements the p p Wishart matrix, defined as W = XX T, was introduced by Wishart [1], who also derived the law of its distribution. As evidenced by the wide interest among the scientists and engineers, the Wishart law is of primary importance to statistics, see e.g. [2, 3]. In particular, Wishart law describes exactly the distribution of the sample covariance matrix in the Gaussian populations. As a natural generalization, the compound Wishart matrices were introduced by Speicher [4]. Definition 1. (Compound Wishart Matrix) Let X i N (0, Θ), i = 1,..., n, be independent and belong to R p, where Θ is a p p real positive definite matrix, and B - be an arbitrary real n n matrix. We say that a random p p matrix W is compound Wishart with shape parameter B and scale parameter Θ if where X = [X 1,..., X n ]. We write W W(Θ, B). W = XBX T, (1.1) Remark 1. It will sometimes be instructive to use the representation W = Θ 1/2 Y BX T Θ 1/2, where Y = [Y 1,..., Y n ] and Y i N (0, I). This work was partially supported Kaete Klausner Scholarship, the Hebrew University of Jerusalem 1

2 I. Soloveychik/Error Bound for Compound Wishart Matrices 2 When B is positive definite, [5] interprets W as a scaled sample covariance under correlated sampling. Similar interpretation for the complex case appears in [6]. The usual real Wishart matrices correspond to the choice B = I. Wishart distribution and its generalization to the case of positive definite matrix B are widely used in economics, in particular in portfolio allocation theory [7]. When applying the representation theoretic machinery to the study of symmetries of normally distributed random vectors (see [8] for an example of such setting) we encountered the case of compound Wishart distribution with a skew symmetric B of the form (2.2) and this problem motivated the current study. Most of the literature concerning Wishart distribution deals with the asymptotic, n, and the double asymptotic n, p regimes. In particular, a version of Marchenko-Pastur law was generalized to this case, see [9] and references therein for a wide survey on the asymptotic behavior of the compound Wishat matrices. In a different line of research in the recent years there have been observed a number of prominent achievements in the random matrix theory concerning the non-asymptotic regime. In particular, Bernstein inequality and other concentration inequalities [10, 11] were generalized to the matrix case using the Stein s method of exchangeable pairs and Lieb s Theorem. The finite sample performance of the sample covariance matrix was also profoundly investigated for a large class of distributions, see [12, 13, 14] and references therein. Closely related to the Wishart family of distributions is partial estimation of covariance matrices by the sample covariance in Gaussian models. In particular, matrices of the form M XX T, where denotes the Hadamard (entry-wise) product of matrices, became attractive due to the advances in structured covariance estimation [15, 16, 17, 18]. The matrix M represents the a priory knowledge about the structure of the true covariance matrix in the form of a mask. The most widespread examples of such assumptions are banding, tapering and thresholding, which assume the elements of the mask M belong to the interval [0, 1]. The non-asymptotic behavior of such masked sample covariance matrices was investigated in [19, 20]. It has been shown that the sample complexity (the minimal number of samples need to achieve some predefined accuracy with stated probability) is proportional to mlog c (2p), where m is a sparsity parameter of M and c 1. The main purpose of this contribution is to demonstrate the non-asymptotic results concerning the average deviations of the compound Wishart matrices from their mean. We actually consider two closely related problems. A Single Compound Wishart matrix. The first one can be roughly described as following: given the dimension p the and shape matrix B, what accuracy can one obtain on average when approximating the expected value of the corresponding Wishart matrix by its empirical measurement. A Sequence of Compound Wishart matrices. Another setting arise from a different approach to the problem: assume we are given a sequence of matrices B n satisfying some assumptions stated below and the dimen-

3 I. Soloveychik/Error Bound for Compound Wishart Matrices 3 sion p is fixed. This data provides us with a sequence of Wishart matrices W n = XB n X T, where the number of columns in X changes appropriately (an exact definition provided below). Assume in addition that all these Wishart matrices have a common expectation W 0. The natural question is how many measurements n does one need to collect in order to estimate the mean value W 0 accurately. Both of these problems arise in different areas of research and non-asymptotic analysis is often required. The problem becomes especially critical when the values p and n are of the same order of magnitude. Below we provide a theorem answering the two posed questions. Although the result obtained in Corollary 2 is related to the case of fixed dimension p, it can be extended to a sequence W p n, where the dimension p = p(n) varies with n, while keeping the spectral properties of the sequence of corresponding covariance matrices Θ n controlled. In particular, a partial answer to the second problem can be formulated as following: the number of samples proportional to p ln 2 p is needed to accurately estimate the expectation of the compound Wishart matrix. The rest of the text is organized as following. After the notations section we provide additional definitions and the statements of the results. Then a few examples demonstrating the applications are given. The proof of the theorem concludes the paper. Notations Capital last letters of the English alphabet (W, X, Y, Z) denote random entities, all the other letters stand for deterministic entities. For an arbitrary rectangular matrix A, A denotes its spectral norm and A F = Tr (AA T ) stands for its Frobenius norm. For two vectors v, u laying in a Euclidean space, (u, v) denotes their scalar product and v 2 the corresponding length. I m stands for the m m identity matrix, when the dimension is obvious from the context the subscript is omitted. By P(p) we denote the open cone of p p positive definite matrices. For brevity we write [p] = {1,..., p}, where p N. 2. Problem Formulation and the Main Results In addition to the definitions given above we define a notion of sequence of Wishart matrices, corresponding to the second question stated in the Introduction. Definition 2. (Sequence of Wishart Matrices) Consider a sequence {B n } n1 n=n 0 of real n n deterministic matrices, where n 0 N and n 1 N { }. Let Θ P(p) and for every n = n 0,..., n 1 : X i N (0, Θ), i = 1,..., n be independent. Then W n = XB n X T, n = n 0,..., n 1, (2.1) where X = [X 1,..., X n ], is a sequence of compound Wishart matrices.

4 I. Soloveychik/Error Bound for Compound Wishart Matrices 4 The same Remark 1 as above applies here, as well. Note also that the dimension of X depends on n, but this is not reflected by an additional subscript. To make this definition useful and meaningful we will have to make some assumptions on the sequence {W n } n S. In particular, we first want to answer the following question: what properties should we require from the sequence {B n } n S to make the sequence {W n } n S interesting to investigate. Below we fix the dimension p and refer to vectors X i, i = 1,..., n as measurements. So what actually changes from matrix to another in the sequence W n is the underlying matrix B n and the respective number of measurements. The examples of sequences {B n } n S are the following: The most widely used is the sequence of diagonal matrices B n = diag {b 1,..., b n }, n S. When Tr (B) = n the expectation of the Wishart sequence coincides with the covariance matrix Θ as shown below. Another common example is a sequence of skew-symmetric matrices of the form B n = ( 0 In/2 I n/2 0 ), (2.2) where n is assumed even. We encountered this case when investigating the group symmetry properties of sample covariance matrices. In order to generalize the properties of {B n } n S we encountered in the application we state an additional auxiliary result that we did not found in the literature. Lemma 1. Let B be a real n n matrix and X i N (0, I), i = 1,..., n, be independent, then for X = [X 1,..., X n ], W = XBX T, E(W ) = Tr (B) I. (2.3) Proof. Denote the expectation E(W ) by W 0 and consider the elements of W 0 : Wij 0 = E X ik B kl X jl. (2.4) k,l=1 As all X ik, i = 1,..., p, k = 1,..., n are independent we get immediately that Wij 0 = B kl E (X ik X jl ) = 0, i j, (2.5) W 0 ii = k,l=1 B kl E (X ik X il ) = k,l=1 And the statement follows. B kk E ( Xik) 2 = Tr (B). (2.6) Corollary 1. Let B be a real n n matrix, Θ P(p), and X i N (0, Θ), i = 1,..., n be independent, then for X = [X 1,..., X n ], W = XBX T, k E(W ) = Tr (B) Θ. (2.7)

5 I. Soloveychik/Error Bound for Compound Wishart Matrices 5 In particular, Lemma 1 implies that if n 1 =, then to ensure the sequence {W n } is consistent we should require Tr (B n ) β R. We actually make a stronger Assumption 1. The traces Tr (B n ) are equal Tr (B n ) = β, n = n 0,..., n 1. (2.8) Since we can scale the sequence {W n } n1 n=n 0, without loss of generality set Tr (B n ) = The Main Results In this section we state the main results of this paper. Theorem 1. Let Θ P(p) and X i N (0, Θ), i = 1,..., n, be independent. Let B be an arbitrary real n n matrix and denote κ = B F B, σ = B, then E W W 0 24 ln 2p 2 p(4σ + κ π) n Proof. The proof is covered by Sections?? and??. Θ. Corollary 2. Let Θ P(p) and {B n } n S R n n, where S N is ordered. For every n = n 0,..., n 1 let X i N (0, Θ), i = 1,..., n, be independent. Assume that Tr (B n ) = 1, n = n 0,..., n 1, and denote κ = max n0 n n 1 B F and σ = max n0 n n 1 B n, then 4. Preliminaries 4.1. Proof outline E W n W 0 24 ln 2p 2 p(4σ + κ π) n Θ. In the rest of this paper we prove Theorems 1 and 2. We shall observe that for a compound Wishart matrix W, the quadratic form (W x, y) is a Gaussian chaos (defined below) for fixed unit vectors x and y on the unit sphere S p 1. We control the chaos uniformly for all x, y by establishing concentration inequalities depending on the sparsity of x and y. We do so by applying the techniques of decoupling, conditioning, and applying concentration bounds for Gaussian measure. After this we make use of covering arguments to measure the number of sparse vectors x, y on the sphere S p 1. The general layout of the proof goes after [19], while we modify and generalize their results.

6 4.2. Decoupling I. Soloveychik/Error Bound for Compound Wishart Matrices 6 We start by considering bilinear forms in normally distributed vectors. The following definition will be useful is the sequel. Definition 3. Let Z R p be a centered Gaussian random vector and B a square p p matrix, then the bilinear form (BZ, Z) is called a quadratic Gaussian chaos. Lemma 2. (Decoupling of Gaussian Chaos) Let Z R p be a centered normal random vector and Z its independent copy. Let also B P(p), then E sup (BZ, Z) E(BZ, Z) E sup (BZ, Z ). (4.1) Proof. Without loss of generality assume Z is standard, otherwise replace B with Θ 1/2 BΘ 1/2, where Θ is the covariance matrix of Z, and follow the same reasoning. E := E Z sup (BZ, Z) E(BZ, Z) = E Z sup (BZ, Z) E Z (BZ, Z ) E Z,Z sup (BZ, Z) (BZ, Z ), where the equality is due to the fact that the distributions of Z and Z are identical and the inequality is due to Jensen. In the calculation above we emphasized explicitly the variables of integration in the expectations to make the transitions clear. For an arbitrary B note the identity (BZ, Z) (BZ, Z ) = (B Z + Z 2, Z Z 2 ) + (B Z Z, Z + ) Z. (4.2) 2 2 ( ) Z+Z By the rotation invariance of the standard Gaussian measure, the pair 2, Z Z 2 is distributed identically to (Z, Z ), hence we conclude that E E Z,Z sup (BZ, Z ) + (BZ, Z) E Z,Z sup (BZ, Z ) + E Z,Z sup (BZ, Z) = 2E Z,Z sup (BZ, Z), (4.3) and the statement follows. Lemma 3. Let X 1,..., X n, X 1,..., X n N (0, Θ), Θ P(p), be all independent. Consider the compound Wishart matrix and its decoupled counterpart Denote W 0 = E(W ) = Tr (B) Θ, then W = XBX T, W = X BX T. (4.4) E W W 0 2E W. (4.5)

7 I. Soloveychik/Error Bound for Compound Wishart Matrices 7 Proof. Using the definition of spectral norm we obtain Rewrite the inner product as E W W 0 = E sup (W x, y) E(W x, y). (4.6) x,y S p 1 (W x, y) = x T W T y = Tr ( XB T X T yx T ) = vec (X) T [ B T (yx T ) ] vec (X) = ([ B T (yx T ) ] vec (X), vec (X) ), (4.7) where vec (X) is obtained from the matrix X by stacking the columns into a single vector of height pn and denotes the Kronecker product. Since vec (X) is Gaussian, the right-hand side of the last equality is a quadratic Gaussian chaos and Lemma 2 applies Concentration Lemma 4. Let Z = (Z 1,..., Z p ) T N (0, Θ), Θ P(p) and let a = (a 1,..., a p ) T R p. Then (a, Z) is a centered normal variable with standard deviation Θ 1/2 a 2 Θ 1/2 a 2. Next we state an auxiliary result from the Concentration of the Gaussian Measure theory. Such concentration results are usually stated in terms of the standard normal distribution, but they can be easily generalized to an arbitrary normal distribution as follows. Lemma 5. [21] Let f : R p R be a Lipschitz function with respect to the Euclidean metric with constant L = f L and Z N (0, Θ), Θ P(p), then P(f(Z) Ef(Z)) 1 ( ) 2 exp t2 2L 2, t 0. Θ 4.4. Discretization Recall that given a Euclidean p dimensional space and an operator over it with the matrix A in some orthonormal basis, the spectral norm of a this operator and of matrix A is defined as A = max x,y Sp 1(Ax, y). We approximate the spectral norm of matrices by using ε-nets in the following way: Lemma 6. [14] Let A be a p p matrix and N be a δ-net of the sphere S p 1 in the Euclidean space for some δ [1, 0), then A 1 (1 δ) 2 max (Ax, y). x,y N

8 I. Soloveychik/Error Bound for Compound Wishart Matrices 8 Following [19], we introduce the notion of coordinate-wise sparse regular vectors. Definition 4. For s [p], the subset of regular vectors of the unit sphere S p 1 is defined as Reg p (s) = {x = (x 1,..., x p ) T S p 1 x 2 i {0, 1/s}}, i = 1,..., p}, Reg p = Reg p (s). s [p] Lemma 7. [19] Let A be a p p matrix, then A 12 ln 2p 2 max (Ax, y). x,y Reg p Proof. The proof can be found in [19], it uses the regular vectors to construct a specific δ-net and obtain the bound given in the statement. 5. The Proofs of Theorems 1 and 2 We partition the proof into a few parts Decoupling and Conditioning Using Remark 1 we can rescale the random vectors and assume without loss of generality that Θ = I. Now by Lemma 3 it suffices to estimate E W. From Lemma 7 we get that P( W t) P(12 ln 2p 2 max (W x, y) t). (5.1) x,y Reg p Write the inner product coordinate-wise and rearrange the summands to obtain p p (W x, y) = b lm x j X jm y i X il. (5.2) l=1 i=1 m=1 j=1 We now fix x and y and condition on the variables X jm, j = 1,..., p, m = 1,..., n so that expression (5.2) defines a centered normal random variable. We want to estimate its standard deviation with the help of Lemma 4. Since we have assumed Θ = I, the covariance matrix of vec (X ) is also the identity matrix. Then Lemma 4 implies that (5.2) is centered normal with standard deviation at most σ x (X) y, where σ x (X) = l=1 i=1 p m=1 j=1 2 1/2 p b lm x j X jm = p We need to bound this quantity uniformly with respect to x. l=1 m=1 j=1 2 1/2 p b lm x j X jm.

9 5.2. Concentration I. Soloveychik/Error Bound for Compound Wishart Matrices 9 Let x Reg p, we estimate σ x (X) using concentration in Gauss space, Lemma 5. Due to Jensen s inequality Eσ x (X) (Eσ x (X) 2 ) 1/2 = p E = p p l=1 m=1 j=1 b 2 lmx 2 j 1/2 l=1 = p m=1 j=1 2 1/2 p b lm x j X jm p b 2 lm x 2 j l=1 m=1 j=1 1/2 = p B F. (5.3) Now we consider σ x : R pn R as a function of vec (X) and compute its Lipschitz constant with respect to the Euclidean measure on R pn. Note that the Euclidean norm on this space coincides with Frobenius norm on the linear space of p n matrices. σ x (X) = p l=1 m=1 j=1 2 1/2 p b lm x j X jm = p p l=1 n x j j=1 m=1 b lm X jm (5.4) = p BX T x 2 p B X p B X F, (5.5) for the Lipschitz constant Lemma 5 now implies that x Reg p and t 0 σ x (X) L p B. (5.6) P (σ x (X) p B F + t) 1 2 exp ( t2 2p B 2 ). (5.7) 2 1/ Union bounds We return to the estimation of the random variable (W x, y). Let us fix u 1, then x Reg p we consider the event E x = {σ x (X) p B F + u p B }. By (5.7) we have P(E x ) exp( u2 /2). (5.8) Note that σ x (X) and, thus E x, are independent of X. Let now x Reg p (r) and y Reg p (s). As we have observed above, conditioned on a realization of X

10 I. Soloveychik/Error Bound for Compound Wishart Matrices 10 satisfying E x, the random variable (W x, y) is distributed as a centered normal random variable h, whose standard deviation is bounded by p p B σ x (X) y B s F + u =: σ. s Then by the usual tail estimate for Gaussian random variables, we have Choose ε = uσ to obtain P((W x, y) ε E x ) 1 2 exp( ε2 /2σ 2 ). P((W x, y) ε E x ) 1 2 exp( u2 /2), x Reg p (r), y Reg p (s). We would like to take the union bound in this estimate over all y Reg p (s) for a fixed s. Note that ( ) p Reg p (s) = 2 s exp(s ln (2ep/s)), (5.9) s as there are exactly ( p s) possibilities to choose the support and 2 s ways to choose the signs of the coefficients of a vector in Reg p (s), thus ( ) P max (W x, y) ε E x 1 y Reg p (s) 2 exp(s ln (2ep/s) u2 /2), in order for this bound to be not trivial we assume u 2s ln (2ep/s). Now, using (5.8), we obtain ( ) ( ) P max (W x, y) ε P max (W x, y) ε E x + P(Ex) c y Reg p (s) y Reg p (s) 1 2 exp(s ln (2ep/s) u2 /2) exp( u2 /2) exp(s ln (2ep/s) u 2 /2). (5.10) 5.4. Gathering the bounds We continue with formula (5.1): P( W t) P(12 ln 2p 2 max (W x, y) t) P(12 ln 2p 2 max x,y Reg p max r,s [p] x Reg p (r) y Reg p (s) With the help of (5.10) and the bound (5.9) on the number of points in Reg p (r) we obtain P( max x Reg p (r) y Reg p (s) (W x, y) ε) exp(r ln (2ep/r) + s ln (2ep/s) u 2 /2), (W x, y) t).

11 I. Soloveychik/Error Bound for Compound Wishart Matrices 11 for u 2s ln (2ep/s). The function x ln (2ep/x) increases monotonically on the interval [1, p], so k ln (2ep/k) p ln (2ep/p) = p ln (2e) 2p, k p. Choose u 3 p to get the bound P( max max (W x, y) ε) exp( u 2 /4). r,s, [p] x Reg p (r) y Reg p (s) Finally, replace t with 12 ln 2p 2 ε to obtain where Integration of (5.11) yields By Lemma 3 we obtain P( W 12 ln 2p 2 ε) exp( u 2 /4), (5.11) ɛ = u p B F + u 2 p B, u 3 p. (5.12) E W 12 ln 2p 2 p(4σ + κ π). E W W 0 24 ln 2p 2 p(4σ + κ π). Now multiply W by Θ 1/2 from left and right to scale the matrices and get the statement of Theorem 1. For Corollary 2 assume we are given a sequence {B n } n S R n n such that Tr (B n ) = n, then for every corresponding Wishart matrix E W W 0 24 ln 2p 2 p(4σn + κ n π). n By bounding the values κ n κ and σ n σ from above we get the desired inequality E W W 0 24 ln 2p 2 p(4σ + κ π). n 6. Acknowledgment The author is grateful to Dmitry Trushin for discussions of the proof and useful suggestions. References [1] J. Wishart. The generalised product moment distribution in samples from a normal multivariate population. Biometrika, 20(1/2):32 52, [2] T. W. Anderson. An introduction to multivariate statistical analysis. 2, [3] R. J. Muirhead. Aspects of multivariate statistical theory. 197, 2009.

12 I. Soloveychik/Error Bound for Compound Wishart Matrices 12 [4] R. Speicher. Combinatorial theory of the free product with amalgamation and operator-valued free probability theory. 627, [5] Z. Burda, A. Jarosz, M. A. Nowak, J. Jurkiewicz, G. Papp, and I. Zahed. Applying free random variables to random matrix analysis of financial data. Part I: The Gaussian case. Quantitative Finance, 11(7): , [6] C.-N. Chuah, D. N. C. Tse, and R. A. Kahn, J. M.and Valenzuela. Capacity scaling in MIMO wireless systems under correlated fading. IEEE Transactions on Information Theory, 48(3): , [7] B. Collins, D. McDonald, and N. Saad. Compound Wishart matrices and noisy covariance matrices: Risk underestimation. arxiv preprint arxiv: , [8] P. Shah and V. Chandrasekaran. Group symmetry and covariance regularization. 46th Annual Conference on Information Sciences and Systems (CISS), pages 1 6, [9] W. Bryc. Compound real Wishart and q-wishart matrices. arxiv preprint arxiv: , [10] J. A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of Computational Mathematics, 12(4): , [11] L. Mackey, M. I. Jordan, R. Y. Chen, B. Farrell, and J. A. Tropp. Matrix concentration inequalities via the method of exchangeable pairs. arxiv preprint arxiv: , [12] R. Vershynin. How close is the sample covariance matrix to the actual covariance matrix? Journal of Theoretical Probability, 25(3): , [13] N. Srivastava and R. Vershynin. Covariance estimation for distributions with 2+ε moments. arxiv preprint arxiv: , [14] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. Compressed sensing: theory and applications, edited by Y.C. Eldar and G. Kutyniok, [15] P. J. Bickel and E. Levina. Covariance regularization by thresholding. The Annals of Statistics, 36(6): , [16] P. J. Bickel and E. Levina. Regularized estimation of large covariance matrices. The Annals of Statistics, pages , [17] A. J. Rothman, E. Levina, and J. Zhu. Generalized thresholding of large covariance matrices. Journal of the American Statistical Association, 104(485): , [18] T. T. Cai, C.-H. Zhang, and H. H. Zhou. Optimal rates of convergence for covariance matrix estimation. The Annals of Statistics, 38(4): , [19] E. Levina and R. Vershynin. Partial estimation of covariance matrices. Probability Theory and Related Fields, 153(3-4): , [20] R. Y. Chen, A. Gittens, and J. A. Tropp. The masked sample covariance estimator: an analysis via the matrix Laplace transform. DTIC Document, technical report, [21] M. Ledoux and M. Talagrand. Probability in banach spaces: isoperimetry and processes. 23, 1991.

Stein s Method for Matrix Concentration

Stein s Method for Matrix Concentration Lester Mackey Collaborators: Michael I. Jordan, Richard Y. Chen, Brendan Farrell, and Joel A. Tropp University of California, Berkeley California Institute of Technology