A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER

Size: px

Start display at page:

Download "A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER"

Nathaniel Harmon
5 years ago
Views:

1 A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER 1. The method Ashwelde and Winter [1] proposed a new approach to deviation inequalities for sums of independent random matrices. The purpose of this note is to indicate how this method implies Rudelson s sampling theorems for random vectors in isotropic position. Let X 1,..., X n be independent random d d real matrices, and let S n = X X n. We will be interested in the magnitude of the deviation S n ES n in the operator norm Real valued random variables. Ashlwede-Winter s method [1] is parallel to the classical approach to deviation inequalities for real valued random variables. We briefly outline the real valued method. Let X 1,..., X n be independent mean zero random variables. We are interested in the magnitude of S n = i X i. For simplicity, we shall assume that X i 1 a.s. This hypothesis can be relaxed to some control of the moments, precisely to having sub-exponential tail. Fix a t > 0 and let λ > 0 be a parameter to be chosen later. We want to estimate p := P(S n > t) = P(e λsn > e λt ). By Markov inequality and using independence, we have p e λt Ee λsn = e λt i Ee λx i. Next, Taylor s expansion and the mean zero and boundedness hypotheses can be used to show that, for every i, This yields Ee λx i e λ2 Var X i, 0 λ 1. p e λt+λ2 σ 2, where σ 2 := i Var X i. The optimal choice of the parameter λ min(τ/2σ 2, 1) implies Chernoff s inequality ) p max (e t2 /σ 2, e t/2. 1

2 Random matrices. Now we try to generalize this method when X i M d are independent mean zero random matrices, where M d denotes the class of symmetric d d matrices. Some of the matrix calculus is straightforward. Thus, for A M d, the matrix exponential e A is defined as usual by Taylor s series. Recall that e A has the same eigenvectors as A, and eigenvalues e λ i(a). The partial order A B means A B 0, i.e. A B is positive semidefinite. The non-straightforward part is that, in general, e A+B e A e B. However, Golden-Thompson s inequality (see [3]) states that tr e A+B tr(e A e B ) holds for arbitrary A, B M d (and in fact for arbitrary unitaryinvariant norm replacing the trace). Therefore, for S n = X X n and for I d being the identity on M d, we have p := P(S n ti d ) = P(e λsn e λti d ) P(tr e λsn > e λt ) e λt E tr(e λsn ). This estimate is not sharp: e λsn e λti d means that the biggest eigenvalue of e λsn exceeds e λt, while tr e λsn > e λt means that the sum of all d eigenvalues exceeds the same. This will be responsible for the (sometimes inevitable) loss of the log d factor in Rudelson s selection theorem. Since S n = X n + S n 1, we can use Golden-Thomson s inequality to separate the last term from the sum: E tr(e λsn ) E tr(e λxn e λs n 1 ). Now, using independence and that E and trace commute, we continue to write = E n 1 tr(e n e λxn e λs n 1 ) E n e λxn E n 1 tr(e λs n 1 ). Continuing by induction, we arrive (since tr(i d ) = d) to E tr(e λsn ) d Ee λx i We have proved that P(S n ti d ) de λt Ee λx i Repeating for S n and using that ti d S n ti d is equivalent to S n t, we have shown that (1) P( S n > t) 2de λt Ee λx i

3 Remark. As in the real valued case, full independence is never needed in the above argument. It works out well for martingales. The main result: Theorem 1 (Chernoff-type inequality). Let X i M d be independent mean zero random matrices, X i 1 for all i a.s. Let S n = X X n, σ 2 = n Var X i Then for every t > 0 we have P( S n > t) d max(e t2 /4σ 2, e t/2 ). To prove this theorem, we have to estimate Ee λx i in (1), which is easy. For example, if X M d and X 1, then Taylor series expansion shows that e Z I d + Z + Z 2. Therefore, we have Lemma 2. Let Z M d be a mean zero random matrix, Z 1 a.s. Then Ee Z e Var Z. Proof. Using the mean zero assumption, we have Ee Z E(I d + Z + Z 2 ) = I d + Var(Z) e Var Z. Let 0 < λ 1. Therefore, by the Theorem s hypotheses, Hence by (1), Ee λx i e λ2 Var X i = e λ2 Var X i. P( S > t) d e λt+λ2 σ 2. With the optimal choice of λ := min(t/2σ 2, 1), the conclusion of Theorem follows. Problem: does the Theorem hold for σ 2 replaced by n Var X i? If so, this would generalize Pisier-Lust-Piquard s non-commutative Khinchine inequality. Corollary 3. Let X i M d be independent random matrices, X i 0, X i 1 for all i a.s. Let S n = X X n, E = n EX i Then for every ε (0, 1) we have P( S n ES n > εe) d e ε2 E/4. Proof. Applying Theorem for X i EX i, we have (2) P( S n ES n > t) d max(e t2 /4σ 2, e t/2 ). Note that X i 1 implies that Var X i EX 2 i E( X i X i ) EX i. 3

4 4 Therefore, σ 2 E. Now we use (2) for t = εe, and note that t 2 /4σ 2 = ε 2 E 2 /4σ 2 ε 2 E/4. Remark. The hypothesis X i 1 can be relaxed throughout to X i ψ1 1. One just has to be more careful with Taylor series. 2. Applications Let x be a random vector in isotropic position in R d, i.e. Ex x = I d. Denote x ψ1 = M. Then the random matrix X := M 2 x x satisfies the hypotheses of Corollary (see remark below it). Clearly, Then Corollary gives We have thus proved: EX = M 2 I d, E = n/m 2, ES n = (n/m 2 )I d. P( S n ES n > ε ES ) d e ε2 n/4m 2. Corollary 4. Let x be a random vector in isotropic position in R d, such that M := x ψ1 <. Let x 1,..., x n be independent copies of x. Then for every ε (0, 1), we have ( 1 n ) P x i x i I d > ε d e ε2 n/4m 2. n The probability in the Corollary is smaller than 1 provided that the number of samples is n ε 2 M 2 log d. This is a version of Rudelson s sampling theorem, where M played the role of (E x log n ) 1/ log n. One can also deduce the main Lemma in [2] from the Corollary in the previous section. Given vectors x i in R d, we are interested in the magnitude of N g i x i x i where g i are independent Gaussians. For normalization, we can assume that N x i x i = A, A = 1. Denote M := max x i i

5 and consider the random operator X := M 2 x i x i with probability 1/N. As before, let X 1, X 2,..., X n be independent copies of X, and S = X X n. This time, we are going to let n. By the Central Limit Theorem, the properly scaled sum S ES will converge to N g ix i x i. One then chooses the parameters correctly to produce a version of the main Lemma in [2]. We omit the details. References [1] R. Ahlswede, A. Winter, Strong converse for identification via quantum channels, IEEE Trans. Information Theory 48 (2002), [2] M. Rudelson, Random vectors in the isotropic position [3] A. Wigderson, D. Xiao, Derandomzing the Ahlswede-Winter matrix-valued Chernoff bound using pessimistic estimators, and applications, Theory of Computing 4 (2008),

Matrix Concentration. Nick Harvey University of British Columbia

Matrix Concentration. Nick Harvey University of British Columbia Matrix Concentration Nick Harvey University of British Columbia The Problem Given any random nxn, symmetric matrices Y 1,,Y k. Show that i Y i is probably close to E[ i Y i ]. Why? A matrix generalization