A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER

A NOTE ON SUMS OF INDEPENDENT RANDOM MATRICES AFTER AHLSWEDE-WINTER 1. The method Ashwelde and Winter [1] proposed a new approach to deviation inequalities for sums of independent random matrices. The purpose of this note is to indicate how this method implies Rudelson s sampling theorems for random vectors in isotropic position. Let X 1,..., X n be independent random d d real matrices, and let S n = X 1 + + X n. We will be interested in the magnitude of the deviation S n ES n in the operator norm. 1.1. Real valued random variables. Ashlwede-Winter s method [1] is parallel to the classical approach to deviation inequalities for real valued random variables. We briefly outline the real valued method. Let X 1,..., X n be independent mean zero random variables. We are interested in the magnitude of S n = i X i. For simplicity, we shall assume that X i 1 a.s. This hypothesis can be relaxed to some control of the moments, precisely to having sub-exponential tail. Fix a t > 0 and let λ > 0 be a parameter to be chosen later. We want to estimate p := P(S n > t) = P(e λsn > e λt ). By Markov inequality and using independence, we have p e λt Ee λsn = e λt i Ee λx i. Next, Taylor s expansion and the mean zero and boundedness hypotheses can be used to show that, for every i, This yields Ee λx i e λ2 Var X i, 0 λ 1. p e λt+λ2 σ 2, where σ 2 := i Var X i. The optimal choice of the parameter λ min(τ/2σ 2, 1) implies Chernoff s inequality ) p max (e t2 /σ 2, e t/2. 1

2 1.2. Random matrices. Now we try to generalize this method when X i M d are independent mean zero random matrices, where M d denotes the class of symmetric d d matrices. Some of the matrix calculus is straightforward. Thus, for A M d, the matrix exponential e A is defined as usual by Taylor s series. Recall that e A has the same eigenvectors as A, and eigenvalues e λ i(a). The partial order A B means A B 0, i.e. A B is positive semidefinite. The non-straightforward part is that, in general, e A+B e A e B. However, Golden-Thompson s inequality (see [3]) states that tr e A+B tr(e A e B ) holds for arbitrary A, B M d (and in fact for arbitrary unitaryinvariant norm replacing the trace). Therefore, for S n = X 1 + + X n and for I d being the identity on M d, we have p := P(S n ti d ) = P(e λsn e λti d ) P(tr e λsn > e λt ) e λt E tr(e λsn ). This estimate is not sharp: e λsn e λti d means that the biggest eigenvalue of e λsn exceeds e λt, while tr e λsn > e λt means that the sum of all d eigenvalues exceeds the same. This will be responsible for the (sometimes inevitable) loss of the log d factor in Rudelson s selection theorem. Since S n = X n + S n 1, we can use Golden-Thomson s inequality to separate the last term from the sum: E tr(e λsn ) E tr(e λxn e λs n 1 ). Now, using independence and that E and trace commute, we continue to write = E n 1 tr(e n e λxn e λs n 1 ) E n e λxn E n 1 tr(e λs n 1 ). Continuing by induction, we arrive (since tr(i d ) = d) to E tr(e λsn ) d Ee λx i We have proved that P(S n ti d ) de λt Ee λx i Repeating for S n and using that ti d S n ti d is equivalent to S n t, we have shown that (1) P( S n > t) 2de λt Ee λx i

Remark. As in the real valued case, full independence is never needed in the above argument. It works out well for martingales. The main result: Theorem 1 (Chernoff-type inequality). Let X i M d be independent mean zero random matrices, X i 1 for all i a.s. Let S n = X 1 + + X n, σ 2 = n Var X i Then for every t > 0 we have P( S n > t) d max(e t2 /4σ 2, e t/2 ). To prove this theorem, we have to estimate Ee λx i in (1), which is easy. For example, if X M d and X 1, then Taylor series expansion shows that e Z I d + Z + Z 2. Therefore, we have Lemma 2. Let Z M d be a mean zero random matrix, Z 1 a.s. Then Ee Z e Var Z. Proof. Using the mean zero assumption, we have Ee Z E(I d + Z + Z 2 ) = I d + Var(Z) e Var Z. Let 0 < λ 1. Therefore, by the Theorem s hypotheses, Hence by (1), Ee λx i e λ2 Var X i = e λ2 Var X i. P( S > t) d e λt+λ2 σ 2. With the optimal choice of λ := min(t/2σ 2, 1), the conclusion of Theorem follows. Problem: does the Theorem hold for σ 2 replaced by n Var X i? If so, this would generalize Pisier-Lust-Piquard s non-commutative Khinchine inequality. Corollary 3. Let X i M d be independent random matrices, X i 0, X i 1 for all i a.s. Let S n = X 1 + + X n, E = n EX i Then for every ε (0, 1) we have P( S n ES n > εe) d e ε2 E/4. Proof. Applying Theorem for X i EX i, we have (2) P( S n ES n > t) d max(e t2 /4σ 2, e t/2 ). Note that X i 1 implies that Var X i EX 2 i E( X i X i ) EX i. 3

4 Therefore, σ 2 E. Now we use (2) for t = εe, and note that t 2 /4σ 2 = ε 2 E 2 /4σ 2 ε 2 E/4. Remark. The hypothesis X i 1 can be relaxed throughout to X i ψ1 1. One just has to be more careful with Taylor series. 2. Applications Let x be a random vector in isotropic position in R d, i.e. Ex x = I d. Denote x ψ1 = M. Then the random matrix X := M 2 x x satisfies the hypotheses of Corollary (see remark below it). Clearly, Then Corollary gives We have thus proved: EX = M 2 I d, E = n/m 2, ES n = (n/m 2 )I d. P( S n ES n > ε ES ) d e ε2 n/4m 2. Corollary 4. Let x be a random vector in isotropic position in R d, such that M := x ψ1 <. Let x 1,..., x n be independent copies of x. Then for every ε (0, 1), we have ( 1 n ) P x i x i I d > ε d e ε2 n/4m 2. n The probability in the Corollary is smaller than 1 provided that the number of samples is n ε 2 M 2 log d. This is a version of Rudelson s sampling theorem, where M played the role of (E x log n ) 1/ log n. One can also deduce the main Lemma in [2] from the Corollary in the previous section. Given vectors x i in R d, we are interested in the magnitude of N g i x i x i where g i are independent Gaussians. For normalization, we can assume that N x i x i = A, A = 1. Denote M := max x i i

and consider the random operator X := M 2 x i x i with probability 1/N. As before, let X 1, X 2,..., X n be independent copies of X, and S = X 1 + + X n. This time, we are going to let n. By the Central Limit Theorem, the properly scaled sum S ES will converge to N g ix i x i. One then chooses the parameters correctly to produce a version of the main Lemma in [2]. We omit the details. References [1] R. Ahlswede, A. Winter, Strong converse for identification via quantum channels, IEEE Trans. Information Theory 48 (2002), 568 579 [2] M. Rudelson, Random vectors in the isotropic position [3] A. Wigderson, D. Xiao, Derandomzing the Ahlswede-Winter matrix-valued Chernoff bound using pessimistic estimators, and applications, Theory of Computing 4 (2008), 53 76 5