18.S096: Concentration Inequalities, Scalar and Matrix Versions

Size: px

Start display at page:

Download "18.S096: Concentration Inequalities, Scalar and Matrix Versions"

Anne Hampton
6 years ago
Views:

1 18S096: Concentration Inequalities, Scalar and Matrix Versions Topics in Mathematics of Data Science Fall 015 Afonso S Bandeira bandeira@mitedu October 15, 015 These are lecture notes not in final form and will be continuously edited and/or corrected as I am sure it contains many typos Please let me know if you find any typo/mistake Also, I am posting short descriptions of these notes together with the open problems on my Blog, see [Ban15a] 41 Large Deviation Inequalities Concentration and large deviations inequalities are among the most useful tools when understanding the performance of some algorithms In a nutshell they control the probability of a random variable being very far from its expectation The simplest such inequality is Markov s inequality: Theorem 41 Markov s Inequality Let X 0 be a non-negative random variable with [X] < Then, Prob{X > t [X] 1 t Proof Let t > 0 Define a random variable Y t as { 0 if X t Y t = t if X > t Clearly, Y t X, hence [Y t ] [X], and t Prob{X > t = [Y t ] [X], concluding the proof Markov s inequality can be used to obtain many more concentration inequalities inequality is a simple inequality that control fluctuations from the mean Chebyshev s Theorem 4 Chebyshev s inequality Let X be a random variable with [X ] < Then, Prob{ X X > t VarX t 1

2 Proof Apply Markov s inequality to the random variable X [X] to get: Prob{ X X > t = Prob{X X > t [ X X ] t = VarX t 411 Sums of independent random variables In what follows we ll show two useful inequalities involving sums of independent random variables The intuitive idea is that if we have a sum of independent random variables X = X X n, where X i are iid centered random variables, then while the value of X can be of order On it will very likely be of order O n note that this is the order of its standard deviation The inequalities that follow are ways of very precisely controlling the probability of X being larger than O n While we could use, for example, Chebyshev s inequality for this, in the inequalities that follow the probabilities will be exponentially small, rather than quadratic, which will be crucial in many applications to come Theorem 43 Hoeffding s Inequality Let X 1, X,, X n be independent bounded random variables, ie, X i a and [X i ] = 0 Then, { Prob X i > t exp t na The inequality implies that fluctuations larger than O n have small probability For example, for t = a n log n we get that the probability is at most n Proof We first get a probability bound for the event n X i > t The proof, again, will follow from Markov Since we want an exponentially small probability, we use a classical trick that involves exponentiating with any λ > 0 and then choosing the optimal λ { { Prob X i > t = Prob X i > t = Prob [eλ {e λ n X i > e λt n X i ] e tλ n = e tλ [e λx i ], 3 where the penultimate step follows from Markov s inequality and the last equality follows from independence of the X i s

3 We now use the fact that X i a to bound [e λx i ] Because the function fx = e λx is convex, e λx a + x a eλa + a x a e λa, for all x [ a, a] Since, for all i, [X i ] = 0 we get [ [e λx i a + Xi ] a eλa + a X ] i a e λa 1 e λa + e λa = coshλa Hence, Note that 1 coshx e x /, Together with, this gives { Prob X i > t for all x R [e λx i ] [e λx i / ] e λa / e tλ n e λa / = e tλ e nλa / This inequality holds for any choice of λ 0, so we choose the value of λ that minimizes min {n λa tλ λ Differentiating readily shows that the minimizer is given by which satisfies λ > 0 For this choice of λ, Thus, λ = nλa / tλ = 1 n t na, { Prob X i > t t a t a = t na e t na By using the same argument on n X i, and union bounding over the two events we get, { Prob X i > t e t na 1 This follows immediately from the Taylor expansions: coshx = n=0 3 x n n!, ex / = n=0 x n n n!, and n! n n!

4 Remark 44 Let s say that we have random variables r 1,, r n iid distributed as 1 with probability p/ r i = 0 with probability 1 p 1 with probability p/ Then, r i = 0 and r i 1 so Hoeffding s inequality gives: { Prob r i > t exp t n Intuitively, the smallest p is the more concentrated n r i should be, however Hoeffding s inequality does not capture this behavior A natural way to quantify this intuition is by noting that the variance of n r i depends on p as Varr i = p The inequality that follows, Bernstein s inequality, uses the variance of the summands to improve over Hoeffding s inequality The way this is going to be achieved is by strengthening the proof above, more specifically in step 3 we will use the bound on the variance to get a better estimate on [e λx i ] essentially by realizing that if X i is centered, Xi = σ, and X i a then, for k, Xi k σ a k = σ a k a Theorem 45 Bernstein s Inequality Let X 1, X,, X n be independent centered bounded random variables, ie, X i a and [X i ] = 0, with variance [Xi ] = σ Then, { t Prob X i > t exp nσ + 3 at Remark 46 Before proving Bernstein s Inequality, note that on the example of Remark 44 we get { t Prob r i > t exp np + 3 t, which exhibits a dependence on p and, for small values of p is considerably smaller than what Hoeffding s inequality gives Proof As before, we will prove { t Prob X i > t exp nσ + 3 at, and then union bound with the same result for n X i, to prove the Theorem 4

5 For any λ > 0 we have { Prob X i > t = Prob{e λ X i > e λt [eλ Xi ] e λt n = e λt [e λx i ] Now comes the source of the improvement over Hoeffding s, [e λx i ] = [ λ m Xi m 1 + λx i + m! m= λ m a m σ 1 + m! m= = 1 + σ λa m a m! m= = 1 + σ a e λa 1 λa Therefore, { [ Prob X i > t e λt 1 + σ ] n a e λa 1 λa We will use a few simple inequalities that can be easily proved with calculus such as 1 + x e x, for all x R This means that, 1 + σ a e λa 1 λa e σ a eλa 1 λa, which readily implies { Prob X i > t e λt e nσ a eλa 1 λa As before, we try to find the value of λ > 0 that minimizes min { λt + nσ λ a eλa 1 λa Differentiation gives t + nσ a aeλa a = 0 In fact y = 1 + x is a tangent line to the graph of fx = e x ] 5

6 which implies that the optimal choice of λ is given by λ = 1 a log 1 + at nσ If we set then λ = 1 a log1 + u Now, the value of the minimum is given by u = at nσ, 4 λ t + nσ a eλ a 1 λ a = nσ [1 + u log1 + u u] a Which means that, { Prob X i > t exp nσ {1 + u log1 + u u a The rest of the proof follows by noting that, for every u > 0, which implies: 1 + u log1 + u u u u +, 5 3 { Prob X i > t exp = exp nσ a u u + 3 t nσ + 3 at 4 Gaussian Concentration One of the most important results in concentration of measure is Gaussian concentration, although being a concentration result specific for normally distributed random variables, it will be very useful throughout these lectures Intuitively it says that if F : R n R is a function that is stable in terms of its input then F g is very well concentrated around its mean, where g N 0, I More precisely: Theorem 47 Gaussian Concentration Let X = [X 1,, X n ] T be a vector with iid standard Gaussian entries and F : R n R a σ-lipschitz function ie: F x F y σ x y, for all x, y R n Then, for every t 0 Prob { F X F X t exp t σ 6

7 For the sake of simplicity we will show the proof for a slightly weaker bound in terms of the constant inside the exponent: Prob { F X F X t exp t This exposition follows closely π σ the proof of Theorem 11 in [Tao1] and the original argument is due to Maurey and Pisier For a proof with the optimal constants see, for example, Theorem 35 in these notes [vh14] We will also assume the function F is smooth this is actually not a restriction, as a limiting argument can generalize the result from smooth functions to general Lipschitz functions Proof If F is smooth, then it is easy to see that the Lipschitz property implies that, for every x R n, F x σ By subtracting a constant to F, we can assume that F X = 0 Also, it is enough to show a one-sided bound Prob {F X F X t exp π t σ, since obtaining the same bound for F X and taking a union bound would gives the result We start by using the same idea as in the proof of the large deviation inequalities above; for any λ > 0, Markov s inequality implies that Prob {F X t = Prob {exp λf X exp λt [exp λf X] exp λt This means we need to upper bound [exp λf X] using a bound on F The idea is to introduce a random independent copy Y of X Since exp λ is convex, Jensen s inequality implies that [exp λf Y ] exp λf Y = exp0 = 1 Hence, since X and Y are independent, [exp λ [F X F Y ]] = [exp λf X] [exp λf Y ] [exp λf X] Now we use the Fundamental Theorem of Calculus in a circular arc from X to Y : F X F Y = π 0 F Y cos θ + X sin θ dθ θ The advantage of using the circular arc is that, for any θ, X θ := Y cos θ + X sin θ is another random variable with the same distribution Also, its derivative with respect to θ, X θ = Y sin θ + X cos θ also is Moreover, X θ and X θ are independent In fact, note that [ ] X θ X θt = [Y cos θ + X sin θ] [ Y sin θ + X cos θ] T = 0 We use Jensen s again with respect to the integral now to get: exp λ [F X F Y ] = exp λ π 1 π/ π/ 1 π/ π/ 0 exp 0 λ π θ F X θ dθ θ F X θ dθ 7

8 Using the chain rule, and taking expectations exp λ [F X F Y ] π exp λ [F X F Y ] π π/ 0 π/ 0 exp λ π F X θ X θ dθ, exp λ π F X θ X θ dθ, If we condition on X θ, since λ π F X θ λ π σ, λ π F X θ X θ with variance at most λ π σ This directly implies that, for every value of Xθ X θ exp λ π [ 1 F X θ X θ exp λ π ] σ Taking expectation now in X θ, and putting everything together, gives [ 1 [exp λf X] exp λ π ] σ, which means that [ 1 Prob {F X t exp λ π ] σ λt, Optimizing for λ gives λ = t π, which gives σ Prob {F X t exp [ π t σ ] is a gaussian random variable 41 Spectral norm of a Wigner Matrix We give an illustrative example of the utility of Gaussian concentration Let W R n n be a standard Gaussian Wigner matrix, a symmetric matrix with otherwise independent gaussian entries, the offdiagonal entries have unit variance and the diagonal entries have variance W depends on nn+1 independent standard gaussian random variables and it is easy to see that it is a -Lipschitz function of these variables, since W 1 W W 1 W W 1 W F The symmetry of the matrix and the variance of the diagonal entries are responsible for an extra factor of Using Gaussian Concentration Theorem 47 we immediately get Since 3 W n we get Prob { W W + t exp t 4 3 It is an excellent exercise to prove W n using Slepian s inequality 8

9 Proposition 48 Let W R n n be a standard Gaussian Wigner matrix, a symmetric matrix with otherwise independent gaussian entries, the off-diagonal entries have unit variance and the diagonal entries have variance Then, Prob { W n + t exp t 4 Note that this gives an extremely precise control of the fluctuations of W In fact, for t = log n this gives { Prob W n + log n exp 4 log n = 1 4 n 4 Talagrand s concentration inequality A remarkable result by Talagrand [Tal95], Talangrad s concentration inequality, provides an analogue of Gaussian concentration to bounded random variables Theorem 49 Talangrand concentration inequality, Theorem 113 [Tao1] Let K > 0, and let X 1,, X n be independent bounded random variables, X i K for all 1 i n Let F : R n R be a σ-lipschitz and convex function Then, for any t 0, t Prob { F X [F X] tk c 1 exp c, for positive constants c 1, and c Other useful similar inequalities with explicit constants are available in [Mas00] 43 Other useful large deviation inequalities This Section contains, without proof, some scalar large deviation inequalities that I have found useful 431 Additive Chernoff Bound The additive Chernoff bound, also known as Chernoff-Hoeffding theorem concerns Bernoulli random variables Theorem 410 Given 0 < p < 1 and X 1,, X n iid random variables distributed as Bernoullip random variable meaning that it is 1 with probability p and 0 with probability 1 p, then, for any ε > 0: Prob Prob { 1 n { 1 n X i p + ε X i p ε [ p p+ε ] 1 p 1 p ε n p + ε 1 p ε [ p p ε ] 1 p 1 p+ε n p ε 1 p + ε σ 9

10 43 Multiplicative Chernoff Bound There is also a multiplicative version see, for example Lemma 33 in [Dur06], which is particularly useful Theorem 411 Let X 1,, X n be independent random variables taking values is {0, 1 meaning they are Bernoulli distributed but not necessarily identically distributed Let µ = n X i, then, for any δ > 0: [ e δ ] µ Prob {X > 1 + δµ < 1 + δ 1+δ [ e δ ] µ Prob {X < 1 δµ < 1 δ 1 δ 433 Deviation bounds on χ variables A particularly useful deviation inequality is Lemma 1 in Laurent and Massart [LM00]: Theorem 41 Lemma 1 in Laurent and Massart [LM00] Let X 1,, X n be iid standard gaussian random variables N 0, 1, and a 1,, a n non-negative numbers Let Z = The following inequalities hold for any t > 0: a k X k 1 Prob {Z a x + a x exp x, Prob {Z a x exp x, where a = n a k and a = max 1 k n a k Note that if a k = 1, for all k, then Z is a χ with n degrees of freedom, so this theorem immediately gives a deviation inequality for χ random variables 44 Matrix Concentration In many important applications, some of which we will see in the proceeding lectures, one needs to use a matrix version of the inequalities above Given {X k n independent random symmetric d d matrices one is interested in deviation inequalities for λ max X k For example, a very useful adaptation of Bernstein s inequality exists for this setting 10

11 Theorem 413 Theorem 14 in [Tro1] Let {X k n be a sequence of independent random symmetric d d matrices Assume that each X k satisfies: X k = 0 and λ max X k R almost surely Then, for all t 0, { t Prob λ max X k t d exp σ + 3 Rt where σ = X k Note that A denotes the spectral norm of A In what follows we will state and prove various matrix concentration results, somewhat similar to Theorem 413 Motivated by the derivation of Proposition 48, that allowed us to easily transform bounds on the expected spectral norm of a random matrix into tail bounds, we will mostly focus on bounding the expected spectral norm Tropp s monograph [Tro15b] is a nice introduction to matrix concentration and includes a proof of Theorem 413 as well as many other useful inequalities A particularly important inequality of this type is for gaussian series, it is intimately related to the non-commutative Khintchine inequality [Pis03], and for that reason we will often refer to it as Non-commutative Khintchine see, for example, 49 in [Tro1] Theorem 414 Non-commutative Khintchine NCK Let A 1,, A n R d d be symmetric matrices and g 1,, g n N 0, 1 iid, then: 1 g k A k + logd σ, where σ = A k 6 Note that, akin to Proposition 48, we can also use Gaussian Concentration to get a tail bound on n g ka k We consider the function F : R n g k A k 11

12 We now estimate its Lipschitz constant; let g, h R n then g k A k h k A k g k A k h k A k = g k h k A k = max v: v =1 vt g k h k A k v = max = v: v =1 max v: v =1 max g k h k v T A k v n g k h k n v T A k v v: v =1 v T A k v g h, where the first inequality made use of the triangular inequality and the last one of the Cauchy-Schwarz inequality This motivates us to define a new parameter, the weak variance σ Definition 415 Weak Variance see, for example, [Tro15b] Given A 1,, A n R d d symmetric matrices We define the weak variance parameter as σ = max v: v =1 v T A k v This means that, using Gaussian concentration and setting t = uσ, we have Prob { g k A k + logd 1 σ + uσ exp 1 u 7 This means that although the expected value of n g ka k is controlled by the parameter σ, its fluctuations seem to be controlled by σ We compare the two quantities in the following Proposition Proposition 416 Given A 1,, A n R d d symmetric matrices, recall that σ = A k and σ = max v T A k v v: v =1 We have σ σ 1

13 Proof Using the Cauchy-Schwarz inequality, σ = max v: v =1 = max v: v =1 max v: v =1 = max v: v =1 = max = v: v =1 = σ A k v T A k v v T [A k v] v A k v A k v v T A k v 45 Optimality of matrix concentration result for gaussian series The following simple calculation is suggestive that the parameter σ in Theorem 414 is indeed the correct parameter to understand n g ka k g k A k = g k A k = max v: v =1 vt g k A k v max v: v =1 vt g k A k v = max v: v =1 vt A k v = σ 8 But a natural question is whether the logarithmic term is needed Motivated by this question we ll explore a couple of examples xample 417 We can write a d d Wigner matrix W as a gaussian series, by taking A ij for i j defined as A ij = e i e T j + e j e T i, if i j, and A ii = e i e T i It is not difficult to see that, in this case, i j A ij = d + 1I d d, meaning that σ = d + 1 This means that Theorem 414 gives us W d log d, 13

14 however, we know that W d, meaning that the bound given by NCK Theorem 414 is, in this case, suboptimal by a logarithmic factor 4 The next example will show that the logarithmic factor is in fact needed in some examples xample 418 Consider A k = e k e T k Rd d for k = 1,, d The matrix n g ka k corresponds to a diagonal matrix with independent standard gaussian random variables as diagonal entries, and so it s spectral norm is given by max k g k It is known that max 1 k d g k log d On the other hand, a direct calculation shows that σ = 1 This shows that the logarithmic factor cannot, in general, be removed This motivates the question of trying to understand when is it that the extra dimensional factor is needed For both these examples, the resulting matrix X = n g ka k has independent entries except for the fact that it is symmetric The case of independent entries [RS13, Seg00, Lat05, BvH15] is now somewhat understood: Theorem 419 [BvH15] If X is a d d random symmetric matrix with gaussian independent entries except for the symmetry constraint whose entry i, j has variance b ij then X max d 1 i d j=1 b ij + max b ij log d ij Remark 40 X in the theorem above can be written in terms of a Gaussian series by taking A ij = b ij ei e T j + e j e T i, for i < j and A ii = b ii e i e T i One can then compute σ and σ : σ = max 1 i d j=1 d b ij and σ b ij This means that, when the random matrix in NCK Theorem 414 has negative entries modulo symmetry then X σ + log dσ 9 Theorem 419 together with a recent improvement of Theorem 414 by Tropp [Tro15c] 5 motivate the bold possibility of 9 holding in more generality Conjecture 41 Let A 1,, A n R d d be symmetric matrices and g 1,, g n N 0, 1 iid, then: g k A k σ + log d 1 σ, 4 By a b we mean a b and a b 5 We briefly discuss this improvement in Remark 43 14

15 While it may very will be that this Conjecture 41 is false, no counter example is known, up to date Open Problem 41 Improvement on Non-Commutative Khintchine Inequality Prove or disprove Conjecture 41 I would also be pretty excited to see interesting examples that satisfy the bound in Conjecture 41 while such a bound would not trivially follow from Theorems 414 or An interesting observation regarding random matrices with independent matrices For the independent { entries setting, Theorem 419 is tight up to constants for a wide range of variance profiles b ij the details are available as Corollary 315 in [BvH15]; the basic idea is that if the i j largest variance is comparable to the variance of a sufficient number of entries, then the bound in Theorem 419 is tight up to constants { However, the situation is not as well understood when the variance profiles b ij are arbitrary i j Since the spectral norm of a matrix is always at least the l norm of a row, the following lower bound holds for X a symmetric random matrix with independent gaussian entries: X max Xe k k Observations in papers of Lata la [Lat05] and Riemer and Schutt [RS13], together with the results in [BvH15], motivate the conjecture that this lower bound is always tight up to constants Open Problem 4 Lata la-riemer-schutt Given X a symmetric random matrix with independent gaussian entries, is the following true? X max Xe k k The results in [BvH15] answer this in the positive for a large range of variance profiles, but not in full generality Recently, van Handel [vh15] proved this conjecture in the positive with an extra factor of log log d More precisely, that X log log d max Xe k, k where d is the number of rows and columns of X 46 A matrix concentration inequality for Rademacher Series In what follows, we closely follow [Tro15a] and present an elementary proof of a few useful matrix concentration inequalities We start with a Master Theorem of sorts for Rademacher series the Rademacher analogue of Theorem

16 Theorem 4 Let H 1,, H n R d d be symmetric matrices and ε 1,, ε n iid Rademacher random variables meaning = +1 with probability 1/ and = 1 with probability 1/, then: 1 ε k H k 1 + logd σ, where σ = H k 10 Before proving this theorem, we take first a small detour in discrepancy theory followed by derivations, using this theorem, of a couple of useful matrix concentration inequalities 461 A small detour on discrepancy theory The following conjecture appears in a nice blog post of Raghu Meka [Mek14] Conjecture 43 [Matrix Six-Deviations Suffice] There exists a universal constant C such that, for any choice of n symmetric matrices H 1,, H n R n n satisfying H k 1 for all k = 1,, n, there exists ε 1,, ε n {±1 such that ε k H k C n Open Problem 43 Prove or disprove Conjecture 43 Note that, when the matrices H k are diagonal, this problem corresponds to Spencer s Six Standard Deviations Suffice Theorem [Spe85] Remark 44 Also, using Theorem 4, it is easy to show that if one picks ε i as iid Rademacher random variables, then with positive probability via the probabilistic method the inequality will be satisfied with an extra log n term In fact one has ε k H k log n H k log n n H k log n n Remark 45 Remark 44 motivates asking whether Conjecture 43 can be strengthened to ask for ε 1,, ε n such that 1 ε k H k Hk 11 16

17 46 Back to matrix concentration Using Theorem 4, we ll prove the following Theorem Theorem 46 Let T 1,, T n R d d be random independent positive semidefinite matrices, then T i 1 T i + Cd max T i i 1, where Cd := log d 1 A key step in the proof of Theorem 46 is an idea that is extremely useful in Probability, the trick of symmetrization For this reason we isolate it in a lemma Lemma 47 Symmetrization Let T 1,, T n be independent random matrices note that they don t necessarily need to be positive semidefinite, for the sake of this lemma and ε 1,, ε n random iid Rademacher random variables independent also from the matrices Then T i T i + ε i T i Proof Triangular inequality gives T i T i + T i T i Let us now introduce, for each i, a random matrix T i identically distributed to T i and independent all n matrices are independent Then T i T i = T = T T [ ] T i T i T T i i T i T i Ti T i Ti T i, where we use the notation a to mean that the expectation is taken with respect to the variable a and the last step follows from Jensen s inequality with respect to T Since T i T i is a symmetric random variable, it is identically distributed to ε i T i T i which gives Ti T i = ε i Ti T i ε i T i + ε i T i = ε i T i, concluding the proof 17

18 Proof [of Theorem 46] Using Lemma 47 and Theorem 4 we get T i T i + Cd The trick now is to make a term like the one in the LHS appear in the RHS For that we start by noting you can see Fact 3 in [Tro15a] for an elementary proof that, since T i 0, Ti max T i T i i This means that T i T i + Cd Further applying the Cauchy-Schwarz inequality for gives, T i T i + Cd max T i i max T i i T i T i T i 1, Now that the term n T i appears in the RHS, the proof can be finished with a simple application of the quadratic formula see Section 61 in [Tro15a] for details We now show an inequality for general symmetric matrices Theorem 48 Let Y 1,, Y n R d d be random independent positive semidefinite matrices, then Y i Cdσ + CdL, where, and, as in 1, σ = Yi and L = max Y i 13 i Cd := log d Proof Using Symmetrization Lemma 47 and Theorem 4, we get [ ] Y i Y ε ε i Y i Cd 18 Y i 1

19 Jensen s inequality gives Yi 1 Yi and the proof can be concluded by noting that Y i 0 and using Theorem 46 1, Remark 49 The rectangular case One can extend Theorem 48 to general rectangular matrices S 1,, S n R d 1 d by setting [ ] 0 Si Y i = Si T, 0 and noting that Y i [ ] [ 0 Si = Si T 0 = Si Si T 0 0 Si T S i We defer the details to [Tro15a] ] = max { Si T S i, S i Si T In order to prove Theorem 4, we will use an AM-GM like inequality for matrices for which, unlike the one on Open Problem 0 in [Ban15b], an elementary proof is known Lemma 430 Given symmetric matrices H, W, Y R d d and non-negative integers r, q satisfying q r, Tr [ HW q HY r q] + Tr [ HW r q HY q] Tr [ H W r + Y r], and summing over q gives r q=0 Tr [ HW q HY r q] r + 1 Tr [ H W r + Y r] We refer to Fact 4 in [Tro15a] for an elementary proof but note that it is a matrix analogue to the inequality, µ θ λ 1 θ + µ 1 θ λ θ λ + θ for µ, λ 0 and 0 θ 1, which can be easily shown by adding two AM-GM inequalities µ θ λ 1 θ θµ + 1 θλ and µ 1 θ λ θ 1 θµ + θλ Proof [of Theorem 4] Let X = n ε kh k, then for any positive integer p, X X p 1 p = X p 1 p Tr X p 1 p, where the first inequality follows from Jensen s inequality and the last from X p 0 and the observation that the trace of a positive semidefinite matrix is at least its spectral norm In the sequel, we 19

20 upper bound Tr X p We introduce X +i and X i as X conditioned on ε i being, respectively +1 or 1 More precisely X +i = H i + ε j H j and X i = H i + ε j H j j i j i Then, we have Tr X p = Tr [ XX p 1] = Tr ε i H i X p 1 Note that εi Tr [ ε i H i X p 1] [ ] = 1 Tr H i X p 1 +i X p 1 i, this means that Tr X p = 1 [ ] Tr H i X p 1 +i X p 1 i, where the expectation can be taken over ε j for j i Now we rewrite X p 1 +i X p 1 i as a telescopic sum: p X p 1 +i X p 1 i = X q +i X +i X i X p q i q=0 Which gives Tr X p = Since X +i X i = H i we get p q=0 1 [ Tr H i X q +i X +i X i X p q i ] Tr X p = p q=0 [ Tr H i X q +i H ix p q i ] 14 We now make use of Lemma 430 to get 6 to get Tr X p p 1 [ Tr Hi ] X p +i + X p i 15 6 See Remark 43 regarding the suboptimality of this step 0

21 Hence, p 1 [ Tr Hi ] X p +i + X p i = p 1 = p 1 = p 1 Tr H i X p +i + X p i Tr [ Hi [ εi X p ]] Tr [ Hi X p ] = p 1 Tr [ H i X p ] Since X p 0 we have [ Tr which gives H i ] X p Applying this inequality, recursively, we get Hence, H i Tr Xp = σ Tr X p, 16 Tr X p σ p 1 Tr X p 17 Tr X p [p 1p 3 31] σ p Tr X 0 = p 1!!σ p d X Tr X p 1 p [p 1!!] 1 p σd 1 Taking p = log d and using the fact that p 1!! p+1 e p p see [Tro15a] for an elementary proof consisting essentially of taking logarithms and comparing the sum with an integral we get 1 log d X σd log d log d σ e Remark 431 A similar argument can be used to prove Theorem 414 the gaussian series case based on gaussian integration by parts, see Section 7 in [Tro15c] Remark 43 Note that, up until the step from 14 to 15 all steps are equalities suggesting that this step may be the lossy step responsible by the suboptimal dimensional factor in several cases although 16 can also potentially be lossy, it is not uncommon that Hi is a multiple of the identity matrix, which would render this step also an equality In fact, Joel Tropp [Tro15c] recently proved an improvement over the NCK inequality that, essentially, consists in replacing inequality 15 with a tighter argument In a nutshell, the idea is that, if the H i s are non-commutative, most summands in 14 are actually expected to be smaller than the ones corresponding to q = 0 and q = p, which are the ones that appear in 15 1

22 47 Other Open Problems 471 Oblivious Sparse Norm-Approximating Projections There is an interesting random matrix problem related to Oblivious Sparse Norm-Approximating Projections [NN], a form of dimension reduction useful for fast linear algebra In a nutshell, The idea is to try to find random matrices Π that achieve dimension reduction, meaning Π R m n with m n, and that preserve the norm of every point in a certain subspace [NN], moreover, for the sake of computational efficiency, these matrices should be sparse to allow for faster matrix-vector multiplication In some sense, this is a generalization of the ideas of the Johnson-Lindenstrauss Lemma and Gordon s scape through the Mesh Theorem that we will discuss next Section Open Problem 44 OSNAP [NN] Let s d m n 1 Let Π R m n be a random matrix with iid entries Π ri = δ riσ ri s, where σ ri is a Rademacher random variable and { 1 s s with probability m δ ri = 0 with probability 1 s m Prove or disprove: there exist positive universal constants c 1 and c such that For any U R n d for which U T U = I d d for m c 1 d+log 1 δ ε and s c log d δ ε Same setting as in 1 but conditioning on Prob { ΠU T ΠU I ε < δ, m δ ri = s, for all i, r=1 meaning that each column of Π has exactly s non-zero elements, rather than on average The conjecture is then slightly different: Prove or disprove: there exist positive universal constants c 1 and c such that For any U R n d for which U T U = I d d for m c 1 d+log 1 δ ε and s c log d δ ε Prob { ΠU T ΠU I ε < δ,

23 3 The conjecture in 1 but for the specific choice of U: [ I U = d d 0 n d d In this case, the object in question is a sum of rank 1 independent matrices More precisely, z 1,, z m R d corresponding to the first d coordinates of each of the m rows of Π are iid random vectors with iid entries z k j = ] 1 s with probability s m 0 with probability 1 s m 1 s with probability s m Note that z k zk T = 1 m I d d The conjecture is then that, there exists c 1 and c positive universal constants such that { m [ Prob zk zk T z kzk T ] ε < δ, d+log for m c 1 δ log 1 and s c d ε δ ε I think this would is an interesting question even for fixed δ, for say δ = 01, or even simply understand the value of m [ zk zk T z kzk T ] 47 k-lifts of graphs Given a graph G, on n nodes and with max-degree, and an integer k a random k lift G k of G is a graph on kn nodes obtained by replacing each edge of G by a random k k bipartite matching More precisely, the adjacency matrix A k of G k is a nk nk matrix with k k blocks given by A k ij = A ij Π ij, where Π ij is uniformly randomly drawn from the set of permutations on k elements, and all the edges are independent, except for the fact that Π ij = Π ji In other words, A k = i<j A ij ei e T j Π ij + e j e T i Π T ij, where corresponds to the Kronecker product Note that 1 A k = A k J, where J = 11 T is the all-ones matrix 3

24 Open Problem 45 Random k-lifts of graphs Give a tight upperbound to A k A k Oliveira [Oli10] gives a bound that is essentially of the form lognk, while the results in [ABG1] suggest that one may expect more concentration for large k It is worth noting that the case of k = can essentially be reduced to a problem where the entries of the random matrix are independent and the results in [BvH15] can be applied to, in some case, remove the logarithmic factor 48 Another open problem Feige [Fei05] posed the following remarkable conjecture Conjecture 433 Given n independent random variables X 1,, X n st, for all i, X i 0 and X i = 1 we have Prob X i n e 1 Note that, if X i are iid and X i = n + 1 with probability 1/n + 1 and X i = 0 otherwise, then Prob n n X i n + 1 = 1 n n+1 1 e 1 Open Problem 46 Prove or disprove Conjecture References [ABG1] L Addario-Berry and S Griffiths The spectrum of random lifts available at arxiv: [mathco], 01 [Ban15a] A S Bandeira Relax and Conquer BLOG: 18S096 Concentration Inequalities, Scalar and Matrix Versions 015 [Ban15b] A S Bandeira Relax and Conquer BLOG: Ten Lectures and Forty-two Open Problems in Mathematics of Data Science 015 [BvH15] A S Bandeira and R v Handel Sharp nonasymptotic bounds on the norm of random matrices with independent entries Annals of Probability, to appear, 015 [Dur06] R Durrett Random Graph Dynamics Cambridge Series in Statistical and Probabilistic Mathematics Cambridge University Press, New York, NY, USA, 006 [Fei05] U Feige On sums of independent random variables with unbounded variance, and estimating the average degree in a graph 005 [Lat05] R Lata la Some estimates of norms of random matrices Proc Amer Math Soc, 1335: electronic, We thank Francisco Unda and Philippe Rigollet for suggesting this problem 4

25 [LM00] B Laurent and P Massart Adaptive estimation of a quadratic functional by model selection Ann Statist, 000 [Mas00] P Massart About the constants in Talagrand s concentration inequalities for empirical processes The Annals of Probability, 8, 000 [Mek14] R Meka Windows on Theory BLOG: Discrepancy and Beating the Union Bound http: // windowsontheory org/ 014/ 0/ 07/ discrepancy-and-beating-the-union-bound/, 014 [NN] [Oli10] [Pis03] [RS13] J Nelson and L Nguyen Osnap: Faster numerical linear algebra algorithms via sparser subspace embeddings Available at arxiv: [csds] R I Oliveira The spectrum of random k-lifts of large graphs with possibly large k Journal of Combinatorics, 010 G Pisier Introduction to operator space theory, volume 94 of London Mathematical Society Lecture Note Series Cambridge University Press, Cambridge, 003 S Riemer and C Schütt On the expectation of the norm of random matrices with nonidentically distributed entries lectron J Probab, 18, 013 [Seg00] Y Seginer The expected norm of random matrices Combin Probab Comput, 9: , 000 [Spe85] J Spencer Six standard deviations suffice Trans Amer Math Soc, 89, 1985 [Tal95] [Tao1] [Tro1] [Tro15a] M Talagrand Concentration of measure and isoperimetric inequalities in product spaces Inst Hautes tudes Sci Publ Math, 81:73 05, 1995 T Tao Topics in Random Matrix Theory Graduate studies in mathematics American Mathematical Soc, 01 J A Tropp User-friendly tail bounds for sums of random matrices Foundations of Computational Mathematics, 14: , 01 J A Tropp The expected norm of a sum of independent random matrices: An elementary approach Available at arxiv: [mathpr], 015 [Tro15b] J A Tropp An introduction to matrix concentration inequalities Foundations and Trends in Machine Learning, 015 [Tro15c] J A Tropp Second-order matrix concentration inequalities In preparation, 015 [vh14] [vh15] R van Handel Probability in high dimensions ORF 570 Lecture Notes, Princeton University, 014 R van Handel On the spectral norm of inhomogeneous random matrices Available online at arxiv: [mathpr], 015 5

STAT 200C: High-dimensional Statistics

STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d