arxiv: v1 [math.st] 27 Sep 2018

Size: px

Start display at page:

Download "arxiv: v1 [math.st] 27 Sep 2018"

Dayna Wheeler
5 years ago
Views:

1 Robust covariance estimation under L 4 L 2 norm equivalence. arxiv: v [math.st] 27 Sep 208 Shahar Mendelson ikita Zhivotovskiy Abstract Let X be a centered random vector taking values in R d and let Σ = EX X) be its covariance matrix. We show that if X satisfies an L 4 L 2 norm equivalence, there is a covariance estimator ˆΣ that exhibits the optimal performance one would expect had X been a gaussian vector. The procedure also improves the current state-of-the-art regarding high probability bounds in the subgaussian case sharp results were only known in expectation or with constant probability). In both scenarios the new bound does not depend explicitly on the dimension d, but rather on the effective rank of the covariance matrix Σ. Introduction The question of estimating the covariance of a random vector has been studied extensively in recent years see, e.g., [2, 4, 0,, 2] and references therein). To formulate the problem, let X be a zero mean random vector taking its values in R d and denote the covariancematrix by Σ = EX X). Given a samplex,...,x consisting ofindependent random vectors that are distributed according to X, the goal is to select a matrix ˆΣ that approximates Σ. While there are various notions of approximation, the focus of this note is on approximation with respect to the l 2 l 2 ) operator norm, which from here on we denote by. One way of viewing the question of covariance estimation with respect to any norm), is as a mean estimation problem. Indeed, if one sets W = X X, then EW = Σ, and since one is given a sample X,...,X, the vectors X i X i ) are independent copies of W. Thus, a matrix Ŵ that is a good approximation of the mean EW with respect to the underlying norm is a solution to the problem of estimating the covariance of X with respect to that norm. An immediate outcome of this simple observation is that the empirical mean ˆΣ = W i = X i X i, which is the trivial choice for estimating the true mean, is a poor estimator unless the random vector W has a nice tail behavior see, for example, the discussion in [5]). An example of a positive result of that flavour is Theorem 9 in [2], and to formulate it we need the following definition. Mathematical Sciences Institute, The Australian ational University and Department of Mathematics, Technion, I.I.T, shahar.mendelson@gmail.com Department of Mathematics, Technion, I.I.T, Skoltech and Higher School of Economics nikita.zhivotovskiy@phystech.edu

2 Definition.. The effective rank of a positive semidefinite square matrix A R d d is given by ra) = TrA) A..) Clearly, ra) d but the gap between ra) and d may be substantial. Theorem.2 [2]). For every L there exists a constant cl) for which the following holds. Let X be an L-subgaussian random vector. Then with probability at least δ rσ) X i X i Σ cl) Σ + rσ) ) log2/δ) + + log2/δ).2) It was shown in [2] that if G is a zero mean gaussian vector and in particular it satisfies the conditions of Theorem.2) with covariance Σ then E { } rσ) G i G i Σ Σ max, rσ). Hence, there is no room for improvement in the deviation estimate of the empirical mean from the true one at the constant confidence level. Of course, that does not imply that the empirical mean is an optimal covariance estimator even for a gaussian vector and at a constant level of confidence. Moreover, as we explain in what follows, there are far better covariance estimators than.2) when the confidence parameter δ is small. Just as in the one-dimensional mean-estimation problem, once the problem is more heavy-tailed the performance of the empirical mean deteriorates quickly and a different procedure has to be used. And that is also the case for covarianceestimation. The current state-of-the-art for covariance estimation in heavy-tailed situation is [2] see Corollary 4. there and similar results in [0, ]), in which X is assumed to satisfy an L 4 L 2 norm equivalence. Definition.3. A random vector X with mean µ satisfies an L 4 L 2 norm equivalence with constant L if for every t R d, E X µ,t 4 ) /4 LE X µ,t 2 ) /2. Remark.4. ote that if X is L-subgaussian then it satisfies an L 4 L 2 norm equivalence with constant 2L. Theorem.5 [2]). For every L there are constants cl) and c L) that depend only on L and for which the following holds. Let X satisfy an L 4 L 2 norm equivalence with constant L. For 0 < δ < there is an estimator Σ δ that satisfies rσ) Σ δ Σ cl) Σ logd+log/δ)).3) with probability at least δ, provided that c L)rΣ)logd+log2/δ)). Remark.6. Let us mention that the procedure from [2] requires prior information on the values of Σ and rσ) up to some absolute multiplicative constant an assumption we shall return to in what follows. Recall that a random vector X with values in R d and with mean µ is L-subgaussian if for every t R d and every p 2, E X µ,t p ) /p L pe X µ,t 2 ) /2. 2

3 ote that if δ is smaller than /d, the error guaranteed by Theorem.5 is of the order of rσ) Σ log2/δ),.4) which will turn out to be far from optimal. To put.4) in some perspective, let us examine possible benchmarks for general mean estimation problems and see how those compare with.2),.3) and.4) when applied to covariance estimation.. Optimality in mean estimation Let W be a random vector with mean µ and set to be an arbitrary norm. Let B be the unit ball ofthe dual normto, and denote by ˆµ a mean-estimatorconstructed using an independent sample W,...,W. As it happens, a lower bound on the performance of ˆµ is R log2/δ).5) where Indeed, for every x B R = sup x B Ex W µ)) 2) /2..6) ˆµ µ x ˆµ µ) = x ˆµ) x µ) ; therefore, if there is a procedure for which ˆµ µ ε with probability δ, then on the same event the procedure automatically performs with accuracy ε and confidence δ for each one of the real-valued mean-estimation problems associated with the random variables x W), x B. By a lower bound from [] on real-valued mean estimation problems, the best possible mean-estimation error for each x W) is var x W) ) log2/δ), and taking the worst x B leads to.5). Although.5) is part of the story, it is unlikely it is the whole story. Intuitively,.5) takes into account the effect of one-dimensional marginals of W rather than the entire geometry of the distribution. It stands to reason that an additional global parameter is called for one that reflects the entire structure of W and the geometry of the norm. Moreover, that parameter should reflect the difficulty of the estimation problem at the constant confidence level. To give an example of such a result, a sharp) lower bound from [] on the mean estimation problem when W is a gaussian random vector is the following: if ˆµ µ ε with probability at least δ then ε c E W µ +R log2/δ) ) ;.7) hence, the global parameter in the gaussian case is just the mean E W µ. Let us examine.7) more carefully, in the hope that it would lead us towards the right answer for general random vectors. ote that by setting δ = exp p), the gaussian random variable W satisfies that log2/δ)ex W µ)) 2 ) /2 pex W µ)) 2 ) /2 E x W µ) p ) /p. 3

4 At the same time, the strong-weak moment inequality 2 for gaussian vectors see, e.g., [3]) implies that p)/p E W i µ E W i µ +c sup E x W i µ ) p) /p x B = E W µ +c sup x B E x W µ) p) /p ), = E W µ +c p sup x B E x W µ) 2) /2 ), where c and c are absolute constants. Thus, the lower bound of.7) implies that the best possible performance of a mean estimator of a gaussian vector matches a strong-weak moment inequality. This leads to a natural conjecture: that the best possible performance in a general mean estimation problem is given by a gaussian-like strong-weak moment inequality, and that there is a procedure that performs with that accuracy/confidence tradeoff. Recently, a general mean estimation procedure was introduced in [5] that exhibits this type of a strong-weak behavior. To formulate the result, let W be an arbitrary random vector taking values in R d and with mean µ, let G be the zero mean gaussian random vector with the same covariance as W and set Y = W i µ), where W,...,W are independent copies of W. Let be a norm, set B to be the unit ball of the dual norm, and put R = sup x B Ex W µ)) 2) /2. Theorem.7. [5] For 0 < δ < there is a procedure µ δ such that { E G µ δ µ cmax E Y, + R } log2/δ). The mean estimation procedure is defined as follows: let T = extb ) to be the set of extreme points in B. For the wanted confidence parameter 0 < δ <, let n = log2/δ) and set m = /n. Let I j ) n j= be the natural partition of {,...,} to blocks of cardinality m and given a sample W,...,W set Z j = m i I j W j. For x T and ε > 0, set S x ε) = { y R d : x Y) x Z j ) ε for more than n/2 blocks }, and define Sε) = x T S x ε). Set ε 0 = inf{ε > 0 : Sε) }, and let µ δ be any vector in ε>ε 0 Sε). 2 By strong moment we mean the L norm of W µ, while the weak moment is just the largest L p norm of a marginal x W µ) for x B. 4

5 The main result of this note which is formulated in the next section), is that the right application of Theorem.7 leads to an almost) optimal covariance estimator: the procedure performs as if X were a gaussian vector even if it only satisfies an L 4 L 2 norm equivalence, and the accuracy/confidence tradeoff obeys the strong-weak inequality one would expect..2 From mean estimation to covariance estimation In what follows, we assume without loss of generality that X is symmetric, not only zero mean. We may do so because if X is a centered random vector and X is an independent copy of X then X X )/ 2 is symmetric and has the same covariance as X. The natural choice of a random vector in Theorem.7 is W = X X, but as it happens, a better alternative is to use a truncated version of X instead of the original one: Definition.8. Let and let TrΣ) Σ α = γ X = X½ { X 2 α}. ) 4, In the L-subgaussian case we set γ = and when X only satisfies L 4 L 2 norm equivalence, let γ = logrσ). We also denote Σ = E X X). The main result of this note is that the procedure described in Theorem.7 for W = X X is an optimal or very close to optimal) covariance estimation procedure. Specifically, we prove the following, which improves both.2) and.3). Theorem.9. Let X be a zero mean random vector with an unknown) covariance matrix Σ, denote by the operator norm and using the notation of Definition.8 set R 2 = sup E v T X X Σ)u) 2, u,v S d where S d is the Euclidean unit sphere in R d. For any 0 < δ <, there is a procedure that receives as data the sample X,...,X, returns a matrix ˆΣ δ and satisfies that: ) If X is L-subgaussian then with probability δ [ ] rσ) ˆΣ δ Σ cl) Σ + rσ) + R log2/δ) ) ; 2) If X satisfies an L 4 L 2 norm equivalence with constantland c L)rΣ)logrΣ)) then with probability at least δ, ) rσ)logrσ)) ˆΣ δ Σ cl) Σ + R log2/δ)..8) In both cases R cl) Σ and cl),c L) are constants that depend only on L. Remark.0. ote that the estimates in Theorem.9 do not depend on the dimension d; instead, they depend only on rσ) which may be small even if d tends to infinity. This is important in view of recent results on the covariance estimation in Banach spaces [2]. 5

6 The estimate in Theorem.9 is actually a strong-weak moment inequality as if X were gaussian at least up to the logarithmic term in.8)). Indeed, let G be the zero mean gaussian random vector that has the same covariance as X and set rσ). As noted previously, rσ) Σ E G i G i Σ, with the L.H.S being the strong term from Theorem.9. Moreover, the term involving R is actually the natural weak term associated with the operator norm. Indeed, recall the well-known fact that the dual norm to the operator norm is the nuclear norm. And, since a linear functional z acts on the matrix x via trace duality that is zx) = [z,x] := Trz T x) it follows from [4] that the extreme points of the dual unit ball B are { u v : u,v S d }. Thus, R 2 = sup x B E x X X Σ) ) 2 = sup u,v S d E v T X X Σ)u ) 2, and in particular, by.5) the weak term R/ ) log2/δ) appearing in Theorem.9 is sharp. Thus, up to the logarithmic factor in 2), Theorem.9 implies that the estimator ˆΣ δ performs as if X were gaussian, even though it can be very far from gaussian. Let us compare the outcome of Theorem.9 to the current state of the art we mentioned previously. In the subgaussian setup Theorem.9 improves Theorem.2 because there are situations in which R is significantly smaller than Σ see such an example in what follows). And, under an L 4 L 2 norm equivalence scenario the improvement is more dramatic: on top of an improvement in the logarithmic factor appearing in the strong term, the weak term, R/ ) log2/δ) is significantly smaller than the corresponding estimate of Σ rσ)/ log2/δ) from Theorem.5. The proof of Theorem.9 is presented in the following section. We end this introduction with some notation. Throughout, absolute constants are denoted by c,c,..., etc.. Their value may change from line to line. Constants that depend on a parameter L are denoted by cl), a b means that there is an absolute constant c such that a cb, and a b means that cb a c b. When the constants depend on L we write a L b and a L b respectively. Finally, we define ψ 2 -norm of a real valued random variable Y as Y ψ2 = inf{c > 0 : EexpY 2 /c 2 ) 2}. In what follows E Y we use the well known fact that Y ψ2 sup p ) p p. p 2 2 Proof of Theorem.9 We require several observations on properties of X. First, note that by the symmetry of X, X is symmetric as well. Second, for every p 2 and any u R d, X,u Lp = E X,u p ) /p E X,u p ) /p. Therefore, if X is L-subgaussian then X,u Lp L p X,u L2, and if X satisfies L 4 L 2 norm equivalence with constant L then X,u L4 L X,u L2. More important features of X have to do with its covariance matrix Σ: 6

7 Lemma 2.. Assume that X is zero mean and satisfies an L 4 L 2 norm equivalence with constant L. Using the notation of Definition.8 we have that γrσ) Σ Σ cl) Σ, 2.) and Tr Σ) TrΣ) γrσ) cl)trσ), 2.2) where cl) is a constant that depends only on L. Proof. Observe that Σ Σ = sup 2 sup u,v S d u,v S d u T EX X) E X X) ) v E X,u X X,v = 2 sup E X,u X,v ½{ X 2 α} u,v S d 2 sup u,v S d E X,u 4 ) /4 E X,v 4 ) /4 Pr /2 X 2 α). By the L 4 L 2 norm equivalence, and sup u S d E X,u 4 ) /4 L sup u S d E X,u 2 ) /2 = L Σ /2 E X 4 2 =E d X,e i 2 ) 2 E i,j X,e i 2 X,e j 2 i,j E X,ei 4) /2 E X,ej 4) /2 L 2 i,j E X,e i 2 E X,e j 2 = L 2 i,j Σ ii Σ jj = L 2 TrΣ) ) ) Recalling the definition of α, we have that ) E X Pr /2 4 /2 X 2 α) 2 L 2 γ TrΣ) ) ) 2 /2 γtrσ) α 4 = L TrΣ) Σ Σ γrσ) =L, 2.4) and combining the two observations, γrσ) Σ Σ cl) Σ, 2.5) as claimed. Turning to the second part of the lemma, note that TrΣ) = d E X,e i 2 and Tr Σ) = Therefore, by the L 4 L 2 norm equivalence and 2.4), d Tr Σ) TrΣ) = E X,e i 2 ½ { X 2>α} L 2 d d E X,e i 2 ½ { X 2 α}. d E X,e i 4) /2 Pr /2 X 2 > α) E X,e i 2 )Pr /2 X 2 > α) cl)trσ) γrσ). 7

8 Clearly, by the first part of Lemma 2. it suffices to address the covariance estimation problem for the random vector X, since ˆΣ δ Σ ˆΣ δ Σ + Σ Σ, and Σ Σ is smaller than the wanted accuracy. Thus, from here on we set W = X X R d d, and the norm is the operator norm. As a result, the estimation procedure of Theorem.7 is Let X,...,X be the given sample, let X i = X i ½ { Xi 2 α} where α is given in Definition.8, and set 0 < δ <. Let n = log2/δ) and split the sample to n blocks I j, each of cardinality m = /n; set M j = m i I Xi j X i. Let T = {u,v) : u,v S d } and for ε > 0 and a pair u,v) let S u,v ε) = { Y R d d : v T M j Y)u ε for more than n/2 blocks }. Set Sε) = u,v) T S u,v ε). Let ε 0 = inf{ε > 0 : Sε) }, and choose Σ δ to be any matrix in ε>ε 0 Sε). Thanks to Theorem.7, the proof of Theorem.9 follows once sufficient control on E Y, E G and R is established in the two cases we are interested in. Controlling R The required estimate on R is presented in the next lemma. Lemma 2.2. Assume that X is zero mean and satisfies an L 4 L 2 norm equivalence with constant L. Then R vx) L Σ, where v 2 X) = sup v S d E X,v 4. Proof. For every u,v S d, E v T X X Σ)u ) 2 =E X,v 2 X,u 2 v T Σu) 2 E X,v 2 X,u 2 E X,v 4) /2 E X,u 4 ) /2, where we have used that fact that E X,v X,u = v T Σu. Thus, R vx). Also, recalling that X satisfies and L 4 L 2 norm equivalence, implying that vx) L 2 Σ, as claimed. E X,v 4 L 4 E X,v 2) 2 L 4 Σ 2 8

9 Controlling E G and E Y In the context of Theorem.7, G is the zero mean gaussian vector on R d d whose covariance coincides with that of W = X X. Instead of dealing with that vector directly, note that E G liminf E Y, 2.6) Indeed, E G = sup E max G), T B, T is finite x T x { and by the multivariate CLT, /2 } x W i EW) : x T {x G) : x T }. Hence, 2.6) follows from tail integration. Thanks to 2.6), all that remains is to bound E Y. converges weakly to The subgaussian case Fix an integer and note that X i X i Σ = sup u S d X i,u 2 E X i,u 2, 2.7) which is the supremum of a quadratic empirical process indexed by S d. Such empirical processes have been studied extensivelysee, e.g.,[6, 7, 8]), mainly using chaining methods. As it happens, quadratic subgaussian processes may be controlled in terms of a natural metric invariant of the indexing class the so-called γ 2 functional 3. In the case of 2.7), the indexing class is S d whose elements are viewed as linear functionals on R d, and the underlying metric is the ψ 2 norm endowed by the random vector X. By Corollary.9 from [8] we have that E sup u S d X i,u 2 E X i,u 2 c D γ 2S d,ψ 2 X)) where c is an absolute constant and + γ2 2S d,ψ 2 X)) ), 2.8) E X,u p ) /p D = DS d,ψ 2 ) = sup X,u ψ2 sup sup u S d u S d p 2 p. To estimate 2.8) one requires two facts see, e.g., [5] for more details). Firstly, a general property of the γ 2 functional is monotonicity in d: if T,d) is a metric space and d is another metric on T which satisfies that for every t,t 2 T, dt,t 2 ) κd t,t 2 ), then γ 2 T,d) κγ 2 T,d ). Here, we have that for every p 2 and every u R d, E X,u p ) /p E X,u p ) /p L p E X,u 2 ) /2, implying that X,u ψ2 L X,u L2 ; 3 Rather than defining the γ 2 functional, we refer the reader to [5] for a detailed exposition on the topic, and to [8, 6, 7] for the study of the quadratic empirical process in this and more general situations. 9

10 hence, γ 2 S d,ψ 2 X)) Lγ 2 S d,l 2 X)). Secondly, by Talagrand s majorizing measures theorem, if G is a zero mean gaussian random vector with the same covariance as X then γ 2 S d,l 2 X)) ce sup u S d G,u c E G 2 2 for a suitable absolute constant c. Finally, again thanks to the fact that X is L-subgaussian, D L sup u S d X,u L2 = L Σ /2. Therefore, by 2.8), for every, ) TrΣ) E Y cl) Σ /2 + TrΣ), and in particular, liminf E Y cl) Σ /2 TrΣ). This completes the proof of the first part of Theorem.9. ) /2 = c TrΣ), L 4 L 2 norm equivalence Just as in the subgaussian case, the key issue is finding a suitable estimate on E Y. Thanks to the fact that X is a truncated random vector one may apply a version of the matrix Bernstein inequality. We invoke Corollary from the survey [6] which is a slightly modified version of the original result from [9]): if Z is a random vector which satisfies that Z Z β almost surely, and B = EZ Z) 2, then ) B logrb)) E Z i Z i EZ Z) c +β logrb)). 2.9) In our case, Z = X½ { X α} for α as in Definition.8, and all that remains is to estimate B and rb). It is straightforward to verify that c Σ Tr Σ) B c L) Σ TrΣ) and TrB) c L) TrΣ) ) 2 : indeed, the upper estimates on B and TrB) follow from a direct computation and the fact that X satisfies an L 4 L 2 norm equivalence see, e.g., Lemma 4. in [2]); the lower estimate is an outcome of the FKG inequality see Corollary 5. in the supplementary material to []). Moreover, by Lemma 2. and using its notation, both Σ and Tr Σ) are equivalent to Σ and TrΣ) respectively, as long as cl)γrσ); hence, rb) L rσ). Finally, observe that Z Z = Z 2 2 α2. By 2.9), the choice of the truncation level α, and the fact that L rσ)logrσ), ) TrΣ) E Y cl) Σ /2 logrσ)+α 2 logrσ) rσ) cl) Σ logrσ). In particular, lim inf E Y cl) Σ TrΣ) logrσ), which completes the proof of second part of Theorem.9. 0

11 Concluding remarks The drawback of our estimator is that it requires prior information on TrΣ) and Σ. This issue has already been addressed in [, 2] using Lepski s method. The alternative we present is to handle the problem by constructing appropriate median of means estimators ˆϕ and ˆϕ 2, and for our purpose it suffices that ˆϕ TrΣ) and ˆϕ 2 Σ with high probability. The freedom to estimate the quantities in question up to an absolute multiplicative constant simplifies the problem considerably. Consider the problem of trace estimation. Since TrΣ) = E d X,e i 2, a standard median-of-meansestimator ˆϕ of E d X,e i 2 satisfiesthat with probability at least δ, d ) ˆϕ TrΣ) c Var X,e i 2 log/δ). see [3] for what is by now a standard argument). Using 2.3) we have Var d X,e i 2 ) L 2 )TrΣ) 2. Therefore, log/δ) ˆϕ TrΣ) cl)trσ), which implies that in the regime c L)log/δ), one has ˆϕ TrΣ) with probability at least δ. The estimation of Σ may be addressed in a similar fashion. Because it is not the focus of this note and for the sake of brevity we just sketch an argument that leads to a bound that depends on the dimension d, rather than on rσ). The more accurate estimate can be derived from Theorem 2 in [5]. Let be a minimal /4 cover of S d with respect to the Euclidean norm. Thus, Σ sup u u T Σu. For any fixed u the median of means estimator ˆϕ 2,u of Eu T X Xu satisfies that with probability at least δ, log/δ) ˆϕ 2,u u T Σu cl) Σ, because Var u T X Xu ) L 4 Σ 2. Finally, recalling that 9 d, the union bound shows that with probability at least δ d+log/δ) sup ˆϕ 2,u u T Σu c L) Σ. u Therefore, when c L)log/δ)+d), one has that sup ˆϕ 2,u Σ with probability u at least δ. Finally, let us give an example showing that there could be a substantial gap between R and Σ as well as R and vx)), which is a reason of sub-optimality of Theorem.2 Theorem 9 in [2]). Example 2.3. Let X = X ),...,X d) ) where X i) = α i ε i ; ε i ) d are independent, symmetric, {,}-valued random variables; and α >... > α d 0. Since the X i) s are

12 centered, independent and subgaussian with an absolute constant, then X is a centered, L-subgaussian random vector for some absolute constant L. If we set Σ = EX X) then clearly Σ = α 2, rσ) = d α 2 i /α2 and E v T X X Σ)u ) 2 = E v i u j X i) X j)) 2 = α 2 iα 2 jviu 2 2 j +v i v j u i u j ) i j α α 2 ) 2 v i u j ) 2 + v i v j u i u j ) α α 2 ) 2 v 2 u 2 + v, u 2) 2α α 2 ) 2. i,j Therefore, by the suitable choice of α 2 i j R 2α α 2 α 2 = Σ, 2.0) and the gap between R and Σ may be arbitrary large. The inequality 2.0) is the best one can hope for. Indeed, let Y be a centered random vector taking its values in R d with Σ = EY Y). Then for R = RY) it holds EY Y Σ) 2 d = EY Y Σ) e i e T i Y Y Σ) d sup E e T i Y Y Σ)v) 2 dr 2. v S d As before Corollary 5. in [] implies EY Y) 2 TrΣ) Σ. Therefore, dr 2 EY Y Σ) 2 EY Y) 2 Σ 2 TrΣ)) Σ Σ 2. This gives the following general lower bound rσ) R Σ. 2.) d When all α 2,...,α d are of the same order R = RX) satisfies 2.) up to multiplicative constant factors. References [] O. Catoni. Challenging the empirical mean and empirical variance: a deviation study. Annales de l Institut Henri Poincare, Probabilites et Statistiques, pages 48 85, 202. [2] V. Koltchinskii and K. Lounici. Concentration inequalities and moment bounds for sample covariance operators. Bernoulli, 207. [3] R. Latala and J. O. Wojtaszczyk. On the infimum convolution inequality. Studia Mathematica, [4] K. Lounici. High-dimensional covariance matrix estimation with missing observations. Bernoulli, 204. [5] G. Lugosi and S. Mendelson. ear-optimal mean estimators with respect to general norms [6] S. Mendelson. Empirical processes with a bounded ψ -diameter. Geometric and Functional Analysis, pages ,

13 [7] S. Mendelson. Upper bounds on product and multiplier empirical processes. Stochastic Processes and their Applications, 26: , 206. [8] S. Mendelson, A. Pajor, and. Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymptotic geometric analysis. Geometric and Functional Analysis, 74): , [9] S. Minsker. On some extensions of bernstein s inequality for self-adjoint operators. Statistics and Probability Letters, 207. [0] S. Minsker. Sub-gaussian estimators of the mean of a random matrix with heavytailed entries. Annals of Statistics, 46: , 208. [] S. Minsker and X. Wei. Estimation of the covariance structure of heavy-tailed distributions. IPS, 207. [2] S. Minsker and X. Wei. Robust modifications of u-statistics and applications to covariance estimation problems [3] A. emirovski and D. Yudin. Problem complexity and method efficiency in optimization. John Wiley and Sons Inc., 983. [4] W. So. Facial structure of shatten p-norms. Linear and multilinear algebra, 990. [5] M. Talagrand. Upper and lower bounds for stochastic processes: modern methods and classical problems, volume 60. Springer Science & Business Media, 204. [6] J. Tropp. An introduction to matrix concentration inequalities. Foundations and Trends in Machine Learning,

On the singular values of random matrices

On the singular values of random matrices Shahar Mendelson Grigoris Paouris Abstract We present an approach that allows one to bound the largest and smallest singular values of an N n random matrix with