A well-conditioned estimator for large-dimensional covariance matrices

Size: px

Start display at page:

Download "A well-conditioned estimator for large-dimensional covariance matrices"

Ross Beasley
5 years ago
Views:

Journal of Multivariate Analysis 88 (004) 365 411 A well-conditioned estimator for large-dimensional covariance matrices Olivier Ledoit a,b and Michael Wolf c,,1 a Anderson Graduate School of

1 Journal of Multivariate Analysis 88 (004) A well-conditioned estimator for large-dimensional covariance matrices Olivier Ledoit a,b and Michael Wolf c,,1 a Anderson Graduate School of Management, UCLA, USA b Equities Division, Credit Suisse First Boston, USA c Deartment of Economics and Business, Universitat Pomeu Fabra, Ramon Trias Fargas 5 7, Barcelona, Sain Received 6 February 001 Abstract Many alied roblems require a covariance matrix estimator that is not only invertible, but also well-conditioned (that is, inverting it does not amlify estimation error). For largedimensional covariance matrices, the usual estimator the samle covariance matrix is tyically not well-conditioned and may not even be invertible. This aer introduces an estimator that is both well-conditioned and more accurate than the samle covariance matrix asymtotically. This estimator is distribution-free and has a simle exlicit formula that is easy to comute and interret. It is the asymtotically otimal convex linear combination of the samle covariance matrix with the identity matrix. Otimality is meant with resect to a quadratic loss function, asymtotically as the number of observations and the number of variables go to infinity together. Extensive Monte Carlo confirmthat the asymtotic results tend to hold well in finite samle. r 003 Elsevier Inc. All rights reserved. AMS 000 subject classifications: 6H1; 6C1; 6J07 Keywords: Condition number; Covariance matrix estimation; Emirical Bayes; General asymtotics; Shrinkage Corresonding author. Fax: address: michael.wolf@econ.uf.es (M. Wolf). 1 Research suorted by DGES Grant BEC X/03/$ - see front matter r 003 Elsevier Inc. All rights reserved. doi: /s x(03)

2 366 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Introduction Many alied roblems require an estimate of a covariance matrix and/or of its inverse, where the matrix dimension is large comared to the samle size. Examles include selecting a mean variance efficient ortfolio from a large universe of stocks (see Markowitz [16]), running generalized least squares (GLS) regressions on large cross-sections (see e.g., Kandel and Stambaugh [1]), and choosing an otimal weighting matrix in the general method of moments (GMM; see Hansen [10]) where the number of moment restrictions is large. In such situations, the usual estimator the samle covariance matrix is known to erform oorly. When the matrix dimension is larger than the number n of observations available, the samle covariance matrix is not even invertible. When the ratio =n is less than one but not negligible, the samle covariance matrix is invertible but numerically ill-conditioned, which means that inverting it amlifies estimation error dramatically. For large ; it is difficult to find enough observations to make =n negligible, and therefore it is imortant to develo a well-conditioned estimator for large-dimensional covariance matrices. If we want a well-conditioned estimator at any cost, we can always imose some ad hoc structure on the covariance matrix to force it to be well-conditioned, such as diagonality or a factor model. But, in the absence of rior information about the true structure of the matrix, this ad hoc structure will be in general missecified, and the resulting estimator may be so biased that it bears little resemblance to the true covariance matrix. To the best of our knowledge, no existing estimator is both well-conditioned and more accurate than the samle covariance matrix. The contribution of this aer is to roose an estimator that ossesses both these roerties asymtotically. One way to get a well-conditioned structured estimator is to imose the condition that all variances are the same and all covariances are zero. The estimator we recommend is a weighted average of this structured estimator and the samle covariance matrix. Our estimator inherits the good conditioning roerties of the structured estimator and, by choosing the weight otimally according to a quadratic loss function, we ensure that our weighted average of the samle covariance matrix and the structured estimator is more accurate than either of them. The only difficulty is that the true otimal weight deends on the true covariance matrix, which is unobservable. We solve this difficulty by finding a consistent estimator of the otimal weight, and show that relacing the true otimal weight with a consistent estimator makes no difference asymtotically. Standard asymtotics assume that the number of variables is finite and fixed, while the number of observations n goes to infinity. Under standard asymtotics, the samle covariance matrix is well-conditioned (in the limit), and has some aealing otimality roerties (e.g., it is the maximum likelihood estimator for normally distributed data). However, this is a bad aroximation of many real-world situations where the number of variables is of the same order of magnitude as the number of observations n; and ossibly larger. We use a different framework, called general asymtotics, where we allow the number of variables to go to infinity too. The only constraint is that the ratio =n must remain bounded. We see standard asymtotics as a secial case where it is otimal to ut (asymtotically) all the weight

3 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) on the samle covariance matrix and none on the structured estimator. In the general case, however, our estimator is asymtotically different from the samle covariance matrix, substantially more accurate, and of course well-conditioned. Note that the framework of general asymtotics is related to the one of Kolmogorov asymtotics, where it is assumed that the ratio =n tends to a ositive, finite constant. Kolmogorov asymtotics is used by, among others, Aivazyan et al. [1] and Girko [5 7]. High-dimensional roblems, where the number of variables is large comared to the samle size n; are also studied by La uter [13] and La uter et al. [14], but fromthe oint of view of testing and inference, not estimation like in the resent aer. Extensive Monte-Carlo simulations indicate that: (i) the new estimator is more accurate than the samle covariance matrix, even for very small numbers of observations and variables, and usually by a lot; (ii) it is essentially as accurate or substantially more accurate than some estimators roosed in finite samle decision theory, as soon as there are at least ten variables and observations; (iii) it is betterconditioned than the true covariance matrix; and (iv) general asymtotics are a good aroximation of finite samle behavior when there are at least 0 observations and variables. The aer is organized as follows. The next section characterizes in finite samle the linear combination of the identity matrix and the samle covariance matrix which minimizes quadratic risk. Section 3 develos a linear shrinkage estimator with asymtotically uniformly minimum quadratic risk in its class (as the number of observations and the number of variables go to infinity together). In Section 4, Monte-Carlo simulations indicate that this estimator behaves well in finite samles, and Section 5 concludes. Aendix A contains the roofs of the technical results of Section 3.. Analysis in finite samle The easiest way to exlain what we do is to first analyze in detail the finite samle case. Let X denote a n matrix of n indeendent and identically distributed (iid) observations on a systemof randomvariables with mean zero and covariance matrix S: Following the lead of Muirhead and Leung [19], we consider the Frobenius norm: jjajj ¼ ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi trðaa t Þ=: (Dividing by the dimension is not standard, but it does not matter in this section because remains finite. The advantages of this convention are that the norm of the identity matrix is simly one, and that it will be consistent with Definition 1 below.) Our goal is to find the linear combination S ¼ r 1 I þ r S of the identity matrix I and the samle covariance matrix S ¼ XX t =n whose exected quadratic loss E½jjS Sjj Š is minimum. Haff [8] studied this class of linear shrinkage estimators, but did not get any otimality results. The otimality result that we obtain in finite samle will come at a rice: S will not be a bona fide estimator, because it will require hindsight knowledge of four scalar functions of the true (and unobservable) covariance matrix S: This would seemlike a high rice to ay but, interestingly, it is not: In the next section, we are able to develo a bona fide estimator S with the same roerties as S asymtotically as the number of

4 368 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) observations and the number of variables go to infinity together. Furthermore, extensive Monte-Carlo simulations will indicate that 0 observations and variables are enough for the asymtotic aroximations to tyically hold well in finite samle. Even the formulas for S and S will look the same and will have the same interretations. This is why we study the roerties of S in finite samle as if it was a bona fide estimator..1. Otimal linear shrinkage The squared Frobenius norm jj jj is a quadratic formwhose associated inner roduct is: /A 1 ; A S ¼ trða 1 A t Þ=: Four scalars lay a central role in the analysis: m ¼ /S; IS; a ¼jjS mijj ; b ¼ E½jjS Sjj Š; and d ¼ E½jjS mijj Š: We do not need to assume that the random variables in X follow a secific distribution, but we do need to assume that they have finite fourth moments, so that b and d are finite. The following relationshi holds. Lemma.1. a þ b ¼ d : Proof. E½jjS mijj Š¼E½jjS S þ S mijj Š ¼ E½jjS Sjj ŠþE½jjS mijj ŠþE½/S S; S misš ¼ E½jjS Sjj ŠþjjS mijj þ /E½S SŠ; S mis: ð1þ ðþ ð3þ Notice that E½SŠ ¼S; therefore, the third termon the right-hand side of Eq. (3) is equal to zero. This comletes the roof of Theorem.1. & The otimal linear combination S ¼ r 1 I þ r S of the identity matrix I and the samle covariance matrix S is the standard solution to a simle quadratic rogramming roblem under linear equality constraint. Theorem.1. Consider the otimization roblem: min E½jjS Sjj Š r 1 ;r s:t: S ¼ r 1 I þ r S; ð4þ where the coefficients r 1 and r are nonrandom. Its solution verifies: S ¼ b a mi þ d d S; ð5þ E½jjS Sjj Š¼ a b d : ð6þ

5 Proof. By a change of variables, roblem(4) can be rewritten as: min r;n E½jjS Sjj Š s:t: S ¼ rni þð1 rþs: ð7þ With a little algebra, and using E½SŠ ¼S as in the roof of Lemma.1, we can rewrite the objective as E½jjS Sjj Š¼r jjs nijj þð1 rþ E½jjS Sjj Š: ð8þ Therefore, the otimal value of n can be obtained as the solution to a reduced roblemthat does not deend on r: min n jjs nijj : Remember that the norm of the identity is one by convention, so the objective of this roblemcan be rewritten as jjs nijj ¼jjSjj n/s; IS þ n : The first-order condition is: /S; IS þ n ¼ 0: The solution is: n ¼ /S; IS ¼ m: Relacing n by its otimal value m in Eq. (8), we can rewrite the objective of the original roblemas E½jjS Sjj Š¼r a þ ð1 rþ b : The first-order condition is: ra ð1 rþb ¼ 0: The solution is: r ¼ b =ða þ b Þ¼b =d : Note that 1 r ¼ a =d : At the otimum, the objective is equal to: ðb =d Þ a þða =d Þ b ¼ a b =d : This comletes the roof. & Note that mi can be interreted as a shrinkage target and the weight b =d laced on mi as a shrinkage intensity. The ercentage relative imrovement in average loss (PRIAL) over the samle covariance matrix is equal to E½jjS Sjj Š E½jjS Sjj Š E½jjS Sjj ¼ b Š d ; ð9þ same as the shrinkage intensity. Therefore, everything is controlled by the ratio b =d ; which is a roerly normalized measure of the error of the samle covariance matrix S: Intuitively, if S is relatively accurate, then you should not shrink it too much, and shrinking it will not hel you much either; if S is relatively inaccurate, then you should shrink it a lot, and you also stand to gain a lot fromshrinking... Interretations ARTICLE IN PRESS O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) The mathematics underlying Theorem.1 are so rich that we are able to rovide four comlementary interretations of it. One is geometric and the others echo some of the most imortant ideas in finite samle multivariate statistics. First, we can see Theorem.1 as a rojection theoremin Hilbert sace. The aroriate Hilbert sace is the sace of -dimensional symmetric random matrices A such that E½jjAjj ŠoN: qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi The associated normis, of course, E½jj jj Š; and the inner roduct of two random matrices A 1 and A is E½/A 1 ; A SŠ: With this structure, Lemma.1 is just a rewriting of the Pythagorean Theorem. Furthermore, formula (5) can be justified as follows: In order to roject the true covariance matrix S onto the sace sanned by the identity matrix I and the samle covariance matrix S; we first roject it onto the line sanned

6 370 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) by the identity, which yields the shrinkage target mi; then we roject S onto the line joining the shrinkage target mi to the samle covariance matrix S: Whether the rojection S ends u closer to one end of the line ðmiþ or to the other ðsþ deends on which one of them S was closer to. Fig. 1 rovides a geometrical illustration. The second way to interret Theorem.1 is as a trade-off between bias and variance. We seek to minimize mean squared error, which can be decomosed into variance and squared bias: E½jjS Sjj Š¼E½jjS E½S Šjj ŠþjjE½S Š Sjj : ð10þ The mean squared error of the shrinkage target mi is all bias and no variance, while for the samle covariance matrix S it is exactly the oosite: all variance and no bias. S reresents the otimal trade-off between error due to bias and error due to variance. See Fig. for an illustration. The idea of a trade-off between bias and variance was already central to the original James and Stein [11] shrinkage technique. Fig. 1. Theorem.1 interreted as a rojection in Hilbert sace. Fig.. Theorem.1 interreted as a trade-off between bias and variance: Shrinkage intensity zero corresonds to the samle covariance matrix S: Shrinkage intensity one corresonds to the shrinkage target mi: Otimal shrinkage intensity (reresented by ) corresonds to the minimum exected loss combination S :

7 The third interretation is Bayesian. S can be seen as the combination of two signals: rior information and samle information. Prior information states that the true covariance matrix S lies on the shere centered around the shrinkage target mi with radius a: Samle information states that S lies on another shere, centered around the samle covariance matrix S with radius b: Bringing together rior and samle information, S must lie on the intersection of the two sheres, which is a circle. At the center of this circle stands S : The relative imortance given to rior vs. samle information in determining S deends on which one is more accurate. Strictly seaking, a full Bayesian aroach would secify not only the suort of the distribution of S; but also the distribution itself. We could assume that S is uniformly distributed on the shere, but it might be difficult to justify. Thus, S should not be thought of as the exectation of the osterior distribution, as is traditional, but rather as the center of mass of its suort. See Fig. 3 for an illustration. The idea of drawing insiration fromthe Bayesian ersective to obtain an imroved estimator of the covariance matrix was used by Haff [8]. The fourth and last interretation involves the cross-sectional disersion of covariance matrix eigenvalues. Let l 1 ; y; l denote the eigenvalues of the true covariance matrix S; and l 1 ; y; l those of the samle covariance matrix S: We can exloit the Frobenius norm s elegant relationshi to eigenvalues. Note that m ¼ 1 X l i ¼ E 1 " # X ARTICLE IN PRESS O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) l i ð11þ Fig. 3. Bayesian interretation: The left shere has center mi and radius a and reresents rior information. The right shere has center S and radius b: The distance between shere centers is d and reresents samle information. If all we knew was that the true covariance matrix S lies on the left shere, our best guess would be its center: the shrinkage target mi: If all we knew was that the true covariance matrix S lies on the right shere, our best guess would be its center: the samle covariance matrix S: Putting together both ieces of information, the true covariance matrix S must lie on the circle where the two sheres intersect; therefore, our best guess is its center: the otimal linear shrinkage S :

8 37 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) reresents the grand mean of both true and samle eigenvalues. Then Lemma.1 can be rewritten as " # 1 E X ðl i mþ ¼ 1 X ðl i mþ þ E½jjS Sjj Š: ð1þ In words, samle eigenvalues are more disersed around their grand mean than true ones, and the excess disersion is equal to the error of the samle covariance matrix. Excess disersion imlies that the largest samle eigenvalues are biased uwards, and the smallest ones downwards. See Fig. 4 for an illustration. Therefore, we can imrove uon the samle covariance matrix by shrinking its eigenvalues towards their grand mean, as in 8i ¼ 1; y; l i ¼ b d m þ a d l i: ð13þ Fig. 4. Samle vs. true eigenvalues: The solid line reresents the distribution of the eigenvalues of the samle covariance matrix. Eigenvalues are sorted from largest to smallest, then lotted against their rank. In this case, the true covariance matrix is the identity, that is, the true eigenvalues are all equal to one. The distribution of true eigenvalues is lotted as a dashed horizontal line at one. Distributions are obtained in the limit as the number of observations n and the number of variables both go to infinity with the ratio =n converging to a finite ositive limit. The four lots corresond to different values of this limit.

9 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Note that l 1 ; y; l defined by Eq. (13) are recisely the eigenvalues of S : Surrisingly, their disersion E½ P ðl i mþ Š= ¼ a =d is even below the disersion of true eigenvalues. For the interested reader, the next subsection exlains why. The idea that shrinking samle eigenvalues towards their grand mean yields an imroved estimator of the covariance matrix was highlighted in Muirhead s [18] review aer..3. Further results on samle eigenvalues The following aragrahs contain additional insights about the eigenvalues of the samle covariance matrix, but the reader can ski them and go directly to Section 3 if he or she so wishes. We discuss: (1) why the eigenvalues of the samle covariance matrix are more disersed than those of the true covariance matrix (Eq. (1)); () how imortant this effect is in ractice; and (3) why we should use instead an estimator whose eigenvalues are less disersed than those of the true covariance matrix (Eq. (13)). The exlanation relies on a result from matrix algebra. Theorem.. The eigenvalues are the most disersed diagonal elements that can be obtained by rotation. Proof. Let R denote a -dimensional symmetric matrix and V a -dimensional rotation matrix: VV 0 ¼ V 0 V ¼ I: First, note that ð1=þtrðv 0 RVÞ ¼ð1=ÞtrðRÞ: The average of the diagonal elements is invariant by rotation. Call it r: Let v i denote the ith column of V: The disersion of the diagonal elements of V 0 RV is ð1=þ P ðv i 0 Rv i rþ : Note that P ðv i 0 Rv i rþ þ P P jai ðv i 0 Rv j Þ ¼ tr½ðv 0 RV riþ Š¼tr½ðR riþ Š is invariant by rotation. Therefore, the rotation V maximizes the disersion of the diagonal elements of V 0 RV if and only if it minimizes P P ðv 0 i Rv j Þ : This is achieved by setting v 0 i Rv j to zero for all iaj: jai In this case, V 0 RV is a diagonal matrix, call it D: V 0 RV ¼ D is equivalent to R ¼ VDV 0 : Since V is a rotation and D is diagonal, the columns of V must contain the eigenvectors of R and the diagonal of D its eigenvalues. Therefore, the disersion of the diagonal elements of V 0 RV is maximized when these diagonal elements are equal to the eigenvalues of R: This comletes the roof of Theorem.. & Decomose the true covariance matrix into eigenvalues and eigenvectors: S ¼ G 0 LG; where L is a diagonal matrix, and G is a rotation matrix. The diagonal elements of L are the eigenvalues l 1 ; y; l ; and the columns of G are the eigenvectors g 1 ; y; g : Similarly, decomose the samle covariance matrix into eigenvalues and eigenvectors: S ¼ G 0 LG; where L is a diagonal matrix, and G is a rotation matrix. The diagonal elements of L are the eigenvalues l 1 ; y; l ; and the columns of G are the eigenvectors g 1 ; y; g : Since S is unbiased and G is nonstochastic, G 0 SG is an unbiased estimator of L ¼ G 0 SG: The diagonal elements of

10 374 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Table 1 Disersion of diagonal elements G 0 SG! G 0 SG ww G 0 SG g G 0 SG This table comares the disersion of the diagonal elements of certain roducts of matrices. The symbols!; E; and g ertain to diagonal elements, and mean less disersed than, aroximately as disersed as, and more disersed than, resectively. G 0 SG are aroximately as disersed as the ones of G 0 SG: For convenience, let us seak as if they were exactly as disersed. By contrast, L ¼ G 0 SG is not at all an unbiased estimator of G 0 SG: This is because the errors of G and S interact. Theorem. shows us the effect of this interaction: the diagonal elements of G 0 SG are more disersed than those of G 0 SG (and hence than those of G 0 SG). This is why samle eigenvalues are more disersed than true ones. See Table 1 for a summary. We illustrate how imortant this effect is in a articular case: when the true covariance matrix is the identity matrix. Let us sort the eigenvalues of the samle covariance matrix from largest to smallest, and lot them against their rank. The shae of the lot deends on the ratio =n; but does not deend on the articular realization of the samle covariance matrix, at least aroximately when and n are very large. Fig. 4 shows the distribution of samle eigenvalues for various values of the ratio =n: This figure is based on the asymtotic formula roven by Marcˇ enko and Pastur [17]. We notice that the largest samle eigenvalues are severely biased uwards, and the smallest ones downwards. The bias increases in =n: This henomenon is very general and is not limited to the identity case. It is similar to the effect observed by Brown [3] in Monte-Carlo simulations. Finally, let us remark that the samle eigenvalues l i ¼ g i 0 Sg i should not be comared to the true eigenvalues l i ¼ g i 0 Sg i ; but to g i 0 Sg i : We should comare estimated vs. true variance associated with vector g i : By Theorem. again, the diagonal elements of G 0 SG are even less disersed than those of G 0 SG: Not only are samle eigenvalues more disersed than true ones, but they should be less disersed. This effect is attributable to error in the samle eigenvectors. Intuitively: Statisticians should shy away from taking a strong stance on extremely small and extremely large eigenvalues, because they know that they have the wrong eigenvectors. The samle covariance matrix is guilty of taking an unjustifiably strong stance. The otimal linear shrinkage S corrects for that. 3. Analysis under general asymtotics In the revious section, we have shown that S has an aealing otimality roerty and fits well in the existing literature. It has only one drawback: it is not a bona fide estimator, since it requires hindsight knowledge of four scalar functions of the true (and unobservable) covariance matrix S: m; a ; b and d : We now address

11 this roblem. The idea is that, asymtotically, there exists consistent estimators for m; a ; b and d ; hence for S too. At this oint, we need to choose an aroriate asymtotic framework. Standard asymtotics consider fixed while n tends to infinity, imlying that the otimal shrinkage intensity vanishes in the limit. This would be reasonable for situations where is very small in comarison to n: However, in the roblems of interest to us tends to be of the same order as n and can even be larger. Hence, we consider it more aroriate to use a framework that reflects this condition. This is achieved by allowing the number of variables to go to infinity at the same seed as the number of observations n: It is called general asymtotics. In this framework, the otimal shrinkage intensity generally does not vanish asymtotically but rather it tends to a limiting constant that we will be able to estimate consistently. The idea then is to use the estimated shrinkage intensity in order to arrive at a bona fide estimator. To the best of our knowledge, the framework of general asymtotics has not been used before to imrove over the samle covariance matrix, but only to characterize the distribution of its eigenvalues, as in Silverstein [0]. We are also aware of the work of Girko [5 7] who emloyed the framework of Kolmogorov asymtotics, where the ratio =n tends to a ositive, finite constant. Girko roves consistency and asymtotic normality of so-called G- estimators of certain functions of the covariance matrix. In articular, he demonstrates the element-wise consistency of a G 3 -estimator of the inverse of the covariance matrix General asymtotics ARTICLE IN PRESS O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Let n ¼ 1; ; y index a sequence of statistical models. For every n; X n is a n n matrix of n iid observations on a systemof n randomvariables with mean zero and covariance matrix S n : The number of variables n can change and even go to infinity with the number of observations n; but not too fast. Assumtion 1. There exists a constant K 1 indeendent of n such that n =nk 1 : Assumtion 1 is very weak. It does not require n to change and go to infinity; therefore, standard asymtotics are included as a articular case. It is not even necessary for the ratio n =n to converge to any limit. Decomose the covariance matrix into eigenvalues and eigenvectors: S n ¼ G n L n G t n ; where L n is a diagonal matrix, and G n a rotation matrix. The diagonal elements of L n are the eigenvalues l n 1 ; y; ln n ; and the columns of G n are the eigenvectors g n 1 ; y; gn n : Y n ¼ G t n X n is a n n matrix of n iid observations on a systemof n uncorrelated random variables that sans the same sace as the original system. We imose restrictions on the higher moments of Y n : Let ðy n 11 ; y; yn n 1 Þt denote the first column of the matrix Y n : Assumtion. There exists a constant K indeendent of n such that P 1 n n E½ðyn i1 Þ8 ŠK :

12 376 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Assumtion 3. n lim n-n n Pði; j;k;lþaq n ðcov½y n i1 yn j1 ; yn k1 yn l1 ŠÞ Cardinal of Q n ¼ 0; where Q n denotes the set of all the quadrules that are made of four distinct integers between 1 and n : Assumtion states that the eighth moment is bounded (on average). Assumtion 3 states that roducts of uncorrelated randomvariables are themselves uncorrelated (on average, in the limit). In the case where general asymtotics degenerate into standard asymtotics ð n =n-0þ; Assumtion 3 is trivially verified as a consequence of Assumtion. Assumtion 3 is verified when random variables are normally or even ellitically distributed, but it is much weaker than that. Assumtions 1 3 are imlicit throughout the aer. We suggest to use a matrix norm based on the Frobenius norm. The idea is that a norm of the n -dimensional matrix A can be secified as jjajj n ¼ f ð nþ trðaa t Þ; where f ð n Þ is a scalar function of the dimension. This defines a quadratic formon the linear sace of n -dimensional symmetric matrices whose associated inner roduct is /A 1 ; A S n ¼ f ð n ÞtrðA 1 A t Þ: The behavior of jj jj n across dimensions is controlled by the function f ðþ: The norm jj jj n is used mainly to define a notion of consistency. A given estimator will be called consistent if the normof its difference with the true covariance matrix goes to zero (in quadratic mean) as n goes to infinity. If n remains bounded, then all ositive functions f ðþ generate equivalent notions of consistency. But this articular case similar to standard asymtotics is not very reresentative. If n (or a subsequence) goes to infinity, then the choice of f ðþ becomes much more imortant. If f ð n Þ is too large (small) as n goes to infinity, then it will define too strong (weak) a notion of consistency. f ðþ must define the notion of consistency that is just right under general asymtotics. Our solution is to define a relative norm. The norm of a n - dimensional matrix is divided by the norm of a benchmark matrix of the same dimension n : The benchmark must be chosen carefully. For lack of any other attractive candidate, we take the identity matrix as benchmark. Therefore, by convention, the identity matrix has norm one in every dimension. This determines the function f ðþ uniquely as f ð n Þ¼1= n : Definition 1. Our normof the n -dimensional matrix A is: jjajj n ¼ trðaat Þ= n : Intuitively, it seems that the norm of the identity matrix should remain bounded away fromzero and frominfinity as its dimension goes to infinity. All choices of f ðþ satisfying this roerty would define equivalent notions of consistency. Therefore, our articular normis equivalent to any normthat would make sense under general asymtotics. An examle might hel familiarize the reader with Definition 1. Let A n be the n n matrix with one in its to left entry and zeros everywhere else. Let Z n be the n n matrix with zeros everywhere (i.e. the null matrix). A n and Z n differ in a way that is indeendent of n : the to left entry is not the same. Yet their squared

13 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) distance jja n Z n jj ¼ 1= n deends on n : This aarent aradox has an intuitive resolution. A n and Z n disagree on the first dimension, but they agree on the n 1 others. The imortance of their disagreement is relative to the extent of their agreement. If n ¼ 1; then A n and Z n have nothing in common, and their distance is 1. If n -N; then A n and Z n have almost everything in common, and their distance goes to 0. Thus, disagreeing on one entry can either be imortant (if this entry is the only one) or negligible (if this entry is just one among a large number of others). 3.. The behavior of the samle covariance matrix Define the samle covariance matrix S n ¼ X n Xn t =n: We follow the notation of Section, excet that we add the subscrit n to signal that all results hold asymtotically. Thus, we have: m n ¼ /S n ; I n S n ; a n ¼jjS n m n I n jj n ; b n ¼ E½jjS n S n jj n Š; and d n ¼ E½jjS n m n I n jj nš: These four scalars are well behaved asymtotically. Lemma 3.1. m n ; a n ; b n and d n remain bounded as n-n: They can go to zero in secial cases, but in general they do not, in site of the division by n in the definition of the norm. The roofs of all the technical results of Section 3 are in Aendix A. The most basic question is whether the samle covariance matrix is consistent under general asymtotics. Secifically, we ask whether S n converges in quadratic mean to the true covariance matrix, that is, whether b n vanishes. In general, the answer is no, as shown below. The results stated in Theorem3.1 and Lemmata 3. and 3.3 are related to secial cases of a general result roven by Yin [5]. But we work under weaker assumtions than he does. Also, his goal is to find the distribution of the eigenvalues of the samle covariance matrix, while ours is to find an imroved estimator of the covariance matrix. Theorem 3.1. Define y n ¼ Var½ 1 P n n ðyn i1 Þ Š: y n is bounded as n-n; and we have: lim n-n E½jjS n S n jj n Š n n ðm n þ y n Þ¼0: Theorem 3.1 shows that the exected loss of the samle covariance matrix E½jjS n S n jj n Š is bounded, but it is at least of the order of n n m n ; which does not usually vanish. Therefore, the samle covariance matrix is not consistent under general asymtotics, excet in secial cases. The first secial case is when n =n-0: For examle, under standard asymtotics, n is fixed, and it is well-known that the samle covariance matrix is consistent. Theorem 3.1 shows that consistency extends to cases where n is not fixed, not even necessarily bounded, as long as it is of order oðnþ: The second secial case is when m n -0 and y n -0: m n-0 imlies that most of the n randomvariables have vanishing variances, i.e. they are asymtotically degenerate. The number of random variables escaing degeneracy must be negligible with resect to n: This is like the revious case, excet that the oðnþ nondegenerate

14 378 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) randomvariables can now be augmented with OðnÞ degenerate ones. Overall, a loose condition for the consistency of the samle covariance matrix under general asymtotics is that the number of nondegenerate random variables be negligible with resect to the number of observations. If the samle covariance matrix is not consistent under general asymtotics, it is because of its off-diagonal elements. Granted, the error on each one of themvanishes, but their number grows too fast. The accumulation of a large number of small errors off the diagonal revents the samle covariance matrix from being consistent. By contrast, the contribution of the errors on the diagonal is negligible. This is aarent fromthe roof of Theorem3.1. After all, it should in general not be ossible to consistently estimate n ð n þ 1Þ= arameters from a data set of n n randomrealizations if these two numbers are of the same order of magnitude. For this reason, we believe that there does not exist any consistent estimator of the covariance matrix under general asymtotics. Theorem3.1 also shows what factors determine the error of S n : The first factor is the ratio n =n: It measures deviation from standard asymtotics. Peole often figure out whether they can use asymtotics by checking whether they have enough observations, but in this case it would be unwise: it is the ratio of observations to variables that needs to be big. Two hundred observations might seem like a lot, but it is not nearly enough if there are 100 variables: it would be about as bad as using two observations to estimate the variance of 1 random variable! The second factor m n simly gives the scale of the roblem. The third factor y n measures covariance between the squared variables over and above what is imlied by covariance between the variables themselves. For examle, y n is zero in the normal case, but usually ositive in the ellitic case. Intuitively, a cross-sectional law of large numbers could make the variance of 1 P n n y i1 vanish asymtotically as n-n if the y i1 s were sufficiently uncorrelated with one another. But Assumtion 3 is too weak to ensure that, so in general y n is not negligible, which might be more realistic sometimes. This analysis enables us to answer another basic question: When does shrinkage matter? Remember that b n ¼ E½jjS n S n jj nš denotes the error of the samle covariance matrix, and that d n ¼ P E½ 1 n n ðln i m n Þ Š denotes the crosssectional disersion of the samle eigenvalues l1 n; y; ln n around the exectation of their grand mean m n ¼ E½ P n ln i = nš: Theorem.1 states that shrinkage matters unless the ratio b n =d n is negligible, but this answer is rather abstract. Theorem3.1 enables us to rewrite it in more intuitive terms. Ignoring the resence of y n ; the error of the samle covariance matrix b n is asymtotically close to n n m n : Therefore, shrinkage matters unless the ratio of variables to observations n =n is negligible with resect to d n =m n ; which is a scale-free measure of cross-sectional disersion of samle eigenvalues. Fig. 5 rovides a grahical illustration. This constitutes an easy diagnostic test to reveal whether our shrinkage method can substantially imrove uon the samle covariance matrix. In our oinion, there are many imortant ractical situations where shrinkage does matter according to this criterion. Also, it is rather excetional for gains fromshrinkage to be as large as n =n; because most of the time ( for examle in estimation of the mean) they are of the much smaller order 1=n:

15 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Fig. 5. Otimal shrinkage intensity and PRIAL as function of eigenvalues disersion and the ratio of variables to observations: Note that eigenvalues disersion is measured by the scale-free ratio d n =m n : 3.3. A consistent estimator for S n S n is not a bona fide estimator because it deends on the true covariance matrix S n ; which is unobservable. Fortunately, comuting S n does not require knowledge of the whole matrix S n ; but only of four scalar functions of S n : m n ; a n ; b n and d n : Given the size of the data set ð n nþ; we cannot estimate all of S n consistently, but we can estimate the otimal shrinkage target, the otimal shrinkage intensity, and even S n itself consistently. For m n ; a consistent estimator is its samle counterart. Lemma 3.. Define m n ¼ /S n ; I n S n : Then E½m n Š¼m n for all n; and m n m n converges to zero in quartic mean ( fourth moment) as n goes to infinity. It imlies that m n m n ƒ! q:m: 0andm n m n ƒ! q:m: 0; where ƒ! q:m: denotes convergence in quadratic mean as n-n: A consistent estimator for d n ¼ E½jjS n m n I n jj nš is also its samle counterart. Lemma 3.3. Define dn ¼jjS n m n I n jj n : Then d n d n ƒ! q:m: 0: Now let the n 1 vector x n k denote the kth column of the observations matrix X n; for k ¼ 1; y; n: S n ¼ n 1 X n Xn t can be rewritten as S n ¼ n 1 P n xn k ðxn k Þt : S n is the average of the matrices x n k ðxn k Þt : Since the matrices x n k ðxn k Þt are iid across k; we can estimate the error b n ¼ E½jjS n S n jj nš of their average by seeing how far each one of themdeviates fromthe average.

16 380 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Lemma 3.4. Define %b n ¼ 1 P n n jjxn k ðxn k Þt S n jj n and b n ¼ minð %b n ; d nþ: Then %b n b n ƒ! q:m: 0 and b n b n ƒ! q:m: 0: We introduce the constrained estimator b n because b n d n by Lemma.1. In general, this constraint is rarely binding. But it ensures that the following estimator of a n is nonnegative. Lemma 3.5. Define a n ¼ d n b n : Then a n a n ƒ! q:m: 0: The next ste of the strategy is to relace the unobservable scalars in the formula defining S n with consistent estimators, and to show that the asymtotic roerties are unchanged. This yields our bona fide estimator of the covariance matrix: Sn ¼ b n dn m n I n þ a n dn S n : ð14þ The next theoremshows that Sn has the same asymtotic roerties as S n : Thus, we can neglect the error that we introduce when we relace the unobservable arameters m n ; a n ; b n and d n by estimators. Theorem 3.. Sn is a consistent estimator of S n ; i.e. jjs n S n jj n ƒ! 0: As a consequence, Sn has the same asymtotic exected loss (or risk) as S n ; i.e. E½jjSn S njj n Š E½jjS n S njj n Š-0: This justifies our studying the roerties of S n in Section as if it was a bona fide estimator. It is interesting to recall the Bayesian interretation of S n (see Section.). Fromthis oint of view, Sn is an emirical Bayesian estimator. Emirical Bayesians often ignore the fact that their rior contains estimation error because it comes from the data. Usually, this is done without any rigorous justification, and it requires sohisticated judgment to ick an emirical Bayesian rior whose estimation error is not too damaging. Here, we treat this issue rigorously instead: We give a set of conditions (Assumtions 1 3) under which it is legitimate to neglect the estimation error of our emirical Bayesian rior. Finally, it is ossible to estimate the exected quadratic loss of S n and S n consistently. Lemma 3.6. E½j a n b n d n a n b njš-0: d n 3.4. Otimality roerty of the estimator Sn The final ste of our strategy is to demonstrate that Sn ; which we obtained as a consistent estimator for S n ; ossesses an imortant otimality roerty. We already know that S n (hence, S n in the limit) is otimal among the linear combinations of the

17 identity and the samle covariance matrix with nonrandom coefficients (see Theorem.1). This is interesting, but only mildly so, because it excludes the other linear shrinkage estimators with random coefficients. In this section, we show that Sn is still otimal within a bigger class: the linear combinations of I n and S n with random coefficients. This class includes both the linear combinations that reresent bona fide estimators, and those with coefficients that require hindsight knowledge of the true (and unobservable) covariance matrix S n : Let S n denote the linear combination of I n and S n with minimum quadratic loss. It solves: min jjs r 1 ;r n S njj n s:t: S n ¼ r 1I n þ r S n : ð15þ In contrast to the otimization roblem in Theorem.1 with solution S n ; here we minimize the loss instead of the exected loss, and we allow the coefficients r 1 and r to be random. It turns out that the formula for S n is a function of S n; therefore, S n does not constitute a bona fide estimator. By construction, S n has lower loss than S n and Sn almost surely (a.s.), but asymtotically it makes no difference. Theorem 3.3. Sn is a consistent estimator of S n ; i.e. jjs n S n jj n ƒ! 0: As a consequence, Sn has the same asymtotic exected loss (or risk) as S n ; that is, E½jjSn S njj n Š E½jjS n S njj n Š-0: Both S n and S n have the same asymtotic roerties as Sn ; therefore, they also have the same asymtotic roerties as each other. The most imortant result of this aer is the following: The bona fide estimator Sn has uniformly minimum quadratic risk asymtotically among all the linear combinations of the identity with the samle covariance matrix, including those that are bona fide estimators, and even those that use hindsight knowledge of the true covariance matrix. Theorem 3.4. For any sequence of linear combinations #S n of the identity and the samle covariance matrix, the estimator Sn defined in Eq. (14) verifies: lim N-N ARTICLE IN PRESS O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) inf ðe½jj #S n S n jj n nxn Š E½jjS n S njj n ŠÞX0: In addition, every #S n that erforms as well as Sn is identical to S n in the limit: lim ðe½jj #S n S n jj n-n n Š E½jjS n S njj n ŠÞ ¼ 0 3 jj #S n Sn jj n ƒ! q:m: 0: Thus, it is legitimate to say that Sn is an asymtotically otimal linear shrinkage estimator of the covariance matrix with resect to quadratic loss under general asymtotics. Tyically, only maximum likelihood estimators have such a sweeing otimality roerty, so we believe that this result is unique in shrinkage theory. Yet another distinctive feature of Sn is that, to the best of our knowledge, it is the only estimator of the covariance matrix to retain a rigorous justification when the number ð16þ ð17þ

18 38 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) of variables n exceeds the number of observations n: Not only that, but Sn is guaranteed to be always invertible, even in the case n 4n; where rank deficiency makes the samle covariance matrix singular. Estimating the inverse covariance matrix when variables outnumber observations is sometimes dismissed as imossible, but the existence of ðsn Þ 1 certainly roves otherwise. The following theoremshows that Sn is usually well-conditioned. Theorem 3.5. Assume that the condition number of the true covariance matrix S n is bounded, and that the normalized variables y i1 = ffiffiffiffi l i are iid across i ¼ 1; y; n: Then the condition number of the estimator Sn is bounded in robability. This result follows fromowerful results roven recently by robabilists []. If the cross-sectional iid assumtion is violated, it does not mean that the condition number goes to infinity, but rather that it is technically too difficult to find out anything about it. Interestingly, there is one case where the estimator Sn is even better-conditioned than the true covariance matrix S n : if the ill-conditioning of S n comes from eigenvalues close to zero (multicollinearity in the variables) and the ratio of variables to observations n =n is not negligible. In this case, Sn is well-conditioned because the samle observations do not rovide enough information to udate our rior belief that there is no multicollinearity. Remark 3.1. An alternative technique to arrive at an invertible estimator of the covariance matrix is based on an maximum entroy (ME) aroach; e.g., see Theil and Laitinen [3] and Vinod [4]. The motivation comes from considering variables subject to rounding or other measurement error; the measurement error corresonding to variable i is quantified by a ositive number d i ; i ¼ 1; y; : The resulting covariance matrix estimator of Vinod [4], say, is then given by S n þ D n ; where D n is a diagonal matrix with ith element di =3: While this estimator certainly is invertible, it was not derived having in mind any otimality considerations with resect to estimating the true covariance matrix. For examle, if the measurement errors, and hence the d i ; are small, S n þ D n will be close to S n and inherit its illconditioning. Therefore, the Maximum Entroy estimators do not seem very interesting for our uroses. 4. Monte-Carlo simulations The goal is to comare the exected loss (or risk) of various estimators across a wide range of situations. The benchmark is the exected loss of the samle covariance matrix. We reort the ercentage relative imrovement in average loss of S ; defined as: PRIALðS Þ¼ðE½jjS Sjj Š E½jjS Sjj ŠÞ=E½jjS Sjj Š100: The subscrit n is omitted for brevity, since no confusion is ossible. If the PRIAL is

19 ositive (negative), then S erforms better (worse) than S: The PRIAL of the samle covariance matrix S is zero by definition. The PRIAL cannot exceed 100%. We comare the PRIAL of S to the PRIAL of other estimators from finite samle decision theory Other estimators ARTICLE IN PRESS O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) There are many estimators worthy of investigation, and we cannot ossibly study all the interesting ones, but we attemt to comose a reresentative selection Emirical Bayesian Haff [8] introduces an estimator with an emirical Bayesian insiration. Like S ; it is a linear combination of the samle covariance matrix and the identity. The difference lies in the coefficients of the combination. Haff s coefficients do not deend on the observations X; only on and n: If the criterion is the mean squared error, Haff s aroach suggests Ŝ EB ¼ n n n m EB I þ n n þ 1 S ð18þ with m EB ¼½detðSÞŠ 1= : When 4n we take m EB ¼ m because the regular formula would yield zero. The initials EB stand for emirical Bayesian Stein Haff Stein [1] rooses an estimator that kees the eigenvectors of the samle covariance matrix and relaces its eigenvalues l 1 ; y; l by 0, B X nl n þ 1 þ l i jai 1 1 C A l i l j i ¼ 1; y; : These corrected eigenvalues need neither be ositive nor in the same order as samle eigenvalues. To revent this from haening, an ad hoc rocedure called isotonic regression is alied before recombining corrected eigenvalues with samle eigenvectors. Haff [9] indeendently obtains a closely related estimator. In any given simulation, we call Ŝ SH the better erforming estimator of the two. The other one is not reorted. The initials SH stand for Stein and Haff. 3 Intuitively, isotonic regression restores the ordering by assigning the same value to a subsequence of corrected eigenvalues that would violate it. Lin and Perlman [15] exlain it in detail. 3 When 4n some of the terms * l i * l j in formula (9) result in a division by zero. We just ignore them. Nonetheless, when is too large comared to n; the isotonic regression does not converge. In this case, ŜSH does not exist. ð19þ

20 384 O. Ledoit, M. Wolf / Journal of Multivariate Analysis 88 (004) Minimax Stein [] and Dey and Srinivasan [4] both derive the same estimator. Under a certain loss function, it is minimax, which means that no other estimator has lower worst-case error. The minimax criterion is sometimes criticized as overly essimistic, since it looks at the worst case only. This estimator reserves samle eigenvectors and relaces samle eigenvalues by n l n þ þ 1 i * i ; ð0þ where samle eigenvalues l 1 ; y; l are sorted in descending order. We call this estimator ŜMX; where the initials MX stand for minimax Comutational cost When the number of variables is very large, S and S take much less time to comute than ŜEB; Ŝ SH and ŜMX; because they do not need eigenvalues and determinants. Indeed the number and nature of oerations needed to comute S are of the same order as for S: It can be an enormous advantage in ractice. The only seemingly slow ste is the estimation of b ; but it can be accelerated by writing b ¼ 1 X n X 1 n ðx 4 ÞðX 4 Þ t 1 ij n X X " # 1 4 n XX t ij ; ð1þ where ½Š ij denotes the entry ði; jþ of a matrix and the symbol 4 denotes elementwise exonentiation, that is, ½A 4 Š ij ¼ð½AŠ ij Þ for any matrix A: 4.. Exeriment design The random variables used in the simulations are normally distributed. The true covariance matrix S is diagonal without loss of generality. Its eigenvalues are drawn according to a log-normal distribution. Their grand mean m is set equal to one without loss of generality. The structure of the Monte-Carlo exeriment is as follows. We identify three arameters as influential on the results: (1) the ratio =n of variables to observations, () the cross-sectional disersion of oulation eigenvalues a ; and (3) the roduct n which measures how close we are to asymtotic conditions. For each one of these three arameters we choose a central value, namely 1= for =n; 1= for a ; and 800 for n: First, we run the simulations with each of the three arameters set to its central value. Then, we run three different sets of simulations, allowing one of the three arameters to vary around its central value, while the two others remain fixed at their central value. This enables us to study the influence of each arameter searately. To get a oint of reference for the shrinkage estimator S ; we can comute analytically its asymtotic PRIAL, as imlied by =n Theorems.1, 3.1 and 3.: it is 100: This is the seed of light that we =nþa would attain if we knew the true arameters m; a ; b ; d ; instead of having to

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo