arxiv: v2 [stat.me] 27 Apr 2018

Size: px

Start display at page:

Download "arxiv: v2 [stat.me] 27 Apr 2018"

Liliana Randall
5 years ago
Views:

1 Nonasymtotic estimation and suort recovery for high dimensional sarse covariance matrices arxiv: v [statme] 7 Ar Introduction Adam B Kashlak and Linglong Kong Deartment of Mathematical and Statistical Sciences, CAB 63 University of Alberta Edmonton, AB, Canada 6G G1 kashlak@ualbertaca url: htts://sitesualbertaca/~kashlak/ Abstract: We roose a general framework for nonasymtotic covariance matrix estimation making use of concentration inequality-based confidence sets We secify this framework for the estimation of large sarse covariance matrices through incororation of ast thresholding estimators with key emhasis on suort recovery his technique goes beyond ast results for thresholding estimators by allowing for a wide range of distributional assumtions beyond merely sub-gaussian tails his methodology can furthermore be adated to a wide range of other estimators and settings he usage of nonasymtotic dimension-free confidence sets yields good theoretical erformance hrough extensive simulations, it is demonstrated to have suerior erformance when comared with other such methods In the context of suort recovery, we are able to secify a false ositive rate and otimize to maximize the true recoveries MSC 010 subject classifications: Primary 6H1; secondary 6G15 Keywords and hrases: Concentration Inequality, Confidence Region, Random Matrix, Schatten Norm, Log Concave Measure, Sub-Exonential Measure Covariance matrices and accurate estimators of such objects are of critical imortance in statistics Various standard techniques including rincile comonents analysis and linear and quadratic discriminant analysis rely on an accurate estimate of the covariance structure of the data Alications can range from genetics and medical imaging data to climate and other tyes of data Furthermore, in the era of high dimensional data, classical asymtotic estimators erform oorly in alications Stein, 1975; Johnstone, 001 Hence, we roose a general methodology for nonasymtotic covariance matrix estimation making use of confidence balls constructed from concentration inequalities While this is a general framework with many otential alications, we secifically consider the use of thresholding estimators for sarse covariance matrices with a view towards suort recovery that is, determining which variable airs are correlated Many estimators for the covariance matrix have been roosed working under the assumtion of sarsity Pourahmadi, 011, which is, in a qualitative sense, the case when most of the off-diagonal entries are zero or negligible Beyond mere theoretical interest, the assumtion of sarsity is widely alicable to real data analysis as the ractitioner may believe that many of the variable airings will be uncorrelated hus, it is desirable to tailor covariance estimation rocedures given this assumtion of sarsity Sarsity in the simlest sense imlies some bound on the number of non-zero entries in the columns of a covariance matrix hus, given a Σ R d d with entries σ i,j for i, j = 1,, d, there exists some constant k > 0 such that max j=1,,d d 1[σ i,j 0] k his can be generalized to aroximate sarsity as in Rothman, Levina and Zhu 009 by max j=1,,d d σ i,j q k for some q [0, 1 Furthermore, Cai and Liu 011 define a broader aroximately sarse class by bounding weighted column sums of Σ In El Karoui 008, a similar notion referred to as β-sarsity is defined Such classes of sarse covariance matrices allow for good theoretical erformance of estimators One class of estimators are shrinkage estimators that follow a James-Stein aroach by shrinking estimated eigenvalues, eigenvectors, or the matrix itself towards some desired target Haff, 1980; Dey and Srinivasan, 1985; Daniels and Kass, 1999, 001; Ledoit and Wolf, 004; Hoff, 009; Johnstone and Lu, 01 Another class of sarse estimators are those that regularize the estimate with lasso-style enalties Rothman, 01; Bien and ibshirani, 011 Yet another class consists of thresholding estimators, which declare the covariance between 1

2 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation two variables to be zero, if the estimated value is smaller than some threshold Bickel and Levina, 008a,b; Rothman, Levina and Zhu, 009; Cai and Liu, 011 Beyond these, there are other methods such as banding and taering, which aly only when the variables are ordered or a notation of roximity exists eg satial, time series, or longitudinal data As we will not assume such an ordering and strive to construct a methodology that is ermutation invariant with resect to the variables, these aroaches will not be considered Lastly, there has also been substantial work into the estimation of the recision or inverse covariance matrix While it is easily ossible that our aroach could be adated to this setting, it will not be considered in this article and will, hence, be reserved for future research In this article, we roose of novel aroach to the estimation of sarse covariance matrices making use of concentration inequality based confidence sets such as those constructed in Kashlak, Aston and Nickl 017 for the functional data setting In short, consider a samle of real vector valued data X 1,, X n R d with mean zero and unknown covariance matrix Σ Concentration inequalities are used to construct a non-asymtotic confidence set for Σ about the emirical estimate of the unknown covariance matrix, ˆΣ em = n 1 n X i XX i X where X = n 1 n X i is the samle mean While, it has been noted for examle, see Cai and Liu 011 that ˆΣ em may be a oor estimator when the dimension d is large and Σ is sarse, the confidence set is still valid given a desired coverage of 1 α o construct a better estimator, we roose to search this confidence set for an estimator ˆΣ s which otimizes some sarsity criterion to be concretely defined later his estimation method adats to the uncertainty of ˆΣ em in the high dimensional setting, d n, by widening the confidence set and thus allowing our sarse estimator to lie far away from the emirical estimate Furthermore, given some distributional assumtions, the concentration inequalities rovide us with non-asymtotic dimension-free confidence sets allowing for very desirable convergence results Many established methods for sarse estimation make use of a regularization or enalization term incororated to enforce sarsity Rothman, 01; Bien and ibshirani, 011 In some sense, our roosed method can be considered to be in this class of estimators However, we do not enforce sarsity via some lasso-style enalization term, but enforce it through the choice of α he larger our 1 α-confidence set is, the sarser our estimator is allowed to be hus, as with other regularized estimators, the ractitioner will have to make a choice of α to achieve the desired level of sarsity A major contribution of this work is develoing a method with the ability to avoid costly cross-validation of the tuning arameter and maintain strong finite samle erformance he secific focus as discussed below and in Aendix B is accurate suort recovery, which is the identification of the non-zero entries in the covariance matrix Our methodology allows for fixing a false ositive rate ercentage of zero entries incorrectly said to be non-zero and otimizing over the true ositive rate ercentage of correctly identified non-zero entries Furthermore, our estimation technique imlements a binary search rocedure resulting in a highly efficient algorithm esecially when comared to the more laborious otimization required by lasso enalization In Section, the general estimation rocedure is outlined, and it is secified for tuning threshold estimators with concentration methods Section 3 discusses our aroach to fixing a certain false ositive rate when attemting to recover the suort of the covariance matrix In Section 4, three different tyes of concentration are considered for secifically log concave measures, sub-exonential distributions, and bounded random variables Lastly, Section 5 details comrehensive simulations comaring our concentration aroach to sarse estimation to standard techniques such as thresholding and enalization Beyond simulation exeriments, a real data set of gene exressions for small round blue cell tumours from the study of Khan et al 001 is considered 11 Notation and Definitions When defining a Banach sace matrices, there are many matrix norms that can be considered In the article, the main norms of interest are the -Schatten norms, which will be denoted and are defined as follows Definition 11 -Schatten Norm For an arbitrary matrix Σ R k l and [1,, the -Schatten norm is 1/ Σ = tr Σ Σ / min{k,l} 1/ = ν l = ν i

3 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 3 where ν = ν 1,, ν min{k,l} is the vector of singular values of Σ and where l is the standard l norm in R d In the covariance matrix case where Σ R d d is symmetric and ositive semi-definite, Σ = tr Σ 1/ = λ l where λ is the vector of eigenvalues of Σ he 1-Schatten norm is referred to as the trace norm and the -Schatten norm as the Hilbert-Schmidt or Frobenius norm For =, we have the usual oerator norm for Σ : R l R k with resect to the l norm, Σ = su Σu l = ν l = max ν i, u l =1,,min{k,l} which is similarly the maximal eigenvalue when Σ is symmetric ositive semi-definite he definition of the -Schatten norm involves taking the square root of a symmetric matrix In general, a matrix square root is only unique u to unitary transformations However, for symmetric ositive semi-definite matrices, we will only require the unique symmetric ositive semi-definite square root defined as follows Definition 1 Matrix Square Root Let A R d d be a symmetric ositive semi-definite matrix with eigen-decomosition A = UDU where U the orthonormal matrix of eigenvectors and D the diagonal matrix of eigenvalues, λ 1,, λ d hen, A = UD U where D is the diagonal matrix with entries λ 1,, λ d Another family of norms that will be used is the collection of entrywise matrix norms denoted, which are written in terms of l norms of the entries Definition 13, q-entrywise norm For an arbitrary matrix Σ R k l with entries σ i,j and, q [1, ], the, q-entrywise norm is 1/ k l Σ,q = σ q i,j /q, j=1 with the usual modification in the case that = and/or q = When = q, these are the l norms of a given matrix treated as a vector in R k,l Note that the -Schatten norm coincides with the, -entrywise norm 1 Main contributions and connections to ast work he main contribution of this work is the construction of a general framework for tuning threshold estimators for suort recovery and estimation of sarse covariance matrices for sub-gaussian, sub-exonential, and bounded data his framework has the otential to extend to other distributional assumtions and to other estimation techniques It offers finite samle guarantees and a much faster comute time than costly otimization and cross validation methods Past work on thresholding estimators for sarse covariance estimation began with solely considering Gaussian data and then extending to sub-gaussian tails Bickel and Levina, 008a,b; Rothman, Levina and Zhu, 009 he more recent work of Cai and Liu 011 also rovides theoretical results for sub-gaussian data as well as certain olynomial-tye tails However, only Gaussian data is considered in their numerical simulations In this article, we consider sub-gaussian, heavier tailed sub-exonential, and bounded data he rincial focus of this work is to use non-asymtotic concentration inequalities to guarantee finite samle erformance Past articles are focused on roximity of their estimator to truth in oerator norm as the main metric of success due to convergence in oerator norm imlying convergence of the eigenvalues and eigenvectors While asymtotically, such methods have elegant theoretical convergence roerties, for finite samles one can achieve better erformance in oerator norm distance by simly choosing the emirical diagonal matrix as an estimator ie the emirical estimator with off-diagonal entries set to zero In Aendix B, we rerun some of the numerical simulations from Rothman, Levina and Zhu 009 and demonstrate that for Gaussian data the emirical diagonal matrix achieves better erformance than all of the universal threshold estimators for data in R 500 for a samle of size n = 100 For sub-exonential data albeit outside of the scoe of their methodology resented in Rothman, Levina and Zhu 009 the emirical diagonal matrix dominated all threshold estimators in oerator norm distance even when d < n We thus strongly argue that the main metric of success for such sarse estimators is suort recovery of the true covariance matrix

4 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 4 able 1 Secific metrics d, and deviation thresholds r α given secific assumtions on the data X i discussed in subsequent sections Assumtion on X i dˆσ s, Σ r α Log Concave Measure ˆΣ s Σ /c0 n log α Bounded in norm ˆΣ s Σ U n log α Sub-Exonential Measure ˆΣ s Σ max{ K log α/ n, K log α} he main theoretical results of this work are heorem 3, which establishes how to fix a false ositive rate for threshold estimators devoid of any distributional assumtions, and heorem 45, which demonstrates suort recovery both zero and non-zero entries in the secific case that the data has a strongly log concave measure In the case that the data is instead sub-exonential or bounded, we do not achieve a similar limit theorem, but are still able to achieve good erformance in numerical simulations Of indeendent interest is Lemma 34, which establishes a symmetrization result for random sarse covariance estimators For suort recovery, let Σ be the true covariance matrix and ˆΣ s be the sarse estimator We define a true ositive to be a air i, j with i, j = 1,, d such that the i, jth entries σ i,j 0 and ˆσ s i,j 0 We similarly define a false ositive to be a air i, j such that σ i,j = 0 and ˆσ s i,j 0 As often in statistical ractice, we aim to maximize the true ositive rate while controlling the false ositive rate Unlike other methods that only rovide a single estimator with no ability to control the false ositive rate, our framework can be tuned to a desired false ositive rate Sarse Estimation Procedure Let X 1,, X n R d be a samle of n indeendent and identically distributed mean zero random vectors with unknown d d covariance matrix Σ Define the emirical estimate of Σ to be ˆΣ em = n 1 n X i XX i X where X = n 1 n X i he goal of the following rocedure is to construct a sarse estimator, ˆΣ s, for Σ by first constructing a non-asymtotic confidence set for Σ based on ˆΣ em and then searching this set for the sarsest member A search method using threshold estimators is outlined in Section 1 o construct such a confidence set about ˆΣ em, concentration inequalities are emloyed Secific inequalities are chosen based on data assumtions and are discussed in subsequent Sections 41 and 4 In general, the inequalities all take a similar form Let d, be some metric measuring the distance between two covariance matrices, and let ψ : R R be monotonically increasing hen, the general form of the concentration inequalities is P dσ, ˆΣ em EdΣ, ˆΣ em + r e ψr, which is a bound on the tail of the distribution of dσ, ˆΣ em as it deviates above its mean hus, to construct a 1 α-confidence set, the variable r = r α is chosen such that ex ψr α = α his r α will be referred to as the deviation threshold able 1 contains some exlicit choices for the metric and deviation threshold given secific assumtions on the X i, which are used to drive the choice of concentration inequality Log concave and sub-exonential measures are discussed searately in Section 4 Now, let ˆΣ s be our sarse estimator for Σ We want these two to be close in the sense of the above confidence set and therefore choose a ˆΣ s such that dˆσ s, ˆΣ em r α Consequently, we have that dˆσ s, Σ EdˆΣ em, Σ + r α P P dˆσ s, ˆΣ em + dˆσ em, Σ EdˆΣ em, Σ + r α P dˆσ em, Σ EdˆΣ em, Σ + r α ex ψr α = α Hence, we choose a ˆΣ s close enough to ˆΣ em to share its elegant concentration roerties, but far enough away to result in a better estimator for Σ

5 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 5 o actually identify such a ˆΣ s, we require some criterion to otimize over all elements of the confidence set Section 1 takes insiration from thresholding techniques for sarse covariance estimation Bickel and Levina, 008a; Rothman, Levina and Zhu, 009; Cai and Liu, 011 and imlements them inside this concentration framework It begins with ˆΣ em and attemts to threshold it as much as ossible while still remaining in the confidence set 1 hresholding within confidence sets A generalized thresholding oerator, as defined in Rothman, Levina and Zhu 009, is s λ : R R such that s λ z z, s λ z = 0 for z λ, and s λ z z λ, which will aly element-wise to a matrix In the ast, such an oerator is alied to the emirical estimate ˆΣ em for some λ generally chosen via cross validation Instead of directly choosing a threshold λ, our aroach is to choose a confidence level 1 α and find the largest λ such that ds λ ˆΣ em, ˆΣ em r α ˆΣ s 0 Set 0 = ˆΣ diag ˆΣ em ˆΣ diag to be the emirical estimator normalized to have a diagonal of s ones Initialize the threshold to λ = 05 and write ˆΣ λ = s λˆσ em Let k = 1 be the number of stes of the recursion Choose an α and comute r α In Section 3, we will use the desired false ositive rate to motivate a choice for α 1 Increase k k + 1, then udate λ as follows a if dˆσ s λ, ˆΣ em r α, set λ λ + k 1 b Otherwise, set λ λ k 1 Reeat ste 1 until k has reached the desired number of iterations Generally, as few as k = 10 will suffice 3 he resulting estimator is ˆΣ s = ˆΣ diag ˆΣ s λ ˆΣ diag s where ˆΣ λ is the final matrix resulting from this recursion Remark 1 Positive Definite Estimators If ˆΣ s is not ositive semi-definite, then it can be rojected onto the sace of ositive semi-definite matrices A standard ast aroach is to ma the negative eigenvalues to zero or to their absolute value, which maintains the eigen-structure However, such a rojection will have an adverse effect on the suort recovery roblem as the estimator will no longer be sarse An alternative is to ma ˆΣ s ˆΣ s + γi d for some γ > 0 large enough to make the result ositive definite his will not effect the recovered suort of the matrix More clever rojections may also be ossible In the case that the metric d, is a monotonically increasing function of the Hilbert-Schmidt / Frobenius norm ˆΣ s λ ˆΣ em or another entrywise norm, then the sequence dˆσ s λ, ˆΣ em will be increasing in λ Proosition In the context of the above algorithm, if λ 1 > λ, then for any, q, we have Proof As λ 1 > λ, the entries of the matrix entries of ˆΣ s λ 1 ˆΣ em,q ˆΣ s λ ˆΣ em,q ˆΣ s λ 1 ˆΣ em are equal to or larger than in absolute value the ˆΣ s λ ˆΣ em Hence ˆΣ s λ 1 ˆΣ em,q ˆΣ s λ ˆΣ em,q by definition 13 his roerty guarantees that the above algorithm will find the sarsest ˆΣ s in the confidence set in the sense of having the largest threshold ossible However, for an arbitrary metric or secifically other -Schatten norms, this sequence may not necessarily be strictly increasing in λ Another commonly used norm, which will be shown in Section 5 to give suerior erformance in simulation, is the oerator norm ˆΣ s λ ˆΣ em, which does not yield a monotonically increasing sequence hough, this sequence is roughly increasing in the sense that it is lower bounded by definition by the maximum l s norm of the columns of ˆΣ λ ˆΣ em, which is an increasing sequence Furthermore, it is uer bounded by the l 1 s norm of the columns of ˆΣ λ ˆΣ em, which follows from the Gershgorin circle theorem Iserles, 009, and which is also an increasing sequence

6 3 Fixing a false ositive rate AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 6 For many sarse matrix estimation methods, theorems demonstrating sarsistency are roved hese indicate that in some asymtotic sense, the correct suort of the true matrix will eventually be recovered generally as n and d grow together at some rate However, none rovide a method for fixing a false ositive rate and finding an estimator that satisfies such a rate, which is certainly of interest to any ractitioner with a finite fixed samle size Hence, we resent a method for tuning our arameter α to a desired false ositive rate for the covariance estimator Before roceeding, we will require a class of sarse matrices similar to those from Bickel and Levina 008a,b; Rothman, Levina and Zhu 009; Cai and Liu 011 Secifically, let Uk, δ = {Σ R d d : max,,d j=1 d 1[σ i,j 0] k, and if σ i,j 0, then σ i,j δ > 0} 31 For the results regarding the false ositive rate, we are not concerned with the lower bound δ and only with k, the maximum number of non-zero entries er column or row As long as k increases more slowly than the dimension d, which is made secific below, we can achieve a desired false ositive rate without interference For an estimator Σ R d d, the false ositive rate is ρ Σ = #{ σ i,j 0 σ i,j = 0, i > j} dd 1/ where σ i,j is the ijth entry of the true covariance matrix and σ i,j is the ijth entry of the estimator Σ Hence, we are counting the number of non-zero entries in our estimator that should have been zero For notation, let ˆΣ em em be the usual emirical estimate of the covariance matrix Let ˆΣ 0 be the emirical estimator with all off em diagonal entries set to zero thus guaranteeing a false ositive rate of zero For η 05, let ˆΣ η be the emirical estimator after alication of the strong threshold oerator with threshold M η = quantile ˆσ i,j, η : i > j}, which removes 1001 η% of the off diagonal entries achieving a false ositive rate of aroximately 1 η due to the following lemma Lemma 31 Let Σ Uk, δ from Equation 31 Let the η [05, 1 threshold, M η, be the η quantile of ˆσ i,j em with i > j, and let the corresonding thresholded estimator be ˆΣ η = s Mη ˆΣ em with ijth entry denoted ˆσ η i,j hen, #{i, j i > j, ˆσ η i,j > 0, σ i,j = 0} as η dd 1/ as d as long as k = Od ν for ν < 1 Proof For Σ Uk, δ, we have that #{i, j σ i,j 0} kd = Od 1+ν and #{i, j σ i,j = 0} = Od Hence, the ratio of true zeros to true non-zeros is of order Od ν 1, which decays to zero as d he robustness of the η-quantile gives the desired convergence as long as no more than 1 η of the samle is contaminated By the sarsity assumtion that k = od, we have the desired result We cannot extend this lemma from η 05 to arbitrary quantiles ie false ositive rates without some additional assumtion on k/d, the relative amount of true ositive entries However, using the matrices, ˆΣ em em η and ˆΣ 0, as reference oints, we can interolate via the following theorem to achieve any desired false ositive rate heorem 3 Let Σ Uk, δ from Equation 31 with k = Od ν for ν < Given a desired false ositive rate, ρ 0, 05], and η = ρ a 05, 1] for some a Z + em, let ˆΣ ρ be the hard thresholded emirical estimator that achieves a false ositive rate of ρ hen, em ˆΣ ρ Σ 0 η ˆΣ em ρ Σ 0 K 1nρ d 1/4 + K nρ 1/4 d + ond where K 1, K are universal constants η

7 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 7 Remark 33 he above heorem 3 is wholly uninteresting for large values of n However, its ower arises in the non-asymtotic realm of interest namely when d n and also from highlighting the interlay between the dimension, samle size, and ρ, the sarseness of the estimator Furthermore, this result does not require any distributional assumtion It also does not require any assumtion on the lower bound δ on the non-zero σ i,j as it is only concerned with the σ i,j that are zero he roof of the above theorem relies on the following lemma involving symmetrization of random covariance matrices, which may be of indeendent interest Lemma 34 Let R R d d be a real valued symmetric random matrix with zero diagonal and mean zero entries bounded by 1 and not necessarily iid, and let B {0, 1} d d be an iid symmetric Bernoulli random matrix with entries b i,j = b j,i Bernoulli ρ for ρ 0, 1 Denoting the entrywise or Hadamard roduct by, let A = R B Let E { 1, 1} d d be a symmetric random matrix with iid Rademacher entries ε i,j for j < i and ε i,j = ε j,i hen, E A E K 1 d ρ 1/4 + K d 3/4 ρ where K 1, K are universal constants 4 Estimation of sarse covariance matrices he following three subsections detail different assumtions on the data under scrutiny and the secific concentration results that aly in these cases We consider sub-gaussian concentration for log concave measures and for bounded random variables We also consider sub-exonential concentration However, this collection is by no means exhaustive Given the wide variety of concentration inequalities being develoed, our aroach can be alied much more widely than to merely these three settings 41 Log Concave Measures In this section, the general methods from Section are secialized for an iid samle X 1,, X n R d whose common measure µ is strongly log-concave his roerty imlies dimension-free sub-gaussian concentration and includes such common distributions as the multivariate Gaussian, Chi, and Dirichlet distributions Definition 41 Strongly log-concave measure A measure µ on R d is strongly log-concave if there exists a c > 0 such that dµ = e Ux dx and HessU ci d 0 ie is non-negative definite where HessU is the d d matrix of second derivatives From Corollary D5, let X 1,, X n R d have measures µ 1,, µ n, which are all strongly log-concave with coefficients c 1,, c n Let ν = µ 1 µ n be the roduct measure on R d n hen, for any 1-Lischitz φ : R d n R and for any r > 0, P φx 1,, X n EφX 1,, X n + r e mini cir / his follows from heorem D4 and the other results contained within Aendix D1 For a detailed exosition of how Gaussian concentration is established for log concave measures, see Chater 5 of Ledoux 001 Examle 4 Multivariate Gaussian measure For an arbitrary Gaussian measure on R d with mean zero and covariance Σ, we have that Ux = x Σ 1 x/ hus, HessU = Σ 1, and letting {λ i } d be the eigenvalues of Σ, any 0 < c min,,d λ 1 i = max i λ i 1 satisfies the above definition Alying Proosition D5, we have that for X 1,, X n R d iid multivariate Gaussian random variables, where λ 0 = max,,d λ i P φx 1,, X n EφX 1,, X n + r e r /λ 0 Examle 43 Dirichlet Distribution For X Dirichlet α 1,, α d with α i > 1 for all i, the exonent U = d 1 α i log x i and HessU is a diagonal matrix with entries α 1 1/x 1,, α d 1/x d hus, the maximum c to ensure that HessU ci d 0 for all values of x i 0, 1 with d x i = 1 is c = min,,d α i 1

8 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 8 o make use of the above result, we must choose a suitable Lischitz function φ Let X 1,, X n, X R d be indeendent and identically distributed random variables with covariance Σ and with a common strongly log-concave measure µ with coefficient c > 0 Let λ 1 λ n be the eigenvalues of Σ and Λ = λ 1,, λ n For some [1, ], let be the -Schatten norm, which in this case is Σ = Λ l Note that XX = X l for any [1, ] Define the function φ to be 1 φx 1,, X n = n X i EXX i EX For {1,, }, we have that φ is Lischitz with coefficient φ Li = n with resect to the Frobenius or Hilbert-Schmidt metric, which is established in Proosition C5 for = and = and in Proosition C for = 1 hat is, let X 1,, X n, Y 1,, Y n R d, and denote X = X 1,, X n and Y = Y 1,, Y n, then φx φy n d, X, Y = 1 n X i Y i l From here, the rocedure outlined in Section can be imlemented with the given φ and r α = /nc 0 log α In many cases, including the two examles above, the constructed confidence set is comletely dimensionfree hus, even mild assumtions on the relationshi between the samle size n and the dimension d, such as log d = on 1/3 from the adative soft thresholding estimator of Cai and Liu 011, are not needed to rove consistency in our setting Furthermore, the concentration inequalities immediately give us a fast rate of convergence as long as log α = on with a roof rovided in Aendix A heorem 44 Let X 1,, X n R d be iid with common measure µ Let µ be strictly log concave with some fixed constant c 0 from Definition 41 hen, for α 0, 1, [1, ], and r α = /nc 0 log α, ˆΣs su P Σ O n 1 + n 1/4 log α α ˆΣ s : ˆΣ s ˆΣ em r α A second and arguably more imortant issue, see Aendix B, in the setting of sarse covariance estimation is that of suort recovery or sarsistency Lam and Fan, 009; Rothman, Levina and Zhu, 009 o recover the suort of a covariance matrix that is, determine which entries σ i,j 0 we will require a class of sarse matrices from Equation 31 In ast work, a notation of aroximate sarsity is considered where the first condition in Uk, δ is relaced with max,,d d σ i,j q < k for q [0, 1 However, once we bound the non-zero entries away from zero by some δ, such aroximate sarsity imlies standard sarsity with q = 0 It is worth noting that the above Proosition 44 does not require such a sarsity class, because our estimator is forced to remain close enough to ˆΣ em to follow ˆΣ em s convergence to Σ heorem 45 Let X 1,, X n R d be iid with common measure µ Let µ be strictly log concave with some fixed constant c 0 from Definition 41 Furthermore, let Σ Uk, δ and let δ = On 1+ε for any ε > 0 hen, for ˆΣ s denoting the concentration estimator using the hard thresholding estimation from Section 1 with the oerator norm metric, lim P suˆσ s suσ = 0 n where suσ = {i, j : σ i,j 0} Remark 46 Note that the condition that δ = On 1+ε allows for a much quicker decay of the non-zero entries of Σ than in El Karoui 008 where the lower bound is of the form Cn α0 with 0 < α 0 < It is also much quicker than the similar rate achieved in Rothman, Levina and Zhu 009 where the lower bound is any τ such that nτ increases faster than logd with the enforced asymtotic condition that logd/n = o1 resulting in a rate no faster than n

9 4 Sub-Exonential Distributions AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 9 Comared with the reviously discussed measures with sub-gaussian concentration, there exists a larger class of measures with sub-exonential concentration Such measures can be secified as those that satisfy the Poincaré or sectral ga inequality Bobkov and Ledoux, 1997; Ledoux, 001; Gozlan, 010 For a random variable X on R d with measure µ, this is Var fx C f dµ for some C > 0 and for all locally Lischitz functions f If X satisfies such an inequality, then see heorem D6 or Chater 5 of Ledoux 001 for for X 1,, X n R d iid coies of X and for some Lischitz function φ : R d n R, P φx 1,, X n EφX 1,, X n + r ex 1K { } rb min, r where K > 0 in a constant deending only on C and a i φ, b max iφ,,n As in Section 41, φ is chosen to be 1 φx 1,, X n = n X i EXX i EX, which is Lischitz with constant n his results in values of a = 1 and b = n for the above coefficients Hence, the deviation threshold in this setting is r α = max{ K log α/ n, K log α} While an otimal or reasonable value for K may not be known, it makes little difference given the roosed rocedure for choosing α detailed in Section 3 for a desired false ositive rate his is because the term K log α will be equivalently tuned to determine the otimal size of the constructed confidence set As the deviation threshold in this setting is bounded below by a constant K log α, we do not achieve the nice convergence results as in the log concave setting However, the dimension-free concentration still allows for good erformance in simulation settings as will be seen in the Section 5 43 Bounded Random Variables In this section, we consider random variables that are bounded in some norm Consider a Banach sace B, and a collection of iid random variables X 1,, X n B such that X i U for all i = 1,, n Given only this assumtion, the bounded differences inequality, detailed in Aendix D3 and in Section 334 of Giné and Nickl 016, can be alied in this secific setting It rovides sub-gaussian concentration for such random variables Secifically, let X 1,, X n R d be iid with X i l U for i = 1,, n hen, for any [1, ], X i X i U, and ˆΣem P Σ E ˆΣ em Σ + r e nr /U his follows immediately from heorem D8 Hence, for any collection of real valued random vectors bounded in Euclidean norm, the bounded differences inequality can be alied to the emirical estimate for any of the -Schatten norms he deviation threshold is r α = U n log α However, unlike in the revious setting, the bounds may not necessarily be dimension free Examle 47 Distributions on the Hyercube If the comonents X i,j 1 such as for multivariate uniform or Rademacher random variables, then U = d Consequently, the deviation threshold r α = O d/n is not dimension free While this makes estimation with resect to oerator norm distance challenging, we can still use heorem 3 to fix the false ositive rate a

10 5 Numerical simulations AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 10 In the following subsections, we aly the methods from the revious sections to three multivariate distributions of interest: the Gaussian, Lalace, and Rademacher distributions In doing so, we aly heorem 3 to analytically determine the ideal confidence ball radius in order to construct a sarse estimator of Σ We also comare the suort recovery of our aroach against enalized estimators and standard alication of universal threshold estimators As mentioned before, our roosed concentration confidence set based method has a similar feel to regularized / enalized estimators as the larger the constructed confidence set is, the sarser the returned estimator will be hus, we comare our aroach with the following lasso style estimator from the R ackage PDSCE Rothman, 013, which otimizes ˆΣ PDS = arg min Σ 0 { Σ ˆΣ } em τ log detσ + λ Σ l 1 with τ, λ > 0 Here, the log det term is used to enforce ositive definiteness of the final solution, and l 1 is the lasso style enalty, which enforces sarsity he similar method from the R ackage scov Bien and ibshirani, 01, which uses a majorize-minimize algorithm to determine { ˆΣ MMA ˆΣem = arg min tr Σ 1 } log detσ 1 + λ Σ l 1 Σ 0 for some enalization λ > 0, was also considered but roved to run too slowly on high dimensional matrices ie d 00 to be included in the numerical exeriments Of course, we also comare our method against the four universal thresholding estimators alied to the emirical covariance matrix from Rothman, Levina and Zhu, 009, Hard, Soft, SCAD, and Adative LASSO: ˆΣ Hard λ ˆΣ SCAD λ = ˆΣ Soft λ ˆΣ Adt λ = {ˆσ i,j 1[ˆσ i,j > λ]} i,j ˆσ Soft i,j a 1 a ˆσ i,j λ + λ ˆσ Hard i,j = {signˆσ i,j ˆσ i,j λ + } i,j for ˆσ i,j λ for λ < ˆσ i,j aλ for ˆσ i,j > aλ = {signˆσ i,j ˆσ i,j λ η+1 ˆσ i,j η + } i,j where ˆσ i,j is the i, jth entry of the emirical covariance estimate, a = 37, and η = 1 he arameter λ > 0 is the threshold, which is chosen in ractice via cross validation with resect to the Hilbert-Schmidt norm Briefly, the data is slit in half, two emirical estimators are formed, one is thresholded, and λ is selected to minimize the Hilbert-Schmidt distance between the one emirical estimate and the other thresholded estimate 51 Multivariate Gaussian Data Let X 1,, X n R d be indeendent and identically distributed mean zero random vectors with a strictly log concave measure and covariance matrix Σ By Corollary D5, there exists a constant c 0 > 0 such that P φx EφX + r e nr /c 0 where φx = ˆΣ em Σ where ˆΣ em = n 1 n X i XX i X is the emirical estimate of the covariance matrix his results in the size 1 α confidence set for Σ { C 1 α = Σ : ˆΣ em Σ E ˆΣ em Σ + } c 0 /n log α for α 0, 1 In the notation of Section, r α = c 0 /n log α

11 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 11 able Percentage of false and true ositives for multivariate Gaussian data and Σ tri-diagonal with diagonal entries 1 and off-diagonal entries 03 False Positive % rue Positive % Dimension CoM 1% CoM 5% PDS Hard Soft SCAD Adt able 3 Percentage of false and true ositives for multivariate Lalace data and Σ tri-diagonal with diagonal entries 1 and off-diagonal entries 03 False Positive % rue Positive % Dimension CoM 1% CoM 5% PDS Hard Soft SCAD Adt In the multivariate Gaussian case, c 0 is the maximal eigenvalue of the covariance matrix Σ We avoid any issues of estimating c 0 in ractice Regardless of our choice for c 0 tuning the regularization arameter α to a secific false ositive rate negates the need for an accurate estimate of c 0 able dislays false ositive and true ositive ercentages for seven sarse estimators comuted over 100 relications of a random samle of size n = 50 of d = 50, 100, 00, 500 dimensional multivariate Gaussian data with a tri-diagonal covariance matrix Σ whose diagonal entries are 1 and whose off-diagonal entries are 03 We can clearly see that the concentration-based estimator aroaches the desired false ositive rate either 1% or 5% as the dimension increases In contrast, the thresholding estimators with threshold λ chosen via cross validation generally start with higher false ositive ercentages, which tend to zero as the dimension increases As noted in revious work, hard thresholding is overly aggressive he PDS method is very stable across changes in the dimension and maintains a constant 34% false ositive rate and 50% true ositive rate 5 Multivariate Lalace Data here are many ossible ways to extend the univariate Lalace distribution, also referred to as the double exonential distribution, onto R d For the following simulation study, we choose the extension detailed in Eltoft, Kim and Lee 006 Namely, let Z N 0, σ and let V Exonential 1 hen, X = V Z Lalace σ/, which has df fx = σ 1 ex x /σ and variance Var X = σ For the multivariate setting, now let Z R d be multivariate Gaussian with zero mean and covariance Σ and, once again, let V Exonential 1 hen, we declare X = V Z to have a multivariate Lalace distribution with zero mean and covariance Σ able 3 dislays false ositive and true ositive ercentages for seven sarse estimators comuted over 100 relications of a random samle of size n = 50 of d = 50, 100, 00, 500 dimensional multivariate Lalace data with a tri-diagonal covariance matrix Σ whose diagonal entries are 1 and whose off-diagonal entries are 03 Similarly to the revious setting, the concentration-based estimator aroaches the desired false ositive rate either 1% or 5% as the dimension increases All universal thresholding estimators set most of the entries in the matrix to zero when threshold λ chosen via cross validation he PDS method is still stable across changes in the dimension but fixates on a much higher false ositive rate around 15% and a similar true ositive rate of 51%

12 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 1 Multivariate Gaussian Data Multivariate Lalace Data rue Positive % Adt Hard SCAD Soft PDS rue Positive % All hresholds PDS False Positive % False Positive % Multivariate Rademacher Data rue Positive % Soft SCAD PDS 10 0 Adt Hard False Positive % Fig 1 A line demarcating the trade-off between false and true ositive recoveries for multivariate Gaussian to left, Lalace to right, and Rademacher bottom data from 100 relications of samle size n = 50 and dimension d = 100 he lines were formed by varying the false ositive rate for the concentration estimator he blue dots indicate the erformance of ast methods for sarse covariance estimation

13 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 13 able 4 Percentage of false and true ositives for multivariate Rademacher data and Σ tri-diagonal with diagonal entries 1 and off-diagonal entries 03 False Positive % rue Positive % Dimension CoM 1% CoM 5% PDS Hard Soft SCAD Adt High Dimensional Binary Vectors Random binary vectors fall into the category of bounded random variables, which have sub-gaussian concentration as a consequence of the bounded differences inequality an extension of Hölder s inequality as discussed in Section 43 he result is a slightly different form for the deviation threshold And while the concentration behaviour in this setting relies on the dimension and is oor for roducing an estimator that is close in oerator or Hilbert-Schmidt norm, our suort recovery methodology is still able to erform well in this setting able 4 dislays false ositive and true ositive ercentages for seven sarse estimators comuted over 100 relications of a random samle of size n = 50 of d = 50, 100, 00, 500 dimensional multivariate Rademacher data with a tri-diagonal covariance matrix Σ whose diagonal entries are 1 and whose off-diagonal entries are 03 As a consequence of the bounded differences inequality, this case also exhibits sub-gaussian behaviour As a consequence, able 4 and similar to able Concentration estimators erform better as d increases; hreshold estimators are overly aggressive as d increases; And the PDS method s suort recover is unaffected by the change in d 54 Small Round Blue-Cell umour Data Following the same analysis erformed in Rothman, Levina and Zhu 009 and subsequently in Cai and Liu 011, we will consider the dataset resulting from the small round blue-cell tumour SRBC microarray exeriment Khan et al, 001 he data set consists of a training set of 64 vectors containing 308 gene exressions he data contains four tyes of tumours denoted EWS, BL-NHL, NB, and RMS As erformed in the two revious aers, the genes are ranked by their resective amount of discriminative information according to their F -statistic 1 k m=1 n m x m x F = k 1 1 n k k m=1 n m 1ˆσ m where x is the samle mean, k = 4 is the number of classes, n = 64 is the samle size, n m is the samle size of class m, and likewise, x m and ˆσ m are, resectively, the samle mean and variance of class m he to 40 and bottom 160 scoring genes were selected to rovide a mix of the most and least informative genes able 5 dislays the results of alying the four threshold estimators with cross validation, the PDS method, and our concentration-based thresholding with the sub-gaussian formula and with false ositive rates of 10, 5, and 1 ercent he ercentage of matrix entries that are retained for the most informative block and the least informative block are tabulated Deending on the chosen false ositive rate, our concentration-based estimators give similar results to Soft and SCAD thresholding PDS is the least conservative of the methods as it kees the most entries Hard and Adative LASSO thresholding are the most aggressive methods It is also worth noting that our method is comutationally efficient enough to consider the entire matrix at once In fact, it took only 1313 seconds to comute ˆΣ s on an Intel i7-7567u CPU, 350GHz In contrast, the PDS method, which still has significantly faster run times than cross validating the threshold estimators, took over 101 minutes to finish False ositive rates of 5%, 1%, and 01% were tested he fraction of non-zero entries in ˆΣ s was 86%, 0%, and 0%, resectively For comarison, the fraction of non-zero

AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 14 able 5 he ercentages of non-zero off-diagonal entries in the six resective covariance estimates artitioned into two arts: the

54% 7% 04% 156% Hard Soft SCAD Adt Informative 60% 47% 13% 99% Uninformative 03% 3% 18% 07% Fig A density lot of the number of non-zero entries in ˆΣ s R 308 308 artitioned into 1 1 blocks for false

14 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 14 able 5 he ercentages of non-zero off-diagonal entries in the six resective covariance estimates artitioned into two arts: the informative art is the block of the highest scoring genes; the uninformative art is the remaining matrix entries non-zero % CoM 10% CoM 5% CoM 1% PDS Informative 303% 56% 85% 473% Uninformative 54% 7% 04% 156% Hard Soft SCAD Adt Informative 60% 47% 13% 99% Uninformative 03% 3% 18% 07% Fig A density lot of the number of non-zero entries in ˆΣ s R artitioned into 1 1 blocks for false ositive rates of 5%, 1%, and 01% entries retained by PDS was 177% If such an analysis is meant to lead to follow-u research on secific gene airings, then culling as many false ositives as ossible is of critical imortance he sarse covariance estimator was artitioned into 1 1 blocks and the number of non-zero entries was tabulated for each he results are dislayed in Figure 6 Summary and Extensions In this article, we were concerned with the suort recovery and estimation of an unknown sarse covariance matrix through the search of non-asymtotic confidence sets constructed via concentration inequalities Our method is able to construct an estimator with a secific false ositive rate agnostic to distributional assumtions via heorem 3 Furthermore, this aroach was secifically shown in heorem 45 to give desirable convergence results for recovery of the suort in the case of log concave measures Numerical erformance on simulated data detailed in Section 5 rovided similar and often suerior erformance to ast aroaches in the log concave, sub-exonential, and bounded data settings While our focus was on sarsity, the generic rocedure of searching the confidence set with some structural criterion in mind can be alied in much more generality Given some assumed form or roerty of the unknown true covariance matrix, a similar search rocedure may be formulated to recover such matrices Our method is similar to the enalization estimators such as lasso in the sense that the arameter α can be tweaked to determine the amount of sarsity to allow However, unlike some of the comlicated otimizations surrounding lasso enalization, our estimator is fast to comute even in high dimensions where such comlicated otimization stes required by lasso enalties can be comutationally intractable his was the case with the majorize-minimization algorithm whose R instantiation became intractably slow when d = 00 As was mentioned in the introduction, the emirical covariance is known in cases of sarsity and high dimension to be a oor estimator We nevertheless chose it as our starting oint for the search rocedure as its form, being a sum of iid random variables, fits nicely into the concentration aradigm However, this does not imly that other starting oints should not be considered Constructing concentration inequality

15 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 15 based confidence sets about more comlicated estimators may not be straightforward But, if some such set can be constructed, then alying our search technique will likely result in even better estimates of the true covariance matrix his methodology is also extendable beyond the setting of covariance matrices to other matrix estimation settings of interest Most readily would be the case of matrix regression For examle, let X R n q, Y R n, E R n, and B R q We could consider the model Y = XB + E where B is the unknown regression oerator and the entries of E are assumed to be a matrix of n centred iid random vectors in R with either sub-gaussian or sub-exonential tails and common covariance matrix Σ R In this case, we could adat the resented theory to the suort recovery and estimation of B under the assumtion of sarsity Even with starting at the emirical covariance matrix, the method of Section 1 was shown in the numerical simulations to have suerior erformance in suort recovery when comared with other methods as it is able to be tuned to a desired false ositive rate as we done in Section 3 It is feasible that a hybrid method is ossible where our method is used to recover the suort of Σ followed by some other technique for imroving uon the non-zero entries Aendix A: Proofs Proof of Lemma 34 his roof follows from the result of Lata la 005 heorem also found in heorem 38 of ao 01 without the assumtion of iid entries in the random matrix We first aly the exectation with resect to E and use the result from Lata la 005 E A E = E A E E A E [ E A E E A E d K 1 E max,,d ] j=1 a i,j + K E d i,j=1 a 4 i,j with K 1, K universal constants For the second term in the above equation, we have via Jensen s inequality and the fact that a i,j 1 that d d E d ρ = dρ i,j=1 a 4 i,j i,j=1 Ea 4 i,j For the first term in the above equation, we make use of the fact that a i,j 1 and that only ρ are non-zero resulting in d d d E max E a i,j,,d j=1 a i,j j=1 d E a i,j + i,j d i,j k=1 d ρ + d 3 d ρ dρ + d 3/ ρ a i,j a i,k Combining the above results and udating the constants K 1, K as necessary gives the desired result [ E A E K 1 dρ + K d ρ] 3/ K1 d ρ 1/4 + K d 3/4 ρ

16 AB Kashlak and L Kong/Nonasymtotic suort recovery and estimation 16 Proof of heorem 3 Without loss of generality, we can normalize ˆΣ em such that the diagonal entries are em 1 hus ˆΣ 0 = I d, the d dimensional identity matrix, and the off-diagonal entries of all matrices considered will be bounded in absolute value by one For the emirical covariance estimator, ˆΣ em em ˆΣ 0 = ˆΣ em 1 We can decomose ˆΣ em into three arts: the diagonal of ones; the off-diagonal terms corresonding to σ i,j 0; and the off-diagonal terms corresonding to σ i,j = 0 he number of non-zero off-diagonal terms is bounded in each row/column by k Hence, ˆΣ em em ˆΣ 0 ˆΣ em 0 em + ˆΣ =0 em k + ˆΣ =0 em where ˆΣ =0 has entries ˆσ i,j such that Eˆσ i,j = 0 Let the entrywise or Hadamard roduct of two similar matrices A and B be A B with entry ijth entry em a i,j b i,j i,j For ease of notation, we denote Π 0 = ˆΣ =0 Let Π 1 be the result of randomly removing half of the entries from Π 0, which is Π 1 = Π 0 B where B {0, 1} d d is a symmetric random matrix with iid Bernoulli entries Considering the corresonding symmetric Rademacher random matrix, E = B 1, we then have that E Π 1 = E Π 0 B = 1 E Π 0 ± Π 0 E where the ± comes from the symmetry of E hus, E Π 1 1 E Π 0 1 E Π 0 E his idea can be iterated Let Π m = Π 0 B 1 B m with the B i iid coies of B from before hen, similarly, Moreover, E Π m 1 E Π m Π m 1 E m E Π m 1 E Π m 1 1 Π m 1 E m m 1 E Πm m E Π 0 m+j E Π j E Alying Lemma 34 m times and udating universal constants K 1, K as necessary results in hus, for ρ = m, we have m 1 E Π [ m m E Π 0 m+j K 1 d j/4 + K d 3/4 j/] j=0 j=0 K 1 d m/4 + K d 3/4 m/ E Π m ρe Π 0 K 1 d ρ 1/4 + K d 3/4 ρ em We want to relace the Π m with ˆΣ ρ Σ 0 and similarly for Π 0 he off-diagonal entries such that σ i,j 0 can contribute at most k = od ν, ν < 1, to the oerator norm Hence, E ˆΣ em ρ Σ 0 ρe ˆΣ em Σ 0 K 1 d ρ 1/4 + K d 3/4 ρ ρod ν We lastly aly the crude but effective in the non-asymtotic setting bound ˆΣ em d/n almost surely Dividing by E ˆΣ em Σ 0 results in E ˆΣ em ρ Σ 0 ρ E ˆΣ em Σ 0 K 1nd ρ 1/4 + K nd 1/4 ρ + ond ν 1

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo