Posterior convergence rates for estimating large precision. matrices using graphical models

Size: px

Start display at page:

Download "Posterior convergence rates for estimating large precision. matrices using graphical models"

Hollie Berry
5 years ago
Views:

1 Biometrika (2013), xx, x, pp C 2007 Biometrika Trust Printed in Great Britain Posterior convergence rates for estimating large precision matrices using graphical models BY SAYANTAN BANERJEE Department of Statistics, North Carolina State University, 5219 SAS Hall, Campus Box 8203, Raleigh, NC 27695, USA 5 sbaner5@ncsu.edu SUBHASHIS GHOSAL Department of Statistics, North Carolina State University, 4276 SAS Hall, Campus Box 8203, Raleigh, NC 27695, USA sghosal@ncsu.edu 10 SUMMARY We consider Bayesian estimation of a p p precision matrix, when p can be much larger than the available sample size n. It is well known that consistent estimation in such ultra-high dimensional situations requires regularisation such as banding, tapering or thresholding. We consider a banding structure in the model and induce a prior distribution on a banded precision matrix 15 through a Gaussian graphical model, where an edge is present only when two vertices are within a given distance. We show that under a very mild growth condition and a proper choice of the order of graph, the posterior distribution based on the graphical model is consistent in the L - operator norm uniformly over a class of precision matrices, even if the true precision matrix may

2 20 2 S. BANERJEE AND S. GHOSAL not have a banded structure. Along the way to the proof, we also establish that the maximum likelihood estimator (MLE) is also consistent under the same set of conditions, which is of independent interest. The consistency and the convergence rate of the precision matrix given the data are also studied. We also conduct a simulation study to compare finite sample performance of the Bayes estimator and the MLE based on the graphical model with that obtained by using a 25 banding operation on the sample covariance matrix. Some key words: Precision matrix, G-Wishart, posterior consistency, convergence rate. 1. INTRODUCTION Estimating a covariance matrix or a precision matrix (inverse covariance matrix) is one of the most important problems in multivariate analysis. Of special interest are situations where the 30 number of underlying variables p is much larger than the sample size n. Such kind of situations are common in gene expression data, fmri data and in several other modern applications. Special care needs to be taken for tackling such high-dimensional scenarios. Conventional estimators like the sample covariance matrix or maximum likelihood estimator behave poorly when the dimensionality is much higher than the sample size. 35 Different regularisation based methods have been proposed and developed in the recent years for dealing with high-dimensional data. These include banding, thresholding, tapering and penalisation based methods to name a few; see, for example, Ledoit & Wolf (2004); Huang et al. (2006); Yuan & Lin (2007); Bickel & Levina (2008a,b); Karoui (2008); Friedman et al. (2008); Rothman et al. (2008); Lam & Fan (2009); Rothman et al. (2009); Cai et al. (2010, 2011). Most 40 of these regularisation based methods for high dimensions impose a sparse structure in the co- variance or the precision matrix, as in Bickel & Levina (2008a), where a rate of convergence has been derived for the estimator obtained by banding the sample covariance matrix, or by

3 Posterior Convergence Rates in Graphical models 3 banding the Cholesky factor of the inverse sample covariance matrix, as long as n 1 log p 0. Cai et al. (2010) obtained the minimax rate under the operator norm and constructed a tapering estimator which attains the minimax rate over a smoothness class of covariance matrices. Cai & 45 Liu (2011) proposed an adaptive thresholding procedure. More recently, Cai & Yuan (2012) introduced a data-driven block-thresholding estimator which is shown to be optimally rate adaptive over some smoothness class of covariance matrices. There are only a few relevant work in Bayesian inference for such kind of problems. Ghosal (2000) studied asymptotic normality of posterior distributions for exponential families when 50 the dimension p, but restricting to p n. Recently, Pati et al. (2012) considered sparse Bayesian factor models for dimensionality reduction in high dimensional problems and showed consistency in the L 2 -operator norm (also known as the spectral norm) by using a point mass mixture prior on the factor loadings, assuming such a factor model representation of the true covariance matrix. 55 Graphical models (Lauritzen, 1996) serve as an excellent tool in sparse covariance or inverse covariance estimation; see Dobra et al. (2004); Meinshausen & Bühlmann (2006); Yuan & Lin (2007); Friedman et al. (2008), as they capture the conditional dependency between the variables by means of a graph. Bayesian methods for inference using graphical models have also been developed, as in Roverato (2000); Atay-Kayis & Massam (2005); Letac & Massam (2007). For 60 a complete graph corresponding to the saturated model, clearly the Wishart distribution is the conugate prior for the precision matrix Ω; see Diaconis & Ylvisaker (1979). For an incomplete decomposable graph, a conugate family of priors is given by the G-Wishart prior (Roverato, 2000). The equivalent prior on the covariance matrix is termed as the hyper inverse Wishart distribution in Dawid & Lauritzen (1993). Letac & Massam (2007) introduced a more general 65 family of conugate priors for the precision matrix, known as the W PG -Wishart family of distri-

4 4 S. BANERJEE AND S. GHOSAL butions, which also has the conugacy property. The properties of this family of distribution were further explored in Raaratnam et al. (2008). Raaratnam et al. (2008) also obtained expressions for Bayes estimators. 70 In this paper, we consider Bayesian estimation of the precision matrix working with a G- Wishart prior induced by a Gaussian graphical model, which has a Markov property with respect to a decomposable graph G. For estimators arising from the resulting conugacy structure, we establish their consistency and derive their posterior convergence rates. More specifically, we work with a Gaussian graphical model structure which induces banding in the corresponding 75 precision matrix. Using this graphical model ensures the decomposability of the graph, along 80 with the presence of a perfect set of cliques, as explained in Section 2. For a G-Wishart prior, we can compute the explicit expression of the normalising constant of the corresponding marginal distribution of the graph. For arbitrary decomposable graphs, the computation of the normalizing constant requires Markov chain Monte-Carlo (MCMC) based methods; see Atay-Kayis & Massam (2005); Carvalho et al. (2007); Carvalho & Scott (2009); Lenkoski & Dobra (2011); Dobra et al. (2011). The paper is organized as follows. In the next section, we discuss some preliminaries on graphical models. In Section 3, we formulate the estimation problem and the describe the corresponding model assumptions. Section 4 deals with the main results related to posterior consistency and 85 convergence rates. We extend the results proved in Section 4 to estimators using a reference prior on the covariance parameter in the next section. In Section 6, we compare the performance of the Bayesian estimator with the graphical maximum likelihood estimator (MLE) and the banding estimators as proposed by Bickel & Levina (2008b). Proof of the main results are presented in Section 7. Some auxiliary lemmas and their proofs are included in the Appendix.

5 Posterior Convergence Rates in Graphical models 5 2. NOTATIONS AND PRELIMINARIES ON GRAPHICAL MODELS 90 We first describe the notations to be used in this paper. By t n = O(δ n ) (respectively, o(δ n )), we mean that t n /δ n is bounded (respectively, t n /δ n 0 as n ). For a random sequence X n, X n = O P (δ n ) (respectively, X n = o P (δ n )) means that P( X n Mδ n ) 1 for some constant M (respectively, P( X n < ϵδ n ) 1 for all ϵ > 0). For numerical sequences r n and s n, by r n s n (or, r n s n ) we mean that r n = o(s n ), while by s n r n we mean that r n = O(s n ). 95 By r n s n, we mean that r n = O(s n ) and s n = O(r n ), while r n s n stands for r n /s n 1. The indicator function is denoted by 1l. ( p ) 1/r, We define the following norms for a vector x R p : x r = =1 x r x = max x. For a matrix A = (a i ), a i stands for the (i, )th entry of A. If A is a symmetric p p matrix, let eig 1 (A),..., eig p (A) stand for its eigenvalues. We consider the following 100 norms on p p matrices: A r = ( p i=1 a i r ) 1/r, 1 r <, A = max i, a i, A (r,s) = sup{ax s : x r = 1}, by respectively viewing A as a vector in R p2 and an operator from (R p, r ) to (R p, s ), where 1 r, s. This gives A (1,1) = max i a i, A (, ) = max i a i A (2,2) = [max{eig i (A T A) : 1 i p}] 1/2, and that for symmetric matrices, A (2,2) = max{ eig i (A) : 1 i p}, and A (1,1) = A (, ). The norm (r,r) will be referred to as the L r -operator norm. For two matrices 105 A and B, we say that A B (respectively, A > B) if A B is nonnegative definite (respec-

6 6 S. BANERJEE AND S. GHOSAL tively, positive definite). Thus A > 0 for a positive definite matrix A, where 0 stands for the zero matrix in such cases. The identity matrix of order p will be denoted by I p. For a set T, we denote the cardinality, that is, the number of elements in T, by #T. We 110 denote the submatrix of the matrix A induced by the set T {1,..., p} by A T, i.e., A T = (a i : i, T ). By A 1 T, we mean the inverse (A T ) 1 of the submatrix A T. For a p p matrix A = (a i ), let (A T ) 0 = (a i ) denote a p-dimensional matrix such that a i = a i for (i, ) T T, and 0 otherwise. Also we denote the banded version of A by B k (A) = {a i 1l( i k)} corresponding to banding parameter k, k < p. 115 Now we discuss some preliminaries on graphical models. An undirected graph G consists of a non-empty vertex set V = {1,..., p} along with an edge-set E {(i, ) V V : i < }. The vertices in V are the indices of the components of a p-dimensional random vector X = (X 1,..., X p ) T. The absence of an edge (i, ) corresponds to the conditional independence of X i and X given the rest. For a Gaussian random variable X with precision matrix Ω = (ω i ), this 120 is equivalent to ω i = 0. Figure 1 illustrates the connection between a banded precision matrix and the corresponding graphical model. Following the notation in Letac & Massam (2007), we restrict the canonical parameter Ω in P G, where P G is the cone of positive definite symmetric matrices of order p having zero entry corresponding to each missing edge in E. Denoting the linear space of symmetric matrices of order p by M, let M + p M be the cone of positive 125 definite matrices. The linear space of symmetric incomplete matrices A = (a i ) with missing 130 entries a i, (i, ) / E, will be denoted by I G. The parameter space of the Gaussian graphical model can be described by the set of incomplete matrices Σ = κ(ω 1 ), Ω P G ; here, κ : M I G is the proection of M into I G ; see Letac & Massam (2007). A subgraph G of G consists of a subset V of V and E = {(i, ) E : i, V }. A maximal saturated subgraph of G is called a clique. A path in a graph is a collection of adacent edges.

7 Posterior Convergence Rates in Graphical models 7 Fig. 1. [Left] Structure of a banded precision matrix with shaded non-zero entries. [Right] The graphical model corresponding to a banded precision matrix of dimension 6 and banding parameter 3. A subset S of E is called a separator of two cliques C 1 and C 2, if all intermediate edges in every path from C 1 to C 2 must entirely lie in S. A graph is called decomposable if it is possible to find a set of cliques covering all vertices connected by a set of separators. We shall only deal with decomposable graphs in the paper. For detailed concepts and notations for graphical models, we refer the readers to Lauritzen (1996). A set of cliques C = {C 1,..., C r } are said to be in perfect 135 order, if the following holds: For H 1 = R 1 = C 1, H = C 1... C, R = C H 1, S = H 1 C, ( = 2,..., r), (1) S = {S, ( = 2,..., p)} is the set of minimal separators in G. For a decomposable graph, a perfect order of the cliques always exists. For a decomposable graph G with a perfect order of the cliques {C 1,..., C r } and the precision matrix Ω is given to lie in P G, the incomplete matrix Σ is defined in terms of the submatrices corresponding to the cliques, that is, for each 140 (i = 1,..., r), Σ Ci is positive definite. Thus we have the parameter space for the decomposable

8 8 S. BANERJEE AND S. GHOSAL Gaussian graphical models restricted to the two cones P G = {A = (a i ) M + p : a i = 0, (i, ) / E}, (2) Q G = {B I G : B Ci > 0, (i = 1,..., r)}, (3) respectively for Ω and Σ. The W PG -Wishart distribution W PG (α, β, D) has three set of parameters α, β and D, where 145 α = (α 1,..., α r ) T and β = (β 2,..., β r ) T are suitable functions defined on the cliques and sep- arators of the graph respectively, and D is a scaling matrix. The G-Wishart distribution W G (δ, D) is a special case of the W PG -Wishart family where α i = δ + #C i 1, (i = 1,..., r), 2 β i = δ + #S i 1, (i = 2,..., r). 2 (4) 3. MODEL ASSUMPTION AND PRIOR SPECIFICATION Let X 1,..., X n be independent and identically distributed (i.i.d.) random p-vectors with mean and covariance matrix Σ. Write X i = (X i1,..., X ip ) T, and assume that the X i, (i = 1,..., n), are multivariate Gaussian. Consistent estimators for the covariance matrix were obtained in Bickel & Levina (2008b) by banding the sample covariance matrix, assuming a certain sparsity structure on the true covariance. Our aim is to obtain consistency of the graphical MLE and and Bayes estimates of the precision matrix Ω = Σ 1 under the condition n 1 log p 0 where 155 Ω ranges over some fairly natural families. For a given positive sequence γ(k) 0, we consider the class of positive definite symmetric matrices Ω = (ω i ) as U(ε 0, γ) = { Ω : max (ω i : i > k) γ(k) for all k > 0, i 0 < ε 0 min eig (Ω) max eig (Ω) ε 1 0 < }. (5)

9 Posterior Convergence Rates in Graphical models 9 We also define another class of positive definite symmetric matrices as V(K, γ) = { Ω : max ( ω i : i > k) γ(k) for all k > 0, i max ( Ω 1 (, ), Ω (, ) ) K }. (6) These two classes are closely related, as shown by the following lemma. LEMMA 1. For every ε 0, there exist K 1 K 2 such that V(K 1, γ) U(ε 0, γ) V(K 2, γ). (7) The sequence γ(k) which bounds Ω B k (Ω) (, ) has been kept flexible so as to include 160 a number of matrix classes. 1. Exact banding: γ(k) = 0 for all k k 0, which means that the true precision matrix is banded, with banding parameter k 0. For instance, any autoregressive process has such a form of precision matrix. 2. Exponential decay: γ(k) = e ck. For instance, any moving average process has such a form 165 of precision matrix. 3. Polynomial decay: γ(k) = γk α, α > 0. This class of matrices has been considered in Bickel & Levina (2008b). We shall work with these two general classes U(ε 0, γ) and V(K, γ) for estimating Ω. A banding structure in the precision matrix can be induced by a Gaussian graphical model model. Since 170 ω i = 0 implies that the components X i and X of X are conditionally independent given the others, we can thus define a Gaussian graphical model G = (V, E), where V = {1,..., p} indexing the p components X 1,..., X p, and E is the corresponding edge set defined by E = {(i, ) : i k}, where k is the size of the band. This describes a parameter space for pre-

10 S. BANERJEE AND S. GHOSAL cision matrices consisting of k-banded matrices, and can be used for the maximum likelihood or the Bayesian approach, where for the latter, a prior distribution on these matrices must be specified. Clearly, G is an undirected, decomposable graphical model for which a perfect order of cliques exist, given by C = {C 1,..., C p k }, C = {,..., + k}, ( = 1,..., p k). The corresponding separators are given by S = {S 2,..., S p k }, S = {,..., + k 1}, ( = 180 2,..., p k). The choice of the perfect set of cliques is not unique, but the estimator for the precision matrix Ω under all choices of the order remains the same. The W PG -family, as a prior distribution for Σ, is conugate if the prior distribution on Ω/2 is W PG (α, β, D), then the posterior distribution of Ω/2 given the sample covariance S = n 1 n i=1 X ix T i is given by W PG {α (n/2)(1,..., 1) T, β (n/2)(1,..., 1) T, D + κ(ns)}. As mentioned earlier, the G- 185 Wishart W G (δ, D) is a special case of W PG (α, β, D) for suitable choice of functions α and β. In our case, #C i = k + 1, (i = 1,..., p k), and #S = k, ( = 2,..., p k). Thus The posterior mean of Ω, given S is α i = δ + k, (i = 1,..., p k), 2 β = δ + k 1, ( = 2,..., p k). 2 E(Ω S) = 2 (α n [ 2 ) {D + κ(ns)} 1 =1 =2 C ] 0 (β n [ 2 ) {D + κ(ns)} 1 S ] 0. (8) (9) Taking D = I p, the p dimensional indicator matrix, and plugging in the values of α and β, the above estimator reduces to the Bayes estimator Ω B with respect to the G-Wishart prior

11 Posterior Convergence Rates in Graphical models 11 W G (δ, I p ): 190 Ω B = δ + k + n n { (n 1 I k+1 + S C ) 1} 0 =1 { (n 1 I k + S S ) 1} 0 + n 1 =2 (10) r { (n 1 I k + S S ) 1} 0. The graphical MLE for Ω under the graphical model with banding parameter k is given by Ω M = p k (S 1 =1 C ) 0 =2 p k (S 1 =2 S ) 0. (11) 4. MAIN RESULTS In this section, we determine the convergence rate of the Bayes estimator of the precision matrix. An important step towards this goal is to find the convergence rate of the graphical MLE, which is also of independent interest. For high-dimensional situations, even when the 195 sample covariance matrix is singular, the graphical MLE will be positive definite if the number of elements in the cliques of the corresponding graphical model is less than the sample size. Analogous results for banded empirical covariance (or precision) matrix or estimators based on thresholding approaches are typically given in terms of the L 2 -operator norm in the literature. We however use the stronger L -operator norm (or equivalently, L 1 -operator norm), so the 200 implication of a convergence rate in our theorems is stronger. THEOREM 1. Let X 1,..., X n be random samples from a p-dimensional Gaussian distribution with mean zero and precision matrix Ω 0 U(ϵ 0, γ) for some ϵ 0 > 0 and γ( ). Then the graphical MLE Ω M of Ω, corresponding to the Gaussian graphical model with banding parameter k, has convergence rate given by 205 { }] Ω M Ω 0 (, ) = O P [max k 2 (n 1 log p) 1/2, γ(k). (12)

12 12 S. BANERJEE AND S. GHOSAL In particular, Ω M is consistent in the L -operator norm if k such that k 4 n 1 log p 0. The proof will use the explicit form of the graphical MLE and proceed by bounding the mean squared error in the L -operator norm. However, as the graphical MLE involves number of terms (k + 1)(p k/2) = O(p), a naive approach will lead to a factor p in the estimate, which 210 will not be able to establish consistency or a convergence rate in the truly high dimensional situations p n. We overcome this obstacle by looking more carefully at the structure of the graphical MLE, and note that for any row i, the number of terms in (11) which have non-zero ith row is only at most (2k + 1) p. This along with the description of L -operator norm in terms of row sums give rise to a much smaller factor than p. 215 Now we treat Bayes estimators. Consider the G-Wishart prior W G (δ, I p ) for Ω, where the graph G has banding of order k and δ is a positive integer. The following result bounds the difference between Ω M and Ω B. LEMMA 2. Assume the conditions of Theorem 1 and suppose that Ω is given the G-Wishart prior W G (δ, I p ), where the graph G has banding of order k. Then Ω B Ω M (, ) = 220 O P (k 2 /n). The proof of the above lemma is given in the Appendix. Theorem 1 and Lemma 2 together lead to the following result for the convergence rate of the Bayes estimator under the G in the L -operator norm. THEOREM 2. In the setting of Lemma 2, the Bayes estimator satisfies { }] Ω B p Ω 0 (, ) = O P [max k 2 (n 1 log p) 1/2, γ(k). (13) 225 In particular, the Bayes estimator Ω B is consistent in the L -operator norm if k such that k 4 n 1 log p 0.

13 Posterior Convergence Rates in Graphical models 13 We now study the consistency and convergence rate of the posterior distribution of the precision matrix given the data. The following theorem describes the behavior of the entire posterior distribution. THEOREM 3. In the setting of Lemma 2, the posterior distribution of the precision matrix Ω 230 satisfies E 0 { pr ( Ω Ω0 (, ) > M n ϵ n,k X )} 1 (14) for ϵ n,k = max { k 2 (n 1 log p) 1/2, γ(k) } and a sufficiently large constant M > 0. In particular, the posterior distribution is consistent in the L -operator norm if k such that k 4 n 1 log p 0. Remarks on the convergence rates. Observe that the convergence rates of the graphical MLE, 235 the Bayes estimator and the posterior distribution obtained above are the same. The obtained rates can be optimised by choosing k appropriately as in a bias-variance trade-off. The fastest possible rates obtained from the theorems may be summarised for the different decay rates of γ(k) as follows: If the true precision matrix is banded with banding parameter k 0, then the optimal rate of convergence n 1/2 (log p) 1/2 is obtained by choosing any fixed k k 0. When γ(k) decays 240 exponentially, the rate of convergence n 1/2 (log p) 1/2 (log n) 2 can be obtained by choosing k n approximately proportional to log n with some sufficiently large constant of proportionality. If γ(k) decays polynomially with index α as in Bickel & Levina (2008b), we get the consistency rate of (n 1 log p) α/(2α+4) corresponding to k n (n 1 log p) 1/(2α+4). It is to be noted that we have not assumed that the true structure of the precision matrix arises 245 from a graphical model. The graphical model is a convenient tool to generate useful estimators through the maximum likelihood and Bayesian approach, but the graphical model itself may be a misspecified model. Further, it can be inspected from the proof of the theorems that the Gaus-

14 14 S. BANERJEE AND S. GHOSAL 250 sianity assumption on true distribution of the observations is not essential, although the graphical model assumes Gaussianity to generate estimators. The Gaussianity assumption is used to control certain probabilities by applying the probability inequality Lemma A.3 of Bickel & Levina (2008b). However, it was also observed by Bickel & Levina (2008b) that one only requires bounds on the moment generating function of Xi 2, (i = 1,..., p). In particular, any thinner tailed distribution, such as one with a bounded support, will allow the arguments to go through ESTIMATION USING A REFERENCE PRIOR A reference prior for the covariance matrix Σ, obtained in Raaratnam et al. (2008), can also be used to induce a prior on Ω. This corresponds to an improper W PG (α, β, 0) distribution for Ω/2 with α i = 0, (i = 1,..., r), β 2 = 1 2 (c 1 + c 2 ) s 2, β = 1 2 (c s ), ( = 3,..., r). (15) By Corollary 4.1 in Raaratnam et al. (2008), the posterior mean Ω R of the precision matrix is 260 given by Ω R = r =1 (S 1 C ) 0 {1 n 1 (c 1 + c 2 2s 2 )}(S 1 S 2 ) 0 r =3 {1 n 1 (c s )}(S 1 S ) 0. (16) Using this prior, we have an improvement in the L -operator norm of the difference between the Bayes estimator Ω R and the graphical MLE Ω M. However, this does not lead to any faster convergence rate of the Bayes estimator.

15 Posterior Convergence Rates in Graphical models 15 THEOREM 4. Under the reference prior mentioned above, Ω R Ω ) M (, ) = O P (k 3/2 /n. (17) A sketch of the proof is given in Section NUMERICAL RESULTS We check the performance of the Bayes estimator of the precision matrix and compare with the graphical MLE and the banded estimators as proposed in Bickel & Levina (2008b). We compare the Bayes estimator of the precision matrix and the corresponding estimator of the covariance matrix with the respective estimates given by the other two methods as mentioned 270 above. Data is simulated from N(0, Σ), assuming specific structures of the covariance Σ or the precision Ω. For all simulations, we compute the L 2 - operator norm of the difference between the estimate and the true parameter for sample sizes n = 50, 100, 200 and p = n/2, n, 2n, 5n, representing cases like p < n, p n, p > n and p n. We simulate 100 replications in each cases. Some of the simulation models are the same as those in Bickel & Levina (2008b). 275 Example 1 (Autoregressive process: AR(1) covariance structure). Let the true covariance matrix have entries given by σ i = ρ i, 1 i, p, (18) with ρ = 0.5 in our simulation experiment. The precision matrix is banded in this case, with banding parameter 1.

16 S. BANERJEE AND S. GHOSAL Example 2 (Autoregressive process: AR(4) covariance structure). The elements of true preci- sion matrix are given by ω i =1l( i = 0) l( i = 1) l( i = 2) l( i = 3) l( i = 4). (19) This is the precision matrix corresponding to an AR(4) process. Example 3 (Long range dependence). We consider a Fractional Gaussian Noise process, that is, the increment process of fractional Brownian motion. The elements of the true covariance 285 matrix are given by σ i = 1 2 [ i + 1 2H 2 i 2H + i 1 2H ], 1 i, p, (20) where H [0.5, 1] is the Hurst parameter. We take H = 0.7 in the simulation example. This precision matrix does not fall in the polynomial smoothness class used in the theorems. We include this example in the simulation study to check how the proposed method is performing when the assumptions of the theorems are not met. 290 Table 1 shows the simulation results for the different scenarios and compares the performance of the Bayes estimator with the MLE and the banded estimator (denoted by BL) in terms of the L 2 -operator norm of the difference of the precision and covariance matrices from their respective true values. The maximum likelihood and Bayes estimates of the covariance matrix is obtained by inverting the estimated precision matrix, while for the banding approach the estimate of the 295 covariance matrix is obtained by banding the sample covariance matrix and that of the precision matrix is obtained by the Cholesky based method as in Bickel & Levina (2008b). The first column in the table with entries respectively Ω 1, Ω 2, Ω 3 stand for the situations where the data generating mechanism follow the processes respectively in Example 1, Example 2 and Example 3. The

17 Posterior Convergence Rates in Graphical models 17 estimates for the first two examples have been computed using the value of the banding parameter of the true precision matrix. For Example 3, we used k = 1, the value which apparently gave the 300 best result. 7. PROOFS In this section we provide the proofs of the theorems and lemmas stated in Section 4. Proofs of these results will require some additional lemmas and propositions, which we include in the 305 Appendix. Proof of Theorem 1. The L -operator norm of the difference between the graphical MLE Ω M and the true precision matrix Ω 0 can be written as Ω M Ω 0 (, ) Ω M B k (Ω 0 ) (, ) + Ω 0 B k (Ω 0 ) (, ). (21) As shown in Lauritzen (1996), in a graphical model ( ) 0 ( ) 0 ΩC ΩS = =1 =2 Hence the first term can be written as =1 ( ) 0 S 1 C =1 =2 ( ) 0 S 1 S { ( S 1 C ) 0 ( Σ 1 =1 =1 ( ) 0 ΩC + ( ) 0 Σ 1 C =2 ( Σ 1 S ) 0. ( ΩS) =2 0(, ) ) } 0 { ( ) 0 ( ) } 0 C + S 1 S Σ 1 S. (, ) (, ) =2

18 S. BANERJEE AND S. GHOSAL Let us first bound the first term. Using the fact that there are only (2k + 1) terms in above expressions inside the norms which have a given row non-zero, it follows that { ( ) 0 ( ) } 0 S 1 C Σ 1 C =1 = max l l =1 { ( max S 1 l =1 l ( (2k + 1) max max l l = (2k + 1) max S 1 C (, ) { ( S 1 C ) 0 ( Σ 1 ) } 0 C ) 0 ( ) } 0 C Σ 1 C S 1 C Σ 1 (l,l ) (l,l ) C )(l,l ) Σ 1 C (, ), (22) where the subscript (l, l ) on the matrices above stand for their respective (l, l )th entries. Using the multiplicative inequality AB AB of operator norms, we have max S 1 C = max max Σ 1 C (, ) Σ 1 C (Σ C S C )S 1 (, ) C (23) ( Σ 1 (, ) ) ΣC C S (, ) S 1 (, ) C C. By assumption on the class of matrices and Lemma 1, Σ 1 C (, ) is bounded by K 2. From 315 Lemma 5, ( P max S 1 C (, ) M 1 ) p max ( S ) 1 (, ) pr C M 1 M 1pk 2 exp( m 2 1nk 2 ) for some constant M 1, M 1, m 1 > 0, while from Lemma 4, ( ) pr max Σ C S (, ) C t M 2 pk 2 exp( m 2 nk 2 t 2 )

19 Posterior Convergence Rates in Graphical models 19 for t < m 2 for some constants M 2, m 2, m 2 > 0. We choose t = Ak(n 1 log p) 1/2 for some sufficiently large A to get the bound { ( ) 0 ( ) } 0 S 1 C Σ 1 C = O P (, ) =1 By a similar argument, we can establish { ( ) 0 ( ) } 0 S 1 S Σ 1 S = O P (, ) =2 { k 2 (n 1 log p) 1/2}. (24) { k 2 (n 1 log p) 1/2}. (25) Therefore, in view of the assumption Ω 0 B k (Ω 0 ) (, ) γ(k), we obtain the result. Proof of Lemma 2. The L -operator norm of Ω B Ω M can be bounded by { n (n 1 I k + S S ) 1} 0 (26) =2 (, ) + δ + k + n { n (n 1 I k+1 + S C ) 1} 0 (S 1 C ) 0 (27) =1 =1 (, ) + δ + k + n { n (n 1 I k + S S ) 1} 0 (S 1 S ) 0 (28) =2 =2 (, ) + δ + k + n ( ) 0 ( ) 0 1 n S 1 C S 1 S. (29) (, ) =1 =2 Now, (26) above is 1 n max { (n 1 I l k + S S ) 1} 0 l =2 (l,l ) 1 p k n max {(n l [ 1 I k + S S ) 1} ] 0 =2 l (l,l ) { (n 1 I k + S S ) 1} (l,l ) 2k + 1 n = 2k + 1 n max max l l max (n 1 I k + S S ) 1 (, ),

20 20 S. BANERJEE AND S. GHOSAL which is bounded by a multiple of k 3/2 n max (n 1 I k + S S ) 1 (2,2) k3/2 n k3/2 n max max S 1 (2,2) S S 1 S (, ). (30) In view of Lemma 5, we have that for some M 3, M 3, m 3 > 0, ( pr max S 1 S (, ) M 3 ) M 3pk 2 exp[ m 2 3nk 2 ], which converges to zero if k 2 (log p)/n 0. This leads to the estimate n 1 For (27), we observe that { (n 1 I k + S S ) 1} 0 = O P (, ) =2 { (n 1 I k+1 + S C ) 1} 0 (S 1 C ) 0 =1 =1 (2k + 1) max (n 1 I k+1 + S C ) 1 S 1 k 3/2 max ( ) k 3/2 /n. (31) (, ) (n 1 I k+1 + S C ) 1 S 1 C (2,2) C (, ) 325 and that (n 1 I k+1 + S C ) 1 S 1 C (2,2) (n 1 I k+1 + S C ) 1 (2,2) n 1 I k+1 (2,2) S 1 C (2,2) n 1 S 1 2 S C (2,2) n C. (, ) Now under k 2 (log p)/n 0, an application of Lemma 5 leads to the bound O P (k 3/2 /n) for (27). A similar argument gives rise to the same O P (k 3/2 /n) bound for (28).

21 Posterior Convergence Rates in Graphical models 21 Finally to consider (29). As argued in bounding (26), we have that =1 { (2k + 1) ( ) 0 S 1 C =2 max ( S 1 S 1 S ) 0 (, ) (, ) C + max S 1 S (, ) } = O P (k), under the assumption k 2 (log p)/n 0 by another application of Lemma 5. Since n 1 (δ + k n) 1 = O(k/n), it follows that (29) is O P (k 2 /n), which is the weakest estimate among all terms in the bound for Ω B Ω M. The result thus follows. Proof of Theorem 2. The proof directly follows from Theorem 1 and Lemma 2 using the triangle inequality. Proof of Theorem 3. The posterior distribution of the precision matrix Ω given the data X is a 335 G-Wishart distribution W G (δ + n, I p + ns). We can write Ω as ( ) 0 ( ) 0 Ω = ΩC ΩS =1 =2 { = (Σ 1 } 0 { ) C (Σ 1 } 0 ) S, =1 =2 ( ) 0 ( ) 0 = WC WS, (32) =1 =2 where W C = (Σ C ) 1, W S = (Σ S ) 1, and the equality of the expressions follow from Lauritzen (1996). Note that the equality in the expressions for Ω and W also imply that E(Ω X) = E(W X). The submatrix Σ C for any clique C has a inverse Wishart distribution with parameters δ + n and scale matrix (I p + ns) C, ( = 1,..., p k). Thus, W C = (Σ C ) 1 has a 340 Wishart distribution induced by the corresponding inverse Wishart distribution. In particular, if i C, then τ 1 in w ii has chi-square distribution with (δ + n) degrees of freedom, where τ in is the (i, i)th entry of {(I + S C ) 1 } 0. Fix a clique C = C and define T n = diag(w ii : i C). For

22 22 S. BANERJEE AND S. GHOSAL i, C, let w i = w i/ τ in τ n and W C = (w i : i, C). Then W C given X has a Wishart 345 distribution with parameters δ + n and scale matrix T 1/2 n (I k+1 + ns C )T 1/2 n. We first note that max i τ in = O P (n 1 ). To see this, observe that (I k + ns C ) 1 n 1 S 1 C, so that max τ in 1 i n S 1 C (2,2) = O P (n 1 ) in view of Lemma 5. On the other hand, from Lemma 4, it follows that max C S C (2,2) = O P (1), so with probability tending to one, S C LI C, and hence (I + ns) 1 C simultaneously for all cliques, for some constant L > 0. Hence max i τ 1 in (1 + nl) 1 I C = O P (n). Consequently, with probability tending to one, the maximum eigenvalue of Tn 1/2 (I k+1 + ns C )Tn 1/2 350 is bounded by a constant depending only on ϵ 0, simultaneously for all cliques. Hence applying Lemma A.3 of Bickel & Levina (2008b), it follows that for all i,, pr { w i E(w i X) t} M 4 exp{ m 4 (δ + n)t 2 }, t < m 4, (33) for some constants M 4, m 4, m 4 > 0 depending on ϵ 0 only. Now, as a G-Wishart prior gives rise to a k-banded structure, as arguing in the bounding of (26) and using (32), we have that, for some M 5, m 5, m 5 > 0, and all t < m 5, ( pr Ω Ω ) B (, ) t X M 5 p exp{ m 5 n(2k + 1) 2 t 2 }. (34) 355 The reduction in the number of terms in the rows from p to (2k + 1) arises due to the fact that the G-Wishart posterior preserves the banded structure of the precision matrix. Choosing t = A(n 1 log p) 1/2, with A sufficiently large, we get E 0 [pr{ω Ω B (, ) Ak(n 1 log p) 1/2 X}] 0. (35)

23 Therefore, using Theorem 2, Posterior Convergence Rates in Graphical models 23 E 0 { pr ( Ω Ω0 (, ) > 2ϵ n X )} ) { ( pr 0 ( Ω B Ω 0 (, ) > ϵ n X + E 0 pr Ω Ω )} B (, ) > ϵ n X, which converges to zero if ϵ n = max{ak(n 1 log p) 1/2, γ(k)}. Proof Proof of Theorem 4. In our scenario, the Bayes estimator under the reference prior is given by the expression p k Ω R = E(Ω S) = =1 (S 1 C ) 0 (1 n 1 )(S 1 S 2 ) 0 (1 n 1 ) =3 (S 1 S ) 0. Therefore 360 Ω R Ω M (, ) = n 1 (S 1 S 2 ) 0 + n 1 (S 1 S ) 0 =3 (, ) n 1 (S 1 S ) 0 + n 1 (S 1 S 2 ) 0 (, ). (, ) =2 The rest of the proof proceeds as in Lemma 2. A. PROOFS OF AUXILIARY RESULTS In this section we give proofs of some lemmas we have used in the paper, which are of some general interest. The first lemma deals with the various equivalence conditions for matrix norms and is easily found in 365 standard textbooks. LEMMA A3. For a symmetric matrix A of order k, we have the following: 1. A (2,2) A (, ) ka (2,2) ; 2. A A (2,2) A (, ) ka ;

24 24 S. BANERJEE AND S. GHOSAL 370 Now we prove the lemma concerning the equivalence of the classes of matrices considered for the preci- sion matrix Ω. Proof of Lemma 1. We rewrite the class of matrices defined in (5) as U(ε 0, γ) = { Ω : max (ω i : i > k) γ(k) for all k > 0, i max ( Ω 1 (2,2), Ω (2,2) ) ε 1 0 }. (A1) Now, max ( Ω 1 (, ), Ω (, ) ) K1 implies max ( Ω 1 (2,2), Ω (2,2) ) ε 1 0 for K 1 = ε 1 0, using Lemma 3. Thus V(K 1, γ) U(ε 0, γ). 375 To see the other way, note that, for any fixed k 0, Ω (, ) Ω B k0 (Ω) (, ) + B k0 (Ω) (, ) γ(k 0 ) + (2k 0 + 1)Ω γ(k 0 ) + (2k 0 + 1)Ω (2,2) γ(k 0 ) + (2k 0 + 1)ε 1 0. (A2) Choosing K 2 = γ(k 0 ) + (2k 0 + 1)ε 1 0 gives U(ε 0, γ) V(K 2, γ). LEMMA A4. Let Z i, (i = 1,..., n), be i.i.d. k-dimensional random vectors distributed as N(0, D) and max ( ) D 1 (, ), D (, ) K. Then for the sample variance Sn = n i=1 Z izi T, we have ( ) pr S n D (, ) t Mk 2 exp( mnk 2 t 2 ), t m, (A3) where M, m, m > 0 depend on K only. 380 In particular, if k 2 (log k)/n 0, then S n (, ) = O P (1). Proof. The proof directly follows from Lemma A.3 of Bickel & Levina (2008b) and noting from Lemma 3 that S n D (, ) ks n D.

25 Posterior Convergence Rates in Graphical models 25 LEMMA A5. Let Z i, (i = 1,..., n), be i.i.d. k-dimensional random vectors distributed as N(0, D) and max ( ) D 1 (, ), D (, ) K. Then for the sample variance Sn = n i=1 Z izi T, we have pr ( S 1 n (, ) M ) M k 2 exp( mnk 2 C 2 ), (A4) where M > K and M, m > 0 depend on M and K only. 385 Proof. Note that, This implies that Sn 1 (, ) D 1 (, ) + Sn 1 D 1 (, ) = D 1 (, ) + D 1 (, ) S n D (, ) Sn 1 (, ) K(1 + S n D (, ) S 1 n (, ) ). (A5) Sn 1 K (, ) 1 S n D (, ) K. Thus, using Lemma 4, we obtain pr ( Sn 1 (, ) M ) ( ) K pr 1 S n D (, ) K M pr (S n D (, ) K 1 M 1) (A6) M k 2 exp( m 2 nk 2 ). REFERENCES ATAY-KAYIS, A. & MASSAM, H. (2005). A Monte-Carlo method for computing the marginal likelihood in nonde- 390 composable Gaussian graphical models. Biometrika 92, BICKEL, P. & LEVINA, E. (2008a). Covariance regularization by thresholding. Ann. Statist. 36, BICKEL, P. & LEVINA, E. (2008b). Regularized estimation of large covariance matrices. Ann. Statist. 36, CAI, T. & LIU, W. (2011). Adaptive thresholding for sparse covariance matrix estimation. J. Amer. Statist. Assoc. 106,

26 26 S. BANERJEE AND S. GHOSAL CAI, T., LIU, W. & LUO, X. (2011). A constrained l 1 -minimization approach to sparse precision matrix estimation. J. Amer. Statist. Assoc. 106, CAI, T. & YUAN, M. (2012). Adaptive covariance matrix estimation through block thresholding. Ann. Statist. 40, CAI, T., ZHANG, C. & ZHOU, H. (2010). Optimal rates of convergence for covariance matrix estimation. Ann. Statist. 38, CARVALHO, C., MASSAM, H. & WEST, M. (2007). Simulation of hyper-inverse Wishart distributions in graphical models. Biometrika 94, CARVALHO, C. & SCOTT, J. (2009). Obective Bayesian model selection in Gaussian graphical models. Biometrika , DAWID, A. & LAURITZEN, S. (1993). Hyper Markov laws in the statistical analysis of decomposable graphical models. Ann. Statist. 21, DIACONIS, P. & YLVISAKER, D. (1979). Conugate priors for exponential families. Ann. Statist. 7, DOBRA, A., HANS, C., JONES, B., NEVINS, J., YAO, G. & WEST, M. (2004). Sparse graphical models for exploring 410 gene expression data. J. Multivariate Anal. 90, DOBRA, A., LENKOSKI, A. & RODRIGUEZ, A. (2011). Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. J. Amer. Statist. Assoc. 106, FRIEDMAN, J., HASTIE, T. & TIBSHIRANI, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9, GHOSAL, S. (2000). Asymptotic normality of posterior distributions for exponential families when the number of parameters tends to infinity. J. Multivariate Anal. 74, HUANG, J., LIU, N., POURAHMADI, M. & LIU, L. (2006). Covariance matrix selection and estimation via penalised normal likelihood. Biometrika 93, KAROUI, N. (2008). Operator norm consistent estimation of large-dimensional sparse covariance matrices. Ann. 420 Statist. 36, LAM, C. & FAN, J. (2009). Sparsistency and rates of convergence in large covariance matrix estimation. Ann Statist. 37, LAURITZEN, S. (1996). Graphical Models, vol. 17. Oxford University Press, USA. LEDOIT, O. & WOLF, M. (2004). A well-conditioned estimator for large-dimensional covariance matrices. J. 425 Multivariate Anal. 88,

27 Posterior Convergence Rates in Graphical models 27 LENKOSKI, A. & DOBRA, A. (2011). Computational aspects related to inference in Gaussian graphical models with the g-wishart prior. J. Comput. Graphical Statist. 20, LETAC, G. & MASSAM, H. (2007). Wishart distributions for decomposable graphs. Ann. Statist. 35, MEINSHAUSEN, N. & BÜHLMANN, P. (2006). High-dimensional graphs and variable selection with the lasso. Ann Statist. 34, PATI, D., BHATTACHARYA, A., PILLAI, N. & DUNSON, D. (2012). Posterior contraction in sparse Bayesian factor models for massive covariance matrices. RAJARATNAM, B., MASSAM, H. & CARVALHO, C. (2008). Flexible covariance estimation in graphical Gaussian models. Ann Statist. 36, ROTHMAN, A., BICKEL, P., LEVINA, E. & ZHU, J. (2008). Sparse permutation invariant covariance estimation. 435 Electron. J. Statist. 2, ROTHMAN, A., LEVINA, E. & ZHU, J. (2009). Generalized thresholding of large covariance matrices. J. Amer. Statist. Assoc. 104, ROVERATO, A. (2000). Cholesky decomposition of a hyper inverse Wishart matrix. Biometrika 87, YUAN, M. & LIN, Y. (2007). Model selection and estimation in the Gaussian graphical model. Biometrika 94,

28 28 S. BANERJEE AND S. GHOSAL Table 1. Simulation results for different structures of precision matrices Ω n p ˆΩ Ω 2,2 ˆΣ Σ 2,2 MLE Bayes BL MLE Bayes BL Ω Ω Ω Ω Ω Ω Ω Ω Ω

arxiv: v3 [math.st] 6 Nov 2014

arxiv: v3 [math.st] 6 Nov 2014 arxiv:1302.2677v3 [math.st] 6 Nov 2014 Electronic Journal of Statistics Vol. 8 (2014) 2111 2137 ISSN: 1935-7524 DOI: 10.1214/14-EJS945 Posterior convergence rates for estimating large precision matrices