QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS. (Apr. 2018; first draft Dec. 2015) 1. Introduction

Size: px

Start display at page:

Download "QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS. (Apr. 2018; first draft Dec. 2015) 1. Introduction"

Barnaby Fowler
5 years ago
Views:

1 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS YVES F. ATCHADÉ Apr. 08; first draft Dec. 05 Abstract. This paper deals with the Bayesian estimation of high dimensional Gaussian graphical models. We develop a quasi-bayesian implementation of the neighborhood selection method of Meinshausen and Buhlmann 006 for the estimation of large Gaussian graphical models. The method produces a product-form quasi-posterior distribution that can be efficiently explored by parallel computing. We derive a non-asymptotic bound on the contraction rate of the quasi-posterior distribution. The result shows that the proposed quasi-posterior distribution contracts towards the true precision matrix at a rate given by the worst contraction rate of the linear regressions that are involved in the neighborhood selection. We develop a Markov Chain Monte Carlo algorithm for approximate computations, following an approach from Atchadé 05. We illustrate the methodology with a simulation study.. Introduction We consider the problem of fitting large Gaussian graphical models with diverging number of parameters from limited data. This amount to estimating a sparse precision matrix ϑ M + p from p-dimensional Gaussian observations y i R p, i =,..., n, where M + p denotes the cone of R p p of symmetric positive inite matrices. The frequentist approach to this problem has generated an impressive literature over the last decade or so see for instance Bühlmann and van de Geer 0; Hastie et al. 05 and the reference therein. There is an interest, particularly in biomedical research, for statistical methodologies that can allow practitioners to incorporate external information in fitting such graphical models Mukherjee and Speed 008; Peterson et al. 05. This problem naturally calls for a Bayesian formulation and significant progress has been made in 000 Mathematics Subject Classification. 60F5, 60G4. Key words and phrases. Gaussian graphical models, Quasi-Bayesian inference,pseudo-likelihood, posterior contraction, forward-backward approximation, Markov Chain Monte Carlo. Y. F. Atchadé: University of Michigan, 085 South University, Ann Arbor, 4809, MI, United States. address: yvesa@umich.edu.

2 YVES F. ATCHADÉ recent years Dobra et al. 0; Lenkoski and Dobra 0; Khondker et al. 03; Peterson et al. 05; Banerjee and Ghosal 05. However, most existing Bayesian methods for fitting graphical models do not scale well with the number of nodes in the graph. The main difficulty is computational, and hinges on the ability to handle interesting prior distributions on M + p when p is large. The most commonly used class of priors distributions for Gaussian graphical models is the class of G-Wishart distributions Atay-Kayis and Massam 005. However G-Wishart distributions have intractable normalizing constants, and become impractical for inferring large graphical models, due to the cost of approximating the normalizing constants Dobra et al. 0; Lenkoski and Dobra 0. Following the development of the Bayesian lasso of Park and Casella 008 and other Bayesian shrinkage priors for linear regressions Carvalho et al. 00, several authors have proposed prior distributions on M + p obtained by putting conditionally independent shrinkage priors on the entries of the matrix, subject to a positive initeness constraint Khondker et al. 03. However this approach does not give a direct estimation of the graph structure, which in many applications is the key quantity of interest. Furthermore, dealing with the positive initeness constraint in the posterior distribution requires careful MCMC design, and becomes a limiting factor for large p. The above discussion suggests when dealing with large graphical models, some form of approximate inference is inescapable. Building on Atchade 07, we propose a quasi-bayesian approach for fitting large Gaussian graphical models. Our general approach to the problem consists in working with a larger pseudo-model { ˇf θ, θ ˇΘ}, where ˇΘ M + p. By pseudo-model we mean that the function z ˇf θ z is typically not a density on R p, but ˇf θ is chosen such that the function θ n i= log ˇf θ y i is a good candidate for M-estimation of ϑ. The enlargement of the model space from M + p to ˇΘ allows us to relax the positive initeness constraint. With a prior distribution Π on ˇΘ, we obtain a quasi-posterior distribution denoted ˇΠ n,p. ˇΠn,p is not a posterior distribution in the strict sense of the term since ˇf θ is not a proper likelihood function. In the specific case of Gaussian graphical models, we take ˇΘ as the space of matrices with positive diagonals, and take z ˇf θ z as the pseudo-model underpinning the neighborhood selection method of Meinshausen and Buhlmann 006. This choice gives a quasi-posterior distribution ˇΠ n,p that factorizes, and leads to a drastic improvement in the computing time needed for MCMC computation when a parallel computing architecture is used. We illustrate the method in Section 4 using simulated data where the number of nodes in the graph is p {00, 500, 000}. The idea of replacing the likelihood function by a pseudo-likelihood or quasilikelihood function in a Bayesian inference is not new and has been developed in other

3 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 3 contexts, such as in Bayesian semi-parametric inference Kato 03; Li and Jiang 04, and the references therein, in approximate Bayesian computation Fearnhead and Prangle 00. See also Atchade 07 for more references, and for contraction properties. We study the contraction properties of the quasi-posterior distribution ˇΠ n,p as n, p. Under the assumption that there exists a well-conditioned true precision matrix, we show that ˇΠ n,p contracts at the rate s logp n see Theorem 3 for a precise statement, where s can be viewed as an upper-bound on the largest degree in the un-directed graph ined by the true precision matrix. This convergence rate corresponds to the worst convergence rate that one gets from the Bayesian analysis of the linear regressions involved in the neighborhood selection. The condition on the sample size n for the results mentioned above to hold is n = O s logp, which shows that the quasi-posterior distribution can concentrate around the true precision matrix, even in cases where p exceeds n. The rest of the paper is organized as follows. Section provides a general discussion of quasi-models and quasi-bayesian inference. The section ends with the introduction of the proposed quasi-bayesian distribution, based on the neighborhood selection method of Meinshausen and Buhlmann 006. We specialized the discussion to Gaussian graphical models in Section 3. The theoretical analysis focuses on the Gaussian case, and is presented in Section 3, but the proofs are postponed to Section 5. The simulation study is presented in Section 4. A MATLAB implementation of the method is available from the author s website.. Quasi-Bayesian inference of graphical models For integers p, and i {,..., p}, let Y i be a nonempty subset of R, and set Y = Y Y p, that we assume is equipped with a reference sigma-finite product measure dy. We first consider a class of Markov random field distributions {f ω, ω Ω} for joint modeling of Y-valued random variables. Let M p denote the set of all real symmetric p p matrices equipped with the inner product A, B F = i j A ib ij, and norm A F = A, A F. As above, M + p denotes the subset of M p of positive inite matrices. For i =,..., p, and j < k p, let B i : Y i R and B jk : Y j Y k R be non-zero measurable functions that we assume known. The contraction rate is measured in the L, matrix norm, ined as the largest L column norms

4 4 YVES F. ATCHADÉ From these functions we ine a M p -valued function B : Y M p by B i y i if i = j, By ij = B ij y i, y j if i < j, B ji y j, y i, if j < i. These functions ine the parameter space { Ω = ω M p : Zω = Y e ω, By } F dy <. We assume that Y and B are such that Ω is non-empty, and we consider the exponential family {f ω, ω Ω} of densities f ω on Y given by f ω y = exp ω, By F log Zω, y Y. The model {f ω, ω Ω} can be useful to capture the dependence structure between a set of p random variables taking values in Y. If Y,..., Y p f ω, then the parameter ω encodes the conditional independence structure among the p variables Y,..., Y p. In particular for i j, ω ij = 0 means that Y i and Y j are conditionally independent given all other variables. The random variables Y,..., Y p can then be represented by an undirect graph where there is an edge between i and j if and only if ω ij 0. This type of models are very useful in practice to tease out direct and indirect connections between sets of random variables. The version posited in can accommodate mixed measurements where some of the y i take discrete values while other take continuous values. Example Gaussian graphical models. One recovers the Gaussian graphical model by taking Y i = R, B i x = x /, B ij x, y = xy, i < j. In this case Y = R p equipped with the Lebesgue measure, and Ω = M + p. Example Potts models. For integer M, one recovers the M-states Potts model by taking Y i = {,..., M}. In this case, Y = {,..., M} d equipped with the counting measure. Since Y is a finite set, we have Ω = M p. An important special case of the Potts model is a version of the Ising model where M =, and B i x = x, and B ij x, y = xy. Suppose that we observe data y,, y n where y i = y i,..., yi p Y is viewed as a column vector. We set x = [y,..., y n ] R n p. Given a prior distribution Π on Ω, and given the data x, the resulting posterior distribution for learning ω is n Π n A x = A i= f ωy i Πdω n i= f ωy i Πdω, A Ω. Ω

5 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 5 However, and as discussed in the introduction, this posterior distribution is typically intractable. In the frequentist literature, a commonly used approach to circumvent computational difficulties with graphical models consists in replacing the likelihood function by a pseudo-likelihood function. For ω M p, let ω i denote the i-th column of ω. Note that in the present case, if Y,..., Y p f ω, then for j p, the conditional distribution of Y j given {Y k, k j} depends on ω only through the j-th column ω j. We write this conditional distribution as u f j ω j u y j, where for y Y, y j = y,..., y j, y j+,..., y p, with obvious modifications when j =, p. Let Ω = { ω M p : u f j ω j u y j is a well-ined density on Y j, for all y Y, and all j p}. Note that Ω Ω. The most commonly used pseudo-likelihood method consists in replacing the initial likelihood contribution f ω y i by f ω y i = p f j y i ω j j j= y i j, ω Ω. This pseudo-likelihood approach typically brings important simplifications. For instance, in the Gaussian case, the parameter space Ω corresponds to the space of symmetric matrices with positive diagonals elements, which has a simpler geometry compared to M + p. And in the case of discrete graphical models, the conditional models typically have tractable normalizing constants. The idea goes back at least to Besag 974, and penalized versions of pseudo-likelihood functions have been employed by several authors to fit high-dimensional graphical models. In a Bayesian setting this approach works well for small to moderate size graphs. The issue is that the space Ω M p grow as Op, and MCMC simulation for exploring probability distributions on such very large spaces is inherently a difficult problem for example, for only p = 00 the dimension of Ω is larger than 0 4. A related pseudo-likelihood for this problem is suggested by the neighborhood selection of Meinshausen and Buhlmann 006. The idea consists in relaxing the symmetry constraint in Ω. For j p, we set Ω j = { θ R p : u f j θ u y j is a well-ined density on Y j, for all y Y, and all j p}. We note that if ω Ω, then ω j Ω j. Hence these sets Ω j are nonempty, and we ine ˇΩ = Ω Ω p, that we identify as a subset of the space of p p real

6 6 YVES F. ATCHADÉ matrices R p p. In particular if ω ˇΩ, and consistently with our notation above, ω,j denotes the j-column of ω. We consider the pseudo-model { ˇf ω, ω ˇΩ}, where ˇf ω y = p j= f j ω j y j y j, ω ˇΩ, y Y. 3 One can then maximize a penalized version of ω n i= log ˇf ω y i, and this corresponds to the neighborhood selection method of Meinshausen and Buhlmann 006, see also Sun and Zhang 03. Notice that by inition ˇΩ is a product space, whereas Ω is not, due to the symmetry constraint. This implies that ω ˇf ω y factorizes along the columns of ω, whereas ω f ω y typically does not. One can take advantage of this separability for fast computation if the penalty is also separable. With a prior distribution Π on ˇΩ, the quasi-likelihood function ω ˇf ω leads to a quasi-posterior distribution given by n ˇΠ ˇf A i= ω y i Πdω n,p A x = n ˇf i= ω i Πdω, A ˇΩ. ˇΩ Let us assume that the prior distribution factorizes: Πdω = p j= Π jω j. Then we are led to the quasi-posterior distribution p ˇΠ n,p du, du p x = ˇΠ n,p,j du j x, 4 where ˇΠ n,p,j du x = n i= f j u j= Ω j n i= f j u y i j y i y i j j Π jdu y i j Π jdu, is a probability measure on Ω j. Basically, relaxing the symmetry allows us to factorize the quasi-likelihood function and this leads to a factorized quasi-posterior distribution, as in 4. Each component of this quasi-posterior distribution can then be explored independently. Despite its simplicity, when used in a parallel computing environment, this approach increases by one order of magnitude the size of graphical models that can be estimated. 3. Gaussian graphical models Here we consider the Gaussian case where Y i = R, B i x = x /, and B ij x, y = xy. Hence in this case, Ω = M + p, Ω corresponds to the set of symmetric matrices with positive diagonal elements, and ˇΩ is the space of p p real matrices not necessarily symmetric with positive diagonal. Assuming that the diagonal elements are known and given, we shall identify ˇΩ with the matrix space R p p.

7 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 7 If ϑ M + p, and Y,..., Y p f ϑ, it is well known that for all j {,..., p}, the conditional distribution of Y j given all other Y k = y k, for k j is N k j ϑ kj ϑ jj y k, ϑ jj, 5 where Nµ, σ denotes the Gaussian distribution with mean µ and variance σ. Given data x R n p, given σj > 0, and given these conditional distributions, the product of the quasi-model 3 across the data set gives upto normalizing constants that we ignore the quasi-likelihood qθ; x = p q j θ j ; x, j= with q j θ j ; x = exp σj x j x j θ j, θ R p p, 6 where x j R n p is the matrix obtained from x by removing the j-th column, and x j resp. θ j denotes the j-column of x resp. θ. Given 5, it is clear that σj is a proxy for /ϑ jj. For the time being, we shall assume that the variance terms ϑ jj are known, and we will set σj = /ϑ jj. In practice we use an empirical Bayes approach described below whereby ϑ jj is obtained from the data. We combine 6 with a prior distribution Πdθ = p j= Π jdθ j to obtain a quasi-posterior distribution on R p p given by p ˇΠ n,p dθ x = ˇΠ n,p,j dθ j x, σj, 7 where ˇΠ n,p,j x, σ j is the probability measure on Rp given by j= ˇΠ n,p,j dz x, σ j q j z; xπ j dz. Again the main appeal of ˇΠ n,p is its factorized form, which implies that Monte Carlo samples from ˇΠ n,p can be obtained by sampling in parallel from the p distributions ˇΠ n,p,j. 3.. Prior distribution. We address here the choice of the prior distribution Π j. For each j {,..., p}, we build the prior Π j on R p as in Castillo et al. 05. First, let p = {0, } p, and let {π δ, δ p } denote a discrete probability distribution on p which we assume to be the same for all the components j. We take Π j as

8 8 YVES F. ATCHADÉ the distribution of the random variable u R p obtained as follows. δ :p i.i.d. Berq. Given δ, u,..., u p are conditionally independent Dirac0 if δ k = 0 and u k δ Laplace if δ k =, 8 where q 0, and ρ j > 0 are hyper-parameter, σj is as in 6, Dirac0 is the Dirac measure on R with mass at 0, and for ρ > 0, Laplaceρ denotes the Laplace distribution with density ρ/ ρ x, x R. 3.. Posterior contraction and rate. We study here the behavior of the posterior distribution given in 7, for large n, p and when the prior is as in 8. We will assume that the observed data matrix x R n p is a realization of random matrix X the rows of which are i.i.d. random vectors from a mean-zero Gaussian distribution on R p with precision matrix ϑ, with known diagonal elements. More precisely, H. For some ϑ M + p, X = Zϑ /, where Z R n p is a random matrix with i.i.d. standard normal entries. From the true precision matrix ϑ, we now derive the true value of the parameter θ R p p towards which ˇΠ n,p is expected to converge. For j =,..., p, θ kj = ϑ kj /ϑ jj, for k =,..., j, and θ kj = ϑ k+j /ϑ jj, for k = j,..., p. Let δ {0, } p p be the sparsity structure of θ, ined as δ kj = { θ kj >0}. We set s j = p k= ρ j σ j { θ kj >0}, j =,..., p and s = max j p s j. Hence s j is the degree of node j, and s is the maximum node degree in the undirected graph ined by ϑ. The asymptotic behavior of ˇΠ n,p depends crucially on certain restricted and m-sparse eigenvalues of the true precision matrix ϑ, that we introduce next. We set κ = inf and for s p, { u ϑu = inf κs u u ϑu u : u R p, u 0, s.t. } : u R p, u 0 s, k: δ,k =0 u k 7 k: δ,k = u k, 9 { κs u } ϑu = sup u : u R p, u 0 s. 0 In the above equations, we convene that inf = +, and sup = 0.

9 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 9 We study the contraction of ˇΠ n,p in the norm θ = max j p θ j. Theorem 3. Assume H and 8 with q = for some absolute constant u > 0, p u+ and ρ j = max k p X k 4 logp ϑ jj, j p. For j p, suppose that σj = /ϑ jj, and set ζ j = 4 u + s j κ u κ + κs 4logp s j, κ s j = s j + ζ j, and s = max j p s j. If minκ, > 0, then there exist absolute κ s constants a 0 > 0, a > 0, a > 0, M 0 such that for all p a 0, and n a s + κ logp, κ the following two statements hold: [ { } E ˇΠn,p θ R p p : θ j 0 ζ j for some j ] X, 3 e a n + p [ { } ] E ˇΠn,p θ R p p : θ θ > M 0 ɛ X 3 e a n p where ɛ > 0 is given Proof. See Section 5... ɛ = κ s logp κ s. n Remark 4. Under H and the assumed prior, 3 says that for n, p large, if θ ˇΠ n,p X, then with high probability θ j 0 < ζ j for all j {,..., p}. Note that if κ/κ and κs logp / κ are small meaning ϑ is well-conditioned then ζ j is of the same order as s,j. In other words the main conclusion of 3 is that ˇΠ n,p X concentrates most of its probability mass on matrices that are sparse with a sparsity structure that mirrors that of ϑ, provided that ϑ is well-conditioned. The behavior of the posterior distribution in practice suggests that the large constant 69 appearing in the theorem is most likely an artifact of the techniques used in the proof, and can probably be improved. 4 says that the contraction rate of ˇΠ n,p towards θ in the norm is Oɛ. This corresponds to the worst among the rates of contraction of the p linear regression

10 0 YVES F. ATCHADÉ problems performed during the neighborhood selection procedure. This rate is similar to the rate of convergence of the frequentist neighborhood selection method of Meinshausen and Buhlmann 006, which is of order s logp, 5 n see the discussion in Section 3.4 of Ravikumar et al Numerical experiments 4.. Fully Bayesian quasi-posterior distribution. Recall that the prior distribution of δ k is δ k Berq, with q = p u. The posterior distribution ˇΠ n,p is fairly robust to the choice of u, so throughout the simulations, we set u = 0.5. In contrast ˇΠ n,p is sensitive to ρ j, so we use a fully Bayesian approach, with a prior distribution ρ j φ, where φ is the uniform distribution Ua, a for a = 0 5, and a = 0 5. Given σj, we obtain a fully specified quasi-posterior distribution p Π n,p,j δ, dθ, dρ j x, σj, 6 j= where the j-th component Π n,p,j x, σ j can be written as follows. For δ p, let µ δ be the product measure on R p ined as µ δ du = p j= ν δ j du j, where ν 0 dz is the Dirac mass at 0, and ν dz is the Lebesgue measure on R. Then Π n,p,j δ, dθ, dρ j x, σ j q j θ; xq δ q p δ 0 ρ j σ j δ e ρ j σ j θ φρj µ δ dθdρ j. 7 The quasi-posterior distribution 7 depends on the choice of σj. Ideally we would like to set σj = /ϑ jj. However this quantity is unknown. In the simulation we choose σj by empirical Bayes. More precisely, following Reid et al. 03 we estimate σ j by ˆσ j = n ŝ λn x j x j ˆβλn, 8 where ˆβ λ is the lasso estimate at regularization level λ in the linear regression of x j the j-th column of x of x j the remaining columns. In the procedure, λ n is selected by 0-fold cross-validation, and ŝ λn is the number of non-zero components of ˆβ λn. We explore this approach in the simulations. Given j {,..., p}, sampling from the distribution Π n,p,j x given in 7 is a difficult computation task, due the discrete-continuous mixture prior on δ. Here we follow the approach developed by the author in Atchadé 05, which produces

11 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS approximate samples from 7 by sampling from its forward-backward approximation γ denoted Π n,p,j δ, dθ, dρ j x, σj however other approximations schemes could be used as well Narisetty and He 04; Schreck et al. 03. The parameter γ 0, /4] controls the quality of the approximation. In all the simulation below, we use γ = Simulation set ups. We generate a data matrix x R n p with i.i.d. rows from N0, ϑ, ϑ M + p. Throughout we set the sample size to n = 50, and we consider three settings. a: ϑ is generated as in Setting c below, but using p = 00 nodes. b: In this case p = 500, and we take ϑ from the R-package space based on the work Peng et al These authors have designed a precision matrix ϑ that is modular with 5 modules of 00 nodes each. Inside each module, there are 3 hubs with degree around 5, and 97 other nodes with degree at most 4. The total number of edges is 587. The resulting partial correlations fall within 0.67, 0.0] [0.0, As explained in Peng et al. 009, this type of networks are useful models for biological networks. c: In this case p =, 000, and we build ϑ as follows. First we generate a symmetric sparse matrix B such that the number of off-diagonal non-zeros entries is roughly p. We magnified the signal by adding 3 to all the nonzeros entries of B subtracting 3 for negative non-zero entries. Then we set ϑ = B + ɛ λ min BI p, where λ min B is the smallest eigenvalue of B, with ɛ =. In this example, values of the partial correlations are typically in the range 0.46, 0.8] [0.8, To evaluate the effect of the hyper-parameter σj, we report two sets of results. One where σj = /ϑ jj, and another set of results where ϑ jj is assumed unknown and we select σj from the data, using the cross-validation estimator described in 8. In order to mitigate the uncertainty in some of the results reported below, we repeat all the MCMC simulations 0 times. Hence, to summarize, for each setting a, b, and c, we generate one precision matrix ϑ. Given ϑ, we generate 0 datasets, and for each dataset, we run two MCMC samplers one where the σj s are taken as the /ϑ jjs, and one where they are estimated from the data. The precision matrix used here corresponds to the example Hub network in Section 3 of Peng et al A non-sparse version of ϑ is attached to the space package

12 YVES F. ATCHADÉ 4.3. Computation details. All the simulations were performed on a high-performance computer using 00 cores and Matlab 7.4. γ To simulate from Π n,p,j x, σ j for a given j, we run the MCMC sampler for 50, 000 iterations and discard the first 0, 000 iterations as burn-in. From the MCMC output, we estimate the structure δ {0, } p p as follows. We set the diagonal of δ to one, and for each off-diagonal entry i, j of δ, we estimate δ ij as equal to if the sample average estimate of δ ij from the j-th chain and the sample average estimate of δ ji from the i-th chain are both larger than 0.5. Otherwise δ ij = 0. Obviously, other symmetrization rules could be adopted. Given the estimate ˆδ say, of δ, we estimate ϑ R p p as follows. We set the diagonal of ϑ to /σj. For i j, if ˆδ ij = 0, we set ϑ ij = ϑ ji = 0. Otherwise we estimate ϑ ij = ϑ ji as 0.5 /σj ϑ ij /σi ϑ ji, where ϑ ij resp. ϑji is the Monte Carlo sample average estimate of ϑ ij from the j-th chain resp. i-th chain. For all the off-diagonal components i, j such that ˆδ ij =, Bayesian posterior intervals can also be produced by taking the union of the 95% posterior intervals from the i-th and j-th chains. When ˆδ ij = 0, those confidence intervals are set to {0} Results. We evaluate the behavior of the quasi-posterior distribution 7 on three simulated datasets. As benchmark, we also report the results obtained using the elastic net estimator ˆϑ glasso = Argmin θ M + p log det θ + TrθS + λ i,j α θ ij + ij α θ, where S = /nx x, α = 0.9, and λ > 0 is a regularization parameter. We choose λ by minimizing log det ˆθλ + TrˆθλS + logn i<j { ˆθλ ij >0}, over a finite set of values of λ. Our goal is not to compare the quasi-bayesian method to graphical lasso, since the former utilizes vastly more computing power that the latter. Rather, we report these numbers as references that help better understand the behavior of the proposed methodology. We look at the performance of the method by computing the relative Frobenius norm, the sensitivity and the precision of the estimated matrix as obtained above. These quantities are ined respectively as E = ˆϑ ϑ F ϑ F, SEN = i<j { ϑ ij >0} {sign ˆϑij =signϑ ij } i<j { ϑ ij >0} and PREC = i<j { ˆϑ ij >0} {sign ˆϑ ij =signϑ ij } i<j. { ˆϑ ij >0} 9 ;

13 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 3 We average these statistics over the 0 simulations replications. We compute also the same quantities for the elastic net ˆϑ glasso. These results are reported in Table -3. These results suggest that the quasi-bayesian procedure generally has good contraction properties in the Frobenius norm and hence in the L, norm. The results also suggest that the quasi-bayesian procedure tends to produce high false-negatives, but has excellent false-positive rates, even with p =, 000. The same conclusion seems to hold across all three network settings considered in the simulations. We also notice with satisfaction that there seems to be little difference between the results where ϑ jj is assumed known and the results where ϑ jj is estimated from the data. ϑ jj known Empirical Bayes Glasso Relative Error E in % Sensitivity SEN in % Precision PREC in % Table. Table showing the relative error, sensitivity and precision as ined in 9 for Setting a, with p = 00 nodes. Based on 0 simulation replications. Each MCMC run is iterations. ϑ jj known Empirical Bayes Glasso Relative Error E in % Sensitivity SEN in % Precision PREC in % Table. Table showing the relative error, sensitivity and precision as ined in 9 for Setting b, with p = 500 nodes. Based on 0 simulation replications. Each MCMC run is iterations. ϑ jj known Empirical Bayes Glasso Relative Error E in % Sensitivity SEN in % Precision PREC in % Table 3. Table showing the relative error, sensitivity and precision as ined in 9 for Setting c, with p =, 000 nodes. Based on 0 simulation replications. Each MCMC run is iterations.

14 4 YVES F. ATCHADÉ 5. Proof of Theorem 3 First we shall establish from first principle some contraction properties for posterior distributions in linear regression models. We will then reduce the proof of Theorem 3 to the linear regression case. 5.. Posterior contraction of high-dimensional linear regression models. Let X R n p be a design matrix, θ R p, σ0 > 0. Suppose that Z NXθ, σ 0I n. 0 We will write P and E for the probability measure and expectation operator under the distribution of Z assumed in 0. Let = {0, } p, and {ω δ, δ } a probability distribution on. For σ > 0, λ > 0, we consider the posterior distribution Π n θ Z = C n Z e σ Z Xθ e σ Z Xθ λ δ ω δ e λ θ µ δ dθ. δ We make the following assumptions on the distribution {ω δ, δ }. H. For all δ, ω δ = g p. δ 0 δ 0 Furthermore, there exists universal constants c, c, c 3, c 4 such that for all s =,..., p, c p c 3 g s g s c p c 4 g s. Remark 5. We note here that if δ i.i.d. Berq where q = as assumed in 8, then p u+ w δ = g p, δ 0 δ 0 where gs = p s q s q p s. Furthermore it is easy to check that {g s } satisfies the double inequalities in H with c = 0.5, c =, c 3 = u +, and c 4 = u. In other words, the prior distribution chosen in 8 satisfies H. Let δ denote the sparsity structure of θ. That is for all j {,..., p}, δ,j = if and only if θ,j > 0. Set C = θ Rp : θ j 7 θ j, and ine j, δ,j = j, δ,j =0 { v u X } Xu = inf n u, u 0, u C. For integer s, we ine { vs u X } Xu = inf n u, u 0, u 0 s {, vs u X } Xu = sup n u, u 0, u 0 s.

15 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 5 Theorem 6. Assume 0 and H, and suppose that v > 0. Then there exists a constant A 0 that depends only on the constants c, c, c 3, c 4 in H such that the following statements hold. For all p A 0, and ζ > 0, E [Π n {θ R p λ σ 4 logp : θ 0 s + ζ} Z] exp 8σ0 max j p X,j + 4 s e λ σ s nv + n vs s p ζ 4c λ σ p c. 4 For all p A 0, M 96, and integer s s such that v s > 0, set κ = nv s σ, ɛ = λ s κ, and A ɛ = {θ R p : θ θ 0 s, θ θ Mɛ}. Then λ σ 4 logp p 9 s e κmɛ /3 E [Π n A ɛ Z] exp 8σ0 max j p X,j + s e κmɛ /3 + + n vs s p p c 3 e κmɛ /64 σ λ e κmɛ /64. Proof. We start the proof with some notations. We set n/ f n,θ z = e σ πσ z Xθ, z R n, θ R p. L n,θ θ; z = log f n,θ z log f n,θ z log f n,θ z, θ θ = n X σ θ θ X θ θ. n We will need the following lemmas which are special cases of respectively Lemma and Lemma 4 of Atchade 07. Lemma 7. The normalizing constant of Π n satisfies for all z R n, C n z ω δ e λ θ λ s λ + n vs, σ where s = θ 0. Lemma 8 Existing of test. Fix M, s s an integer, and suppose that v s > 0. Set κ = nv s σ, ɛ = λ s κ. s c s

16 6 YVES F. ATCHADÉ There exists a measurable function φ : R n [0, ] such that E φz p 9 s e κmɛ /3 s e κmɛ /3. Furthermore, for any θ R p such that θ θ 0 s, θ θ > jmɛ, for some j, [ φz] f n,θ zdz e κjmɛ /3. E Proof of Theorem 6-Part. Set B = {θ R p : θ 0 s + ζ}, and E = { z R n : log f n,θ z λ }. We set κ = n vs /σ. By Lemma 7, and Fubini s theorem, E [Π n B Z] P Z / E + ω δ + κ λ s δ: δ 0 s +ζ ω δ λ δ 0 R p E [ ] fn,θ Z e λ θ f n,θ Z EZ e λ θ µ δdθ The integrand of the integral in the last displayed equation is upper bounded by Ψθ λ [ ] = exp θ θ + λ θ λ θ E e L n,θ θ;z E Z. We have λ θ θ + λ θ λ θ δc θ θ + 3 δ θ θ. Hence, if θ θ / C, using the concavity of L n,θ, Ψθ e λ 4 θ θ e λ 4 δc θ θ + 7λ 4 δ θ θ e λ 4 θ θ. However, if θ θ C, then ] E [e L n,θ θ;z E Z e nv σ θ θ, and Ψθ e λ θ θ e s λ θ θ nv σ θ θ e λ s κ e λ θ θ,

17 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 7 where κ = nv/σ. We conclude that E [Π n B Z] P Z / E + e λ s κ + κ λ s ω δ δ: δ 0 s +ζ ω δ λ P Z / E + e λ s κ δ 0 e λ R p + κ λ s ω δ 4 θ θ µ δ dθ, δ: δ 0 s +ζ ω δ 4 δ 0. Using H, ω δ δ: δ 0 s +ζ ω δ 4 δ 0 = p s g s d j=s +ζ 4 j g j d s g s p j=s +ζ j s 4 j c p c g s 4 p d = 4 s s j=s +ζ j s 4c. p c 4 For p large enough so that 4c p c 4 that <, we have d j=s +ζ j s 4c p c 4 4c ζ. p c 4 It follows e λ s κ + κ λ s ω δ δ: δ 0 s +ζ ω δ 4 δ 0 4 s e λ s κ + κ s p ζ 4c λ s p c. 4 It remains only to bound the term P Z / E. Since log f n,θ z = X z Xθ/σ, and since Z N0, σ0 I n, standard Gaussian exponential bounds give λ σ 4 P Z / E p exp 8σ0 max j p X,j. Proof of Theorem 6-Part. We set κ = n vs σ, κ = nv s σ, ɛ = λ s κ. We also set A ɛ = {θ R p : θ θ 0 s, θ θ > Mɛ}. We have Π n A ɛ Z E Z + E ZΠ n A ɛ Z E Z + φz + E Z φzπ n A ɛ Z.

18 8 YVES F. ATCHADÉ Then by Lemma 7, and Fubini s theorem, E [Π n A ɛ Z] P Z / E + E φz + + κ s λ δ 0 [ ω δ λ ω δ δ We write A ɛ = j A ɛ j, where A ɛ E ] φz f e λ θ n,θ zdz e λ θ µ δdθ A ɛ j = {θ R p : θ θ 0 s, jmɛ < θ θ j + Mɛ}. Therefore, and using Lemma 8, [ ] φz f e λ θ n,θ zdz A ɛ E e λ θ µ δdθ e κ 3 jmɛ e 3λ sjmɛ e λ θ θ µ δ dθ j A ɛj given that M 4. It is easy to check using H that ω δ δ 0 p p c 3 ω δ We can then conclude that δ s c. 4 δ 0 e κ 64 jmɛ, λ j E [Π n A ɛ Z] P Z / E + E φz + + κ s p p c 3 λ e κ 3 jmɛ, s as claimed. 5.. Proof of Theorem 3. We rely on the behavior of some restricted and m-sparse eigenvalues concepts that we introduce first. For z R n q, for some q, and for s, we ine κs, z = inf δ {0,} q : δ 0 s inf and z = inf κs, { θ z zθ n θ θ z zθ n θ : θ R q, θ 0, } : θ R q, θ 0 s, { κs, z θ z zθ = sup n θ k: δ k =0 c j θ k 7 k: δ k = θ, } : θ R q, θ 0 s.

19 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 9 In the above inition, we convene that inf = +, and sup = 0. We are interested in the behavior of κs, X, κs, X and κs, X, when X is the random matrix obtained from assumption H. We will use the following result taken from Raskutti et al. 00 Theorem, and Rudelson and Zhou 03 Theorem 3., which relates the behavior of κs, X, κs, X and κs, X to the corresponding term κ, κs and κs of the true precision matrix ϑ introduced in 9-0. Lemma 9. Assume H. Then there exists finite universal constant a > 0, a > 0 such that for the following hold. κ If κ > 0, then for all n a κ s logp P [64κs, X < κ] e a n. Let s p be such that κs > 0, then for all n a s logp, P [4 κs, X < κs or 4 κs, X > 9 κs] e a n Proof of Theorem 3-Part. We have ˇΠ n,p dθ X = p ˇΠ n,p,j dθ j X, j= where for j {,..., p}, ˇΠ n,p,j dθ j X is given by ˇΠ n,p,j du X q j u; X δ ρ j e ρ j σ j u µδ du, and For q, we ine G n,q = δ p π δ σ j log q j u; X = σj X j X j u. { z R n q : κs, z 9 4 κs, κ, z 9 4 κ, and κs, z κ }. 64 For any k j 0, we start by noting that [ { E ˇΠn,p θ R p p : θ j 0 k j, } ] for some j X PX / G n,p + p E [ Gn,p XˇΠ n,p,j A j X ]. where A j = {u R p : u 0 k j }. We notice that if X G n,p, then X j G n,p for any j p. We recall that the notation X j denotes the matrix obtained by j=

20 0 YVES F. ATCHADÉ removing the j column of X. Hence E [ Gn,p XˇΠ n,p,j A j X ] [ ] E Gn,p X j ˇΠ n,p,j A j X We conclude that [ { E ˇΠn,p θ R p p : θ j 0 k j, where [ = E Gn,p X j E ˇΠn,p,j A j X X j]. } ] for some j X PX / G n,p + p j= T j = E ˇΠn,p,j A j X X j. ] E [ Gn,p X j T j, 3 The key idea of the proof is to notice that T j is an expected quasi-posterior probability in the linear regression model X j = X j β +η, where η N0, /ϑ jj I n. Therefore, by Theorem 6-Part, we have ϑ jj ρ j T j p exp 8 max k j X k + 4 s j + σ j L j ρ j s j e ρ j s j τ j σ j p kj s j 4c, 4 where L j = n κs, X j, and τ j = nκs, X j. Given the choice of ρ j, we see that the first term on the right-hand side of 4 is bounded by s j p c 4 p exp 3 logp = p, Using the fact that for X j G n,p, we have L j 9/4n κs, τ j /64nκ, it is easy to show that the second term on the right-hand side of 4 is bounded by [ 69 exp s j logp σj ϑ jj κs κ + σ j ϑ ] jj κs j 4logp κ + log4ep c 4 logp k j s j logp. With k j = ζ j as given in the statement of the theorem, this latter expression is bounded by /p. This conclude the proof.

21 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 5... Proof of Theorem 3-Part. We use the same approach as above. We ine s j = s j + ζ j s j = if s j = 0, and s = max j s j, and we set G n,q = {z R n q : κs, z 94 κs, and z 4 κ s }. κ s, We also ine U = {θ R p p : θ j θ j > ɛ j, for some j}, Ū = U {θ R p p : θ j θ j 0 s j + ζ j for all j}, and ˇΠ n,p U X ˇΠ n,p {θ R p p : θ j θ j 0 > s j + ζ j for some j} X + G c n,p X + Gn,p XˇΠ n,p Ū X. 5 If for some j, θ j θ j 0 > s j + ζ j, then we necessarily have θ j 0 > ζ j. Therefore, by Theorem 3, we have: [ ] E ˇΠn,p {θ R p p : θ j θ j 0 > s j + ζ j for some j} X e a n + 4 p. 6 By Lemma 9, for n a s logp, [ ] E G c n,p X = P [X / G n,p ] e a n. 7 It remains to control the last term on the right-hand side of 5. To do so, we note that if X G n,p, then X j G n,p for all j p. Hence E [ Gn,p XˇΠ n,p Ū X ] p j= p j= [ ] E Gn,p X j ˇΠ n,p,j A j X [ E Gn,p X j E ˇΠn,p,j A j X X j], 8 where A j = {u R p : u θ j > ɛ j, and u θ j 0 s j }. As in the proof of Theorem 3, we note that under the conditional distribution of X j given X j, the term ˇΠ n,p,j A j X can be viewed as the posterior distribution in the linear regression model X j = X j β + η, where η N0, /ϑ jj I n. Therefore, using Theorem 6-Part, and for any constant M 0 96, we have E ˇΠn,p,j A j X X j ϑ jj ρ j p exp 8 max k j X k + e s j log9p e M 0 τ j ɛ j 3 p c 3 s j e M 0 τ j ɛ j 3 p + s j c + σ j L j ρ j s j e M 0 τ j ɛ j 64 e M 0 τ j ɛ j 64, 9

22 YVES F. ATCHADÉ where ɛ j = ρ j s / j τ j, τ j = n κ s j, X j, and L j = n κs j, X j. As seen in the proof of Theorem 3, the first term on the right-hand side of 9 is upper bounded by /p. We have Hence for p 4e, and 54M 0 3 M 0 τ j ɛ j 3 σ j ϑ jj 54M 0 3 is also upper bounded by /p. For 54M 0 3 by 4 exp [ s j logp by choosing 54M 0 64 σ j ϑ jj + c 3 + σ j ϑ jj 4logp σ j ϑ jj s j logp. 4, the second term on the right-hand side of 9 σ j ϑ jj 4, the third term is upper bounded κs j 54M 0 κ 64 + c 4 + c 3. This conclude the proof. ] σj ϑ s j logp jj p, Acknowledgements The author would like to thank Shuheng Zhou for very helpful conversations. This work is partially supported by the NSF, grants DMS 864 and DMS References Atay-Kayis, A. and Massam, H A Monte Carlo method for computing the marginal likelihood in nondecomposable Gaussian graphical models. Biometrika Atchade, Y. A. 07. On the contraction properties of some high-dimensional quasi-posterior distributions. Ann. Statist Atchadé, Y. F. 05. A Moreau-Yosida approximation scheme for highdimensional quasi-posterior distributions. ArXiv e-prints. Banerjee, S. and Ghosal, S. 05. Bayesian structure learning in graphical models. Journal of Multivariate Analysis Besag, J Spatial interaction and the statistical analysis of lattice systems. J. Roy. Statist. Soc. Ser. B Bühlmann, P. and van de Geer, S. 0. Statistics for high-dimensional data. Springer Series in Statistics, Springer, Heidelberg. Methods, theory and applications. Carvalho, C. M., Polson, N. G. and Scott, J. G. 00. The horseshoe estimator for sparse signals. Biometrika

23 QUASI-BAYESIAN ESTIMATION OF LARGE GAUSSIAN GRAPHICAL MODELS 3 Castillo, I., Schmidt-Hieber, J. and van der Vaart, A. 05. Bayesian linear regression with sparse priors. Ann. Statist Dobra, A., Lenkoski, A. and Rodriguez, A. 0. Bayesian inference for general Gaussian graphical models with application to multivariate lattice data. J. Amer. Statist. Assoc Fearnhead, P. and Prangle, D. 00. Constructing summary statistics for approximate bayesian computation: Semi-automatic abc. Technical Report, Lancaster University, UK. Hastie, T., Tibshirani, R. and Wainwright, M. 05. Statistical Learning with Sparsity: The Lasso and Generalizations. Chapman and Hall/CRC. Kato, K. 03. Quasi-Bayesian analysis of nonparametric instrumental variables models. Ann. Statist Khondker, Z. S., Zhu, H., Chu, H., Lin, W. and Ibrahim, J. G. 03. The Bayesian covariance lasso. Stat. Interface Lenkoski, A. and Dobra, A. 0. Computational aspects related to inference in Gaussian graphical models with the G-Wishart prior. J. Comput. Graph. Statist Supplementary material available online. Li, C. and Jiang, W. 04. Model Selection for Likelihood-free Bayesian Methods Based on Moment Conditions: Theory and Numerical Examples. ArXiv e-prints. Meinshausen, N. and Buhlmann, P High-dimensional graphs with the lasso. Annals of Stat Mukherjee, S. and Speed, T. P Network inference using informative priors. Proceedings of the National Academy of Sciences Narisetty, N. and He, X. 04. Bayesian variable selection with shrinking and diffusing priors. Ann. Statist Park, T. and Casella, G The Bayesian lasso. J. Amer. Statist. Assoc Peng, J., Wang, P., Zhou, N. and Zhu, J Partial correlation estimation by joint sparse regression models. Journal of the American Statistical Association Peterson, C., Stingo, F. C. and Vannucci, M. 05. Bayesian inference of multiple gaussian graphical models. Journal of the American Statistical Association Raskutti, G., Wainwright, M. J. and Yu, B. 00. Restricted eigenvalue properties for correlated gaussian designs. J. Mach. Learn. Res

24 4 YVES F. ATCHADÉ Ravikumar, P., Wainwright, M. J., Raskutti, G. and Yu, B. 0. Highdimensional covariance estimation by minimizing l -penalized log-determinant divergence. Electron. J. Stat Reid, S., Tibshirani, R. and Friedman, J. 03. A Study of Error Variance Estimation in Lasso Regression. ArXiv e-prints. Rudelson, M. and Zhou, S. 03. Reconstruction from anisotropic random measurements. IEEE Trans. Inf. Theor Schreck, A., Fort, G., Le Corff, S. and Moulines, E. 03. A shrinkagethresholding Metropolis adjusted Langevin algorithm for Bayesian variable selection. ArXiv e-prints. Sun, T. and Zhang, C.-H. 03. Sparse Matrix Inversion with Scaled Lasso. Journal of Machine Learning Research

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage

BAGUS: Bayesian Regularization for Graphical Models with Unequal Shrinkage Lingrui Gan, Naveen N. Narisetty, Feng Liang Department of Statistics University of Illinois at Urbana-Champaign Problem Statement