Gaussian graphical model determination based on birth-death MCMC inference

Gaussian graphical model determination based on birth-death MCMC inference Abdolreza Mohammadi Johann Bernoulli Institute, University of Groningen, Netherlands a.mohammadi@rug.nl Ernst C. Wit Johann Bernoulli Institute, University of Groningen, Netherlands e.c.wit@rug.nl February 5, 2013 Abstract We propose an efficient Bayesian methodology for model determination in Gaussian graphical models for both decomposable and nondecomposable cases. The proposed methodology is a trans-dimensional MCMC approach, which makes use of the spatial birth-death process. The birth-death process jumps through all possible graphical models by adding a new edge in a birth event or deleting an edge in a death event. The proposed method is easy to implement and computationally feasible for large graphical models. We illustrate the efficiency of the proposed methodology on simulated and real datasets. Besides, we have implemented the proposed methodology into an R-package, called BDgraph which is freely available online. Keywords : Bayesian model selection, Gaussian graphical models, Non-decomposable graphs, Birth-death process, Markov chain Monte Carlo, G-Wishart. 1 Introduction In archetypal high-dimensional inference problems, large number of variables are recorded on a relatively small number of observations. Examples of these kind of high-dimensional problems are detecting neurological associations in fmri data, inferring gene networks from genomic data or predicting movie preferences in sparse film rating data. The simplest way to describe these types of multivariate data would be by means of a multivariate Gaussian distribution. In the high-dimensional problems, either by design or necessity, it can be to interest to look at the multivariate Gaussian distributions with a reduced parameter space. Covariance-selection models or Gaussian graphical models offer a potent set of tools for shrinkage and regularization of covariance matrices in these kind of high-dimensional problems. Dempster (1972) proposed the 1

method which reduces the number of parameters in Gaussian graphical models by setting to zero selected elements of the precision matrix. In addition, the dependency patterns among the variables in the model can be visually summarized by means of a undirected graph G = (V, E). In the graph each variable is associated with a vertex V = {1,..., p} and the edges set E V V. In case the underlying variables are multivariate normal then the off-diagonal elements of precision matrix that are unequal to zero show the edges which link the vertices. The graphical model is undirected since the precision matrix is symmetric. Graphs with p nodes have m = p(p 1)/2 possible edges. As a result, there are 2 m possible graphical models corresponding to all combinations of individual edges being included in or excluded from the model. For instance, for a graph with p = 8, there are more than 250 million structurally different graphical models. In typical high-dimensional problems, such as in genetic networks, there are hundreds of nodes. This motivates the development of efficient, scalable search methodologies, which are able to move through all possible graphical models for inferring a model close to the true one, or at least for distinguishing a set of true edges from irrelevant ones. Roverato (2002), Jones et al. (2005) and Lenkoski and Dobra (2011) proposed Bayesian approaches for computing the posterior distribution of graph based on G-Wishart prior distribution. The ability to focus on the graph alone allows for the development of various search algorithms to visit the high probability regions of model space. However, determining the graphical models with the highest posterior probability requires knowledge of the normalizing constants for all possible graphical models and these normalizing constants are not available analytically unless the graph is decomposable. Such methods are unsuitable as the basis of an MCMC sampling scheme for even moderate p because there is a huge number of possible graphical models with only a small fraction of them being decomposable. Some approaches try to approximate the normalizing constant for the non-decomposable graphical models; Roverato (2002) proposed importance sampling, Atay-Kayis and Massam (2005) proposed Monte Carlo sampling and Lenkoski and Dobra (2011) proposed Laplace approximation. Nevertheless, normalizing constant approximation is often the crucial part of the computation. An alternative is a trans-dimensional MCMC methodology. In this methodology the MCMC algorithm can move through all possible models for not only selecting a best model, but also estimating the parameters of the best model, simultaneously. One special and popular case is reversible-jump MCMC (RJM- CMC) approach which have been proposed by Green (1995). Reversible jump methods allow for the construction of an ergodic Markov chain with the joint posterior distribution of the parameters and the model as its stationary distribution. Moves between models are achieved by periodically proposing a move to a different model, and rejecting it with appropriate probability to ensure that the chain possesses the required stationary distribution. Ideally these proposed moves are designed to have a high probability of acceptance so that the algorithm explores the model space efficiently, though this is not always easy. Giudici and Green (1999) used this methodology in the decomposable Gaus- 2

sian graphical models which works only for low-dimensional models. Dobra et al. (2011) proposed another RJMCMC method based on Cholesky decompositions of precision matrix which requires the computationally intensive matrix completion. Moreover, both methods still require to compute the normalizing constant. Another trans-dimensional MCMC methodology is the birth-death MCMC (BDMCMC) approach which is based on a continuous time Markov process. This methodology was developed by Stephens (2000) for the use in finite mixture models with variable dimensions, following earlier proposals by Preston (1976), Ripley (1977) and Geyer and Møller (1994). In this method, the time between jumps to a larger dimensional model, i.e. births, or jumps to a smaller dimensional model, i.e. deaths, is taken to be a random variable with a specific rate. The choice of the birth and death rates determines the birth-death process and is chosen in such a way that its stationary distribution is precisely the posterior distribution of interest. In contrast to the RJMCMC approach, moves between models are always accepted, which can make the BDMCMC approach extremely efficient. In this paper, we propose a novel Bayesian framework for Gaussian graphical models based on the BDMCMC methodology. In our proposed BDMCMC method, we add or remove an edge via birth and death events, respectively. The birth and death events are modeled as an independent Poisson processes. Therefore, the time of a birth or death is exponentially distributed. Our proposed methodology applies to general graphical models, i.e. both decomposable and non-decomposable models. It can be used for high-dimensional problems, i.e. graphical models with more than 120 nodes; example 4.2. In section 2, after briefly introducing the notation and preliminary background material related to graphical models from a Bayesian point of view, we propose our Bayesian model selection in Gaussian graphical models based on the BDMCMC methodology. In addition, for the proposed BDMCMC algorithm, we consider two different death rates for low or high dimensional cases, respectively. Section 3 contains the specific implementation of the proposed method, such as proposing suitable prior distributions, an algorithm for sampling from the precision matrix and computing the death rates. In section 4 we demonstrate the performance of proposed methodology in several simulations and real data. We conclude the paper with a discussion of various possible extensions of proposed methodology. 2 Birth-death MCMC inference for Gaussian graphical models 2.1 Bayesian graphical models We briefly introduce some notation and the structure of undirected Gaussian graphical models. For a comprehensive introduction to Gaussian graphical models see Lauritzen (1996) and Whittaker (1990). Let G = (V, E) be a undirected 3

graph, where V = {1, 2,..., p} is the set of p vertices and E is the edges set. Let W = {(i, j) i, j V, i j}, V = {(i, j) i j, such that i = j or (i, j) E}, and E = W\V. We define Gaussian graphical models with respect to the graph G and zero mean as M G = { N p (0, Σ) K = Σ 1 P G }, where P G denotes the space of p p positive definite matrices with entries (i, j) equal to zero whenever (i, j) E, that is P G = { K P k ij = 0, for (i, j) E; G = (V, E) }, where K = {k ij } and P denotes the space of p p positive definite matrices. Note that P G P. Let x = (x 1,..., x n ) be an independent and identically distributed sample of size n from a Gaussian graphical model M G. Therefore, the likelihood is p(x K, G) = (2π) np/2 K n/2 exp { 12 } tr(ks), (1) where S = x x. In our graphical model we are dealing with two kinds of uncertainties: (a) uncertainty about the structure of the underlying conditional independence graph and (b) uncertainty about the parameters of the graphical model. Our aim is to propose a Bayesian framework to deal with both of these uncertainties. It is natural to define the joint prior on the graph and precision matrix via product rule p(g, K) = p(g)p(k G). Therefore, the joint posterior distribution on (G, K) is p(k, G x) P (x K, G)p(K, G) P (x K, G)p(K G)p(G). (2) 2.2 Birth-death process for Gaussian graphical models Here, we propose a continuous time Markov birth-death process for Gaussian graphical model determination based on theory derived by (Preston, 1976, Section 5). The birth and death events of the edges occur in continuous time with rates determined by the stationary distribution of the process. Let G = (V, E) be the current state of the process. If it is the time for birth of edge ξ = (i, j) E, the process jumps from current graph to a new graph G +ξ = (V, E ξ). If, however, it is the time for death, then an edge ξ = (i, j) E is removed and the process jumps from the current graph to a new graph G ξ = (V, E \ ξ). 4

Suppose at time t the process is at state M G in which G = (V, E) with precision matrix K P G and let Ω = G G P G where G denotes the set of all possible graphical models. Now, we consider the following continuous time Markov birth-death process on Ω as follows: Death: When the process is at state M G each edge ξ = (i, j) E dies independently of the others as a Poisson process with a rate δ ξ (K). Thus, the overall death rates is given by δ(k) = ξ E δ ξ (K). When death of an edge ξ = (i, j) E occurs as a result the parameter k ξ dies and the process jumps from K to K ξ = K \k ξ. We define the matrix K ξ is equal with matrix K except for the entry in positions {(i, j), (j, i), (j, j)}. We put 0 in positions (i, j) and (j, i). To get a guarantee that new precision matrix is positive definite, we set the (j, j)th entry in K ξ to k jj c + c, where and c = K j,v \j ( KV \j,v \j ) 1 KV \j,j, ( ) 1 c = K ξ j,v \j K ξ V \j,v \j K ξ V \j,j. The idea of the proposed modification is coming from block Gibbs sampling method; see (8). Birth: When the process is at state M G a new edge ξ = (i, j) E is born independently of the others as a Poisson process with a rate β ξ (K) respecting to the product measure R. Then the overall birth rates is given by β(k) = ξ E β ξ (K). When a birth of ξ E occurs the process jumps from K to K +ξ = K k ξ. We define matrix K +ξ is equal with matrix K except for the entry in positions {(i, j), (j, i), (j, j)}. We put non-zero value k ξ in positions (i, j) and (j, i). k ξ is chosen according to proposed density b ξ (k ξ ; K). To get a guarantee that new precision matrix is positive definite, we set the (j, j)th entry in K ξ to k jj c + c +, where ( ) 1 c + = K +ξ j,v \j K +ξ V \j,v \j K +ξ V \j,j. Thus a death decreases the number of parameters by one, while a birth increases the number of parameters by one. In our birth death process, the time to the next birth/death event is exponentially distributed, with mean 1/(β(K) + δ(k)). As a result, it will be a death of edge ξ E with probability δ ξ (K)/(β(K) + δ(k)), likewise a birth of edge ξ E with probability β ξ (K)/(β(K) + δ(k)). To show the stationary distribution for the birth-death process is precisely the posterior distribution p(k, G x), we require the following sufficient condition on the birth and death rates. 5

Theorem 2.1. The birth-death process defined above has stationary distribution p(k, G x), if for each ξ E β ξ (K)b ξ (k ξ ; K)p(G, K x) = δ ξ (K +ξ )p(g +ξ, K +ξ x). (3) Proof. See the Appendix. 2.3 The proposed BDMCMC algorithm Here we propose a BDMCMC algorithm based on a specific choice of the birth and death rates that satisfy the Theorem 2.1. Suppose we consider the birthdeath process obtained by setting fixed birth rates as β ξ (K) = β 0 for each ξ E, where β 0 is an arbitrary fixed number in R +. Therefore, according to (3), the death rates are δ ξ (K) = b ξ(k ξ ; K ξ )p(g ξ, K ξ x) β 0 for each ξ E. (4) p(g, K x) Based on these birth and death rates, we can determine the BDMCMC algorithm for the Gaussian graphical models as below. Algorithm 2.1. BDMCMC algorithm. Starting with initial graphical model M G in which G = (V, E) with the precision matrix K, iterate the following steps: 1. Let the birth rates β ξ (K) = β 0, for each edge ξ E. 2. Calculate the total birth rate β(k) = E β 0. 3. Calculate the death rates by: δ ξ (K) = b ξ(k ξ ; K ξ )p(g ξ, K ξ x) β 0 for each ξ E. p(g, K x) 4. Calculate the total death rate δ(k) = ξ E δ ξ(k). 5. Calculate the waiting time by λ(k) = 1/(β(K) + δ(k)). 6. Simulate the type of jump: a birth or death with respective probabilities p(birth element ξ) = β ξ(k), for each ξ E, λ(k) p(death element ξ) = δ ξ(k), for each ξ E. λ(k) 7. According to the type of jump sample from the posterior distribution of new precision matrix. For step 7, in subsection 3.4 we explain how to sample from the posterior distribution of precision matrix. 6

Figure 1: Illustration of sampling from BDMCMC algorithm in continuous time. According to this figure we sampling from the BDMCMC algorithm in times {t 1, t 2, t 3, t 4,...}. 2.4 Sampling from BDMCMC algorithm in continuous time In the RJMCMC approach or other kinds of Metropolis-Hastings algorithms, the outputs are typically monitored after each iteration. In our continuous time BDMCMC algorithm there are several choices for a sampling scheme. For example, we can sample from the continuous time Markov process at regular times, as Stephens (2000). Another way is to sample in each step of jumping to the new state, as we do in this paper; see figure 1. Then effectively put the weight on each state visited by algorithm, when computing the sample mean. The weights are equal with the length of the holding time in that state. In other word, assume that the process is in state M G with precision matrix K, thus the holding time for this state would be λ(k) in which λ(k) = β(k) + δ(k); See (9). In this way, the variances of estimators built from the sampler output are decreased; For more details see Cappé et al. (2003) subsection 2.5. 3 Specific implementation of the BDMCMC algorithm 3.1 Proposed prior distributions In our proposed Bayesian methodology, we embed the joint inference problem naturally in the structure of a Bayesian hierarchical model. Given a prior p(g) over the graph, we set a prior distribution for its precision matrix p(k G). For the prior distribution of graph we propose two different prior distributions. One is discrete uniform distribution on the graph space G as below p(g) = 1, for each G G, G 7

where G denotes the set of all possible graphical models. A second prior distribution of the graph is given by a truncated Poisson distribution on the edge degree (degree(g) T P (γ)) with the probabilities of the graphs proportional to p(g) γ E, for each G G, E! where E is the number of edges in the graph G. For simplicity in computing the death rates we can put γ = β 0 which β 0 is the birth rate in our BDMCMC algorithm. For the prior distribution of precision matrix, we use the G-Wishart distribution. The G-Wishart distribution is extremely attractive, since it represents the conjugate prior to normally distributed data. It easily places probability no mass on absent edges of graph. A zero constrained random matrix K P G has the G-Wishart distribution W G (b, D) as below 1 p(k G) = I G (b, D) K (b 2)/2 exp { 12 } tr(dk), where b > 2 is the degree of freedom, D is a symmetric positive definite matrix and I G (b, D) is the normalizing constant, namely, I G (b, D) = K (b 2)/2 exp { 12 } tr(dk) dk. P G Hence, conditional on a specific graph and an observed dataset x, the posterior distribution of K is { 1 p(k x, G) = I G (b, D ) K (b 2)/2 exp 1 } 2 tr(d K), where b = b+n and D = D+S. This posterior distribution is also G-Wishart, W G (b, D ). 3.2 Computing the death rates for low dimensional cases By using uniform prior distribution for graph, and G-Wishart distribution for precision matrix, we have p(g, K x) P (x K, G)p(K G)p(G) { 1 I G (b, D) K (b 2)/2 exp 1 } 2 tr(d K), where b = b + n and D = D + S. As a result, for each ξ E, the death rate according to (4) is 8

δ ξ (K) = p(g ξ, K ξ x) β 0 b ξ (k ξ ; K ξ ) p(g, K x) = I ( G(b, D) K ξ ) (b 2)/2 exp { 1 2 tr(d K ξ ) } I G ξ(b, D) K exp { 1 2 tr(d K) } β 0b ξ (k ξ ; K ξ ) = I ( G(b, D) K ξ ) (b 2)/2 { exp 1 } I G ξ(b, D) K 2 tr(d (K ξ K)) β 0 b ξ (k ξ ; K ξ ). For the proposal density, b ξ (k ξ ; K ξ ), according to the G-Wishart distribution property, we propose a Normal distribution k ξ N( d ij d k ii, k ii jj d ), for each ξ E, jj which d ij is the (i, j) entry of matrix D. For computing above death rates, we need to compute the ratio of normalizing constants of G-Wishart distribution. There is no direct way to obtain the exact value of the normalizing constant. This is the biggest computational bottleneck not only in our Bayesian approach, but also in the Bayesian Gaussian graphical model literature; see Atay-Kayis and Massam (2005), Lenkoski and Dobra (2011), and Dobra et al. (2011). To approximate the normalizing constant of the G-Wishart distribution, Atay-Kayis and Massam (2005) proposed a Monte Carlo method based on the Cholesky decomposition, described below. Let G = (V, E) be an arbitrary Gaussian graphical model with precision matrix K that K W G (b, D). According to Cholesky decomposition we have K = T Ψ ΨT that Ψ = (ψ ij ) p p and D 1 = T T. Then, for i = 1,..., p, ψ 2 ii has the chi-square distribution with b + ν i degrees of freedom and, for (i, j) E, ψ ij has standard Normal distribution that all are mutually independent. For the ψ ij that (i, j) E, are well-defined functions of Ψ ν, as ( j 1 ψ ij = ψ ik h kj 1 i 1 i ( j ψ rl h li) ψ rl h li ), ψ ii k=i r=1 l=r l=r where h ij = t ij /t jj and t ij being the (i, j) entry of matrix T. In particular, for i = 1 and (1, j) E j 1 ψ 1j = ψ 1k h kj. k=1 According to Atay-Kayis and Massam (2005), the normalizing constant of G-Wishart distribution W G (b, D) is I G (b, D) = C b,t E G [f T (Ψ ν )], 9

where and f T (Ψ ν ) = exp 1 2 C b,t = 2 (pb/2+ ν i) π ν i/2 i=1 ξ E ψ 2 ξ, p ( ) b + νi Γ t b+τi ii, 2 where ν i is the number of neighbors of node i subsequent to it in the ordering of vertices and τ i is total number of neighbors of node i. We can approximate E G [f T (Ψ ν )] according to Atay-Kayis and Massam (2005), as below. Algorithm 3.1. Monte Carlo method. Given the arbitrary graph G = (V, E): 1. Sample Ψ following Steps 1,2,3 and 4 in Section 4.2 of Atay-Kayis and Massam (2005). ( ) 2. Compute f (k) T (Ψν ) = exp 1 2 ξ E (ψk ξ )2, for N iterations. 3. Compute E G [f T (Ψ ν )] = 1 N N exp 1 (ψξ k ) 2. 2 k=1 ξ E With some computation, we can write the ratio of the normalizing constant, for each ξ = (i, j) E, as below I G (b, D) I G ξ(b, D) = 2 πt ii t jj Γ ( b+νi ) 2 Γ ( b+ν i 1 2 As a result, we can write the death rates as below ) E G [f T (ψ ν )] E G ξ [f T (ψ ν )]. (5) δ ξ (K) = 2 Γ ( b+ν i ) 2 πt ii t jj Γ ( b+ν i 1) E G [f T (ψ ν )] E 2 G ξ [f T (ψ ν )] ( K ξ ) (b 2)/2 { exp 1 } K 2 tr(d (K ξ K)) β 0 b ξ (k ξ ; K ξ ),(6) in which the ratio of expectations are computed by Algorithm 3.1. We should mention that the proposed BDMCMC algorithm according to the above death rates is slow; see example 4.1. As a result, it is not suitable for high-dimensional cases. 10

Plot for ratio of normalizing constants ratio of expectation 1.00 1.05 1.10 1.15 1.20 0 50 100 150 200 250 300 number of nodes Figure 2: Plot the ratio of expectation in 5 for the graphs with the same structure by increasing the dimension of the graph (s). This plot shows that for high-dimensional cases this ratio of expectation converging to one. 3.3 Computing death rates for high-dimensional cases Computing the ratio of normalizing constants is the main computational bottleneck not only in our proposed BDMCMC algorithm but also in other Bayesian approaches in this area. For example, Wang and Li (2012) proposed double Metropolis Hastings algorithm to avoid the computationally expensive normalizing constants that made their algorithm mach faster. Nevertheless, in their methodology instead of computing ratio of normalizing constant they need sampling from precision matrix which for high-dimensional cases takes time. Here, we propose easy way to compute ratio of normalizing constants which makes sense especially for high-dimensional cases. We can write the ratio of normalizing constants according to (5) which includes a ratio of expectations. Fortunately, our results indicate that for high-dimensional cases, this ratio of expectations converges to one. To show this assumption makes sense for highdimensional case, according to algorithm 3.1. we computed this ratio of expectations for the graphs with the same structure by increasing the dimension of the graph (p). The result is in a figure 2. This figure shows that ratio of expectations in (5) for high-dimensional cases converges to one. This ratio does not depend on the data. Therefore, for high-dimensional graphs, the death rates become δ ξ (K) = 2 Γ ( ) b+ν i 2 πt ii t jj Γ ( b+ν i 1 2 { exp 1 2 tr(d (K ξ K)) ( K ξ ) (b 2)/2 ) K } β 0 b ξ (k ξ ; K ξ ). (7) 11

In simulation examples 4.1 and 4.2 we show that our proposed BDMCMC algorithm according to above death rates is fast and accurate for high-dimensional cases. 3.4 Sampling from precision matrix In our proposed BDMCMC algorithm, we need to sample from the conditional distribution of precision matrix. For sampling from precision matrix under a G-Wishart distribution, several sampling methods have been proposed; Block Gibbs sampling (Wang and Li (2012)), Metropolis-Hastings method (Mitsakakis et al. (2011)), and accept-reject algorithm (Wang and Carvalho (2010)). Wang and Li (2012) review all existing methods and they show the block Gibbs sampler method generally outperforms all other proposed methods. Here, we briefly review the block Gibbs sampler method. Let G = (V, E) be an arbitrary graph with precision matrix K that K W G (b, D) and let l V denote a complete subset of graph. (Roverato, 2002, Lemma 1) shows that for any complete subset of graph like l, we have K l,l c K \ K l,l W (b + p l, D l,l ), (8) where c = K l,v \l (K V \l,v \l ) 1 K V \l,l and l is a size of complete subset l and W denotes a standard Wishart distribution. Now, according to Wang and Li (2012), we can summarize the block Gibbs samplers as follows. Algorithm 3.2. Block Gibbs sampler. Given an arbitrary graph G = (V, E), construct a sequence of complete subsets l = {l k }, where l k V such that k l k = E: 1. Generate A W (b + p l, D l,l ). 2. Set K l,l = A + K l,v \l (K V \l,v \l ) 1 K V \l,l. For an arbitrary graph G, the choice of complete subsets l may not be unique. Lenkoski and Dobra (2011) proposed the special case where l is a collection of the maximum cliques. This case requires an algorithm for maximum clique decomposition which computationally is expensive. Another extreme is proposed by Wang and Li (2012) where l is a collection of the edge set E and they call it edgewise block Gibbs sampler. 4 Statistical performance of proposed methodology In this section we present the result of the analyses for a real and two simulation datasets considered for both high and low dimensional cases. All computations have been done by an R package, called BDgraph. The R package is freely available from the Comprehensive R Archive Network at http: //CRAN.R-project.org/package=BDgraph. All timings were carried out on a Intel(R) Core(TM) i5 CPU 2.67GHz processor. 12

4.1 Simulation example 1: Graph with 8 nodes We consider a graph with 8 nodes, in which we have more than 250 million graphical models. We assume the true graphical model is M G = { N 8 (0, Σ) K = Σ 1 P G }, in which the precision matrix is 1 0.5 0 0 0 0 0 0.4 1 0.5 0 0 0 0 0 1 0.5 0 0 0 0 K = 1 0.5 0 0 0 1 0.5 0 0. 1 0.5 0 1 0.5 1 We sample from the true graph with n = 100. For the prior distribution of graph, we place a uniform prior distribution. For the prior distribution of precision matrix, we proposed the G-Wishart prior distribution W G (3, I 8 ). First, we run the BDMCMC algorithm (Algorithm 2.1) with death rates according to (7). We run the algorithm with 10000 iterations and 5000 iterations as a burn-in and it takes 55 seconds. We calculate the posterior edge inclusion probabilities as ˆp ξ = N t=1 N t=1 I(ξ G (t) ) λ(k (t) ) 1 λ(k (t) ), for each ξ W, (9) which N is a number of iterations, I(ξ G (t) ) is a general indicator function, so that I(ξ G (t) ) = 1 if ξ G (t) and zero otherwise and λ(k (t) ) is the waiting time in the graph G (t) with precision matrix K (t) ; see figure 1. By using this formula, for the posterior mean estimations of all edges ξ = (i, j) W we have 1 1 0.03 0.06 0.02 0.02 0.03 1 1 1 0.04 0.03 0.02 0.03 0.03 1 1 0.06 0.04 0.06 0.03 ˆp ξ = 1 1 0.05 0.04 0.03 1 1 0.05 0.13. 1 1 0.13 1 1 1 Moreover, the posterior distribution of the true graph is 0.36 which is the most probable graph. With output of the BDMCMC algorithm, we can also estimate the matrix variance covariance and precision matrix. Estimation of 13

Pr(number of links in the graph data) 0.0 0.1 0.2 0.3 0 5 10 15 20 25 number of links in the graph Figure 3: (Left) plot of posterior distribution for the graphs according to number of their edges. (Right) The cumulative occupancy fractions of all possible edges for checking convergency of our BDMCMC algorithm. It shows that our BDMCMC algorithm converges after roughly 4000 iteration. precision matrix is 1.3 0.6 0 0 0 0 0 0.5 1.4 0.5 0 0 0 0 0 1 0.5 0 0 0 0 ˆK = 1.2 0.6 0 0 0 1.3 0.4 0 0. 0.9 0.5 0 0.9 0.4 1 Figure 3 in the left shows the estimation of posterior distribution for the graphs according to number of their edges. The figure shows that the posterior distribution for most of the graphs are zero. Furthermore, according to number of edges, the most probable graphs are the graph with 8 edges which is in total 0.36 and this probability also includes probability of the true graph which is 0.35 that is quite reasonable. In comparison to other Bayesian methodologies in this area, like the RJM- CMC, one of the advantages of our proposed BDMCMC algorithm is its fast convergency. A useful check on the convergency is given by the plot of the cumulative occupancy fraction for different edges against all iterations. It is represented in the right side of Figure 3. As the figure shows, our BDMCMC algorithm converges roughly after 4000 iterations. It can be seen that our burnin (5000) is more that adequate to achieve stability in the occupancy fractions. Effect of sample size. For checking the sensitivity of our BDMCMC algorithm to sample size, we run the BDMCMC algorithm in the same situation but for different number of observations. The result is shown in table 1. The 14

Table 1: Simulation results according to different number of observations for the graph with 8 nodes. It shows that the accuracy of the BDMCMC is depend on number of observation and for observation equal or more than 30 the proposed BDMCMC algorithm perfectly select the true graph as the best graph. n 20 30 40 60 80 100 150 200 p(true graph data) 0.018 0.067 0.121 0.2 0.22 0.35 0.43 0.49 false discovery 1 0 0 0 0 0 0 0 false negative 0 0 0 0 0 0 0 0 Table 2: Simulation results according to different value of b in W G (b, D), for the graph with 8 nodes. It shows that the result of our proposed BDMCMC algorithm almost is not sensitive to value of b. b 3 10 20 50 p(true graph data) 0.35 0.39 0.33 0.25 first row of the table is the number of observations. The second row is the posterior distribution of the true graph for the different value of observations. The third row (false discovery) is the number of true zeros estimated as nonzero and the forth row (false negative) is the number of off-diagonal non-zero elements estimated as zero. The result of the table 1 shows that our BDMCMC algorithm is sensitive to the number of observations. By increasing the number of observations the result of BDMCMC algorithm is going to be more accurate. Sensitivity to the priors. To evaluate the sensitivity of the BDMCMC algorithm to the prior distributions, first we check the result for different value of b, the parameter of prior distribution of precision matrix in W G (b, D). Then we evaluate the result for different prior distributions of the graph. In this example, we placed a uniform prior distribution for the graph. For evaluating the sensitivity of the results to the prior distribution of the graph, we also check the result by placing truncated Poisson distribution. Table 2 shows the results for different values of b. The results indicate that our BDMCMC algorithm is not very sensitive to value b. Table 3 shows the results for different values of β 0 (the birth rates here is the rate for truncated Poisson distribution, degree(g) T P (β 0 )). In addition, according to the result from tables 1, 2 and 3 our BDMCMC algorithm improves with sample size, but is not very sensitive to the prior dis- Table 3: Simulation results according to different value of β 0 in degree(g) T P (β 0 ), for the graph with 8 nodes.it shows that the result of our proposed BDMCMC algorithm almost is not sensitive to value of β 0. β 0 1 5 8 10 20 50 p(true graph data) 0.52 0.37 0.33 0.33 0.30 0.22 15

Table 4: Result of our proposed BDMCMC algorithm with death rates according to 3.2 for different Monte Carlo iterations for the graph with 8 nodes. According to the result, for more than 100 MC iterations the BDMCMC is accurate but not fast. MC iteration 1 10 50 100 200 500 CPU time (min.) 3 29 140 270 460 910 p(true graph data) 0.003 0.20 0.22 0.41 0.43 0.44 false discovery 2 0 0 0 0 0 false negative 0 0 0 0 0 0 tributions. Comparison two different death rates. For comparison and checking the accuracy of our BDMCMC algorithm with our two proposed different death rates for low and high dimensional cases according to (7) and (6), we also run the BDMCMC algorithm in the same situation for death rates according to (6). The results are given in table 4. It shows the CPU time per minute and accuracy of the BDMCMC algorithm (with death rates 6) with number of iterations in Monte Carlo approach, according to algorithm 3.1. In addition, the example shows the result of the BDMCMC algorithm with death rates according to (7) is almost the same with death rates according to (6). The main difference is computation time; the BDMCMC algorithm with death rates according to (7) gives us almost the same result in less than one minute. Table 4 shows that the BDMCMC algorithm with death rates according to (6) is not fast, therefore it is not suitable for high-dimensional problems. On the other hand, the BDMCMC algorithm with death rates according to (7) is fast and accurate, so we can use it for a high-dimensional problems. 4.2 Simulation example 2: Graph with 120 nodes For checking the accuracy of the BDMCMC algorithm for high-dimensional problems, here we consider a sparse circle graph with p = 120. We assume the true graphical model is given by M G = { N 120 (0, Σ) K = Σ 1 P G }, in which for the element of precision matrix we have k ii = 1, k ij = 0.5 for i j = 1, k 1p = k p1 = 0.4, and k ij = 0 otherwise. We sample from the true graphical model with n = 1000. We place a uniform prior distribution for the prior distribution of graph. For the prior distribution of precision matrix K, we proposed the G-Wishart prior distribution W G (3, I 120 ). We run the BDMCMC algorithm with death rates according to (7) with 10000 iterations and 5000 iterations as a burn-in. It takes only 190 minutes which shows that the algorithm is fast and outperforms other Bayesian approaches in this area. The posterior distribution of the true graph is 0.4 which is most probable graphical model. The posterior edge inclusion probabilities are calculated for 16

Pr(number of links in the graph data) 0.0 0.1 0.2 0.3 0.4 100 110 120 130 140 number of links in the graph Figure 4: Plot of posterior distributions for the graphs according to number of their edges. each ξ W according to (9) in which the lowest probability including the true edge has probability 1 and the highest probability excluded the true edge has probability 0.047 which is quite reasonable. Figure 4 shows the estimation of posterior distribution for the graphs according to number of their edges. 4.3 Real example: Cell signaling data Here we consider a flow cytometry dataset with 11 proteins from Sachs et al. (2005). By using Bayesian network inference, they fit a directed acyclic graph (DAG) to the data and produce the network shown in the left side of figure 5. Friedman et al. (2008) applied the undirected graphs from graphical lasso to the data for different values of the penalty parameter. In our Bayesian approach, we place a uniform prior distribution for the graph and the G-Wishart prior distribution W G (3, I 11 ) for precision matrix K. We run the BDMCMC algorithm with the death rates according to (7) with 10000 iterations and 5000 iterations as a burn-in. Running the algorithm takes less than 2 minutes. Figure 5 in the left shows the graph with the most posterior probability which is 0.125; this graph has 31 edges. Moreover, according to (9), the posterior mean estimations for all edges 17

Figure 5: (Left) Cell-signaling dataset: the most probable undirected graphical model according to the result from the BDMCMC algorithm. (Right) Result from Sachs et al. (2005): Directed graph from cell-signaling dataset according to Bayesian network inference. ξ = (i, j) W are 1 1 1 0.05 1 1 1 0.67 0.28 0.14 1 1 1 0.01 1 1 1 0.96 0.04 1 1 1 1 1 1 1 0.93 0.99 1 1 1 1 0.01 0.01 0.00 0.11 0.01 0.01 1 0.00 0.04 0.00 0.00 0.00 0.10 ˆp = 1 1 1 0.01 0.07 0.01. 1 1 0.02 1 0.00 1 0.05 0.94 0.00 1 1 1 1 1 1 5 Discussion In this article, we have proposed a Bayesian method for determining the Gaussian conditional independence graphs based on birth-death MCMC inference. We derived the conditions for which the balance conditions of the birth-death MCMC methodology holds. In accordance with those conditions we proposed a convenient BDMCMC algorithm. If we use the exact death rates (6), we show in example 4.1 that the BDMCMC algorithm is accurate but not fast. However, our proposed BDMCMC algorithm according to death rates (7) is fast and for high-dimensional problems actually improves! A so called blessing of dimensionality. Our examples demonstrate that a scalable Bayesian inference methodology exists, which exactly in the case of large graphs is able to distinguish important edges from irrelevant ones and detect the true model with 18

high accuracy. The resulting graphical model is reasonably robust to modelling assumptions and priors used. A possible disadvantage of our BDMCMC algorithm is that the birth rates are constant across the edges and therefore the algorithm relies on death rates to converge to the most probable graph. However, this feature has the advantage that it allows for fast mixing across the model space. Other Bayesian approaches in this area, such as the RJMCMC in Giudici and Green (1999), mix very slowly, since they randomly pick new edges, but only add them if they are consistent with the data. Therefore, our algorithm has an advantage of being able to mix fast across the full model space, especially for high-dimensional problems. There are several conceptual extensions. For our BDMCMC methodology, we proposed a uniform or a truncated Poisson prior for the conditional independence graph, and a G-Wishart prior for precision matrix. We can also use different prior distributions for both the graph and the precision matrix, as was done in Wong et al. (2003), Chan and Jeliazkov (2009) and Wang and Pillai (2011). Furthermore, our methodology is general for any type of graphical model and does not rely on the normality of the variable. We can therefore use this methodology for other families of graphical models. We hope this work opens a window for new developments in MCMC approaches for efficient inference of general, high-dimensional graphical models. A Appendix: The proof for theorem 2.1 Our proof for theorem 2.1 is based on the theory derived by (Preston, 1976, Section 7 and 8). Preston proposed a special birth-death process, in which the birth and death rates depend on the position of the individuals in the underlying space. The process evolves by jumps, of which only a finite number can occur in a finite time. The jumps are of two types: a birth is defined as the appearance of a single individual, whereas a death is the removal of a single individual. By considering the solution of the backward Kolmogorov equation, (Preston, 1976, Theorem 7.1) showed that under certain conditions the process exists and is temporally ergodic, that is, there exists a unique stationary distribution. He showed that if the balance conditions would be hold the birth-death process converge to unique stationary distribution which for our proposed method is the joint posterior distribution of graph and precision matrix. Before we derive the detailed balance conditions for our proposed BDMCMC algorithm we introduce some notation. Assume the process is at state M G in which G = (V, E) with precision matrix K P G. The behavior of the process is defined by the birth rates β ξ (K), the death rates δ ξ (K), and the birth and death transition kernels T G β ξ (K;.) and T G δ ξ (K;.). For each ξ E, T G β ξ (K;.) denotes the probability that the process jumps from state M G to a point in the new state M G +ξ. Hence, if F P G +ξ then we have 19

Tβ G ξ (K; F) = β ξ(k) b ξ (k ξ ; K)dk ξ. (10) β(k) k ξ :K k ξ F Likewise, for each ξ E, T G δ ξ (K;.) denotes the probability that the process jumps from state M G to a point in the new state M G ξ. Therefore, if F P G ξ then T G δ ξ (K; F) = η E:K\k η F δ η (K) δ(k) = δ ξ(k) δ(k) I(K ξ F). (11) In our model, one specific way to satisfy the detailed balance conditions is by matching the birth events from graph G to all possible graphs with one more link and death events from all possible graphs with one more link to graph G. It is described by the following definition; See also (Preston, 1976, equations 8.4 and 8.5). Detailed balance conditions. In our birth-death process, p(k, G x) satisfy detailed balance conditions if β(k)dp(k, G x) = δ(k +ξ )Tδ G ξ (K +ξ ; F)dp(K +ξ, G +ξ x) (12) F P ξ E G +ξ and F δ(k)dp(k, G x) = ξ E P G ξ β(k ξ )T G β ξ (K ξ ; F)dp(K ξ, G ξ x), (13) which F P G. The first part (Eq. 12) means the rate at which the process leaves the current graph through a birth events is precise matched by the rates at which the process enters this graph through all possible death events, and the other way around, for second part (Eq. 13). 20

To prove the first part of the detailed balance conditions (Eq. 12), we have LHS = β(k)dp(g, K x) F = I(K F)β(K)dp(G, K x) P G = I(K F) β ξ (K)dp(G, K x) P G ξ E = I(K F)β ξ (K)dp(G, K x) P G ξ E = [ ] I(K F)β ξ (K) b ξ (k ξ K)dk ξ dp(g, K x) P G k ξ E ξ = I(K F)β ξ (K)b ξ (k ξ K)dk ξ dp(g, K x) P G k ξ E ξ = I(K F)β ξ (K)b ξ (k ξ K)p(G, K x)dk ξ dk ζ. ξ E ζ V Now, for RHS, by using Eq. (11) we have RHS = δ(k +ξ )Tδ G ξ (K +ξ ; F )dp(g +ξ, K +ξ x) P ξ E G +ξ = I(K F)δ ξ (K +ξ )dp(g +ξ, K +ξ x) P ξ E G +ξ = I(K F)δ ξ (K +ξ )p(g +ξ, K +ξ x)dk ξ dk ζ, ξ E ζ V and so LHS=RHS, if β ξ (K)b ξ (k ξ ; K)p(G, K x) = δ ξ (K +ξ )p(g +ξ, K +ξ x), which is equivalent to the conditions in theorem 2.1. In the same way, the second part of detailed balance condition Eq. (13) can be shown to hold. References Atay-Kayis, A. and H. Massam (2005). A monte carlo method for computing the marginal likelihood in nondecomposable gaussian graphical models. Biometrika 92 (2), 317 335. 21

Cappé, O., C. Robert, and T. Rydén (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 65 (3), 679 700. Chan, J. and I. Jeliazkov (2009). Mcmc estimation of restricted covariance matrices. Journal of Computational and Graphical Statistics 18 (2), 457 480. Dempster, A. (1972). Covariance selection. Biometrics 28 (1), 157 175. Dobra, A., A. Lenkoski, and A. Rodriguez (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association 106 (496), 1418 1433. Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9 (3), 432 441. Geyer, C. and J. Møller (1994). Simulation procedures and likelihood inference for spatial point processes. Scandinavian Journal of Statistics 21 (4), 359 373. Giudici, P. and P. Green (1999). Decomposable graphical gaussian model determination. Biometrika 86 (4), 785 801. Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika 82 (4), 711 732. Jones, B., C. Carvalho, A. Dobra, C. Hans, C. Carter, and M. West (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science 20 (4), 388 400. Lauritzen, S. (1996). Graphical models, Volume 17. Oxford University Press, USA. Lenkoski, A. and A. Dobra (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics 20 (1), 140 157. Mitsakakis, N., H. Massam, and M. D Escobar (2011). A metropolis-hastings based method for sampling from the g-wishart distribution in gaussian graphical models. Electronic Journal of Statistics 5, 18 30. Preston, C. J. (1976). Special birth-and-death processes. Bull. Inst. Internat. Statist. 46, 371 391. Ripley, B. (1977). Modelling spatial patterns. Journal of the Royal Statistical Society. Series B (methodological) 39 (2), 172 212. Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics 29 (3), 391 411. 22

Sachs, K., O. Perez, D. Pe er, D. Lauffenburger, and G. Nolan (2005). Causal protein-signaling networks derived from multiparameter single-cell data. Science s STKE 308 (5721), 523. Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics 28 (1), 40 74. Wang, H. and C. Carvalho (2010). Simulation of hyper-inverse wishart distributions for non-decomposable graphs. Electronic Journal of Statistics 4, 1470 1475. Wang, H. and S. Li (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics 6, 168 198. Wang, H. and N. Pillai (2011). On a class of shrinkage priors for covariance matrix estimation. Arxiv preprint arxiv:1109.3409. Whittaker, J. (1990). Graphical models in applied multivariate statistics, Volume 16. Wiley New York. Wong, F., C. Carter, and R. Kohn (2003). Efficient estimation of covariance selection models. Biometrika 90 (4), 809 830. 23