Bayesian Model Determination in Complex Systems

Size: px

Start display at page:

Download "Bayesian Model Determination in Complex Systems"

Maximilian Watkins
5 years ago
Views:

1 Bayesian Model Determination in Complex Systems PhD thesis to obtain the degree of PhD at the University of Groningen on the authority of the Rector Magnificus Prof. E. Sterken and in accordance with the decision by the College of Deans. This thesis will be defended in public on 24 April 2015 at hours by Abdolreza Mohammadi born on 6 February 1982 in Shiraz, Iran

2 Supervisor Prof. E. C. Wit Assessment Committee Prof. N. Friel Prof. N. Petkov Prof. G. Kauermann ISBN:

3 To my wife and my family

5 Contents Contents vii Chapter 1: Introduction Motivation Bayesian model determination in graphical models Trans-dimensional Markov chain Monte Carlo Outline of thesis contribution References Chapter 2: Bayesian Structure Learning in Sparse Gaussian Graphical Models Abstract Introduction Bayesian Gaussian graphical models The birth-death MCMC method Proposed BDMCMC algorithm Step1: Computing the birth and death rates Step 2: Direct sampler from precision matrix Statistical performance Simulation study Application to human gene expression data Extension to time course data Discussion References Chapter 3: Bayesian Modelling of Dupuytren Disease Using Gaussian Copula Graphical Models Abstract

6 viii Contents 3.2 Introduction Methodology Gaussian graphical models Gaussian copula graphical models Bayesian Gaussian copula graphical models Analysis of Dupuytren disease dataset Inference for Dupuytren disease with risk factors Severity of Dupuytren disease between pairs of fingers Fit of model to Dupuytren data Conclusion References Chapter 4: BDgraph: An R Package for Bayesian Structure Learning in Graphical Models Abstract Introduction User interface Methodological background Bayesian Gaussian graphical models Gaussian copula graphical models The BDgraph environment Description of the bdgraph function Description of the bdgraph.sim function Description of the plotcoda and traceplot functions Description of the phat and select functions Description of the compare and plotroc functions User interface by toy example Running the BDMCMC algorithm Convergence check Comparison and goodness-of-fit Application to real data sets Application to Labor force survey data Application to Human gene expression Conclusion References

7 Contents ix Chapter 5: Bayesian Modelling of the Exponential Random Graph Models with Non-observed Networks Abstract Introduction Exponential families and graphical models Exponential random graph models Graphical models Bayesian hierarchical model for ERGMs with non-observed networks Model for prior specification on graph Prior specification on precision matrix MCMC sampling scheme Discussion References Chapter 6: Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service Abstract Introduction The M/G/1 queuing system with optional second service Some performance measures in this queuing system Bayesian inference Bayesian inference for finite Gamma Mixture BD-MCMC algorithm Model identification Predictive densities Estimation of some performance measures via the BD-MCMC output Simulations Simulation study: mixture of Gammas Simulation study: Mixture of truncated Normal and Log-Normal Discussion and future directions References Summary 133 Samenvatting 135

9 Chapter 1 Introduction 1.1 Motivation One of the main challenges in modern science is to model the complex systems. The difficulty of modeling complex systems lies partly in their topology and how they form rather complex networks. For example, in neuroscience we are interested to understand how various regions of the brain interact with one another. In genetics, the cell is a network of chemicals linked by chemical reactions and we are interested to model this complex network. From this perspective, our interest to modeling networks is part of a broader current of research on complex systems. The main contribution of this thesis is to develop Bayesian statistical methods that jointly model the underlying network (or graph) and its structure among variables in the system. Graphical models provide potential tools to model and make statistical inference regarding complex relationships among variables. The key feature of graphical models is the close relationship between their probabilistic properties and the topology of the underlying graphs represents, as it allows an intuitive understanding of complex systems. 1.2 Bayesian model determination in graphical models In graphical models, nodes represent variables and edges represent pairwise dependencies, with the edge set defining the global conditional independence structure of the distribution. The methodological issues faced, as the dimension grows, include questions of the nature and consistency of prior specification (priors over graph space, and parameters on any single, specified graph). Then the challenging problem is searching over the space of

10 2 General introduction graphs to identify subsets of interest under the theoretically implied posterior distributions. This represents a complex model selection problem. The analysis challenge is inherently one of model uncertainty and model selection: we are interested in exploring graph space and identifying graphs that are most appropriate for a given dataset. Therefore, inference on variable dependencies and prediction is then based on parametric inferences within a set of selected graphs. In Bayesian paradigm, we are interested in exploring graph space and identifying graphs that are most appropriate for a given data. In this regard, we need to calculate the posterior distribution of the graph G conditional on data P r(g data) = P r(g)p r(data G) (1.1) P r(g)p r(data G), G G in which G is a graph space. Computing this posterior distribution is computationally unfeasible, since in the denominator we require the sum over all possible graph space. The graph space supper-exponentially increases according to the dimension of variables. p nodes in a graph mean p(p 1)/2 possible edge, and hence we have 2 p(p 1)/2 different possible graphs corresponding to all combinations of individual edges being in or out of the graph. For example, for the graph with only 10 variables, we have more than possible different graphical models. This motivates us to develop effective search algorithms for exploring graphical model uncertainty. In order to be accurate and scalable, the main key is to design search algorithms which are able to quickly move towards high posterior probability region, and also to take advantage of local computation. One solution is the trans-dimensional MCMC methodology (4). 1.3 Trans-dimensional Markov chain Monte Carlo This subsection is more narrowly focused on Bayesian model selection via Markov chain Monte Carlo (MCMC) methods for what can be called trans-dimensional problems; those where the dynamic variable of the simulation, the unknowns in the Bayesian set-up, does not have fixed dimension. One conceptually elegant method is reversible jump Markov chain Monte Carlo (RJMCMC), which was proposed by (3), also termed trans-dimensional MCMC, in which the model itself is conceived as another unknown and the MCMC algorithm is enlarged to allow jumps between all possible models. A prior is required over the model space, but with judicious selection of jumps the number of models does not need to

11 1.4 Outline of thesis contribution 3 be specified in advance and each model does not require separate estimation. These moves then require (reversible) bridges to be built between parameters of models in different dimensions. The posterior probability of a model is then estimated by the proportion of times that the particular model is accepted in the MCMC run. This method has been employed and discussed for model selection in many contexts. Specially, (2) used this approach for model selection in decomposable undirected Gaussian graphical models. More recently, (1, 6, 5) used this method for model selection in more general case, non-decomposable graphical models. An alternative trans-dimensional MCMC approach is the birth-death MCMC (BDM- CMC) algorithm, which is based on a continuous time Markov birth-death process. In this method, the time between jumps to a larger dimension (birth) or a smaller one (death) is taken to be a random variable with a specific rate. The choice of birth and death rates determines the birth-death process and is made in such a way that the stationary distribution is precisely the posterior distribution of interest. Contrary to the RJMCMC approach, moves between models are always accepted, which makes the BDMCMC approach extremely efficient. In the context of finite mixture distributions with variable dimension, this method has been used (15). 1.4 Outline of thesis contribution In this thesis we consider the problem of Bayesian inference in the following statistical models: graphical models (Chapters 2, 3, 4), exponential random graph models (Chapter 5) and queuing systems (Chapter 6). In Chapter 2 we introduce a novel and efficient Bayesian framework for Gaussian graphical model determination. We cover the theory and computational details of the proposed method. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on birth-death process. It is easy to implement and computationally feasible for high-dimensional graphs. We show our method outperforms alternative Bayesian approaches in terms of convergence and computing time. Unlike frequentist approaches, it gives a principled and, in practice, sensible approach to structure learning. We apply the method to large-scale real applications from human and mammary gland gene expression studies to show its empirical usefulness. The result of this chapter is published in (13). The method that we propose in Chapter 2 is limited only to the data that follows the Gaussianity assumption. In Chapter 3 we propose a Bayesian approach for graphical

12 4 General introduction model determination based on a Gaussian copula approach that can deal with continuous, discrete, or mixed data. We embed a graph selection procedure inside a semi-parametric Gaussian copula. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on the birth-death process. We implement our approach to discovering potential risk factors related to Dupuytren disease. The contents of this chapter corresponded to the manuscripts (7, 8) and the paper (11). In Chapter 4 we introduce an R package BDgraph (12) which contains functions to perform Bayesian structure learning in high-dimensional graphical models with either continuous or discrete variables. This package efficiently performs the Bayesian approaches that proposed in Chapters 2 and 3. The core of the BDgraph package efficiently implemented in C++ to maximize computational speed. The contents of this chapter corresponded to the manuscript (14). In Chapter 5 we introduce a comprehensive Bayesian graphical modeling for new features of exponential random graph models (ERGM). The method increases the range and applicability of the ERGM as a potential tool for the statistical inference in network structure learning. In Chapter 6 we introduce a Bayesian framework in an M/G/1 queuing system with an optional second service. The semi-parametric model based on a finite mixture of Gamma distributions is considered to approximate both the general service and re-service times densities in this queuing system. We estimate system parameters, predictive densities and some performance measures related to this queuing system such as stationary system size and waiting time. The result of this chapter is published in (9, 10).

13 References 5 References [1] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496): [2] Giudici, P. and Green, P. (1999). Decomposable graphical gaussian model determination. Biometrika, 86(4): [3] Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4): [4] Green, P. J. (2003). Trans-dimensional markov chain monte carlo. Oxford Statistical Science Series, pages [5] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1): [6] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics, 20(1): [7] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using gaussian copula graphical models. Arxiv preprint arxiv: v2. [8] Mohammadi, A., Abegaz Yazew, F., and Wit, E. C. (2014). Bayesian copula gaussian graphical modelling. (IWSM 14) Proceedings of the 29th International Workshop on Statistical Modelling, 1: [9] Mohammadi, A., Salehi-Rad, M., and Wit, E. (2013). Using mixture of gamma distributions for bayesian analysis in an m/g/1 queue with optional second service. Computational Statistics, 28(2): [10] Mohammadi, A. and Salehi-Rad, M. R. (2012). Bayesian inference and prediction in an m/g/1 with optional second service. Communications in Statistics-Simulation and Computation, 41(3): [11] Mohammadi, A. and Wit, E. (2014). Contributed discussion on article by finegold and drton. Bayesian Analysis, 9(3):

14 6 References [12] Mohammadi, A. and Wit, E. (2015a). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version [13] Mohammadi, A. and Wit, E. C. (2015b). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1): [14] Mohammadi, A. and Wit, E. C. (2015c). Bdgraph: Bayesian structure learning of graphs in r. arxiv preprint arxiv: v2. [15] Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics, 28(1):40 74.

15 Chapter 2 Bayesian Structure Learning in Sparse Gaussian Graphical Models Abstract Decoding complex relationships among large numbers of variables with relatively few observations is one of the crucial issues in science. One approach to this problem is Gaussian graphical modeling, which describes conditional independence of variables through the presence or absence of edges in the underlying graph. In this paper, we introduce a novel and efficient Bayesian framework for Gaussian graphical model determination which is a trans-dimensional Markov Chain Monte Carlo (MCMC) approach based on a continuoustime birth-death process. We cover the theory and computational details of the method. It is easy to implement and computationally feasible for high-dimensional graphs. We show our method outperforms alternative Bayesian approaches in terms of convergence, mixing in the graph space and computing time. Unlike frequentist approaches, it gives a principled and, in practice, sensible approach for structure learning. We illustrate the efficiency of the method on a broad range of simulated data. We then apply the method on large-scale real applications from human and mammary gland gene expression studies to show its empirical usefulness. In addition, we implemented the method in the R package BDgraph which is freely available at Key words: Bayesian model selection; Sparse Gaussian graphical models; Non-decomposable graphs; Birth-death process; Markov chain Monte Carlo; G-Wishart. 1 Published as: Mohammadi A. and E. C. Wit (2015). Bayesian Structure Learning in Sparse Gaussian Graphical Models, Bayesian Analysis, 10(1),

16 8 Bayesian Structure Learning in Graphical Models 2.2 Introduction Statistical inference of complex relationships among large numbers of variables with a relatively small number of observations appears in many circumstances. Biologists want to recover the underlying genomic network between thousands of genes, based on at most a few hundred observations. In market basket analysis analysts try to find relationships between only a small number of purchases of individual customers (17). One approach to these tasks is probabilistic graphical modeling (25), which is based on the conditional independencies between variables. Graphical models offer fundamental tools to describe the underlying conditional correlation structure. They have recently gained in popularity in both statistics and machine learning with the rise of high-dimensional data (22, 13, 31, 49, 15, 38, 52, 47, 48). For the purpose of performing structure learning, Bayesian approaches provide a straightforward tool, explicitly incorporating underlying graph uncertainty. In this paper, we focus on Bayesian structure learning in Gaussian graphical models for both decomposable and non-decomposable cases. Gaussian graphical determination can be viewed as a covariance selection problem (11), where the non-zero entries in the offdiagonal of the precision matrix correspond to the edges in the graph. For a p-dimensional variable there are in total 2 p(p 1)/2 possible conditional independence graphs. Even with a moderate number of variables, the model space is astronomical in size. The methodological problem as the dimension grows includes searching over the graph space to identify high posterior regions. High-dimensional regimes, such as genetic networks, have hundreds of nodes, resulting in over possible graphs. This motivates us to construct an efficient search algorithm which explores the graph space to distinguish important edges from irrelevant ones and detect the underlying graph with high accuracy. One solution is the trans-dimensional MCMC methodology (20). In the trans-dimensional MCMC methodology, the MCMC algorithm explores the model space to identify high posterior probability models and estimate the parameters simultaneously. A special case is the reversible-jump MCMC (RJMCMC) approach, proposed by Green (19). This method constructs an ergodic discrete-time Markov chain whose stationary distribution is taken to be the joint posterior distribution of the model and the parameters. The process transits among models using an acceptance probability, which guarantees convergence to the target posterior distribution. If this probability is high, the process efficiently explores the model space. However, for the high-dimensional regime this is not always efficient. Giudici and Green (18) extended this method for the decomposable Gaussian graphical models. Dobra et al. (13) developed it based on the Cholesky

17 2.2 Introduction 9 decomposition of the precision matrix. Lenkoski (26), Wang and Li (49), and Cheng et al. (9) developed an RJMCMC algorithm, which combined the exchange algorithm (34) and the double Metropolis-Hastings algorithm (29) to avoid the intractable normalizing constant calculation. An alternative trans-dimensional MCMC methodology is the birth-death MCMC (BDM- CMC) approach, which is based on a continuous time Markov process. In this method, the time between jumps to a larger dimension (birth) or a smaller one (death) is taken to be a random variable with a specific rate. The choice of birth and death rates determines the birth-death process, and is made in such a way that the stationary distribution is precisely the posterior distribution of interest. Contrary to the RJMCMC approach, moves between models are always accepted, which makes the BDMCMC approach extremely efficient. In the context of finite mixture distributions with variable dimension this method has been used (45), following earlier proposals by Ripley (39) and Geyer and Møller (16). The main contribution of this paper is to introduce a novel Bayesian framework for Gaussian graphical model determination and design a BDMCMC algorithm to perform both structure learning (graph estimation) and parameter learning (parameters estimation). In our BDMCMC method, we add or remove edges via birth or death events. The birth and death events are modeled as independent Poisson processes. Therefore, the time between two successive birth or death events has an exponential distribution. The birth and death events occur in continuous time and the relative rates at which they occur determine the stationary distribution of the process. The relationships between these rates and the stationary distribution is formalized in Section 2.4 (Theorem 2.1). The outline of this paper is as follows. In Section 2.3, we introduce the notation and preliminary background material such as suitable prior distributions for the graph and precision matrix. In Section 2.4, we propose our Bayesian framework and design our BDMCMC algorithm. In addition, this section contains the specific implementation of our method, including an efficient way for computing the birth and death rates of our algorithm and a direct sampler algorithm from G-Wishart distribution for the precision matrix. In Section 2.5, we show the performance of the proposed method in several comprehensive simulation studies and large-scale real-world examples from human gene expression data and a mouse mammary gland microarray experiment.

18 10 Bayesian Structure Learning in Graphical Models 2.3 Bayesian Gaussian graphical models We introduce some notation and the structure of undirected Gaussian graphical models; for a comprehensive introduction see Lauritzen (25). Let G = (V, E) be an undirected graph, where V = {1, 2,..., p} is the set of nodes and E V V is the set of existing edges. Let W = { (i, j) i, j V, i < j }, and E = W\E denotes the set of non-existing edges. We define a zero mean Gaussian graphical model with respect to the graph G as M G = { N p (0, Σ) K = Σ 1 P G }, where P G denotes the space of p p positive definite matrices with entries (i, j) equal to zero whenever (i, j) E. Let x = (x 1,..., x n ) be an independent and identically distributed sample of size n from model M G. Then, the likelihood is P (x K, G) K n/2 exp { 12 } tr(ks), (2.1) where S = x x. The joint posterior distribution is given as P (G, K x) P (x G, K)P (K G)P (G). (2.2) For the prior distribution of the graph there are many options, of which we propose two. In the absence of any prior beliefs related to the graph structure, one case is a discrete uniform distribution over the graph space G, P (G) = 1, for each G G. G Alternatively, we propose a truncated Poisson distribution on the graph size E with parameter γ, p(g) γ E, for each G = (V, E) G. E! Other choices of priors for the graph structure involve modelling the joint state of the edges as a multivariate discrete distribution (7) and (43), encouraging sparse graphs (22)

19 2.4 The birth-death MCMC method 11 or having multiple testing correction properties (42). For the prior distribution of the precision matrix, we use the G-Wishart (40, 28), which is attractive since it represents the conjugate prior for normally distributed data. It places no probability mass on zero entries of the precision matrix. A zero-constrained random matrix K P G has the G-Wishart distribution W G (b, D), if 1 P (K G) = I G (b, D) K (b 2)/2 exp { 12 } tr(dk), where b > 2 is the degree of freedom, D is a symmetric positive definite matrix, and I G (b, D) is the normalizing constant, I G (b, D) = K (b 2)/2 exp { 12 } tr(dk) dk. P G When G is complete the G-Wishart distribution W G (b, D) reduces to the Wishart distribution W p (b, D), hence, its normalizing constant has an explicit form (33). If G is decomposable, we can explicitly calculate I G (b, D) (40). For non-decomposable graphs, however, I G (b, D) does not have an explicit form; we can numerically approximate I G (b, D) by the Monte Carlo method (3) or Laplace approximation (27). The G-Wishart prior is conjugate to the likelihood (5.2), hence, conditional on graph G and observed data x, the posterior distribution of K is { 1 P (K x, G) = 2)/2 I G (b, D ) K (b exp 1 } 2 tr(d K), where b = b + n and D = D + S, that is, W G (b, D ). Other choices of priors for the precision matrix are considered on a class of shrinkage priors (50) using the graphical lasso approach (47, 48). They place constant priors for the nonzero entries of the precision matrix and no probability mass on zero entries. In the following section, we describe an efficient trans-dimensional MCMC sampler scheme for our joint posterior distribution (2.2). 2.4 The birth-death MCMC method Here, we determine a continuous time birth-death Markov process particularly for Gaussian graphical model selection. The process explores over the graph space by adding or removing an edge in a birth or death event. The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process.

20 12 Bayesian Structure Learning in Graphical Models Suppose the birth-death process at time t is at state (G, K) in which G = (V, E) with precision matrix K P G. Let Ω = G G K P G (G, K) where G denotes the set of all possible graphs. We consider the following continuous time birth-death Markov process on Ω: Death: Each edge e E dies independently of the others as a Poisson process with a rate δ e (K). Thus, the overall death rate is δ(k) = e E δ e(k). If the death of an edge e = (i, j) E occurs, then the process jumps to a new state (G e, K e ) in which G e = (V, E \ {e}), and K e P G e. We assume K e is equal to matrix K except for the entries in positions {(i, j), (j, i), (j, j)}. Note we can distinguish i from j, since by our definition of an edge i < j. Birth: A new edge e E is born independently of the others as a Poisson process with a rate β e (K). Thus, the overall birth rate is β(k) = e E β e(k). If the birth of an edge e = (i, j) E occurs, then the process jumps to a new state (G +e, K +e ) in which G +e = (V, E {e}), and K +e P G +e. We assume K +e is equal to matrix K except for the entries in positions {(i, j), (j, i), (j, j)}. The birth and death processes are independent Poisson processes. Thus, the time between two successive events is exponentially distributed, with mean 1/(β(K) + δ(k)). Therefore, the probability of a next birth/death event is P (birth for edge e) = β e (K), for each e E, (2.3) β(k) + δ(k) P (death for edge e) = δ e (K), for each e E. (2.4) β(k) + δ(k) The following theorem gives a sufficient condition for which the stationary distribution of our birth-death process is precisely the joint posterior distribution of the graph and precision matrix. Theorem 2.1. The above birth-death process has stationary distribution P (K, G x), if for each e W δ e (K)P (G, K \ (k ij, k jj ) x) = β e (K e )P (G e, K e \ k jj x). (2.5) Proof. Our proof is based on the theory derived by Preston (37, Section 7 and 8). Preston proposed a special birth-death process, in which the birth and death rates are func-

21 2.4 The birth-death MCMC method 13 tions of the state. The process evolves by two types of jumps: a birth is defined as the appearance of a single individual, whereas a death is the removal of a single individual. This process converges to a unique stationary distribution, if the balance conditions hold (37, Theorem 7.1). We construct our method in such a way that the stationary distribution equals the joint posterior distribution of the graph and the precision matrix. See the appendix for a detailed proof Proposed BDMCMC algorithm Our proposed BDMCMC algorithm is based on a specific choice of birth and death rates that satisfies Theorem 2.1. Suppose we consider the birth and death rates as β e (K) = P (G+e, K +e \ (k ij, k jj ) x), for each e E, (2.6) P (G, K \ k jj x) δ e (K) = P (G e, K e \ k jj x), for each e E. (2.7) P (G, K \ (k ij, k jj ) x) Based on the above rates, we determine our BDMCMC algorithm as below. 1 Algorithm 2.1. BDMCMC algorithm. Given a graph G = (V, E) with a precision matrix K, iterate the following steps: Step 1. Birth and death process 1.1. Calculate the birth rates by (5.7) and β(k) = e E β e(k), 1.2. Calculate the death rates by (5.8) and δ(k) = e E δ e(k), 1.3. Calculate the waiting time by w(k) = 1/(β(K) + δ(k)), 1.4. Simulate the type of jump (birth or death) by (5.5) and (5.6). Step 2. According to the type of jump, sample from the new precision matrix. The main computational parts of our BDMCMC algorithm are computing the birth and death rates (steps 1.1 and 1.2) and sampling from the posterior distribution of the precision matrix (step 2). In Section 2.4.2, we illustrate how to calculate the birth and death rates. In Section 4.4.1, we explain a direct sampling algorithm from the G-Wishart distribution for sampling from the posterior distribution of the precision matrix. In our continuous time BDMCMC algorithm we sample in each step of jumping to the new state (e.g. {t 1, t 2, t 3,...} in Figure 4.2). For inference, we put the weight on each state

22 14 Bayesian Structure Learning in Graphical Models to effectively compute the sample mean as a Rao-Blackwellized estimator (6, subsection 2.5); See e.g. (4.12). The weights are equal to the length of the waiting time in each state ( e.g. {w 1, w 2, w 3,...} in Figure 4.2). Based on these waiting times, we estimate the posterior distribution of the graphs, which are the proportion to the total waiting times of each graph (see Figure 4.2 in the right and Figure 2.3). For more detail about sampling from continuous time Markov processes see Cappé et al. (6, subsection 2.5). Fig. 2.1 (Left) Continuous time BDMCMC algorithm where {t 1, t 2, t 3,...} are jumping times and {w 1, w 2, w 3,...} are waiting times. (Right) Posterior probability estimation of the graphs based on the proportions of their waiting times Step1: Computing the birth and death rates In step 1 of our BDMCMC algorithm, the main task is calculating the birth and death rates (steps 1.1 and 1.2); Other steps are straightforward. Here, we illustrate how to calculate the death rates. The birth rates are calculated in a similar manner, since both birth and death rates (5.7) and (5.8) are the ratio of the conditional posterior densities. For each e = (i, j) E, the numerator of the death rate is P (G e, K e \ k jj x) = P (G e, K e x) P (k jj K e \ k jj, G e, x). The full conditional posterior for k jj is (see Roverato 40, Lemma 1) k jj c K e \ k jj, G e, x W (b, D jj), where c = K j,v \j (K V \j,v \j ) 1 K V \j,j. Following Wang and Li (49) and some simplification,

23 2.4 The birth-death MCMC method 15 we have P(G e, K e \k jj x)= P(G) P(x) I(b, Djj) { I G e(b, D) K0 V \j,v \j (b 2)/2 exp 1 } 2 tr(k0 D ), (2.8) where I(b, D jj) is the normalizing constant of a G-Wishart distribution for p = 1 and K 0 = K except for an entry 0 in the positions (i, j) and (j, i), and an entry c in the position (j, j). For the denominator of the death rate we have P (G, K \ (k ij, k jj ) x) = P (G, K x) P ((k ij, k jj ) K \ (k ij, k jj ), G, x), in which we need the full conditional distribution of (k ij, k jj ). We can obtain the conditional posterior of (k ii, k ij, k jj ) and by further conditioning on k ii and using the proposition in the appendix, we can evaluate the full conditional distribution of (k ij, k jj ). Following Wang and Li (49) and some simplification, we have P(K \(k ij, k jj ), G x)= P (G) { J(b,Dee,K) P(x) I G (b, D) K1 V \e,v \e (b 2)/2 exp 1 } 2 tr(k1 D ), (2.9) where { } J(b, Dee, K)=( 2π ) 1 Djj 2 I(b, Djj)(k ii kii) 1 b 2 2 exp 1 2 (D ii D 2 ij )(k Djj ii kii) 1, and K 1 = K except for the entries K e,v \e (K V \e,v \e ) 1 K V \e,e in the positions corresponding to e = (i, j). By plugging (2.8) and (2.9) into the death rates (5.8), we have in which δ e (K) = P (G e ) P (G) I G (b, D) I G e(b, D) H(K, D, e), (2.10) D jj H(K, D, e) = ( 2π(k ii kii 1 2 [ ))1 ] exp 1 tr(d (K 0 K 1 )) (D 2 ii D 2 ij )(k Djj ii kii) 1. (2.11) For computing the above death rates, we require the prior normalizing constants which is the main computational part. Calculation time for the remaining elements is extremely

24 16 Bayesian Structure Learning in Graphical Models fast. Coping with evaluation of prior normalizing constants Murray et al. (34) proved that the exchange algorithm based on exact sampling is a powerful tool for general MCMC algorithms in which their likelihoods have additional parameterdependent normalization terms, such as the posterior over parameters of an undirected graphical model. Wang and Li (49) and Lenkoski (26) illustrate how to use the concept behind the exchange algorithm to circumvent intractable normalizing constants as in (3.7). With the existence of a direct sampler of G-Wishart, Lenkoski (26) used a modification of the exchange algorithm to cope with the ratio of prior normalizing constants. Suppose that (G, K) is the current state of our algorithm and we would like to calculate the death rates (3.7), first we sample K according to W G (b, D) via an exact sampler, Algorithm 3.2 below. Then, we replace the death rates with δ e (K) = P (G e ) H(K, D, e) P (G) H( K, D, e), (2.12) in which the intractable prior normalizing constants have been replaced by an evaluation of H (given in (2.11)) in the prior, evaluated at K; For theoretical justifications of this procedure, see Murray et al. (34) and Liang (29) Step 2: Direct sampler from precision matrix Lenkoski (26) developed an exact sampler method for the precision matrix, which borrows ideas from Hastie et al. (21). The algorithm is as follows. Throughout, we use the direct sampler algorithm for sampling from the precision matrix K. 2.5 Statistical performance Here we present the results for three comprehensive simulation studies and two applications to real data sets. In Section 2.5.1, we show that our method outperforms alternative Bayesian approaches in terms of convergence, mixing in the graph space and computing time; Moreover, the model selection properties compare favorably with frequentist alternatives. In Section 2.5.2, we illustrate our method on a large-scale real data set related to

25 2.5 Statistical performance 17 1 Algorithm 2.2. Direct sampler from precision matrix Lenkoski (26). Given a graph G = (V, E) with precision matrix K and Σ = K 1 : Step 1. Set Ω = Σ. Step 2. Repeat for i = 1,..., p, until convergence: 2.1 Let N i V be the set of neighbors of node i in graph G. Form Ω Ni and Σ Ni,i and solve ˆβ i = Ω 1 N i Σ Ni,i, 2.2 Form ˆβ i R p 1 by padding the elements of ˆβ i to the appropriate locations and zeroes in those locations not connected to i in graph G, 2.3 Update Ω i, i and Ω i,i with Ω i, i ˆβi. Step 3. Return K = Ω 1. the human gene expression data. In Section 2.5.3, we demonstrate the extension of our method to graphical models which involve time series data. It shows how graphs can be useful in modeling real-world problems such as gene expression time course data. We performed all computations with the R package BDgraph, (32) Simulation study Graph with 6 nodes We illustrate the performance of our methodology and compare with two alternative Bayesian methods on a concrete small toy simulation example which comes from Wang and Li (49). We consider a data generating mechanism with p = 6 within M G = { } N 6 (0, Σ) K = Σ 1 P G, in which the precision matrix is K = Just like Wang and Li (49) we let S = nk 1 where n = 18, which represents 18 samples from the true model M G. As a non-informative prior, we take a uniform distribution for

26 18 Bayesian Structure Learning in Graphical Models the graph and a G-Wishart W G (3, I 6 ) for the precision matrix. To evaluate the performance of our BDMCMC algorithm, we run our BDMCMC algorithm with 60, 000 iterations and 30, 000 as a burn-in. All the computations for this example were carried out on an Intel(R) Core(TM) i5 CPU 2.67GHz processor. We calculate the posterior pairwise edge inclusion probabilities based on the Rao- Blackwellization (6, subsection 2.5) as ˆp e = N t=1 I(e G(t) )w(k (t) ) N, for each e W, (2.13) t=1 w(k(t) ) where N is the number of iterations, I(e G (t) ) is an indicator function, such that I(e G (t) ) = 1 if e G (t) and zero otherwise, and w(k (t) ) is the waiting time in the graph G (t) with the precision matrix K (t) ; see Figure 4.2. The posterior pairwise edge inclusion probabilities for all the edges e = (i, j) W are ˆp e = The posterior mean of the precision matrix is ˆK = We compare the performance of our BDMCMC algorithm with two recently proposed trans-dimensional MCMC approaches. One is the algorithm proposed by Lenkoski (26) which we call Lenkoski. The other is an algorithm proposed by Wang and Li (49) which we call WL. The R code for the WL approach is available at projects/bgraph/. Compared to other Bayesian approaches, our BDMCMC algorithm is highly efficient due to its fast convergence speed. One useful test of convergence is given by the plot of

27 2.5 Statistical performance 19 the cumulative occupancy fraction for all possible edges, shown in Figure 2.2. Figure 2.2a shows that our BDMCMC algorithm converges after approximately 10, 000 iterations. Figure 2.2b and 2.2c show that the Lenkoski algorithm converges after approximately 30, 000, whereas the WL algorithm still does not converge after 60, 000 iterations (a) BDMCMC (b) Lenkoski (c) WL Fig. 2.2 Plot of the cumulative occupancy fractions of all possible edges to check convergence in simulation example 2.5.1; BDMCMC algorithm in 2.2a, Lenkoski algorithm(26) in 2.2b, and WL algorithm in Wang and Li (49) 2.2c. Figure 2.3 reports the estimated posterior distribution of the graphs for BDMCMC, WL, and Linkoski algorithm, respectively. Figure 2.3a indicates that our algorithm visited around 450 different graphs and the estimated posterior distribution of the true graph is 0.66, which is the graph with the highest posterior probability. Figure 2.3b shows that the Lenkoski algorithm visited around 400 different graphs and the estimated posterior distribution of the true graph is Figure 2.3c shows that the WL algorithm visited only 23 different graphs and the estimated posterior distribution of the true graph is True graph True graph (a) BDMCMC (b) Lenkoski (c) WL Fig. 2.3 Plot of the estimated posterior probability of graphs in simulation example 2.5.1; BDMCMC algorithm in 2.3a, Lenkoski algorithm (26) in 2.3b, and WL algorithm (49) in 2.3c True graph

28 20 Bayesian Structure Learning in Graphical Models To assess the performance of the graph structure, we compute the posterior probability of the true graph, and the calibration error (CE) measure, defined as follows CE = ˆp e I(e G true ), (2.14) e W where, for each e W, ˆp e is the posterior pairwise edge inclusion probability in (4.12) and G true is the true graph. The CE is positive with a minimum at 0 and smaller is better. Table 2.1 reports comparisons of our method with two other Bayesian approaches (WL and Lenkoski), reporting the mean values and standard errors in parentheses. We repeat the entire simulation 50 times. The first and second columns show the performance of the algorithms. Our algorithm performs better due to its faster convergence feature. The third column shows the acceptance probability (α) which is the probability of moving to a new graphical model. The fourth column shows that our algorithm is slower than the Lenkoski algorithm and faster than the WL approach. It can be argued that to make a fair comparison our method takes 60, 000 samples in 25 minutes while e.g. Lenkoski algorithm in 14 minutes takes only 60, = 3, 480 efficient samples. For fair comparison we performed all simulations in R. However, our package BDgraph efficiently implements the algorithm with C++ code linked to R. For 60, 000 iterations, our C++ code takes only 17 seconds instead of 25 minutes in R, which means around 90 times faster than the R code. It makes our algorithm computationally feasible for high-dimensional graphs. P(true G data) CE α CPU time (min) BDMCMC 0.66 (0.00) 0.47 (0.01) 1 25 (0.14) Lenkoski 0.36 (0.02) 1.17 (0.08) (0.001) 14 (0.13) WL 0.33 (0.12) 1.25 (0.46) (0.0003) 37 (0.64) Table 2.1 Summary of performance measures in simulation example for BDMCMC approach, Lenkoski (26), and WL (49). The table presents the average posterior probability of the true graph, the average calibration error (CE) which is defined in (2.14), the average acceptance probability (α), and the average computing time in minutes, with 50 replications and standard deviations in parentheses. Extensive comparison with Bayesian methods We perform here a comprehensive simulation with respect to different graph structures to evaluate the performance of our Bayesian method and compare it with two recently proposed trans-dimensional MCMC algorithms; WL (49) and Lenkoski (26). Corresponding to different sparsity patterns, we consider 7 different kinds of synthetic graphical models:

29 2.5 Statistical performance Circle: A graph with k ii = 1, k i,i 1 = k i 1,i = 0.5, and k 1p = k p1 = 0.4, and k ij = 0 otherwise. 2. Star: A graph in which every node is connected to the first node, with k ii = 1, k 1i = k i1 = 0.1, and k ij = 0 otherwise. 3. AR(1): A graph with σ ij = 0.7 i j. 4. AR(2): A graph with k ii = 1, k i,i 1 = k i 1,i = 0.5, and k i,i 2 = k i 2,i = 0.25, and k ij = 0 otherwise. 5. Random: A graph in which the edge set E is randomly generated from independent Bernoulli distributions with probability 2/(p 1) and the corresponding precision matrix is generated from K W G (3, I p ). 6. Cluster: A graph in which the number of clusters is max { 2, [ p/20 ]}. Each cluster has the same structure as a random graph. The corresponding precision matrix is generated from K W G (3, I p ). 7. Scale-free: A graph which is generated by using the B-A algorithm (2). The resulting graph has p 1 edges. The corresponding precision matrix is generated from K W G (3, I p ). For each graphical model, we consider four different scenarios: (1) dimension p = 10 and sample size n = 30, (2) p = 10 and n = 100, (3) p = 50 and n = 100, (4) p = 50 and n = 500. For each generated sample, we fit our Bayesian method and two other Bayesian approaches (WL and Lenkoski) with a uniform prior for the graph and the G-Wishart prior W G (3, I p ) for the precision matrix. We run those three algorithms with the same starting points with 60, 000 iterations and 30, 000 as a burn in. Computation for this example was performed in parallel on 235 batch nodes with 12 cores and 24 GB of memory, running Linux. To assess the performance of the graph structure, we compute the calibration error (CE) measure defined in (2.14) and the F 1 -score measure (4, 36) which is defined as follows F 1 -score = 2TP 2TP + FP + FN, (2.15) where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively. The F 1 -score lies between 0 and 1, where 1 stands for perfect identification and 0 for bad identification. Table 2.2 reports comparisons of our method with two other Bayesian approaches, where we repeat the experiments 50 times and report the average F 1 -score and CE with their standard errors in parentheses. Our method performs well overall as its F 1 -score and

30 22 Bayesian Structure Learning in Graphical Models its CE are the best in most of the cases, mainly because of its fast convergence rate. Both our method and the Lenkoski approach perform better compared to the WL approach. The main reason is that the WL approach uses a double Metropolis-Hastings (based on a block Gibbs sampler), which is an approximation of the exchange algorithm. On the other hand, both our method and the Lenkoski approach use the exchange algorithm based on exact sampling from the precision matrix. As we expected, the Lenkoski approach converges slower compared to our method. The reason seems to be the dependence of the Lenkoski approach on the choice of the tuning parameter, σg 2 (26, step 3 in algorithm p. 124). In our simulation, we found that the convergence rate (as well acceptance probability) of the Lenkoski algorithm depends on the choice of σg. 2 Here we choose σg 2 = 0.1 as a default. From a theoretical point of view, both our BDMCMC and the Lenkoski algorithms converge to the true posterior distribution, if we run them a sufficient amount of time. Thus, the results from this table just indicate how quickly the algorithms converge. Table 2.3 reports the average running time and acceptance probability (α) with their standard errors in parentheses across all 7 graphs with their 50 replications. It shows that our method compared to the Lenkoski approach is slower. The reason is that our method scans through all possible edges for calculating the birth/death rates, which is computationally expensive. On the other hand, in the Lenkoski algorithm, a new graph is selected by randomly choosing one edge which is computationally fast but not efficient. The table shows that the acceptance probability (α) for both WL and Lenkoski is small especially for the WL approach. Note the α here is the probability that the algorithm moves to a new graphical model and it is not related to the double Metropolis-Hastings algorithm. The α in the WL approach is extremely small and it should be the cause of the approximation which has been used for the ratio of prior normalizing constants. As Murray et al. (34) pointed out these kinds of algorithms can suffer high rejection rates. For the Lenkoski approach the α is relatively small, but compared with the WL method is much better. As in the Lenkoski approach, a new graph is proposed by randomly choosing one edge, yielding a relatively small acceptance probability. Comparison with frequentist methods We also compare the performance of our Bayesian method with two popular frequentist methods, the graphical lasso (glasso) (15) and Meinshausen-Buhlmann graph estimation (mb) (31). We consider the same 7 graphical models with the same scenarios in the previous example. For each generated sample, we fit our Bayesian method with a uniform prior for the

31 2.5 Statistical performance 23 F 1 -score BDMCMC Lenkoski WL BDMCMC Lenkoski WL CE p=10 & n=30 circle 0.95 (0.00) 0.93 (0.01) 0.24 (0.01) 2.5 (1.6) 4.9 (2.1) 15.8 (5) star 0.15 (0.02) 0.17 (0.02) 0.16 (0.01) 11.3 (2.1) 14 (1.4) 13.6 (3.4) AR (0.01) 0.70 (0.01) 0.34 (0.02) 4.4 (2.2) 9.7 (2.4) 12.5 (8.8) AR (0.01) 0.59 (0.02) 0.36 (0.01) 11.5 (3.5) 12.8 (3.1) 16.3 (6.2) random 0.57 (0.03) 0.50 (0.01) 0.34 (0.01) 11.4 (8.0) 15.3 (6.3) 14.1 (14.0) cluster 0.61 (0.02) 0.49 (0.01) 0.33 (0.01) 10.3 (9.4) 14.3 (7.3) 13.5 (9.8) scale-free 0.53 (0.03) 0.45 (0.02) 0.31 (0.02) 11.8 (8.8) 15.6 (6.9) 13.3 (6.5) p=10 & n=100 circle 0.99 (0.00) 0.98 (0.00) 0.26 (0.01) 1.0 (0.4) 2.2 (0.5) 15.6 (6.7) star 0.21 (0.02) 0.18 (0.02) 0.25 (0.02) 9.3 (1.6) 11.4 (1.3) 11.4 (3.4) AR (0.00) 0.95 (0.00) 0.34 (0.01) 1.5 (0.4) 5.2 (0.5) 13.0 (5.7) AR (0.01) 0.90 (0.01) 0.47 (0.01) 4.1 (3.7) 5.6 (2.7) 14.0 (7.5) random 0.76 (0.01) 0.65 (0.02) 0.35 (0.01) 7.0 (5.6) 10.7 (6.3) 13.8 (10.5) cluster 0.74 (0.02) 0.67 (0.02) 0.37 (0.02) 6.4 (7.2) 9.9 (7.8) 12.4 (9.4) scale-free 0.69 (0.02) 0.56 (0.02) 0.33 (0.02) 7.9 (8.0) 11.6 (7.0) 13.0 (7.8) p=50 & n=100 circle 0.99 (0.01) 0.55 (0.10) 0.00 (0.00) 2.5(0.9) 75.3 (7.2) 50 (0.0) star 0.17 (0.04) 0.09 (0.04) 0.00 (0.00) 68.8 (4.4) (4.5) 49 (0.0) AR (0.04) 0.33 (0.09) 0.00 (0.00) 19.0 (4.5) (5.4) 49 (0.0) AR (0.04) 0.49 (0.17) 0.00 (0.00) 28.6 (5.7) (5.4) 97 (0.0) random 0.51 (0.09) 0.21 (0.07) 0.00 (0.00) 73.2 (18.5) (36.7) 49.2 (5.6) cluster 0.55 (0.11) 0.18 (0.07) 0.00 (0.00) 72.8 (18.2) (44.4) 47.8 (8.4) scale-free 0.49 (0.11) 0.19 (0.07) 0.00 (0.00) 72.4 (22.5) (47.8) 49 (0.0) p=50 & n=500 circle 1.00 (0.01) 0.72 (0.09) 0.00 (0.00) 1.7 (0.6) 55.8 (5.4) 50 (0.0) star 0.65 (0.05) 0.35 (0.05) 0.00 (0.00) 31.7 (4.5) 92.3 (3.6) 49 (0.0) AR (0.02) 0.54 (0.07) 0.00 (0.00) 7.2 (1.9) 84.9 (4.0) 49 (0.0) AR (0.01) 0.78 (0.11) 0.00 (0.00) 4.8 (1.8) 61.7 (4.4) 97 (0.0) random 0.73 (0.09) 0.34 (0.10) 0.00 (0.00) 34.3 (11.2) (28.8) 50.7 (7.0) cluster 0.74 (0.09) 0.32 (0.13) 0.00 (0.00) 32.2 (10.6) (27.3) 48.5 (5.9) scale-free 0.73 (0.10) 0.33 (0.08) 0.00 (0.00) 35.3 (13.6) (26.9) 49 (0.0) Table 2.2 Summary of performance measures in simulation example for BDMCMC approach, Lenkoski (26), and WL (49). The table presents the F 1 -score, which is defined in (4.13) and CE, which is defined in (2.14), for different models with 50 replications and standard deviations in parentheses. The F 1 -score reaches its best score at 1 and its worst at 0. The CE is positive valued for which 0 is minimum and smaller is better. The best models for both F 1 -score and CE are boldfaced.

32 24 Bayesian Structure Learning in Graphical Models BDMCMC Lenkoski WL p = 10 α (0.001) 8.8e-06 (4.6e-11) Time 97 (628) 40 (225) 380 (11361) p = 50 α (0.045) (0.0000) Time 5408 (1694) 1193 (1000) 9650 (1925) Table 2.3 Comparison of our BDMCMC algorithm with the WL approach (49) and Lenkoski approach (26). It presents the average computing time in minutes and the average probability of acceptance (α) with their standard deviations in parentheses. graph and the G-Wishart prior W G (3, I p ) for the precision matrix. To fit the glasso and mb methods, however, we must specify a regularization parameter λ that controls the sparsity of the graph. The choice of λ is critical since different λ s may lead to different graphs. We consider the glasso method with three different regularization parameter selection approaches, which are the stability approach to regularization selection (stars) (30), rotation information criterion (ric) (53), and the extended Bayesian information criterion (ebic) (14). Similarly, we consider the mb method with two regularization parameter selection approaches, namely stars and the ric. We repeat all the experiments 50 times. Table 2.4 provides comparisons of all approaches, where we report the averaged F 1 - score with their standard errors in parentheses. Our Bayesian approach performs well as its F 1 -score typically out performs all frequentist methods, except in the unlikely scenario of a high number of observations where it roughly equals the performance of the mb method with stars criterion. All the other approaches appear to perform well in some cases, and fail in other cases. For instance, when p = 50, the mb method with ric is the best for the AR(1) graph and the worst for the circle graph. To assess the performance of the precision matrix estimation, we use the Kullback- Leibler divergence (23) which is given as follows ( ( Ktrue 1 ˆK ) p log KL = 1 tr 2 ˆK Ktrue ), (2.16) where Ktrue is the true precision matrix and ˆK is the estimate of the precision matrix. Table 2.5 provides a comparison of all methods, where we report the averaged KL with their standard errors in parentheses. Based on KL, the overall performance of our Bayesian approach is good as its KL is the best in all scenarios except one.

33 2.5 Statistical performance 25 glasso BDMCMC stars ric ebic stars ric mb p=10 & n=30 circle 0.95 (0.00) 0.00 (0.00) 0.01 (0.01) 0.48 (0.00) 0.42 (0.01) 0.01 (0.01) star 0.15 (0.02) 0.01 (0.00) 0.15 (0.02) 0.00 (0.00) 0.01 (0.02) 0.14 (0.02) AR (0.01) 0.20 (0.13) 0.61 (0.01) 0.17 (0.07) 0.46 (0.01) 0.83 (0.01) AR (0.01) 0.09 (0.02) 0.19 (0.02) 0.00 (0.00) 0.07 (0.02) 0.19 (0.02) random 0.57 (0.03) 0.36 (0.06) 0.48 (0.02) 0.08 (0.04) 0.45 (0.03) 0.53 (0.03) cluster 0.61 (0.02) 0.45 (0.05) 0.54 (0.02) 0.07 (0.04) 0.50 (0.02) 0.54 (0.02) scale-free 0.53 (0.03) 0.30 (0.05) 0.4 (0.02) 0.06 (0.02) 0.36 (0.03) 0.46 (0.03) p=10 & n=100 circle 0.99 (0.00) 0.00 (0.00) 0.50 (0.08) 0.45 (0.00) 0.89 (0.08) 0.81 (0.09) star 0.21 (0.02) 0.08 (0.02) 0.29 (0.03) 0.01 (0.00) 0.07 (0.03) 0.29 (0.03) AR (0.00) 0.90 (0.01) 0.57 (0.00) 0.56 (0.00) 0.94 (0.00) 0.85 (0.00) AR (0.01) 0.34 (0.06) 0.63 (0.00) 0.08 (0.05) 0.41 (0.01) 0.64 (0.01) random 0.76 (0.01) 0.61 (0.02) 0.57 (0.01) 0.45 (0.07) 0.68 (0.02) 0.61 (0.02) cluster 0.74 (0.02) 0.66 (0.03) 0.59 (0.02) 0.53 (0.07) 0.68 (0.03) 0.61 (0.03) scale-free 0.69 (0.02) 0.56 (0.02) 0.48 (0.008) 0.34 (0.07) 0.63 (0.02) 0.52 (0.02) p=50 & n=100 circle 0.99 (0.01) 0.28 (0.05) 0.00 (0.00) 0.28 (0.01) 0.00 (0.00) 0.00 (0.00) star 0.17 (0.04) 0.14 (0.06) 0.06 (0.05) 0.00 (0.00) 0.15 (0.04) 0.05 (0.041) AR (0.04) 0.56 (0.04) 0.59 (0.03) 0.49 (0.05) 0.82 (0.02) 0.98 (0.02) AR (0.04) 0.59 (0.02) 0.02 (0.02) 0.00 (0.00) 0.66 (0.02) 0.02 (0.02) random 0.51 (0.09) 0.52 (0.10) 0.40 (0.16) 0.04 (0.13) 0.61 (0.21) 0.49 (0.21) cluster 0.55 (0.11) 0.54 (0.06) 0.42 (0.18) 0.13 (0.24) 0.64 (0.22) 0.50 (0.22) scale-free 0.49 (0.11) 0.48 (0.10) 0.32 (0.18) 0.02 (0.09) 0.60 (0.23) 0.40 (0.23) p=50 & n=500 circle 1.00 (0.01) 0.27 (0.05) 0.00 (0.00) 0.25 (0.01) 0.00 (0.00) 0.00 (0.00) star 0.65 (0.05) 0.29 (0.12) 0.60 (0.07) 0.01 (0.02) 0.31 (0.07) 0.60 (0.07) AR (0.02) 0.57 (0.02) 0.54 (0.02) 0.44 (0.02) 0.97 (0.02) 0.98 (0.01) AR (0.01) 0.69 (0.03) 0.64 (0.01) 0.66 (0.04) 0.89 (0.02) 0.69 (0.02) random 0.73 (0.09) 0.62 (0.12) 0.46 (0.15) 0.56 (0.13) 0.82 (0.24) 0.61 (0.24) cluster 0.74 (0.09) 0.65 (0.10) 0.51 (0.17) 0.58 (0.10) 0.82 (0.23) 0.64 (0.23) scale-free 0.73 (0.10) 0.57 (0.14) 0.41 (0.15) 0.47 (0.15) 0.82 (0.24) 0.62 (0.24) Table 2.4 Summary of performance measures in simulation example for BDMCMC approach, glasso (15) with 3 criteria and mb (31) method with 2 criteria. The table reports F 1 -score, which is defined in (4.13), for different models with 50 replications and standard deviations are in parentheses. The F 1 -score reaches its best score at 1 and its worst at 0. The two top models are boldfaced.

34 26 Bayesian Structure Learning in Graphical Models glasso BDMCMC stars ric ebic p=10 & n=30 circle 0.73 (0.12) (0.03) (1.33) star 0.57 (0.08) 0.31 (0.00) 0.22 (0.00) 0.33 (0.01) AR (0.10) 3.63 (0.07) 1.59 (0.06) 2.77 (2.34) AR (0.07) 1.27 (0.00) 1.26 (0.00) 1.28 (0.00) random 0.67 (0.08) 8.32 (305) (1637) cluster 0.61 (0.06) 4.90 (2.37) 3.74 (3.23) 5.72 (7.35) scale-free 0.65 (0.07) 5.83 (12.35) (26.62) p=10 & n=100 circle 0.14 (0.00) (0.01) (0.56) star 0.13 (0.00) 0.15 (0.00) 0.10 (0.00) 0.17 (0.00) AR (0.00) 2.88 (0.16) 0.81 (0.01) 0.37 (0.00) AR (0.01) 1.24 (0.01) 1.14 (0.00) 1.25 (0.02) random 0.16 (0.00) 4.47 (1.09) 3.30 (0.76) 3.92 (2.55) cluster 0.13 (0.00) 4.46 (12.62) 3.62 (8.17) 4.47 (30.31) scale-free 0.16 (0.00) 4.14 (1.27) 3.01 (0.70) 3.68 (1.94) p=50 & n=100 circle 0.67 (0.13) (32.12) (5.88) star 1.75 (0.21) 1.05 (0.08) 1.27 (0.07) 1.49 (0.16) AR (0.23) 8 (1.10) 8.92 (0.50) 6.20 (0.93) AR (0.33) 6.56 (0.19) 7.29 (0.06) 7.27 (0.09) random 2.01 (0.42) (6.44) (12.54) cluster 1.94 (0.42) (3.75) (5.08) scale-free 1.96 (0.45) (6.71) (8.40) p=50 & n=500 circle 0.11 (0.01) (30.56) (3.26) star 0.34 (0.04) 0.78 (0.04) 0.63 (0.03) 0.96 (0.06) AR (0.02) 4.87 (0.45) 3.51 (0.23) 1.75 (0.08) AR (0.03) 5.50 (0.2) 6.42 (0.13) 4.01 (0.10) random 0.26 (0.06) (5.58) (6.59) (8.79) cluster 0.24 (0.05) (21.51) (28.62) scale-free 0.25 (0.07) (7.69) (16.65) Table 2.5 Summary of performance measures in simulation example for BDMCMC approach and glasso (15) with 3 criteria. The table reports the KL measure, which is defined in (2.16), for different models with 50 replications and standard deviations are in parentheses. The KL is positive valued for which 0 is minimum and smaller is better. The best models are boldfaced.

35 2.5 Statistical performance Application to human gene expression data We apply our proposed method to analyze the large-scale human gene expression data which was originally described by Bhadra and Mallick (5), Chen et al. (8), and Stranger et al. (46). The data are collected by Stranger et al. (46) using Illumina s Sentrix Human- 6 Expression BeadChips to measure gene expression in B-lymphocyte cells from Utah (CEU) individuals of Northern and Western European ancestry. They consider 60 unrelated individuals whose genotypes are available from the Sanger Institute website (ftp: //ftp.sanger.ac.uk/pub/genevar). The genotype is coded as 0, 1, and 2 for rare homozygous, heterozygous and homozygous common alleles. Here the focus is on the 3125 Single Nucleotide Polymorphisms (SNPs) that have been found in the 5 UTR (untranslated region) of mrna (messenger RNA) with a minor allele frequency 0.1. There were four replicates for each individual. Since the UTR has been subject to investigation previously, it should have an important role in the regulation of the gene expression. The raw data were background corrected and then quantile normalized across replicates of a single individual and then median normalized across all individuals. We chose the 100 most variable probes among the 47,293 total available probes corresponding to different Illumina TargetID. Each selected probe corresponds to a different transcript. Thus, we have n = 60 and p = 100. The data are available in the R package, BDgraph. Bhadra and Mallick (5) have analyzed the data by adjusting the effect of SNPs using an expression quantitative trait loci (eqtl) mapping study. They found 54 significant interactions among the 100 traits considered. Previous studies have shown that these data are an interesting case study to carry out prediction. We place a uniform distribution as an uninformative prior on the graph and the G- Wishart W G (3, I 100 ) on the precision matrix. We run our BDMCMC algorithm for 60, 000 iterations with a 30, 000 sweeps burn-in. The graph with the highest posterior probability is the graph with 281 edges, which includes almost all the significant interactions discovered by Bhadra and Mallick (5). Figure 4.9 shows the selected graph with 86 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.6. Edges in the graph show the interactions among the genes. Figure 4.10 shows the image of the the all posterior inclusion probabilities for visualization Extension to time course data Here, to demonstrate how well our proposed methodology can be extended to other types of graphical models, we focus on graphical models involving time series data (10, 1). We

36 28 Bayesian Structure Learning in Graphical Models GI_ S GI_ S hmm10289 SGI_ SGI_ S GI_ S GI_ S Hs S GI_ S GI_ S GI_ A GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ A GI_ S GI_ I GI_ I GI_ S GI_ S GI_ A GI_ I hmm10298 S GI_ A GI_ S GI_ S GI_ S GI_ S GI_ A GI_ A GI_ S GI_ S hmm3587 S Hs S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S Hs S GI_ S GI_ S GI_ S GI_ S GI_ I GI_ S GI_ S Hs S Hs S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ A GI_ S Hs S hmm3577 S GI_ S GI_ S Hs S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S Hs S GI_ S GI_ S GI_ S GI_ S GI_ S GI_ S hmm3574 S GI_ S GI_ S GI_ A GI_ S GI_ I Hs S GI_ S GI_ A GI_ A GI_ S Hs S Hs S GI_ S hmm9615 S GI_ S Fig. 2.4 The inferred graph for the human gene expression data set. It reports the selected graph with 86 significant edges for which their posterior inclusion probabilities (4.12) are more then 0.6. Posterior Edge Inclusion Probabilities Fig. 2.5 Image visualization of the posterior pairwise edge inclusion probabilities for all possible edges in the graph. show how graphs can be useful in modeling real-world problems such as gene expression time course data.

37 2.5 Statistical performance 29 Suppose we have a T time point longitudinal microarray study across p genes. We assume a stable dynamic graph structure for the time course data as follows: x t N p (f(t), K 1 ), for t = 1,..., T, (2.17) in which vector f(t) = {f i (t)} p with f i (t) = β ih(t) = m r=1 β irh r (t), β i = (β i1,..., β im ), h(t) = (h 1 (t),..., h m (t)), and m is the number of basic elements. h(t) is a cubic spline basis which should be continuous with continuous first and second derivatives (21, chapter 5). The aim of this model is to find a parsimonious description of both time dynamics and gene interactions. For this model, we place a uniform distribution as an uninformative prior on the graph and a G-Wishart W G (3, I p ) on the precision matrix. For a prior distribution for β i, we choose N p (µ 0i, B 0i ), with i = 1,..., p. Thus, based on our likelihood and priors, the conditional distribution of β i is in which β i x, K, G N p (µ i, B i ), (2.18) B i = (B 1 0i + K ii µ i = B i (B 1 0i µ 0i + T h(t)h T (t)) 1, t=1 T h(t)(x tk V,i K i,v \i f i (t))). t=1 Thus, to account for the time effect in our model, we require one more Gibbs sampling step in the BDMCMC algorithm for updating β i, for i = 1,..., p. To evaluate the efficiency of the method, we focus on the mammary gland gene expression time course data from Stein et al. (44). The data reports a large time course Affymetrix microarray experiment across different developmental stages performed by using mammary tissue from female mice. There are 12, 488 probe sets representing 8600 genes. In total, the probe sets are measured across 54 arrays with three mice used at each of 18 time points. The time points are in the four main stages, as follows: virgin, 6, 10, and 12 weeks; pregnancy, 1, 2, 3, 8.5, 12.5, 14.5, and 17.5; lactation, 1, 3, and 7; involution, 1, 2, 3, 4, and 20. By using cluster analysis, we identify 30 genes which provide the best partition among the developmental stages. Those genes play a role in the transitions across the main developmental events. The mammary data is available in the R package smida; for more details about the data see Wit and McClure (51, chapter one). Abegaz

38 30 Bayesian Structure Learning in Graphical Models and Wit (1) analyze this data based on a sparse time series chain graphical model. By using our proposed methodology, we infer the interactions between the crucial genes. By placing a uniform prior on the graph and the G-Wishart W G (3, I 30 ) on the precision matrix, we run our BDMCMC algorithm for 60, 000 iterations using 30, 000 as burn-in. Figure 2.6 shows the selected graph based on the output of our BDMCMC algorithm. The graph shows the edges with a posterior inclusion probability greater then 0.6. As we can see in Figure 2.6, the genes with the highest number of edges are LCN2, HSD17B, CRP1, and RABEP1, which each have 7 edges. Schmidt-Ott et al. (41) suggested that gene LCN2 (lipocalin 2) plays an important role in the innate immune response to bacterial infection and also functions as a growth factor. For the gene HSD17B (17-β hydroxysteroid dehydrogenase), past studies suggest that this gene family provides each cell with the necessary mechanisms to control the level of intracellular androgens and/or estrogens (24). Gene CRP1 was identified by Abegaz and Wit (1) as a likely hub. PGM TOR1B SID1 H13 OLFM1 RABEP1 GTF2A SPIN1 CRP1 AI AA SAA2 TOM1 HSD17B CYP1B1 PAX5 TNFAIP6 PTPRB S1C22A4 ACTB LCN2 X63240 XIRP1 SOCS3 IGH IGIV1 IGHB FMOD P0 Fig. 2.6 The inferred graph for the mammary gland gene expression data set. It reports the selected graph with 56 significant edges for which their posterior inclusion probabilities (4.12) are more then 0.6.

39 2.6 Discussion 31 SID1 S1C22A4 CDKN1B RABEP1 TOM1 PAX5 SPIN1 AA AI XIRP1 H13 P0 GTF2A TOR1B CRP1 OLFM1 X63240 SAA2 IGH SOCS3 ACTB TNFAIP6 LCN2 IGIV1 HSD17B FMOD PGM PTPRB IGHB CYP1B1 Posterior Edge Inclusion Probabilities SID1 S1C22A4 CDKN1B RABEP1 TOM1 PAX5 SPIN1 AA AI XIRP1 H13 P0 GTF2A TOR1B CRP1 OLFM1 X63240 SAA2 IGH SOCS3 ACTB TNFAIP6 LCN2 IGIV1 HSD17B FMOD PGM PTPRB IGHB CYP1B1 Fig. 2.7 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph Discussion We introduce a Bayesian approach for graph structure learning based on Gaussian graphical models using a trans-dimensional MCMC methodology. The proposed methodology is based on the birth-death process. In Theorem 2.1, we derived the conditions for which the balance conditions of the birth-death MCMC method holds. According to those conditions we proposed a convenient BDMCMC algorithm, whose stationary distribution is our joint posterior distribution. We showed that a scalable Bayesian method exists, which, also in the case of large graphs, is able to distinguish important edges from irrelevant ones and detect the true model with high accuracy. The resulting graphical model is reasonably robust to modeling assumptions and the priors used. As we have shown in our simulation studies (2.5.1), in Gaussian graphical models, any kind of trans-dimensional MCMC algorithm which is based on a discrete time Markov process (such as reversible jump algorithms by 49 and 26) could suffer from high rejection rates, especially for high-dimensional graphs. However in our BDMCMC algorithm, moves between graphs are always accepted. In general, although our trans-dimensional MCMC algorithm has significant additional computing cost for birth and death rates, it has clear benefits over reversible jump style moves when graph structure learning in a non-hierarchical setting is of primary interest.

40 32 Bayesian Structure Learning in Graphical Models In Gaussian graphical models, Bayesian structure learning has several computational and methodological issues as the dimension grows: (1) convergence, (2) computation of the prior normalizing constant, and (3) sampling from the posterior distribution of the precision matrix. Our Bayesian approach efficiently eliminates these problems. For convergence, Cappé et al. (6) demonstrate the strong similarity of reversible jump and continuous time methodologies by showing that, on appropriate rescaling of time, the reversible jump chain converges to a limiting continuous time birth-death process. In Section we show the fast convergence feature of our BDMCMC algorithm. For the second problem, in Subsection 2.4.2, by using the ideas from Wang and Li (49) and Lenkoski (26) we show that the exchange algorithm circumvents the intractable normalizing constant. For the third problem, we used the exact sampler algorithm which was proposed by Lenkoski (26). Our proposed method provides a flexible framework to handle the graph structure and it could be extended to different types of priors for the graph and precision matrix. In Subsection 2.5.3, we illustrate how our proposed model can be integrated in types of graphical models, such as a multivariate time series graphical models. Although we have focused on normally distributed data, in general, we can extend our proposed method to other types of graphical models, such as log-linear models (see e.g. 13 and 27), non-gaussianity data by using copula transition (see e.g. 12), or copula regression models (see e.g. 35). This will be considered in future work. Appendix 1: Proof of theorem 2.1 Before we derive the detailed balance conditions for our BDMCMC algorithm we introduce some notation. Assume the process is at state (G, K), in which G = (V, E) with precision matrix K P G. The process behavior is defined by the birth rates β e (K), the death rates δ e (K), and the birth and death transition kernels Tβ G e (K;.) and Tδ G e (K;.). For each e E, Tβ G e (K;.) denotes the probability that the process jumps from state (G, K) to a point in the new state K P G +e(g +e, K ). Hence, if F P G +e we have T G β e (K; F) = β e(k) β(k) k e :K k e F b e (k e ; K)dk e. (2.19) Likewise, for each e E, T G δ e (K;.) denotes the probability that the process jumps from

41 2.6 Discussion 33 state (G, K) to a point in the new state K P G e (G e, K ). Therefore, if F P G e we have T G δ e (K; F) = η E:K\k η F δ η (K) δ(k) = δ e(k) δ(k) I(K e F). Detailed balance conditions. In our birth-death process, P (K, G x) satisfies detailed balance conditions if δ(k)dp (K, G x) = F e E P G e β(k e )T G β e (K e ; F)dP (K e, G e x), (2.20) and β(k)dp (K, G x) = F e E P G +e δ(k +e )T G δ e (K +e ; F)dP (K +e, G +e x), (2.21) where F P G. The first expression says edges that enter the set F due to the deaths must be matched by edges that leave that set due to the births, and vice versa for the second part. To prove the first part (2.20), we have LHS = δ(k)dp (G, K x) F = I(K F)δ(K)dP (G, K x) P G = I(K F) δ e (K)dP (G, K x) P G e E = I(K F)δ e (K)dP (G, K x) e E P G = p I(K F)δ e (K)P (G, K x) e E i=1 dk ii dk ij. (i,j) E

42 34 Bayesian Structure Learning in Graphical Models For the RHS, by using (2.19) we have RHS = β(k e )Tβ G e (K e ; F)dP (K e, G e x) P e E G e = β e (K) b e (k e ; K)dk e dp (K e, G e x) P e E G e k e :K e k e F = I(K F)β e (K)b e (k e ; K)dk e dp (K e, G e x) P e E G e k e = p I(K F)β e (K)b e (k e ; K)P (K e, G e x) dk ii e E i=1 (i,j) E dk ij. By putting δ e (K)P (G, K x) = β e (K)b e (k e ; K)P (K e, G e x), we have LHS=RHS. Now, in the above equation and P (G, K x) = P (G, K \ (k ij, k jj ) x)p ((k ij, k jj ) K \ (k ij, k jj ), G, x), P (G e, K e x) = P (G e, K e \ k jj x)p (k jj K e \ k jj, G e, x). We simply choose the proposed density for the new element k e = k ij as follows: b e (k e ; K) = P ((k ij, k jj ) K \ (k ij, k jj ), G, x). P (k jj K e \ k jj, G e, x) Therefore, we reach the expression in Theorem 2.1. The proof for the second part (2.21) is the same. Appendix 2: Proposition Let A be a 2 2 random matrix with Wishart distribution W (b, D) as below 1 P (A) = I(b, D) A (b 2)/2 exp { 12 } tr(da),

43 2.6 Discussion 35 where [ ] [ ] a 11 a 12 d 11 d 12 A =, D =. a 12 a 22 d 12 d 22 Then (i) a 11 W (b + 1, D 11.2 ) where D 11.2 = d 11 d 1 22 d 2 21, (ii) P (a 12, a 22 a 11 ) = P (A) P (a 11 ) = 1 J(b, D, a 11 ) A (b 2)/2 exp { 12 tr(da) }, where ( 2π J(b, D, a 11 ) = d 22 ) 1 2 I(b, d22 )a (b 1) 2 11 exp { 12 } D 11.2a 11. Proof. For proof of part (i), see Muirhead (33, Theorem ). The result for part (ii) is immediate by using part (i).

44 36 References References [1] Abegaz, F. and Wit, E. (2013). Sparse time series chain graphical models for reconstructing genetic networks. Biostatistics, 14(3): [2] Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47. [3] Atay-Kayis, A. and Massam, H. (2005). A monte carlo method for computing the marginal likelihood in nondecomposable gaussian graphical models. Biometrika, 92(2): [4] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5): [5] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2): [6] Cappé, O., Robert, C., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3): [7] Carvalho, C. M.,, and Scott, J. G. (2009). Objective bayesian model selection in gaussian graphical models. Biometrika, 96(3): [8] Chen, L., Tong, T., and Zhao, H. (2008). Considering dependence among genes and markers for false discovery control in eqtl mapping. Bioinformatics, 24(18): [9] Cheng, Y., Lenkoski, A., et al. (2012). Hierarchical gaussian graphical models: Beyond reversible jump. Electronic Journal of Statistics, 6: [10] Dahlhaus, R. and Eichler, M. (2003). Causality and graphical models in time series analysis. Oxford Statistical Science Series, pages [11] Dempster, A. (1972). Covariance selection. Biometrics, 28(1): [12] Dobra, A., Lenkoski, A., et al. (2011a). Copula gaussian graphical models and their application to modeling functional disability data. The Annals of Applied Statistics, 5(2A):

45 References 37 [13] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011b). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496): [14] Foygel, R. and Drton, M. (2010). Extended bayesian information criteria for gaussian graphical models. In Lafferty, J., Williams, C. K. I., Shawe-Taylor, J., Zemel, R., and Culotta, A., editors, Advances in Neural Information Processing Systems 23, pages [15] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): [16] Geyer, C. J. and Møller, J. (1994). Simulation procedures and likelihood inference for spatial point processes. Scandinavian Journal of Statistics, pages [17] Giudici, P. and Castelo, R. (2003). Improving markov chain monte carlo model search for data mining. Machine learning, 50(1-2): [18] Giudici, P. and Green, P. (1999). Decomposable graphical gaussian model determination. Biometrika, 86(4): [19] Green, P. (1995). Reversible jump markov chain monte carlo computation and bayesian model determination. Biometrika, 82(4): [20] Green, P. J. (2003). Trans-dimensional markov chain monte carlo. Oxford Statistical Science Series, pages [21] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer. [22] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4): [23] Kullback, S. and Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1): [24] Labrie, F., Luu-The, V., Lin, S.-X., Claude, L., Simard, J., Breton, R., and Bélanger, A. (1997). The key role of 17β-hydroxysteroid dehydrogenases in sex steroid biology. Steroids, 62(1):

46 38 References [25] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [26] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1): [27] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics, 20(1): [28] Letac, G. and Massam, H. (2007). Wishart distributions for decomposable graphs. The Annals of Statistics, 35(3): [29] Liang, F. (2010). A double metropolis hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80(9): [30] Liu, H., Roeder, K., and Wasserman, L. (2010). Stability approach to regularization selection (stars) for high dimensional graphical models. In Advances in Neural Information Processing Systems, pages [31] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3): [32] Mohammadi, A. and Wit, E. (2015). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version [33] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library. [34] Murray, I., Ghahramani, Z., and MacKay, D. (2012). Mcmc for doubly-intractable distributions. arxiv preprint arxiv: [35] Pitt, M., Chan, D., and Kohn, R. (2006). Efficient bayesian inference for gaussian copula regression models. Biometrika, 93(3): [36] Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1): [37] Preston, C. J. (1976). Special birth-and-death processes. Bulletin of the International Statistical Institute, 46:

47 References 39 [38] Ravikumar, P., Wainwright, M. J., Lafferty, J. D., et al. (2010). High-dimensional ising model selection using l1-regularized logistic regression. The Annals of Statistics, 38(3): [39] Ripley, B. D. (1977). Modelling spatial patterns. Journal of the Royal Statistical Society. Series B (Methodological), pages [40] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3): [41] Schmidt-Ott, K. M., Mori, K., Li, J. Y., Kalandadze, A., Cohen, D. J., Devarajan, P., and Barasch, J. (2007). Dual action of neutrophil gelatinase associated lipocalin. Journal of the American Society of Nephrology, 18(2): [42] Scott, J. G. and Berger, J. O. (2006). An exploration of aspects of bayesian multiple testing. Journal of Statistical Planning and Inference, 136(7): [43] Scutari, M. (2013). On the prior and posterior distributions used in graphical modelling. Bayesian Analysis, 8(1):1 28. [44] Stein, T., Morris, J. S., Davies, C. R., Weber-Hall, S. J., Duffy, M.-A., Heath, V. J., Bell, A. K., Ferrier, R. K., Sandilands, G. P., and Gusterson, B. A. (2004). Involution of the mouse mammary gland is associated with an immune cascade and an acute-phase response, involving lbp, cd14 and stat3. Breast Cancer Research, 6(2):R [45] Stephens, M. (2000). Bayesian analysis of mixture models with an unknown number of components-an alternative to reversible jump methods. Annals of Statistics, 28(1): [46] Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., et al. (2007). Population genomics of human gene expression. Nature genetics, 39(10): [47] Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4): [48] Wang, H. (2014). Scaling it up: Stochastic search structure learning in graphical models.

48 40 References [49] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6: [50] Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3): [51] Wit, E. and McClure, J. (2004). Statistics for Microarrays: Design, Analysis and Inference. John Wiley & Sons. [52] Zhao, P. and Yu, B. (2006). On model selection consistency of lasso. The Journal of Machine Learning Research, 7: [53] Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2012). The huge package for high-dimensional undirected graph estimation in r. The Journal of Machine Learning Research, 13(1):

49 Chapter 3 Bayesian Modelling of Dupuytren Disease Using Gaussian Copula Graphical Models 3.1 Abstract Dupuytren disease is a fibroproliferative disorder with unknown etiology that often progresses and eventually can cause permanent contractures of the affected fingers. Most of the research on severity of the disease and the phenotype of this disease are observational studies without concrete statistical analyses. There is a lack of multivariate analysis for the disease taking into account potential risk factors. In this paper, we provide a novel Bayesian framework to discover potential risk factors and investigate which fingers are jointly affected. Gaussian copula graphical modelling is one potential way to discover the underlying conditional independence of variables in mixed data. Our Bayesian approach is based on Gaussian copula graphical models. We embed a graph selection procedure inside a semiparametric Gaussian copula. We carry out the posterior inference by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on a birth-death process. We implemented the method in the R package BDgraph. Key words: Dupuytren disease; Risk factors; Bayesian inference; Gaussian copula graphical models; Bayesian model selection; Latent variable models; Birth-death process; Markov chain Monte Carlo.

50 42 Bayesian modeling of Dupuytren disease 3.2 Introduction Dupuytren disease is inherited and presents worldwide, which is more prevalent in people with northern European ancestry (3). This disease is an incurable fibroproliferative disorder that alters the palmar hand and may causes progressive and permanent flextion contracture of the fingers. At the first stage of the disease, giving rise to the development of skin pitting and subcutaneous nodules in the palm; see Figure 3.1 in the left side. At a later stage, cords appear that connect the nodules and may contract the fingers into a flexed position; see Figure 3.1 in the right side. A contracture can arise isolated in a single ray or even multiple rays and affects them in decreasing order. The disease mostly appears in the ulnar side of the hand, i.g. it affects the little and ring fingers most-frequently, see figure 3.2. The only treatment available is surgical interventions. The main questions are: (1) Can we determine what variables affect the disease and how? (2) Do we need to intervene on multiple fingers when a surgical intervention takes place? The first question is more of an epidemiological question, while the second question is a clinical question. Empirical research has described the patterns of occurrence of Dupuytren disease in multiple fingers. (19) have stated that most often the combination of affected ring and little fingers occurred, followed by the combination of an affected third, fourth, and fifth finger. (31) found that Dupuytren disease rarely affects isolated radial side, and that radial effect often associate with an affected ulnar side. (20) noticed that patients who had required surgery because of firmly affected thumb were on average 8 years older and had suffered significantly longer from the disease compared with patients with a mildly affected radial side. Moreover, these patients suffered from ulnar disease that repeatedly had required surgery, suggesting an intractable form of disease. More recently, (13), with a multivariate ordinal logit model, suggested that the middle finger is substantially correlated with other fingers on the ulnar side, and the thumb and index finger are correlated. They taking into account age and sex, and tested hypotheses on independence between groups of fingers. However, there is a lack of multivariate analysis study for the disease with taking into account potential risk factors. Essential risk factors of Dupuytren disease is has been variably attributed to both phenotypic and genotypic factors (29). Essential risk factors include genetic predisposition and ethnicity, as well as sex and age. Research on family studies and twin studies recommended that Dupuytren disease has a potential genetic risk factors. However, until now, it is unclear whether Dupuytren disease is a complex oligogenic or a simple monogenic Mendelian disorder. Several life-style risk factors (some considered controversial) include

In the right is a hand image of a patient who has Dupuytren disease and his fingers has not been affected by the disease yet.

51 3.2 Introduction 43 (a) With a flexed position (b) Without a flexed position Fig. 3.1 In the left is a hand image of a patient who has Dupuytren disease and his finger has been affected by the disease. Palmar nodules and small cords without signs of contracture. In the right is a hand image of a patient who has Dupuytren disease and his fingers has not been affected by the disease yet. smoking, excessive alcohol consumption, manual work, and hand trauma, but also several diseases, such as diabetes mellitus and epilepsy, are thought to play a role in the cause of Dupuytren disease. However, the role of these risk factors and diseases has not been fully elucidated, and the results of different studies are occasionally conflicting (12). In this paper we analyse data which are collected in the north of Netherlands from patients who have Dupuytren disease. Both hands of patients are examined for signs of Dupuytren disease. These are tethering of the skin, nodules, cords, and finger contractures in patients with cords. Severity of the disease are measured by total angles on each of the 10 fingers. For the potential risk factors, in addition, we inquired about smoking habits, alcohol consumption, whether participants had performed manual labor during a significant part of their life, and whether they had sustained hand injury in the past, including surgery. In addition, we inquired about the presence of Ledderhose diabetes, epilepsy, peyronie, knucklepade, and liver disease; familial occurrence of Dupuytren disease, defined as a first-degree relative with Dupuytren disease. In our dataset we have 279 patients in which 79 of them, have the disease in at least one of their fingers. Therefore, there are a lot of zeros in the data as shown in figure 3.2. Beside, there are 13 potential risk factors. This mixed data set contains binary (disease factors), discrete (alcohol and hand injury), and continuous variables (total angles for fingers). The primary aim of this paper is to model the relationships between the risk factors

52 44 Bayesian modeling of Dupuytren disease Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5 Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5 (a) Boxplot (b) Histogram Fig. 3.2 In the left is a boxplot for all hand fingers which is based on the total angles of the fingers. In the right is is occurrence of rays affected with Dupuytren disease for all 10 hand fingers. and disease indicators for Dupuytren disease based on this mixed dataset. We propose an efficient Bayesian statistical methodology based on Gaussian copula graphical models that can be applied to binary, ordinal or continuous variables simultaneously. We embed the graphical model inside a semiparametric framework, using extended rank likelihood (10). We carry out posterior inference for the graph structure and the precision matrix by using an efficient sampling scheme which is a trans-dimensional MCMC approach based on a continuous-time birth-death process (22). Graphical models provide an effective way to describe statistical patterns in data. In this context undirected Gaussian graphical models are commonly used, since inference in such models is often tractable. In undirected Gaussian graphical models, the graph structure is characterized by its precision matrix (the inverse of covariance matrix): the non-zero entries in the precision matrix show the edges in the graph. In the real world, data are often non-gaussian. For non-gaussian continuous data, variables can be transformed to Gaussian latent variables. For discrete data, however, the situation is more convoluted; there is no one-to-one transformation into latent Gaussian variables. A common approach is to apply a Markov chain Monte Carlo method (MCMC) to simulate both the latent Gaussian variables and the posterior distributions (10). A related approach is the Gaussian copula graphical models developed by (7). In their method, they have designed the sampler algorithm which is based on reversible-jump MCMC approach. Here, in our proposed method

53 3.3 Methodology 45 we implement the birth-death MCMC approach (22) which has several computational advantages compared with reversible-jump MCMC approach; see (22, Section 4). The paper is organized as follows. In Section 3.3, we introduce a comprehensive Bayesian framework based on Gaussian copula graphical models. In addition, we show the performance of our methodology and we compare it with state-of-the-art alternatives. In section 3.4 we analyses Dupuytren disease dataset based on our proposed Bayesian statistical methodology. In this section, first we discover the potential phenotype risk factors for the Dupuytren disease. Moreover, we consider the severity of Dupuytren disease between pairs of 10 hand fingers; the result help surgeons to decide weather they should operate one finger or they should operate multiple fingers simultaneously. In the last section, we discuss the connections to existing methods and possible future directions in the last section. 3.3 Methodology Gaussian graphical models In graphical models, conditional dependence relationships among random variables are presented as a graph G. A graph G = (V, E) specifies a set of vertices V = {1, 2,..., p}, where each vertex corresponds to a random variable, and a set of existing edges E V V (15). E denotes the set of non-existing edges. We focus here on undirected graphical models in where (i, j) E is equivalent with (j, i) E, also known as Markov random fields. The absence of an edge between two vertices specifies the pairwise conditional independence of these two variables given the remaining variables, while an edge between the two variables determines the conditional dependence of the variables. For a p-dimensional variable exists in total 2 p(p 1)/2 possible conditional independence graphs. Even with a relatively small number of variables, the size of graph space is enormous. The graph space can be explored by stochastic search algorithms (22, 8, 11). These types of algorithms explore the graph space by adding or deleting one edge at each step, known as a neighborhood search algorithm. A graphical model that follows a multivariate normal distribution is called a Gaussian graphical models, also known as a covariance selection model (6). Zero entries in the precision matrix correspond to the absence of edges on the graph and conditional independence between pairs of random variables given all other variables. We define a zero

54 46 Bayesian modeling of Dupuytren disease mean Gaussian graphical model with respect to the graph G as M G = { N p (0, Σ) K = Σ 1 P G }, where P G denotes the space of p p positive definite matrices with entries (i, j) equal to zero whenever (i, j) E. Let z = (z 1,..., z n ) be an independent and identically distributed sample of size n from model M G, where z i is a p dimensional vector of variables. Then, the likelihood is P (z K, G) K n/2 exp { 12 tr(ks) }, (3.1) where S = z z Gaussian copula graphical models A copula is a multivariate cumulative distribution function whose uniform marginals are on the interval [0, 1]. Copulas provide a flexible tool for understanding dependence among random variables, in particular for non-gaussian multivariate data. By Sklar s theorem (30) there exists a copula C such that any p dimensional distribution function H can be completely specified by its marginal distributions and a copula C satisfying H(y 1,..., y p ) = C ( F 1 (y 1 ),..., F p (y p ) ), where F j are the univariate marginal distributions of H. If all F j are all continuous, then C is unique, otherwise it is uniquely determined on Ran(F 1 ) Ran(F p ) which is the cartesian product of the ranges of F j. Conversely, a copula function can be extracted from any p dimension distribution function H and marginal distributions F j by C(u 1,..., u p ) = H ( ) F1 1 (y 1 ),..., Fp 1 (y p ), where F 1 j (s) = inf{t F j (t) s} are the pseudo-inverse of F j. The decomposition of a joint distribution into marginal distributions and a copula suggests that the copula captures the essential features of dependence between random variables. Moreover, the copula measure of dependence is invariant to any monotone transformation of random variables. Thus, copulas allow one to model the marginal distributions and the dependence structure of a multivariate random variables separately. In copula modelling, Genest et al. (9) develop a popular semiparametric estimation approach

55 3.3 Methodology 47 or rank likelihood based estimation in which the association among variables is represented with a parametric copula but the marginals are treated as nuisance parameters. The marginals are estimated non-parametrically using the scaled empirical distribution function ˆF j (y) = n F n+1 n j (y), where F nj (y) = 1 n n i=1 I{y ij y}. As a result estimation and inference are robust to misspecification of marginal distributions. The semiparametric estimators are well-behaved for continuous data but fail for discrete data, for which the distribution of the ranks depends on the univariate marginal distributions, making them somewhat inappropriate for the analysis of mixed continuous and discrete data (10). To remedy this, Hoff (10) propose the extended rank likelihood which is a type of marginal likelihood that does not depend on the marginal distributions of the observed variables. Under the extended rank likelihood approach the ranks are free of the nuisance parameters (or marginal distributions) of the discrete data. This makes the extended rank likelihood approach more focused on the determination of graphical models (or multivariate association) and avoids the difficult problem of modelling the marginal distributions (7). In case of ordered discrete and continuous variables, a Gaussian copula has been considered to describe dependence pattern between heterogeneous variables using the extended rank likelihood in Gaussian copula graphical modelling (7). Let Y be a collection of continuous, binary, ordinal or count variables with F j the marginal distribution of Y j and F 1 j its pseudo inverse. For constructing a joint distribution of Y, we introduce a multivariate normal latent variable as follows Z 1,..., Z n iid N (0, Γ), and define the observed data as Y ij = F 1 j (Φ(Z ij )), where Γ is the correlation matrix for a Gaussian copula. The joint distribution of Y is given by P ( Y 1 y 1,..., Y p y p ) = C(F1 (y 1 ),..., F p (y p ) Γ), where C( ) is the Gaussian copula given by C(u 1,..., u p Γ) = Φ p ( Φ 1 (u 1 ),..., Φ 1 (u p ) Γ ),

56 48 Bayesian modeling of Dupuytren disease where Φ p ( ) is the cumulative distribution of a multivariate normal distribution and Φ( ) is a cumulative distribution function of a univariate normal distribution. Hence the joint cumulative distribution function is P ( Y 1 y 1,..., Y p y p ) =Φp ( Φ 1 (F 1 (y 1 )),..., Φ 1 (F (y p )) Γ ). (3.2) In the semiparametric copula estimation, since the marginals are treated as nuisance parameters the joint distribution in (4.9) is parametrized only by the correlation matrix of the Gaussian copula, Γ. Our aim is to infer the underlying graph structure G of the observed variables Y implied by the continuous latent variables Z. Since Zs are unobservable we follow the idea of (10) that relate them to the observed data as follows. Given the observed data Y from a sample of n observations, the latent samples z are constrained to belong to the set where A(y) = {z R n p : l r j(z) < z (r) j { lj(z) r = max u r j(z) = min z (k) j : y (s) j { z (s) j : y (r) j } < y (r) j < y (s) j < u r j(z), r = 1,..., n; j = 1,..., p}., }. (3.3) Further (10) suggests that inference on the latent space can be performed by substituting the observed data y with the event A(y). For a given graph G and precision matrix K = Γ 1, the likelihood is defined as: P (y K, G, F 1,..., F p ) = P (y, z A(y) K, G, F 1,..., F p ) = P (z A(y) K, G) P (y z A(y), K, G, F 1,..., F p ). The only part of the observed data likelihood relevant for inference on K and G is P (z A(y) K, G). Thus, the extended rank likelihood function as referred by (10) is given by P(z A(y) K, G)=P (z A(y) K, G)= P (z K, G)dz, where the expression inside the integral for the Gaussian copula based distribution given by (4.9) takes a similar form as in 5.2. A(y)

57 3.3 Methodology 49 Therefore, we can infer about (K, G) by obtaining a posterior distribution P (K, G z A(y)) P (z A(y) K, G)P (K G)P (G), which is discussed in detail in the next sections. Moreover, we evaluate the results induced by the latent variables using posterior predictive analysis on the scale of the original mixed variables Bayesian Gaussian copula graphical models Prior specification In this section we discuss the specification of prior distributions for the graph G and the precision matrix K. For the prior distribution of the graph, we propose to use the discrete uniform distribution over the graph space, P (G) 1, as a non-informative prior. Other choices of priors for the graph structure have been considered by modelling the joint state of the edges (28), encouraging sparse graphs (11) or a truncated Poisson distribution on the graph size (22). We consider the G-Wishart (27) distribution for the prior distribution of the precision matrix. The G-Wishart distributions is conjugate for normally distributed data and places no probability mass on zero entries of the precision matrix. Matrix K P G has the G- Wishart distribution W G (b, D), if P (K G) = 1 I G (b, D) K (b 2)/2 exp { 12 } tr(dk), where b > 2 is the degree of freedom, D is a symmetric positive definite matrix, and I G (b, D) is the normalizing constant, I G (b, D) = K (b 2)/2 exp { 12 } tr(dk) dk. P G If graph G is complete the G-Wishart distribution reduces to the usual Wishart distribution. In that case, its normalizing constant has an explicit form (24). If a graph is decomposable, I G (b, D) can be calculated explicitly (27). For non-decomposable graphs, we can approximate I G (b, D) by a Monte Carlo approach (2) or a Laplace approximation (17). The G-Wishart prior is conjugate to the likelihood (5.2), hence, the posterior distribu-

58 50 Bayesian modeling of Dupuytren disease tion of K is P (K Z A(y), G) = { 1 2)/2 I G (b, D ) K (b exp 1 } 2 tr(d K), where b = b + n and D = D + S, that is, W G (b, D ). For other choices of priors for the precision matrix see (36, 34, 33, 37). Posterior inference Consider the joint posterior distribution of K P G and the graph G given by P (K, G Z A(y)) P (Z A(y) K) P (K G) P (G). (3.4) Sampling from this joint posterior distribution can be done by a computationally efficient birth-death MCMC sampler proposed in Mohammadi and Wit (2013) for Gaussian graphical models. Here we extend their algorithm for the more general case of Gaussian copula graphical models. Our algorithm is based on a continuous time birth-death Markov process in which the algorithm explores the graph space by adding or removing an edge in a birth or death event. The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process. The algorithm is considered in such a way that the stationary distribution equals the target joint posterior distribution of the graph and the precision matrix (5.3). The birth-death process is designed in such a way that the birth and death events are independent Poisson processes; the time between two successive events has an exponential distribution. Therefore, the probability of birth and death events are proportional to their rates. Mohammadi and Wit (22, section 3) prove that by considering the following birth and death rates, the birth-death MCMC sampling algorithm converges to the target joint posterior distribution of the graph and the precision matrix, β e (K) = P (G+e, K +e \ (k ij, k jj ) Z A(y)), for each e E, (3.5) P (G, K \ k jj Z A(y)) δ e (K) = P (G e, K e \ k jj Z A(y)), for each e E, (3.6) P (G, K \ (k ij, k jj ) Z A(y)) in which G +e = (V, E {e}), and K +e P G +e and similarly G e = (V, E \ {e}), and K e P G e. The extended birth-death MCMC algorithm for Gaussian copula graphical

59 3.3 Methodology 51 models are summarized in Algorithm Algorithm 3.1 Given a graph G = (V, E) with a precision matrix K, iterate the following steps: 1. Sample the latent data. For each r V and j {1, 2,..., n}, we update the latent value z r (j) from its full conditional distribution Z r K, Z V \{r} = z (j) K,V \{r} N r K rr z (j) r /K rr, 1/K rr, truncated to the interval [ L j r, U j r ] in (4.10). 2. Sample the graph based on birth and death process Calculate the birth rates by equation 5.7 and β(k) = e E β e(k), 2.2. Calculate the death rates by equation 5.8 and δ(k) = e E δ e(k), 2.3. Calculate the waiting time by W(K) = 1/(β(K) + δ(k)), 2.4. Calculate the type of jump (birth or death), 3. Sample the new precision matrix, according to the type of jump, based on Algorithm 3.2. In Algorithm 3.1, the first step is to sample from the latent variables given the observed data. Then, based on this sample, we calculate the birth and death rates and waiting times. Based on birth and death rates we calculate the type of jump. Details of how to efficiently calculate the birth and death rates are discussed in subsection 1. Finally in step 3, according to the new state of jump, we sample from new precision matrix using a direct sampling scheme from the G-Wishart distribution which is described in Algorithm 3.2. in subsection 1. To calculate the posterior probability of a graph we compute the Rao-Blackwellized sample mean (4, subsection 2.5). The Rao-Blackwellized estimate of the posterior graph probability is the proportion of the total waiting time for that graph (see Figure 4.2 in the right). The weights are equal to the length of the waiting time in each state ( e.g. {W 1, W 2, W 3,...} in Figure 4.2). Computing the birth and death rates Calculating the birth and death rates (5.7 and 5.8) is the bottleneck of our BDMCMC algorithm. Here, we explain how to calculate efficiently the death rates; the birth rates are calculated a similar manner.

60 52 Bayesian modeling of Dupuytren disease Fig. 3.3 This image visualizes Algorithm 3.1. (Bottom left) Continuous time BDMCMC algorithm where {W 1, W 2,...} denote waiting times and {t 1, t 2,...} denote jumping times. (Bottom right) Estimated posterior probability of the graphs which are proportional to sum of their waiting times. Following (22) and after some simplification, for each e = (i, j) E, we have δ e (K) = P (G e ) P (G) D jj I G (b, D) I G e(b, D) ( 2π(k ii k11) ) H(K, D ), (3.7) where [ ] H(K, D )=exp 1 tr(d 2 e,e(k 0 K 1 )) (Dii (D ij) 2 )(k ii k11) 1, in which K 0 = [ k ii 0 0 K j,v \j (K V \j,v \j ) 1 K V \j,j ] and K 1 = K e,v \e (K V \e,v \e ) 1 K V \e,e. The computational bottleneck in (3.7) is the ratio of normalizing constants., D jj

61 3.3 Methodology 53 Dealing with calculation of normalizing constants Calculating the ratio of normalizing constants has been a major issue in recent literature (32, 35, 22). To compute the normalizing constants of a G-Wishart, (27) proposed an importance sampling algorithm, while (2) developed a Monte Carlo method. These methods can be computationally expensive and numerical instable (11, 35). (35, 5, 22) developed an alternative approach, which borrows ideas from the exchange algorithm (25) and the double Metropolis-Hastings algorithm (18) to compute the ratio of such normalizing constants. When the dimension of the problem is high, the curse of dimensionality may be a serious difficulty for the double MH sampler (18). More recently (32) derived an explicit representation of the normalizing constant ratio. Theorem 3.1 (Uhler et al. 32). Let G = (V, E) be an undirected graph and G e = (V, E e ) denotes the graph G with one less edge e. Then I G (b, I p ) I G e(b, I p ) = 2 Γ((b + d + 1)/2) π, Γ((b + d)/2) where d denotes the number of triangles formed by the edge e and two other edges in G and I p denotes an identity matrix with p dimension. Proof. it is immediate by using Uhler et al. (32, theorem 3.7). Therefore, for the case of D = I p, we have a simplified expression for the death rates, given by δ e (K) = P (G e ) P (G) Γ((b + d + 1)/2) Γ((b + d)/2) 2D jj ( (k ii kii 1 2 H(K, D ), ))1 Sampling from posterior distribution of precision matrix Several sampling methods from a G-Wishart have been proposed; for a review of existing methods see (35) and (16). Here we use an exact sampler algorithm developed by (16) and summarized in Algorithm 3.2. Simulation study We perform a comprehensive simulation study with respect to different graph structures to evaluate the performance of our method and compare it to an alternative approach proposed by Dobra and Lenkoski (7), referred to as DL. We generate mixed data from a latent Gaussian copula model with 5 different types of variables, for Gaussian, non-gaussian,

62 54 Bayesian modeling of Dupuytren disease 1 Algorithm 3.2. Direct sampler from precision matrix (16). Given a graph G = (V, E) with precision matrix K: 1. Set Σ = K 1, 2. Repeat for j = 1,..., p, until converge: 2.1 Let N j V be the set of variables that connected to j in G. Form Σ Nj and K 1 N j,j and solve ˆβ j = Σ 1 N j K 1 N j,j, 2.2 Form ˆβ j R p 1 by plugging zeroes in those locations not connected to j in G and padding the elements of ˆβ j to the rest locations, 2.3 Replace Σ j, j and Σ j,j with Σ j, j ˆβj, 3. Return K = Σ 1. ordinal, count, and binary. We performed all computations with our extended R package BDgraph (21, 23). Corresponding to different sparsity patterns, we consider 4 different kinds of synthetic graphical models: 1. Random Graph: A graph in which the edge set E is randomly generated from independent Bernoulli distributions with probability 2/(p 1) and corresponded precision matrix is generated from K W G (3, I p ). 2. Cluster Graph: A graph in which the number of clusters is max { 2, [ p/20 ]}. Each cluster has the same structure as a random graph. The corresponded precision matrix is generated from K W G (3, I p ). 3. Scale-free Graph: A scale-free graph has a power-low degree distribution generated by the Barabasi-Albert algorithm (1). The corresponded precision matrix is generated from K W G (3, I p ). 4. Hub Graph: A graph in which every node is connected to one node, and corresponded precision matrix is generated from K W G (3, I p ). For each graphical model, we consider four different scenarios: (1) dimension p = 10 and sample size n = 30, (2) p = 10 and n = 100, (3) p = 30 and n = 100, (4) p = 30 and n = 500. For each mixed data set, we fit our method and DL approach using a uniform prior for the graph and the G-Wishart prior W G (3, I p ) for the precision matrix. We run the two algorithms with the same starting points with 100, 000 iteration and 50, 000 as a burn in. Computations for this example were performed in parallel on a 235 batch nodes with 12 cores and 24 GB of memory, running Linux.

63 3.4 Analysis of Dupuytren disease dataset 55 To assess the performance of the graph structure, we compute the F 1 -score measure (26) for MAP graph which defined as F 1 -score = 2TP 2TP + FP + FN, (3.8) where TP, FP, and FN are the number of true positives, false positives, and false negatives, respectively. The F 1 -score score lies between 0 and 1, where 1 stands for perfect identification and 0 for bad identification. Also, the mean square error (MSE) is used, defined as MSE = e (ˆp e I(e G true )) 2, (3.9) where ˆp e is the posterior pairwise edge inclusion probabilities and I(e G true ) is an indicator function, such that I(e G true ) = 1 if e G true and zero otherwise. For our BDMCMC algorithm we calculate the posterior pairwise edge inclusion probabilities based on the Rao-Blackwellization (4, subsection 2.5) for each possible edge e = (i, j) as ˆp e = N t=1 I(e G(t) )W(K (t) ) N, (3.10) t=1 W(K(t) ) where N is the number of iterations and W(K (t) ) is the waiting time in the graph G (t) with the precision matrix K (t) ; see (22). Table 3.1 reports comparisons of our method with DL (7), where we repeat the experiments 50 times and report the average F 1 -score and MSE with their standard errors in parentheses. Our method performs well overall as its F 1 - score and its MSE are beter in most of the cases and mainly because of its fast convergence rate. As we expected, the DL approach converges slower compared to our method. From a theoretical point of view, both algorithms converge to the true posterior distribution, if we run them a sufficient amount of time. Thus, the results from this table just indicate how quickly the algorithms converge. 3.4 Analysis of Dupuytren disease dataset The data set we analyses here are collected from patients who have Dupuytren disease from north of Netherlands. Both hands of patients who were willing to participate and signed an informed consent from are examined for signs of Dupuytren disease and knuckle pads. Signs of Dupuytren disease include tethering of the skin, nodules, cords, and finger contractures in patients with cords. Participants who had at least one of these features

64 56 Bayesian modeling of Dupuytren disease F1-score MSE p n graph BDMCMC DL BDMCMC DL Random 0.37 (0.17) 0.33 (0.11) 7.0 (2.0) 9.2 (1.1) Cluster 0.35 (0.16) 0.30 (0.11) 6.9 (1.8) 9.6 (1.4) Scale-free 0.31 (0.13) 0.34 (0.08) 7.6 (1.2) 9.7 (1.0) Hub 0.26 (0.11) 0.31 (0.10) 8.3 (0.8) 10.2 (0.8) random 0.33 (0.17) 0.32 (0.11) 7.7 (1.7) 9.9 (1.0) Cluster 0.30 (0.17) 0.28 (0.09) 6.8 (1.5) 9.5 (1.3) Scale-free 0.33 (0.16) 0.32 (0.12) 7.3 (1.4) 9.6 (1.0) Hub 0.26 (0.12) 0.31 (0.09) 8.3 (0.9) 10.0 (1.0) Random 0.54 (0.06) 0.44 (0.04) 52.3 (9.9) 59.1 (8.7) Cluster 0.56 (0.05) 0.47 (0.04) 48.0 (6.5) 54.4 (8.1) Scale-free 0.53 (0.17) 0.30 (0.05) 27.7 (14.6) 25.8 (1.7) Hub 0.39 (0.08) 0.31 (0.04) 34.8 (6.8) 30.5 (4.5) Random 0.79 (0.04) 0.63 (0.07) 25.8 (6.5) 41.1 (14.3) Cluster 0.79 (0.05) 0.66 (0.05) 26.3 (5.2) 35.1 (7.9) Scale-free 0.81 (0.07) 0.59 (0.06) 9.4 (3.2) 11.7 (3.0) Hub 0.73 (0.07) 0.53 (0.08) 12.7 (3.4) 13.5 (3.2) Table 3.1 Summary of performance measures in simulation example 1 for our method and DL (7). The table presents the F 1 -score, defined in (4.13) and MSE, defined in (3.9), with 100 replications and standard deviations in parenthesis. The F 1 -score reaches its best score at 1 and its worst at 0. The MSE is positive value for which 0 is minimal and smaller is better. The best models for both F 1 -score and MSE are boldfaced. were labeled as having Dupuytren disease. Severity of the disease are measured by total angles of each 10 fingers. The total angles is the sum of angles for metaccarpophalangeal joint, two interphalangeal joints (for thumb fingers are only two interphalangeal joints). As potential risk factors, in addition, information is available about smoking habits, alcohol consumption, whether participants performed manual labor during a significant part of their life, and whether they had sustained hand injury in the past, including surgery. In addition, disease history information about the presence of Ledderhose diabetes, epilepsy, peyronie, knucklepade, or liver disease, and familial occurrence of Dupuytren disease (defined as a first-degree relative with Dupuytren disease) is available. The data consist of 279 patients who have Dupuytren disease (n = 279); among those patients, 79 of them have an irreversible flexion contracture at least one of their fingers. The severity of the disease in the all 10 fingers of the patients is measured by total angles

65 3.4 Analysis of Dupuytren disease dataset 57 of each fingers. To study the potential phenotype risk factors of this disease, we consider the above mentioned 14 factors. (13) analyzes the Dupuytren disease with a multivariate ordinal logit model, taking into account age and sex, and tested hypotheses of the independence between groups of fingers. However, most of the studies on the phenotype of this disease have been observational studies without comprehensive statistical analyses. Phenotype risk factors previously described include alcohol consumption, smoking, manual labor, hand trauma, diabetes mellitus, and epilepsy (29, 14). In subsection 3.4.1, we infer the Dupuytren disease network with 14 potential risk factors based on our Bayesian approach. In subsection 3.4.2, we consider only the 10 fingers to infer the interaction between the fingers Inference for Dupuytren disease with risk factors We consider the severity of disease in all 10 fingers of the patients and 14 potential phenotype risk factors of the disease, so p = 24. The factors are: age, sex, smoking, amount of alcohol (Alcohol), relative (Relative), number of hand injury of patients (HandInjury), Manual labour (Labour), Ledderhose disease (Ledderhose), diabetes disease (Diabetes), epilepsy disease (Epilepsy), liver Disease (LiverDisease), peyronie disease (Peyronie), knucklepade disease (Knucklepade). For each finger we measure angles of metaccarpophalangeal joint, two interphalangeal joints (for thumb fingers we only measure two interphalangeal joints); then we sum those angles for each fingers. The total angles could vary from 0 to 270 degrees; in this dataset the minimum degree is 0 and maximum 157 degrees. The age of participants (in years) ranges from 40 to 89 years, with an average age of 66 years. Smoking is binned into 3 ordered categories (never, stopped, and smoking). Amount of alcohol consumption is binned into 8 ordered categories (ranging from no alcohol to 20 consumption per week). The other variables are binary. We apply our Bayesian framework to infer the conditional (in)dependence among the variables in order to identify the potential risk factors of the Dupuytren disease and discover how they affect the disease. We place a uniform distribution as an uninformative prior on the graph and the G-Wishart W G (3, I 24 ) on the precision matrix. We run our BDMCMC algorithm for 2, 000K iterations with a 1, 000K sweeps burn-in. The graph with the highest posterior probability is the graph with 42 edges. Figure 3.4 shows the selected graph with 26 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.5. Edges in the graph show the interactions among the genes. Figure 3.5 shows the image of all posterior inclusion probabilities for visualization.

66 58 Bayesian modeling of Dupuytren disease Fig. 3.4 The inferred graph for the Dupuytren disease dataset based on 14 risk factors and the total degrees of flexion in all 10 fingers. It reports the selected graph with 24 significant edges for which their posterior inclusion probabilities (4.12) are more than 0.5. The results (Figures 3.4 and 3.5) show that factors Age, Ledderhose disease, Hand Injury, Alcohol, have a significant effect on the severity of the Dupuytren disease. Graph 3.4 also shows that factor Age is a hub in this graph and it plays a significant role as it affects the severity of the disease directly and indirectly through the influence of other risk factors such as Ledderhose Severity of Dupuytren disease between pairs of fingers Here, we consider the severity of Dupuytren disease between pairs of 10 hand fingers. Interaction between fingers is important because it help surgeons to decide weather they should operate one finger or multiple fingers simultaneously. The main idea is that if

67 3.4 Analysis of Dupuytren disease dataset 59 Sex Age Labour Smoking Alcohol Diabetes Epilepsy LiverDisease Peyronie Ledderhose Knucklepads HandInjury Relative Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5 Posterior Edge Inclusion Probabilities Sex Age Labour Smoking Alcohol Diabetes Epilepsy LiverDisease Peyronie Ledderhose Knucklepads HandInjury Relative Right1 Right2 Right3 Right4 Right5 Left1 Left2 Left3 Left4 Left5 Fig. 3.5 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph, for 10 fingers with 14 risk factors fingers are almost independent in terms of the severity of Dupuytren disease, there is no reason to operate the fingers simultaneously. To apply our Bayesian framework for these 10 variables (fingers), we place a uniform distribution as an uninformative prior on the graph and the G-Wishart W G (3, I 10 ) on the precision matrix. We run our BDMCMC algorithm for 2, 000K iteration with a 1, 000K as burn-in. The graph with the highest posterior probability is the graph with 12 edges. Figure 3.6 shows the selected graph with 8 edges, for which the posterior inclusion probabilities in (4.12) is greater then 0.5. Edges in the graph show the interactions among the variables under consideration. Figure 3.7 shows the table of all posterior inclusion probabilities. The results (Figures 3.6 and 3.7) show that there are significant co-occurrences of Dupuytren disease in the ring fingers and middle fingers in both hands. Therefore we can infer that middle finger substantially associate to the ulnar side of the hand. Surpris-

68 60 Bayesian modeling of Dupuytren disease Fig. 3.6 The inferred graph the Dupuytren disease dataset based on the total degrees of flexion in all 10 fingers. It reports the selected graph with 9 significant edges for which their posterior inclusion probabilities (4.12) are more than 0.5. Right1 0 Posterior Edge Inclusion Probabilities Right Right Right Right Row Left Left Left Left Left Right Right Right Right Right Left Left Left Left4 0 Left Fig. 3.7 Image visualization of the posterior pairwise edge inclusion probabilities of all possible edges in the graph, for 10 fingers. ingly, our result shows that there is a significant relationship between middle fingers in both hands. This result supports the hypotheses the the disease has genetic factors or other biological factors that affect fingers simultaneously. Moreover, it also shows that the joint interactions between fingers in both hand is almost symmetric.

69 3.5 Conclusion Fit of model to Dupuytren data Posterior predictive checks can be used for checking the proposed Bayesian approach fits the Dupuytren data set. If the model fits the Dupuytren data set, then simulated data generated under the model should look like to the observed data. Therefore, first, based on our estimated graph from our BDMCMC algorithm in section 3.4.1, we draw simulated values from the posterior predictive distribution of replicated data. Then, we compare the samples to our observed data. Any systematic differences between the simulations and the data determine potential failings of the model. In this regard and based on the result in subsection (Figures 3.4 and 3.5), for both simulation and observed data, we obtain the conditional distributions of the potential risk factors and fingers. We show that the result for finger 4 in right hand and risk factor age in figure 3.8, for finger 5 in right hand and risk factor relative in figure 3.9, and for finger 2 in right hand and risk factor Ledderhose in figure Figure 3.8 plots the empirical and predictive distribution of variable finger 4 in hand right conditional on variable age in four categories {(40, 50), (50, 60), (60, 70), (70, 90)}. For variable finger 4 in hand right, based on Tubiana Classification, we categories it in 5 categories (category 1: 0 degree for total angle; 2: degree between (1, 45); 3: degree between (46, 90); 4: degree between (90, 135); 5: degree more than 135). Figure 3.9 plots the empirical and predictive distribution of variable finger 5 in hand right conditional on variable Relative. Figure 3.10 plots the empirical and predictive distribution of variable finger 2 in hand right conditional on variable Ledderhose. Figures 3.10 and 3.9 show that the fit is good, since the predicted conditional distributions, in general, are the same as the empirical distributions. 3.5 Conclusion In this paper we have proposed a Bayesian methodology for discovering the effect of potential risk factors of Dupuytren disease and the underling relationships between fingers simultaneously. The results of the case study clearly demonstrate that age, alcohol, relative, and ledderhose diseases all affect Dupuytren disease directly. Other risk factors only affect Dupuytren disease indirectly. Another important result is that severity of the Dupuytren disease in fingers are correlated. In particular, the middle finger with the ring finger. This implies that

70 62 Bayesian modeling of Dupuytren disease P(Right4 40<age<50) P(Right4 50<age<60) Emprical Predictive P(Right4 60<age<70) P(Right4 70<age<90) Fig. 3.8 Empirical and predictive conditional distributions for total angle of finger 4 in right hand condition on four different categories of variable age. P(Right 5 Relative=0) P(Right 5 Relative=1) NULL Emprical Predictive NULL Emprical Predictive Fig. 3.9 Empirical and predictive conditional distributions for total angles of finger 5 in right hand condition on relative variable. a surgical intervention on either the ring or middle finger should preferably be executed simultaneously.

71 3.5 Conclusion 63 Pr(Right2 Ledderhose=0) Pr(Right2 Ledderhose=1) NULL Emprical Predictive NULL Emprical Predictive Fig Empirical and predictive conditional distributions for total angles of finger 2 in right hand condition on Ledderhose disease variable. Of course, our proposed Bayesian methodology is not limited only to this type of data. It can potentially be applied to any kind of mixed data where the observed variables are binary, ordinal or continuous. Our method does not work with discrete variables that are not binary or ordinal. We compare our Bayesian approach with an alternative Bayesian approach (7) using a simulation study on various types of network structures. Although, both approaches converge to the same posterior distribution our approach has some clear advantages on finite MCMC runs. This difference is mainly due to our implementation of a computationally efficient algorithm. Our method is computationally more efficient because of two reasons. Firstly, our sampling algorithm is based on birth-death process which compare to the RJM- CMC implemented in (7) is much more efficient. Secondly, based on the theory which is derived by (32), we used exact values for the ratio of normalizing constants which has been computationally a bottleneck in the Bayesian approach. Moreover, we implemented the code for our method in C++ which is linked to R; it is freely available online by R package BDgraph at

72 64 References References [1] Albert, R. and Barabási, A.-L. (2002). Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47. [2] Atay-Kayis, A. and Massam, H. (2005). A monte carlo method for computing the marginal likelihood in nondecomposable gaussian graphical models. Biometrika, 92(2): [3] Bayat, A. and McGrouther, D. (2006). Management of dupuytren s disease clear advice for an elusive condition. Annals of the Royal College of Surgeons of England, 88(1):3. [4] Cappé, O., Robert, C., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3): [5] Cheng, Y., Lenkoski, A., et al. (2012). Hierarchical gaussian graphical models: Beyond reversible jump. Electronic Journal of Statistics, 6: [6] Dempster, A. (1972). Covariance selection. Biometrics, 28(1): [7] Dobra, A. and Lenkoski, A. (2011). Copula gaussian graphical models and their application to modeling functional disability data. The Annals of Applied Statistics, 5(2A): [8] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496): [9] Genest, C., Ghoudi, K., and Rivest, L.-P. (1995). A semiparametric estimation procedure of dependence parameters in multivariate families of distributions. Biometrika, 82(3): [10] Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, pages [11] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4):

73 References 65 [12] Lanting, R., Broekstra, D. C., Werker, P. M., and van den Heuvel, E. R. (2014a). A systematic review and meta-analysis on the prevalence of dupuytren disease in the general population of western countries. Plastic and reconstructive surgery, 133(3): [13] Lanting, R., Nooraee, N., Werker, P., and van den Heuvel, E. (2014b). Patterns of dupuytren disease in fingers; studying correlations with a multivariate ordinal logit model. Plastic and reconstructive surgery. [14] Lanting, R., van den Heuvel, E. R., Westerink, B., and Werker, P. M. (2013). Prevalence of dupuytren disease in the netherlands. Plastic and reconstructive surgery, 132(2): [15] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [16] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1): [17] Lenkoski, A. and Dobra, A. (2011). Computational aspects related to inference in gaussian graphical models with the g-wishart prior. Journal of Computational and Graphical Statistics, 20(1): [18] Liang, F. (2010). A double metropolis hastings sampler for spatial models with intractable normalizing constants. Journal of Statistical Computation and Simulation, 80(9): [19] Meyerding, H. W. (1936). Dupuytren s contracture. Archives of Surgery, 32(2): [20] Milner, R. (2003). Dupuytren s disease affecting the thumb and first web of the hand. Journal of Hand Surgery (British and European Volume), 28(1): [21] Mohammadi, A. and Wit, E. (2015a). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version [22] Mohammadi, A. and Wit, E. C. (2015b). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1): [23] Mohammadi, A. and Wit, E. C. (2015c). Bdgraph: Bayesian structure learning of graphs in r. arxiv preprint arxiv: v2. [24] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library.

74 66 References [25] Murray, I., Ghahramani, Z., and MacKay, D. (2012). Mcmc for doubly-intractable distributions. arxiv preprint arxiv: [26] Powers, D. M. (2011). Evaluation: from precision, recall and f-measure to roc, informedness, markedness & correlation. Journal of Machine Learning Technologies, 2(1): [27] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3): [28] Scutari, M. (2013). On the prior and posterior distributions used in graphical modelling. Bayesian Analysis, 8(1):1 28. [29] Shih, B. and Bayat, A. (2010). Scientific understanding and clinical management of dupuytren disease. Nature Reviews Rheumatology, 6(12): [30] Sklar, M. (1959). Fonctions de répartition à n dimensions et leurs marges. Université Paris 8. [31] Tubiana, R., Simmons, B., and DeFrenne, H. (1982). Location of dupuytren s disease on the radial aspect of the hand. Clinical orthopaedics and related research, (168):222. [32] Uhler, C., Lenkoski, A., and Richards, D. (2014). Exact formulas for the normalizing constants of wishart distributions for graphical models. arxiv preprint arxiv: [33] Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4): [34] Wang, H. (2014). Scaling it up: Stochastic search structure learning in graphical models. [35] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6: [36] Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3): [37] Wong, F., Carter, C. K., and Kohn, R. (2003). Efficient estimation of covariance selection models. Biometrika, 90(4):

75 Chapter 4 BDgraph: An R Package for Bayesian Structure Learning in Graphical Models 4.1 Abstract We introduce an R package BDgraph which performs Bayesian structure learning in highdimensional graphical models with either continuous or discrete variables. The package efficiently implements recent improvements in the Bayesian literature, including (24) and Mohammadi et al. (21). The core of the BDgraph package consists of two main MCMC sampling algorithms efficiently implemented in C++ to maximize computational speed. In this paper, we give a brief overview of the methodology and illustrate the package s functionality in both toy examples and applications. Key words: Bayesian structure learning, Gaussian graphical models, Gaussian copula, Covariance selection, Birth-death process, Markov chain Monte Carlo, G-Wishart, R.

76 68 BDgraph: An R Package for Structure Learning in Graphical Models 4.2 Introduction Graphical models (17) can be used to describe the conditional independence relationships among large numbers of variables. In graphical models, each random variable is associated with a node in a graph and links represent conditional dependency between variables, whereas the absence of a link implies that the variables are independent conditional on the rest of the variables (called the pairwise Markov property). In recent years, significant progress has been made to design efficient algorithms to discover graph structures from high-dimensional multivariate data (7, 15, 24, 21, 9, 20, 26). In this regard, Bayesian approaches provide a principled alternative to various penalized approaches. In this paper, we describe the BDgraph package (23) in R (28) for Bayesian structure learning in undirected graphical models. The package can deal with Gaussian, non- Gaussian, discrete and mixed data sets. The package includes several functional modules, including data generation for simulation, a search algorithm, a graph estimation routine, a convergence check and a visualization tool; see figure 4.1. The package efficiently implements recent improvements in the Bayesian literature, including (24, 21, 19, 32, 13). For a Bayesian framework of Gaussian graphical models, we implement the method developed by (24) and for Gaussian copula graphical models we use the method described by (21). To make our Bayesian methods computationally feasible for high-dimensional data, we efficiently implement the BDgraph package in C++ linked to R. To make the package easy to use, we use the S3 class. The package is available under the general public license (GPL 3) from the Comprehensive R Archive Network (CRAN) at http: //cran.r-project.org/packages=bdgraph. In the Bayesian literature, the BDgraph is the only software which is available online for Gaussian graphical models and Gaussian copula graphical models. On the other hand, in frequentest literature, existing packages are huge (34), glasso (10), bnlearn (30), pcalg (16) and grain (14). We compare the performance to several packages. The article is organized as follows. In Section 4.3 we illustrate the user interface of BDgraph package. In Section 4.4 we explain some methodological background of the package. In this regard, in Section 5.3.2, we briefly explain the Bayesian framework for Gaussian graphical models for continuous data. In Section we briefly describe the Bayesian framework in the Gaussian copula graphical models for the data that not follow the Gaussianity assumption, such as non-gaussian continuous, discrete or mixed data. In Section 4.5 we describe the main functions implemented in the BDgraph package. In Sec-

77 4.3 User interface 69 tion 4.6 by a simple simulation example, we explain the user interface and the performance of the package and we compare it with huge package. In Section 4.7, by using the functions implemented in the BDgraph package, we study two real data sets. 4.3 User interface In R environment, we can access and load the BDgraph package by suing the following commands R> install.packages( "BDgraph" ) R> library( "BDgraph" ) Loading the package automatically loads igraph (6) and Matrix (3) packages, since BDgraph package depends on these two packages. These packages are available on the Comprehensive R Archive Network (CRAN) at We use igraph package for graph visualization and Matrix package for memory-optimization using sparse matrix output. In our package, all the functions operate on or return Matrix objects. To maximize computational speed, we efficiently implement the BDgraph package with C++ code linked to R. For C++ code, we used the highly optimized LAPACK (1) and BLAS (18) linear algebra libraries on systems that provide them. We significantly improve program speed by using these libraries. We design the BDgraph package to provide a Bayesian framework for high-dimensional undirected graph estimation for different types of data sets such as continuous, discrete or mixed data. The package facilitates a flexible pipeline for analysis by three functional modules; see Figure 4.1. These modules are as follows: Module 1. Data Simulation: Function bdgraph.sim simulates multivariate Gaussian, discrete and mixed data with different undirected graph structures, including random, cluster, hub, scale-free, circle and fixed graphs. Users can set the sparsity of the graph structure. Users can generate mixed data, including count, ordinal, binary, Gaussian and non-gaussian variables. Module 2. Method: The function bdgraph provides two estimation methods with two different algorithms for sampling from the posterior distributions: 1. Graph estimation for normally distributed data which is based on the Gaussian graphical models (GGMs) by using the birth-death MCMC sampling algorithm, described in (24).

78 70 BDgraph: An R Package for Structure Learning in Graphical Models Fig. 4.1 Configuration of the BDgraph package which includes three main parts: (M1) data generation, (M2) algorithm for sampling from joint posterior distribution and (M3) several functions to check convergence of BDMCMC algorithm, estimate the true graph, compare and model check and graph visualization. 2. Graph estimation for non-gaussian, discrete, and mixed data, which is based on Gaussian copula graphical models (GCGMs) by using the birth-death MCMC sampling algorithm, described in (21). Module 3. Result: This module includes four types of functions as follow Convergence check: The functions plotcoda and traceplot provide several visualization plots to monitor the convergence of the sampling algorithms. Graph selection: The functions select, phat, prob, provide selected graph, the posterior link inclusion probabilities and posterior probability of each graph, respectively. Comparison and goodness-of-fit: The functions compare and plotroc provide several comparison measures and ROC plot for model comparison. Visualization: The plotting functions plot.bdgraph and plot.simulate provide visualizations of the simulated data and estimated graphs. 4.4 Methodological background In Section 5.3.2, we briefly explain the Gaussian graphical model for normal data. Then we explain the birth-death MCMC algorithm for sampling from the joint posterior distribution in the Gaussian graphical models; for more detail see (24). In Section 4.4.2, we briefly describe the Gaussian copula graphical model, which can deal with non-gaussian, discrete or mixed data. Then we explain the Birth-death MCMC algorithm which is designed for the Gaussian copula graphical models; for more detail see (21).

79 4.4 Methodological background Bayesian Gaussian graphical models In graphical models, each variable is associated with a node and conditional dependence relationships among random variables are presented as a graph G = (V, E) in which V = {1, 2,..., p} specifies a set of nodes and a set of existing links E V V (17). Our focus here is on undirected graphs in where (i, j) E is equivalent with (j, i) E. The absence of a link between two nodes specifies the pairwise conditional independence of these two variables given the remaining variables, while a link between two variables determines their conditional dependence. In the Gaussian graphical models (GGMs), we assume the observed data follow a multivariate Gaussian distribution N p (µ, K 1 ), here we assume µ = 0. Let Z = (Z (1),..., Z (n) ) T be the observed data of n independent samples, then the likelihood function is P r(z K, G) K n/2 exp { 12 } tr(ku), (4.1) where U = Z Z. In GGMs, the conditional independence is implied by the form of the precision matrix. Based on pairwise Markov property, variables i and j are conditionally independence given the remaining variables, if and only if K ij = 0. This property implies that the links in graph G = (V, E) correspond with the nonzero elements of precision matrix K, it means E = {(i, j) K ij 0}. Given graph G, the precision matrix K is constrained to the cone P G of symmetric positive definite matrices with elements K ij equal to zeros for all (i, j) / E. We consider the G-Wishart distribution W G (b, D) as a prior distribution for the precision matrix K with density 1 P r(k G) = I G (b, D) K (b 2)/2 exp { 12 } tr(dk) 1(K P G ), (4.2) where b > 2 is the degree of freedom, D is a symmetric positive definite matrix, I G (b, D) is the normalizing constant with respect to the graph G and 1(x) evaluates to 1 if x holds and to 0 otherwise. The G-Wishart distribution is a well-known prior for the precision matrix, since it represents the conjugate prior for normally distributed data. When G is complete the G-Wishart distribution reduces to the standard Wishart distribution, hence, its normalizing constant has an explicit form (25). Also, for decomposable graphs, the normalizing constant has an explicit form (29). But, for non-decomposable graphs, the normalizing constant does not have an explicit form. Since the G-Wishart prior is a conjugate prior to the likelihood (5.2), the posterior

80 72 BDgraph: An R Package for Structure Learning in Graphical Models distribution of K is P r(k Z, G) = { 1 2)/2 I G (b, D ) K (b exp 1 } 2 tr(d K), where b = b + n and D = D + S, that is, W G (b, D ). Direct sampler from G-Wishart Several sampling methods from G-Wishart distribution have been proposed; to review existing methods see Wang and Li (33). More recently, Lenkoski (19) have developed an exact sampling algorithm to sample from G Wishart distribution, which borrows an idea from Hastie et al. (12). The algorithm is as follows. 1: Set Ω = Σ 2: repeat 3: for i = 1,..., p do 4: Let N i V be the neighbors set of node i in G. Form Ω Ni and Σ Ni,i and solve ˆβ i = Ω 1 N i Σ Ni,i 5: Form ˆβ i R p 1 by padding the elements of ˆβ i to the appropriate locations and zeros in those locations not connected to i in G 6: Update Ω i, i and Ω i,i with Ω i, i ˆβi 7: end for 8: until convergence 9: return K = Ω 1 Algorithm 1: Given a graph G = (V, E) with precision matrix K and Σ = K 1 In the BDgraph package, we use Algorithm 1 to sample from the posterior distribution of the precision matrix. Besides, we implement the algorithm for general purposes in our package as a function rgwish; see the R code below for an illustration. R> G <- toeplitz( c( 0, 1, rep( 0, 2 ) ) ) R> G # Adjacency matrix of graph with 4 nodes and 3 links [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,]

81 4.4 Methodological background 73 R> sample <- rgwish( n = 1, G = G, b = 3, D = diag(4) ) R> round( sample, 2 ) [,1] [,2] [,3] [,4] [1,] [2,] [3,] [4,] BDMCMC algorithm for GGMs Consider the joint posterior distribution of K and the graph G given by P r(k, G Z) P r(z K) P r(k G) P r(g). (4.3) For the graph-prior, as a non-informative prior, we consider a uniform distribution over all graph space, pr(g) 1. For the prior distribution of K condition on graph G we consider a G-Wishart W G (b, D). Here, we consider a computationally efficient birth-death MCMC sampling algorithm proposed by (24) for the Gaussian graphical models. The algorithm is based on a continuous time birth-death Markov process in which the algorithm explores over the graph space by adding/removing a link in a birth/death event. In the birth-death process, at a particular pair of graph G = (V, E) with precision matrix K, each link dies independently of the rest as a Poisson process with death rate δ e (K). Since the links are independent, the overall death rate is δ(k) = e G δ e(k). Birth rates β e (K) for e / G are defined similarly. Thus the overall birth rate is β(k) = e/ G β e(k). Since the birth and death events are independent Poisson processes, the time between two successive events exponentially distributed with mean 1/(β(K) + δ(k)). The time between successive events can be considered as support for any particular instance of graph G. The probability of the death and death events are P r(birth of link e) = β e (K), for each e / G, (4.4) β(k) + δ(k) P r(death of link e) = δ e (K), for each e G. (4.5) β(k) + δ(k)

82 74 BDgraph: An R Package for Structure Learning in Graphical Models The birth and death rates of links occur in continuous time with the rates determined by the stationary distribution of the process. The algorithm is designed in such a way that the stationary distribution equals the target joint posterior distribution of the graph and the precision matrix (5.3). Mohammadi and Wit (24, section 3) prove by considering the birth and death rates as ratios of joint posterior distributions, as below, the birth-death MCMC sampling algorithm converges to the target joint posterior distribution of the graph and the precision matrix, if β e (K) = P r(g+e, K +e Z), for each e / G, (4.6) P r(g, K Z) δ e (K) = P r(g e, K e Z), for each e G, (4.7) P r(g, K Z) in which G +e = (V, E {e}) and K +e P G +e and similarly G e = (V, E \ {e}) and K e P G e. Based on the above rates, we determine our BDMCMC algorithm as below. for N iteration do 1. Sample from the graph. Based on birth and death process 1.1. Calculate the birth rates by (5.7) and β(k) = e / G β e(k) 1.2. Calculate the death rates by (5.8) and δ(k) = e G δ e(k) 1.3. Calculate the waiting time by W (K) = 1/(β(K) + δ(k)) 1.4. Simulate the type of jump (birth or death) by (5.5) and (5.6) 2. Sample from the precision matrix. By using Algorithm 1. end for Algorithm 2: Given a graph G = (V, E) with a precision matrix K We design our BDMCMC sampling algorithm in such a way that, we sample from (G, K) in each steep of jumping to the new state, e.g. {t 1, t 2,...} in Figure 4.2. For efficient inference, we compute the sample means based on the Rao-Blackwellized estimator (5, subsection 2.5); see e.g. (4.12). Note, our main aim is to estimate the posterior distribution of the graphs based on the data, P r(g data). By using the Rao-Blackwellized estimator the estimated posterior distribution of the graphs are proportion to the total waiting times of each graph; see Figure 4.2.

83 4.4 Methodological background 75 Fig. 4.2 This image visualizes the Algorithm 2. Left side is a continuous time BDMCMC sampling algorithm where {W 1, W 2,...} denote waiting times and {t 1, t 2,...} denote jumping times. Right side denotes the estimated posterior probability of the graphs which are proportional to the total of their waiting times, according to the Rao-Blackwellized estimator Gaussian copula graphical models In practice, we encounter both discrete and continuous variables; Gaussian copula graphical modelling has been proposed to describe dependencies between such heterogeneous variables. Let Y (as an observed data) be a collection of continuous, binary, ordinal or count variables with the marginal distribution F j of Y j and F 1 j its pseudo inverse. Towards constructing a joint distribution of Y, we introduce a multivariate Gaussian latent variable as follows Z 1,..., Z n iid N (0, Γ(K)), Y ij = F 1 j (Φ(Z ij )), (4.8) where Γ(K) is the correlation matrix for a given precision matrix K. The joint distribution of Y is given by P r ( Y 1 Y 1,..., Y p Y p ) = C(F1 (Y 1 ),..., F p (Y p ) Γ(K)), (4.9) where C( ) is the Gaussian copula given by C(u 1,..., u p Γ) = Φ p ( Φ 1 (u 1 ),..., Φ 1 (u p ) Γ ),

84 76 BDgraph: An R Package for Structure Learning in Graphical Models with u v = F v (Y v ) and Φ p ( ) is the cumulative distribution of multivariate Gaussian and Φ( ) is the cumulative distribution of univariate Gaussian distributions. It follows that ( Y v = Fv 1 Φ(Zv ) ) for v = 1,..., p. If all variables are continuous then the margins are unique; thus zeros in K imply conditional independence, as in Gaussian graphical models (13). For discrete variables, the margins are not unique but still well-defined (27). In semiparametric copula estimation, the marginals are treated as nuisance parameters and estimated by the rescaled empirical distribution. The joint distribution in (4.9) is then parametrized only by the correlation matrix of the Gaussian copula. We are interested to infer the underlying graph structure of the observed variables Y implied by the continuous latent variables Z. Since Z are unobservable we follow the idea of Hoff (13) to associate them to the observed data as follows. Given the observed data Y from a sample of n observations, we constrain the latent samples Z to belong to the set where D(Y) = {Z R n p : L r j(z) < z (r) j { L r j(z) = max Z (k) j : Y (s) j } < Y (r) j < U r j (Z), r = 1,..., n; j = 1,..., p}, { and Uj r (Z) = min Z (s) j : Y (r) j } < Y (s) j. (4.10) Following (13) we infer on the latent space by substituting the observed data Y with the event D(Y) and defined the likelihood as P r(y K, G, F 1,..., F p ) = P r(z D(Y) K, G) P r(y Z D(Y), K, G, F 1,..., F p ). The only part of the observed data likelihood relevant for inference on K is P r(z D(Y) K, G). Thus, the likelihood function is given by P r(z D(Y) K, G) = P r(z D(Y) K, G) = where P r(z K, G) is defined in (5.2). D(Y) P r(z K, G)dZ (4.11) BDMCMC algorithm for GCGMs The joint posterior distribution of the graph G and precision matrix K for our Gaussian copula graphical model is as follow P r(k, G Z D(Y)) P r(k, G)P r(z D(Y) K, G).

85 4.5 The BDgraph environment 77 Sampling from this posterior distribution can be done by using the birth-death MCMC algorithm. (21) have developed and extended the birth-death MCMC algorithm for more general case of GCGMs. We summarize their algorithm as follows. for N iteration do 1. Sample the latent data. For each r V and j {1,..., n}, we update the latent values from its full conditional distribution as follows Z r (j) Z V \{r} = z (j) V \{r}, K N( r K rr z (j) r /K rr, 1/K rr ), truncated to the interval [ L j r(z), U j r (Z) ] in (4.10). 2. Sample from the graph. Same as Step 1 in the Algorithm Sample from the precision matrix. By using Algorithm 1. end for Algorithm 3: Given a graph G = (V, E) with a precision matrix K In each iteration of the Algorithm 3, first conditional on the observed data (Y) we sample from the latent variables (Z). The other steps are the same as the Algorithm 2. Remark. For the cases that all variables are continuous, we do not need to sample from latent variables in each iteration of the Algorithm 2, since all margins in Gaussian copula are unique. Therefore, for these cases, first based on Gaussian copula approach, we transfer our non-gaussian data to Gaussian then we run the Algorithm The BDgraph environment The BDgraph package provides a set of comprehensive tools related to Bayesian graphical models; here, we describe the essential functions available in the package Description of the bdgraph function The main function of the package is bdgraph, which includes two Bayesian frameworks (GGMs and GCGMs). This function is based on an underlying sampling engine (Algorithms 2 and 3) takes the model definition and returns a sequence of samples from the joint posterior distribution of the graph and precision matrix (5.3), given the supplied data. By using the following function bdgraph( data, n = NULL, method = "ggm", iter = 5000, burnin = iter / 2, b = 3, D = NULL, Gstart = "empty" )

86 78 BDgraph: An R Package for Structure Learning in Graphical Models we can sample from our target joint posterior distribution. bdgraph returns an object of S3 class type bdgraph. The functions plot, print and summary are working with the object bdgraph. The input data can be a matrix or a data.frame ( n p) or a covariance matrix; it can also be an object of class simulate, which is output of the function bdgraph.sim. The argument method determines the type of BDMCMC algorithm (Algorithms 2 and 3). Option ggm implements the BDMCMC algorithm based on the Gaussian graphical models (Algorithm 2). It is designed for the data that follow Gaussianity assumption. Option gcgm implements the BDMCMC algorithm based on the Gaussian copula graphical models (Algorithm 3). It is designed for the data that do not follow the Gaussianity assumption such as non-gaussian continuous, discrete or mixed data. This option can deal with missing data Description of the bdgraph.sim function The function bdgraph.sim is designed to simulate different types of data sets with different graph structures. The function bdgraph.sim( n = 2, p = 10, graph = "random", size = NULL, prob = 0.2, class = NULL, type = "Gaussian", cut = 4, b = 3, D = diag(p), K = NULL, sigma = NULL, mean = 0, vis = FALSE ) can simulate multivariate Gaussian, non-gaussian, discrete and mixed data with different undirected graph structures, including random, cluster, hub, scale-free, circle and fixed graphs. Users can determine the sparsity level of the obtained graph by option prob. By this function, users can generate mixed data from count, ordinal, binary, Gaussian and non-gaussian distributions. bdgraph.sim returns an object of the S3 class type simulate. The plot, print and summary functions are working with this object type Description of the plotcoda and traceplot functions In general, convergence in MCMC approaches can be difficult to evaluate. From a theoretical point of view, the sampling distribution will converge to the target joint posterior distribution as the number of iteration increases to infinity. We normally have little theoretical insight about how quickly convergence KICKS IN ; therefore we rely on post hoc

87 4.5 The BDgraph environment 79 testing of the sampled output. In general, the sample is divided into two parts: a burnin part of sample and the remainder, in which the chain is considered to have converged sufficiently close to the target posterior distribution. Two questions then arise: How many samples are sufficient? How long should the burn-in period be? The plotcoda and traceplot are two visualization functions in the BDgraph package for checking the convergence of the BDMCMC algorithm. The function plotcoda( output, thin = NULL, main = NULL, links = TRUE,... ) provides the trace of posterior inclusion probability of all possible links to check convergence of the BDMCMC algorithm. The function traceplot ( output, acf = FALSE, pacf = FALSE, main = NULL,... ) provides the trace of graph size to check convergence of the BDMCMC algorithm. The input of these two functions is the output of bdgraph function, which is the sample drawn from the joint posterior distribution of graph and precision matrix Description of the phat and select functions In the BDgraph package, the functions phat and select provide the potential tools to do statistical inference from the samples drawn from the joint posterior distribution. The function phat( output, round = 3 ) provides the estimated posterior link inclusion probabilities for all possible links. These probabilities, for all possible link e = (i, j) in graph G, can be calculated by using the Rao-Blackwellization (5, subsection 2.5) as P r(e G data) = N t=1 1(e G(t) )W (K (t) ) N t=1 W, (4.12) (K(t) ) where N is the number of iteration and W (K (t) ) are the weights of the graph G (t) with the precision matrix K (t). The function select( output, cut = NULL, vis = FALSE ) provides the inferred graph, which is the graph with the highest posterior probability as a default. By option cut, users can select the inferred graph based on the estimated posterior link inclusion probabilities.

88 80 BDgraph: An R Package for Structure Learning in Graphical Models Description of the compare and plotroc functions The function compare and plotroc are designed to check and compare the performance of the selected graph. This function is particularly useful for simulation studies. With the function compare( G, est, est2 = NULL, colnames = NULL, vis = FALSE ) we can check the performance of our BDMCMC algorithm and compare it with other alternative approaches. This function provides several measures such as balanced F -score measure (2) which is defined as follows F 1 -score = 2TP 2TP + FP + FN, (4.13) where TP, FP and FN are the number of true positives, false positives and false negatives, respectively. The F 1 -score lies between 0 and 1, where 1 is for perfect identity and 0 for worst case. The function plotroc( G, prob, prob2 = NULL, cut = 20, smooth = FALSE ) provides a ROC plot for visualization comparison based on the estimated posterior link inclusion probabilities (4.12). 4.6 User interface by toy example In this section, we illustrate the user interface of the BDgraph package with a simple simulation example. We perform all the computations on an Intel(R) Core(TM) i5 CPU 2.67GHz processor. By using the function bdgraph.sim we simulate 60 observations (n = 60) from a multivariate Gaussian distribution with 8 variables (p = 8) and scalefree graph structure, as below R> data.sim <- bdgraph.sim( n = 60, p = 8, graph = "scale-free", type = "Gaussian" ) R> round( head( data.sim $ data, 4), 2 ) [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] [2,] [3,] [4,]

89 4.6 User interface by toy example Running the BDMCMC algorithm Since the generated data follow Gaussianity assumption, we run the BDMCMC algorithm which is based on the Gaussian graphical models. Therefore, we choose method = "ggm", as follow R> output.ggm <- bdgraph( data = data.sim, method = "ggm", iter = ) Running this function takes around 5 second which is computationally fast as main sampling part is in C++. Since the function bdgraph returns an object of class S3, users can see the summary result of BDMCMC algorithm as follows R> summary( output.ggm ) $selected_graph $phat $Khat

90 82 BDgraph: An R Package for Structure Learning in Graphical Models [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [1,] [2,] [3,] [4,] [5,] [6,] [7,] [8,] The summary result are the adjacency matrix of the graph with the highest posterior probability (selected_graph), the estimated posterior probabilities of all possible links (phat) and the estimated precision matrix (Khat). In addition, function summary reports a visualization summary of the results as we can see in Figure 4.3. In the top-left is the graph with the highest posterior probability. In the top-right is the estimated posterior probabilities of all the graphs which are visited by the BDMCMC algorithm. It indicates our algorithm visits more than 600 different graphs and the posterior probability of the selected graph is around In the bottom-left is the estimated posterior probabilities of the size of graphs. It indicates our algorithm visited mainly graphs with size between 6 and 14 links. In the bottom-right is the trace of our algorithm based on the size of the graphs Convergence check In our simulation example, we run the BDMCMC algorithm for 50, 000 iteration, 25, 000 of which as burn-in. To check whether the number of iteration is enough or not, we run R> plotcoda( output.ggm ) The results in Figure 4.4 indicates that our BDMCMC algorithm, roughly speaking, converges after 20, 000 iteration and that burn-in of 25, 000 is sufficient Comparison and goodness-of-fit The function compare provides several measures for checking the performance of our algorithms and compared with other alternative approaches, with respect to the true graph structure. To check the performance of our both algorithms (Algorithms 2 and 3), we also run the Algorithms 3 with the same condition as bellow

91 4.6 User interface by toy example 83 Selected graph Posterior probability of graphs Pr(selected graph data) Posterior probability of graphs size Trace of graph size Fig. 4.3 Visualization summary of the simulation data based on the output of bdgraph function. Figure in the top-left is the inferred graph with the highest posterior probability. Figure in the topright is the estimated posterior probabilities of all visited graphs. Figure in the bottom-left is the estimated posterior probabilities of all visited graphs based on the size of the graph. Figure in the bottom-right is the trace of our algorithm based on the size of the graphs. R> output.gcgm <- bdgraph( data = data.sim, method = "gcgm", iter = ) where output.ggm is 50, 000 samples from the joint posterior distribution which are generated based on the Gaussian copula graphical models. Users can compare the performance of these two algorithms by using the code R> plotroc( data.sim, output.ggm, output.gcgm ) As expected, the result indicates that the GGM approach performs slightly better than the GCGM, since the data is generated from a Gaussian graphical model.

92 84 BDgraph: An R Package for Structure Learning in Graphical Models Trace of the Posterior Probabilities of the Links Posterior link probability Iteration Fig. 4.4 Plot for monitoring the convergence of BDMCMC algorithm. It shows the trace of the cumulative posterior probability of the all possible links of the graph. ROC Curve True Postive Rate ggm gcgm False Postive Rate Fig. 4.5 ROC plot to compare the performance of the GGM and GCGM approachs. Here, we also compare our approach to the Meinshausen-Buhlmann approach mb (20) by using the huge package (34). We consider the following code R> install.packages( "huge" ) R> library( huge ) R> huge <- huge( data.sim $ data, method = "mb" ) R> output.huge <- huge.select( mb )

93 4.7 Application to real data sets 85 R> compare( data.sim, output.ggm, output.gcgm, output.huge, vis = TRUE ) True graph BDgraph(ggm) BDgraph(gcgm) huge(mb) true positive true negative false positive false negative true positive rate false positive rate accuracy balanced F-score positive predictive This result indicates that the huge package based on the mb discovers 5 true links, in addition, it finds 7 extra links which are not in the true graph. See Figure 4.6 for visualization. For more comparison see Mohammadi and Wit (24, section 4). 4.7 Application to real data sets In this section, we analyze two data sets from biology and sociology, by using the functions that are available in the BDgraph package. In Section 4.7.1, we analyze a Labor force survey data set, which are mixed data. In Section 4.7.2, we analyze Human gene expression data, which are high-dimensional data that do not follow the Gaussianity assumption. Both data sets are available in our package Application to Labor force survey data (13) analyzes the multivariate dependencies among income, eduction and family background, by using data concerning 1002 males in the U.S labor force. The data is available in our package. Users can call the data via R> data( surveydata ) R> dim( surveydata ) [1]

94 86 BDgraph: An R Package for Structure Learning in Graphical Models True graph BDgraph(ggm) BDgraph(gcgm) huge(mb) Fig. 4.6 Comparing the performance of the BDgraph with huge packages. The graph in the topleft is the true graph. The graph in the top-right is the selected graph based on Gaussian grapical models. The graph in the bottom-left is the selected graph based on Gaussian copula grapical models. The graph in the bottom-right is the selected graph based on the mb approach in the huge package. R> head( surveydata, 5 ) income degree children pincome pdegree pchildren age [1,] NA [2,] NA [3,] NA [4,] NA [5,] Missing data are indicated by NA and in general the rate of missing data is around 0.09 with higher rates for variables income and pincome. In this data set, we have seven observed variables as follows

95 4.7 Application to real data sets 87 income: An ordinal variable indicating the respondent s income in 1000s of dollars. degree: An ordinal variable with five categories indicating the respondent s highest educational degree. children: A count variable indicating the number of children of the respondent. pincome: An ordinal variable with five categories indicating financial status of respondent s parents. pdegree: An ordinal variable with five categories indicating the highest educational degree of the respondent s parents. pchildren: A count variable indicating the number of children of the respondent s parents. age: A count variable indicating the respondent s age in years. Since those variables are measured on various scales, the marginal distributions are heterogeneous, which makes the study of their joint distribution very challenging. However, we can apply our Bayesian framework based on the Gaussian copula graphical models to this data set. Thus, we run the function bdgraph with option method = "gcgm". For the prior distributions of the graph and precision matrix, as default of the function bdgraph, we place a uniform distribution as a uninformative prior on the graph and a G-Wishart W G (3, I 7 ) on the precision matrix. We run our function for 50, 000 iteration with 25, 000 as burn-in. R> output <- bdgraph( data = surveydata, method = "gcgm", iter = ) R> summary( output ) $selected_graph income degree children pincome pdegree pchildren age income degree children pincome pdegree pchildren age $phat income degree children pincome pdegree pchildren age

96 88 BDgraph: An R Package for Structure Learning in Graphical Models income degree children pincome pdegree pchildren age $Khat [,1] [,2] [,3] [,4] [,5] [,6] [,7] [1,] [2,] [3,] [4,] [5,] [6,] [7,] The result of the function summary are the adjacency matrix of the graph with the highest posterior probability (selected_graph), estimated posterior probabilities of all possible links (phat) and estimated precision matrix (Khat). Figure 4.7 shows the selected graph which is a graph with the highest posterior probability. Links in the graph are indicated by signs + and -, which represents a positively and negatively correlated relationship between associated variables, respectively. The result indicate that education, fertility and age determinate income, since there are highly positively correlated relationships between income and those three variables, with posterior probability equal to one for all. It shows that respondent s education and fertility are negatively correlated with posterior probability equal to The respondent s education is certainly related to his parent s education, since there is positively correlated relationship with posterior probability equal to one. Moreover, the result indicate that relationships between income, education and fertility hold across generations. For this data set, (13) estimated the conditional independence between variables by inspecting whether the 95% credible intervals for the associated regression parameters do not contain zero. Our result is the same with Hoff s result except for one link. Our result indicate there is a strong relationship between parent s education (pdegree) and fertility (child), which is not selected by Hoff.

97 4.7 Application to real data sets 89 age + + income + + pdegree + children + + degree pincome + pchildren Fig. 4.7 Graph with the highest posterior probability for the labor force survey data based on the output from bdgraph. Sign + represents a positively correlated relationship between associated variables and - represents a negatively correlated relationship Application to Human gene expression Here, by using the functions that are available in the BDgraph package, we study the structure learning of the sparse graphs applied to the large-scale human gene expression data which was originally described by (31). They collected the data to measure gene expression in B-lymphocyte cells from Utah individuals of Northern and Western European ancestry. They consider 60 individuals whose genotypes are available online at ftp://ftp.sanger.ac.uk/pub/genevar. Here the focus is on the 3125 Single Nucleotide Polymorphisms (SNPs) that have been found in the 5 UTR (untranslated region) of mrna (messenger RNA) with a minor allele frequency 0.1. Since the UTR (untranslated region) of mrna (messenger RNA) has been subject to investigation previously, it should have an important role in the regulation of the gene expression. The raw data were background corrected and then quantile normalized across replicates of a single individual and then median normalized across all individuals. Following (4), among the 47, 293 total available probes, we consider the 100 most variable probes that correspond to different Illumina TargetID transcripts. The data for these 100 probes are available in our package. To see the data, users can run the code R> data(geneexpression) R> dim( geneexpression )

98 90 BDgraph: An R Package for Structure Learning in Graphical Models [1] The data consist of only 60 observations (n = 60) across 100 genes (p = 100). This data set is an interesting case study for graph structure learning, as it has been used in (4, 24, 11). In this data set, all the variables are continuous but they do not follow the Gaussianity assumption, as you can see in Figure 4.8. Thus, we apply the Gaussian copula graphical models. Therefore, we run function bdgraph with option method = "gcgm". For the prior distributions of the graph and precision matrix, as default of the function bdgraph, we place a uniform distribution as a uninformative prior on graph and the G-Wishart W G (3, I 100 ) on the precision matrix GI_ S GI_ S GI_ S GI_ S GI_ S Hs S Fig. 4.8 Univariate histograms of first 6 genes in human gene data set. We run our function for 100, 000 iteration with 50, 000 as burn-in as follows R> output <- bdgraph( data = geneexpression, method = "ggm", iter = ) R> select( output ) This function takes around 13 hours. The function select gives as a default the graph with the highest posterior probability, which has 991 links. We use the following code to visualize the ones that we believe have posterior probabilities larger than R> select( output, cut = 0.995, vis = TRUE )

99 4.7 Application to real data sets 91 By using option vis = TRUE, this function plots the selected graph. Figure 4.9 shows the selected graph with 266 links, for which the posterior inclusion probabilities (4.12) are greater than GI_3079 GI_3165 hmm3587 GI_4505 GI_2014 GI_1837 GI_4247 GI_4035 GI_1332 GI_3753 GI_1302 Hs.1851 GI_4266 GI_4265 GI_2789 GI_1460 GI_4507 GI_3422 GI_1974 GI_2776 GI_7657 GI_4119 GI_2748 GI_2449 GI_5454 GI_3107 GI_3754 hmm1029 GI_3754 GI_7019 Hs.4495 GI_2351 GI_2479 GI_7661 GI_1922 GI_4266 GI_2430 Hs.4496 GI_2789 GI_4135 GI_3754 GI_7662 GI_2030 GI_1109 GI_3137 GI_1199 GI_9961 GI_4505 Hs.4064 GI_2747 GI_2841 Hs.5121 GI_3856 hmm3577 GI_3040 GI_3491 hmm3574 GI_3134 GI_3753 GI_4119 GI_8923 GI_1615 GI_2855 GI_2138 GI_4504 GI_1864 GI_1842 hmm1028 GI_2007 GI_3754 GI_4502 GI_2775 GI_2202 GI_2861 GI_1655 GI_2146 GI_3422 GI_1864 GI_1421 GI_3335 hmm9615 GI_2775GI_4504 GI_4504 GI_3335 Hs.4496 GI_2138 GI_1351 GI_2161 Hs.5121 Hs.1712 Hs.4495 GI_2037 Hs.4496 Hs.1363 GI_4119 GI_1798 Fig. 4.9 The inferred graph for the human gene expression data using Gaussian copula graphical models. This graph consists of links with posterior probabilities (4.12) larger than The function phat reports the estimated posterior probabilities of all possible links in the graph. For our data the output of this function is a matrix. Figure 4.10 reports the visualization of that matrix. Most of the links in our selected graph (graph with the highest posterior probability) conform the results in previous studies. For instance, (4) found 54 significant interactions between genes, most of which are covered by our method. In addition, our approach indicates additional gene interactions with high posterior probabilities that are missed by previous work; thus our method may complement the analysis of human gene interaction networks.

100 92 BDgraph: An R Package for Structure Learning in Graphical Models Posterior Probabilities of all Links Fig Image visualization of the estimated posterior probabilities for all possible links in the graph for the human gene expression data. 4.8 Conclusion The BDgraph package aims to help researchers in two ways. Firstly, the package provides a Bayesian framework which potentially can be extended, customized and adapted to address different requirements, in graphical models. Secondly, it is currently the only R package that provides a simple and complete range of tools for conducting Bayesian inference for graphical modelling based on conditional independence graph estimation. We plan to maintain and develop the package in the future. Future versions of the package will contain more options for prior distributions of graph and precision matrix. On possible extension of our package, is to deal with outliers, by using robust Bayesian graphical modelling using Dirichlet t-distributions (8, 22). An implementation of this method would be desirable in real applications.

101 References 93 References [1] Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., and Sorensen, D. (1999). LAPACK Users Guide. Society for Industrial and Applied Mathematics, Philadelphia, PA, third edition. [2] Baldi, P., Brunak, S., Chauvin, Y., Andersen, C. A., and Nielsen, H. (2000). Assessing the accuracy of prediction algorithms for classification: an overview. Bioinformatics, 16(5): [3] Bates, D. and Maechler, M. (2014). Matrix: Sparse and Dense Matrix Classes and Methods. R package version [4] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2): [5] Cappé, O., Robert, C., and Rydén, T. (2003). Reversible jump, birth-and-death and more general continuous time markov chain monte carlo samplers. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 65(3): [6] Csardi, G. and Nepusz, T. (2006). The igraph software package for complex network research. InterJournal, Complex Systems:1695. [7] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496): [8] Finegold, M. and Drton, M. (2014). Robust bayesian graphical modeling using dirichlet t-distributions. Bayesian Analysis, 9(3): [9] Friedman, J., Hastie, T., and Tibshirani, R. (2008). Sparse inverse covariance estimation with the graphical lasso. Biostatistics, 9(3): [10] Friedman, J., Hastie, T., and Tibshirani, R. (2014). glasso: Graphical lasso- estimation of Gaussian graphical models. R package version 1.8. [11] Gu, Q., Cao, Y., Ning, Y., and Liu, H. (2015). Local and global inference for high dimensional gaussian copula graphical models. arxiv preprint arxiv: [12] Hastie, T., Tibshirani, R., and Friedman, J. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer.

102 94 References [13] Hoff, P. D. (2007). Extending the rank likelihood for semiparametric copula estimation. The Annals of Applied Statistics, pages [14] Højsgaard, S. (2012). Graphical independence networks with the grain package for r. Journal of Statistical Software, 46(i10). [15] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4): [16] Kalisch, M., Mächler, M., Colombo, D., Maathuis, M. H., and Bühlmann, P. (2012). Causal inference using graphical models with the r package pcalg. Journal of Statistical Software, 47(11):1 26. [17] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [18] Lawson, C. L., Hanson, R. J., Kincaid, D. R., and Krogh, F. T. (1979). Basic linear algebra subprograms for fortran usage. ACM Transactions on Mathematical Software (TOMS), 5(3): [19] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1): [20] Meinshausen, N. and Buhlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3): [21] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using gaussian copula graphical models. Arxiv preprint arxiv: v2. [22] Mohammadi, A. and Wit, E. (2014). Contributed discussion on article by finegold and drton. Bayesian Analysis, 9(3): [23] Mohammadi, A. and Wit, E. (2015a). BDgraph: Graph Estimation Based on Birth-Death MCMC Approach. R package version [24] Mohammadi, A. and Wit, E. C. (2015b). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1): [25] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library.

103 References 95 [26] Murray, I. and Ghahramani, Z. (2004). Bayesian learning in undirected graphical models: approximate mcmc algorithms. In Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages AUAI Press. [27] Nelsen, R. B. (2007). An introduction to copulas. Springer Science & Business Media. [28] R Core Team (2013). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. [29] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3): [30] Scutari, M. (2010). Learning bayesian networks with the bnlearn r package. Journal of Statistical Software, 35(3):22. [31] Stranger, B. E., Nica, A. C., Forrest, M. S., Dimas, A., Bird, C. P., Beazley, C., Ingle, C. E., Dunning, M., Flicek, P., Koller, D., et al. (2007). Population genomics of human gene expression. Nature genetics, 39(10): [32] Uhler, C., Lenkoski, A., and Richards, D. (2014). Exact formulas for the normalizing constants of wishart distributions for graphical models. arxiv preprint arxiv: [33] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6: [34] Zhao, T., Liu, H., Roeder, K., Lafferty, J., and Wasserman, L. (2014). huge: Highdimensional Undirected Graph Estimation. R package version

104

105 Chapter 5 Bayesian Modelling of the Exponential Random Graph Models with Non-observed Networks 5.1 Abstract Across the sciences, one of the main objectives is network modelling to discover complex relationships among many variables. The most promising statistical models that can be used for network modelling is the Exponential Random Graph Models (ERGMs). These models provide an insightful probabilistic model to represent a variety of structural tendencies that define complicated dependence patterns hardly modeled by other probabilistic models. However, they are restricted to the models that regarded the network as given, observed networks data. In the present paper, we develop a novel Bayesian statistical framework which combines the class of ERGMs with graphical models capable of modelling non-observed networks. Our proposed method greatly extends the scope of the ERGMs to more applied research areas. We discuss possible extensions of the method. Key words: Bayesian inference, Exponential random graph models, Graphical models, Covariance selection, Birth-death process, Markov chain Monte Carlo, G-Wishart.

106 98 Bayesian Modelling of the Exponential Random Graph Models 5.2 Introduction Network modeling pervades all of science since one of the main objectives of science is to discover complex relationships among large numbers of variables. For the prevention of epidemics, social science relies on a keen understanding of interactions among individuals (20, 7). Similarly in biology, curing complex diseases requires an understanding of how genes talk to each other (1). One way to describe these kinds of complex relationships is by means of an abstract network. For a general overview of the statistical models and methods for network modeling, see (10). Exponential random graph models (34, 22) are promising and flexible family of statistical models for modelling network topology. These models have been used mainly in the social science literature since they allow to statistically account for the complexity inherent in many network data (27, 26). In ERGMs, the basic assumption is that the topological structure of an observed network can be explained by the a vector of network statistics that capture the generative structures in the network (27). However, up till now, these models are restricted to the observed network data. In most applications data are not the observed networks, like in biology (e.g. microarray gene expression data and cell signaling data) in neuroscience (e.g. fmri data). Is it possible to extend the ERGMs to those type of data? A possible solution is combining the class of ERGMs with graphical models. Graphical models (12) provide an appealing and insightful way to obtain uncertainty estimates when inferring network structure. The close relationship between the topology of the underlying graphs and their probabilistic properties is a main aspect in graphical models, and it provides the potential tools to interpret the complex models. In this regard, Bayesian approaches provide a mainly straightforward tools, and much recent progress has been made in graphical models (9, 5, 14, 32, 16). More recently, (16) have developed a search algorithm based on birth-death MCMC approach that work with high-dimensional data. However, graphical models are powerful approaches only for estimating the underlying graph structure, they are not designed to model the network graphs. In this paper, we develop a new Bayesian statistical framework for ERGMs, which is capable for network modeling of non-observed networks. The proposed method greatly extends the scope of the ERGMs to more applied research areas, which not limited only in social science. In our method, to apply the ERGMs to non-observed networks data, we combine the class of ERGMs with graphical models capable of modelling non-observed networks. In particular, in our Bayesian framework, we design a computationally efficient search algorithm to explore all the graph space to distinguish not only important edges

107 5.3 Exponential families and graphical models 99 but also key features and detect the underlying graph structure. This search algorithm is based on birth-death Markov chain Monte Carlo algorithm proposed by (16) for Gaussian graphical models. The paper is organized as follows. In section 5.3 we briefly explain the exponential random graph models and graphical models. In section 5.4 we introduce our proposed Bayesian framework for exponential graph models with non-observed networks. 5.3 Exponential families and graphical models In this section we briefly review two probabilistic families of graphical models which we will use in our proposed methodology; for more details see e.g. (29) Exponential random graph models Exponential random graph models (ERGMs) or p models (34) are the families of statistical models that provide a flexible way to model the complex dependence structure of networks. They are the most popular models for social networks and also used in physics and biology (25). The aim to model data as observed networks consisting of nodes and edges, which in the social network context represent actors and relationships between these actors, respectively. There has been comparably little research on using Bayesian approach to infer the parameters of ERGMs besides recent articles by (26, 11, 2, 3). See (21) and (22) for an overview of ERGMs. In an ERGMs, the random matrix G = {g ij } is defined over the graph space on a set of p nodes, with each variable in G representing the presence or absence of a particular edge ( g ij = 1 if there is a link from i to j, and g ij = 0 otherwise). Edges connecting a node to itself are not allowed so g ii = 0. For a graph (which consists of a set of edges over the set of nodes), the ERGM is then given by P (G θ) = 1 Z(θ) exp{θt S(G)}, (5.1) where θ Θ represents a vector of unknown parameters, and Z(θ) is a normalizing constant, Z(θ) = G G p exp{θ t S(G)}, and S(G) term is a network statistic of interest that gives the ERGMs much of its explana-

108 100 Bayesian Modelling of the Exponential Random Graph Models tory power. The vector S(G) can contain statistics to capture the generative structures of connectivity in the network. It can include, for instance, the number of edges ( i,j g ij) to capture network density; the number of triangles ( i,j,l g ijg il g jl ) to capture transitivity; the number of 2-stars ( i,j,l g ilg jl ) where a k-star (k > 2) is a node with k neighbors or a node of degree k, and the wide variety of other endogenous structures (27). Estimating ERGMs parameters is a challenging problem, since the analytic form of the normalizing constant Z(θ) is unknown due to the combinational complexity of summing over all possible 2 p(p 1)/2 graphs in G p (2, 3, 8). Although, ERGMs are difficult to handle in practice, they are quite popular in the literature since they are conceived to capture the complex dependence structure of the graph and allow a reasonable interpretation of the observed data. At the basis of this class of models, the dependence hypothesis is that edges selforganize into specific structures called configurations (e.g. triangles, n-star). Flexibility to adapt to different network contexts, there is a broad range of possible network configurations (34, 22). A positive value of θ i θ results in a tendency for the specific configuration corresponding to S i (G) to be observed in the data, otherwise it should expected by chance. Note, in ERGMs data are observed networks, which is a strong limitation. In most of the applications, data are measured on variables, such as gene expression data and cell signalling data. The question we intend to answer is whether it is possible to extend the idea of ERGMs to those types of data? We extend the ERGMs to non-observed networks by combining it with graphical models Graphical models Graphical models (12) use a graph concept to represent conditional dependence relationships among random variables, as non-observed networks. When observed data come from noisy measurements of the variables, then graphical models present an appealing and insightful way to describe graph-based dependencies between the random variables. A graph G = (V, E) denotes a set of vertices V = {1, 2,..., p} where each node corresponds with a random variable and a set of existing edges E, and E denotes the set of non-existing edges. We are interested on undirected graphical models in where (i, j) E is equivalent with (j, i) E, also known as Markov random fields (24). In this class of models, nodes in the graph G correspond to the random variables. The absence of an edge between two nodes determines the two corresponding variables are conditionally inde-

109 5.4 Bayesian hierarchical model for ERGMs with non-observed networks 101 pendent given the remaining variables, while an edge between the two nodes indicates the conditional dependence of the variables. Graphical models that follow the multivariate Gaussian distribution are called Gaussian graphical models (GGMs), also known as covariance selection models (4). In GGMs, zero entries in the precision matrix correspond to the absence of edges on the graph, which are conditional independence between pairs of random variables given the rest. With respect to the graph G, a zero mean Gaussian graphical model is M G = { N p (0, Σ) K = Σ 1 P G }, where P G denotes the space of p p positive definite matrices with entries equal to zero for not existing edges in graph G. Let X = (X 1,..., X n ) be an independent and identically distributed sample of size n from model M G ; Then, the likelihood is P (X K, G) K n/2 exp { 12 } tr(ku), (5.2) where U = X X. For a p-dimensional graph, the size of graph space is in total 2 p(p 1)/2, which grows supper-exponentially with the number of nodes in the graph. Thus, Bayesian inference on all graph space is severely limited by the nature of the graph space. In this regard, there are the efficient stochastic search algorithms that can explore the graph space (16, 5, 9). These types of search algorithms explore the graph space by adding or removing one edge in each step, knows as neighborhood search algorithm. These algorithms can potentially work with the graph with more than 100 nodes (9, 16, 17). 5.4 Bayesian hierarchical model for ERGMs with nonobserved networks In this section, we proposed the hierarchical Bayesian methodology to discover the underlining network structure and features which are important. We can display the hierarchical model schematically as below θ G K X = (X 1,..., X n ).

110 102 Bayesian Modelling of the Exponential Random Graph Models Thus, we consider the joint posterior distribution of the parameters as bellow P (θ, G, K X) P (X θ, G, K) P (K G) P (G θ) P (θ). (5.3) In our methodology, we assume the observed data follows a multivariate Gaussian distribution Model for prior specification on graph Here, by consider the idea of exponential random graph, we use a prior on the graph as follows P (G θ) = 1 Z(θ) exp{θt S(G)} (5.4) where S(G) is a vector of statistics of the graph (e.g., the number of edges, triangles, etc.) and θ Θ denotes the parameter vector of the model Prior specification on precision matrix For the prior distribution of the precision matrix, we use the G-Wishart (23) distribution. In Gaussian graphical models, the G-Wishart prior distribution is highly attractive since it is conjugate to normally distributed data and places no probability mass on zero entries of the precision matrix. The G-Wishart distribution W G (b, D), for random matrix K is defined as P (K G) = 1 I G (b, D) K (b 2)/2 exp { 12 } tr(dk), K P G, where b > 2 is a degree of freedom, D is a symmetric positive definite matrix, and I G (b, D) is a normalizing constant, I G (b, D) = K (b 2)/2 exp { 12 } tr(dk) dk. P G For complete graph G, the G-Wishart distribution reduces to the Wishart distribution, hence, its normalizing constant has an explicit form (18). Also, for decomposable graphs, the I G (b, D) has an explicit form (23). However, for non-decomposable graphs, the I G (b, D) has an intractable form (28). Since the G-Wishart prior is conjugate to the likelihood (5.2), the posterior distribution

111 5.4 Bayesian hierarchical model for ERGMs with non-observed networks 103 of K is P (K X, G) = { 1 2)/2 I G (b, D ) K (b exp 1 } 2 tr(d K), where b = b + n and D = D + U, that is, W G (b, D ). To consider other possible prior for precision matrix see e.g. (33, 31, 30), and (35). They place no probability mass on zero entries and stable priors on the non-zero entries of the precision matrix MCMC sampling scheme The high dimensionality of the graph G leads to the use of MCMC sampler algorithm can potentially explore all graph space. Specifically, we introduce the MCMC algorithm that simulate from the joint posterior distribution (5.3). The proposed MCMC algorithm is in three steps as follows Step 1: Sample from θ, based on exchange algorithm (19). Step 2: Sample from graph space, based on birth-death MCMC sampling algorithm proposed by (16). Step 3: Sample from precision matrix, based on exact sampling algorithm form G-Wishart distribution proposed by (13). For step 1, in section 1, We illustrate how to sample from conditional distribution of θ based on exchange algorithm (19). For step 2, we using computationally efficient birth-death MCMC sampler proposed by (16) for Gaussian graphical models. Their algorithm explores the graph space by adding or deleting an edge in a birth or death event, in which the events are based on a continuous time birth-death Markov process. In a graph G = (E, V ) with precision matrix K, each edge e = (i, j) E dies independently of others as a Poisson process with death rate δ e (K). Since the events are independent, the overall death rate is δ(k) = e E δ e(k). With similar definition, each non-edge e = (i, j) E appears independently as a Poisson process with birth rate β e (K) and the overall birth rate is β(k) = e E β e(k). The birth and death rates of edges occur in continuous time with the rates determined by the stationary distribution of the process. The algorithm is considered in such a way that the stationary distribution equals the target posterior distribution. Since, the birth-death processes are independent Poisson processes, the time between two successive events has an exponential distribution with mean 1/(β(K) + δ(k)). This time can be considered as the process expends for any particular instance of the graph space. Therefore, the probability of birth and death events are proportional to their rates

112 104 Bayesian Modelling of the Exponential Random Graph Models as P (birth for edge e) = β e (K), for each e E, (5.5) β(k) + δ(k) P (death for edge e) = δ e (K), for each e E. (5.6) β(k) + δ(k) Mohammadi and Wit (16, section 3) proof the birth-death MCMC sampling algorithm converge to the target posterior distribution by considering accordingly birth and death rates, β e (K) = P (G+e, K +e X, θ), for each e E, (5.7) P (G, K X, θ) δ e (K) = P (G e, K e X, θ), for each e E, (5.8) P (G, K X, θ) in which G +e = (V, E {e}), and K +e P G +e and similarly G e = (V, E \ {e}), and K e P G e. Based on the above explanation, we summarize the proposed sampler algorithm as below. 1 Algorithm 5.1. Given the current state (θ, G, K): 1. Update θ, using a Metropolis step Draw θ from symmetric proposal distribution h(θ θ) 1.2. Draw G p(g θ ) 1.3. Propose the exchange move with probability { α(θ θ ) = min 1, q θ (G)P } (θ )q θ (G ) q θ (G)P (θ)q θ (G ) 2. Update G, conditional on θ based on birth and death process Calculate the birth rates by equation (5.7) and β(k) = e E β e(k), 2.2. Calculate the death rates by equation (5.8) and δ(k) = e E δ e(k), 2.3. Calculate the waiting time by W(K) = 1/(β(K) + δ(k)), 2.4. Calculate the type of jump from (5.7) and (5.8) 3. Update K, conditional on the recent G, sample the new precision matrix. In Algorithm 5.1, the first step is to sample from θ by using exchange algorithm which we explain in section 1. Then (in step two), we calculate the birth and death rates and waiting times. Based on birth and death rates we calculate the type of jump. Details of how to efficiently calculate the birth and death rates are discussed in subsection 1. Finally

113 5.4 Bayesian hierarchical model for ERGMs with non-observed networks 105 in step 3, according to the new state of the jump, we sample from a new precision matrix using a direct sampling scheme from the G-Wishart distribution which proposed by (13). Step 1: Updating parameter θ In step 1 of algorithm 5.1, we are interested in sample from conditional distribution of θ, P (θ G) = 1 Z(θ) exp{θt S(G)}p(θ). (5.9) Sampling from this conditional distribution is difficult, since requires the evaluation of the intractable normalizing constant Z(θ) = G G p exp{θ t S(G)}. Murray et al. (19) introduced the exchange algorithm based on exact sampling, which is designed for general MCMC algorithms in which their target posterior distributions have additional intractable normalization constant. To circumvent such an intractable normalizing constant, Caimo and Friel (2) explain how to use the concept behind the exchange algorithm. Suppose that G is the current state of our algorithm and we would like to sample θ from (5.9), first we sample θ from symmetric proposal distribution h(θ θ). Then we sample G, which is the difficult step of the algorithm since this requires a draw from (5.4). Note that, exchange algorithm requires an exact sampling from G. Following (2), we approximate the exact simulation by sampling G from P (G θ ) using an MCMC run that is long enough to get a point that can be treated as if it were simulated exactly from P (G θ ). They suggested that 500 iteration is a long-enough run. (6) proof that as few as 50 or 100 iterations are usually sufficient. Computing the birth and death rates Calculating the birth and death rates (5.7 and 5.8) is the bottleneck of our BDMCMC algorithm. Here, we explain how to calculate efficiently the death rates and for the birth rates is followed a similar manner. For more details see (15, 16). Following (16) and some simplification, for each e = (i, j) E, we have δ e (K) = P (G e θ) P (G θ) D jj I G (b, D) I G e(b, D) ( 2π(k ii k11) ) H(K, D ), where P (G e θ) P (G θ) = e θ[s(g e ) S(G)],

114 106 Bayesian Modelling of the Exponential Random Graph Models and [ ] H(K, D ) = exp 1 tr(d 2 e,e(k 0 K 1 )) (Dii (D ij) 2 )(k ii k11) 1, D jj in which K 0 is a 2 2 diagonal matrix that k 0 11 = k ii and k 0 22 = K j,v \j (K V \j,v \j ) 1 K V \j,j and K 1 = K e,v \e (K V \e,v \e ) 1 K V \e,e. Evaluating the ratio of prior normalizing constants is the computational bottleneck for computing the death/birth rates. Explicit form for ratio of prior normalizing constants Calculating the ratio of prior normalizing constants has been a major computational bottleneck in the Bayesian literature (28, 32, 16, 15). More recently (28) providing an explicit representation of such intractable normalizing constant and (15) implement this concept to sampling algorithm of graphical models. By using the theorem derived by (28, Theorem 3.7.), for special case which D is an identity matrix, we have I G (b, I p ) I G e(b, I p ) = 2 Γ((b + d + 1)/2) π, Γ((b + d)/2) where d denotes the number of triangles formed by the edge e and two other edges in G and I p denotes a p dimensional identity matrix. Therefore, for the case of D = I p, we have a simplified formula for the death rates which is given by δ e (K) = e θ[s(g e ) S(G)] 5.5 Discussion Γ((b + d + 1)/2) Γ((b + d)/2) 2D jj ( (k ii kii 1 2 H(K, D ), ))1 We have proposed a Bayesian methodology for exponential random graph models with non-observed networks, which opens a large toolbox to network modelling for non-observed data. By combining exponential random graph models and graph models, we have developed hierarchical Bayesian frameworks. The Bayesian framework that we proposed here is not limited to data which follow Gaussianity assumption. One possible extension of our work could be to Gaussian copula graphical models (15). In our Bayesian framework, we focus on undirected graphical models. We can extend our work to directed graphical models as well. This is an ongoing research subject.

115 References 107 References [1] Bhadra, A. and Mallick, B. K. (2013). Joint high-dimensional bayesian variable and covariance selection with an application to eqtl analysis. Biometrics, 69(2): [2] Caimo, A. and Friel, N. (2011). Bayesian inference for exponential random graph models. Social Networks, 33(1): [3] Caimo, A. and Friel, N. (2013). Bayesian model selection for exponential random graph models. Social Networks, 35(1): [4] Dempster, A. (1972). Covariance selection. Biometrics, 28(1): [5] Dobra, A., Lenkoski, A., and Rodriguez, A. (2011). Bayesian inference for general gaussian graphical models with application to multivariate lattice data. Journal of the American Statistical Association, 106(496): [6] Everitt, R. G. (2012). Bayesian parameter estimation for latent markov random fields and social networks. Journal of Computational and Graphical Statistics, 21(4): [7] Goodreau, S. M. (2007). Advances in exponential random graph (p*) models applied to a large social network. Social Networks, 29(2): [8] He, R. and Zheng, T. (2015). Glmle: graph-limit enabled fast computation for fitting exponential random graph models to large social networks. Social Network Analysis and Mining, 5(1):1 19. [9] Jones, B., Carvalho, C., Dobra, A., Hans, C., Carter, C., and West, M. (2005). Experiments in stochastic computation for high-dimensional graphical models. Statistical Science, 20(4): [10] Kolaczyk, E. D. (2009). Statistical analysis of network data: methods and models. Springer Science & Business Media. [11] Koskinen, J. H., Robins, G. L., and Pattison, P. E. (2010). Analysing exponential random graph (p-star) models with missing data using bayesian data augmentation. Statistical Methodology, 7(3): [12] Lauritzen, S. (1996). Graphical models, volume 17. Oxford University Press, USA. [13] Lenkoski, A. (2013). A direct sampler for g-wishart variates. Stat, 2(1):

116 108 References [14] Meinshausen, N. and Bühlmann, P. (2006). High-dimensional graphs and variable selection with the lasso. The Annals of Statistics, 34(3): [15] Mohammadi, A., Abegaz Yazew, F., van den Heuvel, E., and Wit, E. C. (2015). Bayesian modeling of dupuytren disease using copula gaussian graphical models. Arxiv preprint arxiv: v2. [16] Mohammadi, A. and Wit, E. C. (2015a). Bayesian structure learning in sparse gaussian graphical models. Bayesian Analysis, 10(1): [17] Mohammadi, A. and Wit, E. C. (2015b). Bdgraph: Bayesian structure learning of graphs in r. arxiv preprint arxiv: v2. [18] Muirhead, R. (1982). Aspects of multivariate statistical theory, volume 42. Wiley Online Library. [19] Murray, I., Ghahramani, Z., and MacKay, D. (2012). Mcmc for doubly-intractable distributions. arxiv preprint arxiv: [20] Newman, M. E., Watts, D. J., and Strogatz, S. H. (2002). Random graph models of social networks. Proceedings of the National Academy of Sciences, 99(suppl 1): [21] Robins, G., Pattison, P., Kalish, Y., and Lusher, D. (2007a). An introduction to exponential random graph (< i> p</i>*) models for social networks. Social networks, 29(2): [22] Robins, G., Snijders, T., Wang, P., Handcock, M., and Pattison, P. (2007b). Recent developments in exponential random graph (< i> p</i>*) models for social networks. Social networks, 29(2): [23] Roverato, A. (2002). Hyper inverse wishart distribution for non-decomposable graphs and its application to bayesian inference for gaussian graphical models. Scandinavian Journal of Statistics, 29(3): [24] Rue, H. and Held, L. (2005). Gaussian Markov random fields: theory and applications. CRC Press. [25] Saul, Z. M. and Filkov, V. (2007). Exploring biological network structure using exponential random graph models. Bioinformatics, 23(19):

117 References 109 [26] Snijders, T. A. (2002). Markov chain monte carlo estimation of exponential random graph models. Journal of Social Structure, 3(2):1 40. [27] Snijders, T. A., Pattison, P. E., Robins, G. L., and Handcock, M. S. (2006). New specifications for exponential random graph models. Sociological methodology, 36(1): [28] Uhler, C., Lenkoski, A., and Richards, D. (2014). Exact formulas for the normalizing constants of wishart distributions for graphical models. arxiv preprint arxiv: [29] Wainwright, M. J. and Jordan, M. I. (2008). Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1-2): [30] Wang, H. (2012). Bayesian graphical lasso models and efficient posterior computation. Bayesian Analysis, 7(4): [31] Wang, H. (2014). Scaling it up: Stochastic search structure learning in graphical models. [32] Wang, H. and Li, S. (2012). Efficient gaussian graphical model determination under g-wishart prior distributions. Electronic Journal of Statistics, 6: [33] Wang, H. and Pillai, N. S. (2013). On a class of shrinkage priors for covariance matrix estimation. Journal of Computational and Graphical Statistics, 22(3): [34] Wasserman, S. and Pattison, P. (1996). Logit models and logistic regressions for social networks: I. an introduction to markov graphs andp. Psychometrika, 61(3): [35] Wong, F., Carter, C. K., and Kohn, R. (2003). Efficient estimation of covariance selection models. Biometrika, 90(4):

118

119 Chapter 6 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service Abstract The paper proposes Bayesian framework in an M/G/1 queuing system with optional second service. The semi-parametric model based on a finite mixture of Gamma distributions is considered to approximate both the general service and re-service times densities in this queuing system. A Bayesian procedure based on birth-death MCMC methodology is proposed to estimate system parameters, predictive densities and some performance measures related to this queuing system such as stationary system size and waiting time. The approach is illustrated with several numerical examples based on various simulation studies. Key words: Gamma mixtures, Bayesian inference, MCMC, Birth-death MCMC, Predictive distribution, M/G/1 queue, Optional service. 1 Published as: Mohammadi A., M. Salehi-Rad, and E. Wit (2013). Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue with Optional Second Service, Computational Statistics 28(2),

120 112 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue 6.2 Introduction Bayesian analysis of queuing systems is a relatively recent research area. Some useful references are (1, 2, 17, 3, 4, 5, 20). The queuing systems that they considered customers depart the system after taking their service. But in many applied queuing systems, some customers need to be re-serviced after taking their main service. For example, in a production line, some items might fail and require repair. In these kinds of problems, we must re-service some items. The primary aim of this paper is to propose a Bayesian inference scheme for an M/G/1 queuing system in which some customers with probability p need re-servicing. In this queuing system we have a service unit in which customers arriving following a Poisson process and demanding service with a general distribution and some customers that need re-service. From a classical queuing theory perspective, this queuing system has been studied by (23, 24); they considered three alternatives for re-servicing in this queuing system and obtained the mean busy period, the probability of the idle period and the probability generating function(pgf) of the steady-state system size. The main contribution of this paper is to introduce a semi-parametric model for the general density of service and re-service based on a mixture of Gamma distributions, providing an alternative Bayesian approach for approximating the general distributions in queuing systems based on former work. Secondly, we will introduce a Bayesian algorithm based on the birth-death MCMC approach of (26) in order to fit this model to data. The use of finite mixture distributions is very common and the Bayesian approach provides an important tool in semi-parametric density estimation, see for instance (9, 22). Recently, MCMC methods for fully Bayesian mixture models of unknown dimension have been proposed; see (10). (13) introduced the reversible jump technique (RJ-MCMC) and (21) used this methodology to analyze Normal mixtures. This type of algorithm was used by (17) for mixture of Exponential distributions, (29) for mixture of Gamma distributions and (3) for mixture of Erlang distributions. More recently, in the context with this methodology, (26) rekindled interest in the use of continuous time birth-death methodology (BD- MCMC) for variable dimension problems. This type of methodology was used by (4) for mixture of Erlang distributions and (20) for the mixture of Pareto distributions. Moreover, (6) investigated the similarity between the reversible jump and birth-death methodology. The paper is structured as follows. In section 6.3, we illustrate the M/G/1 queuing system with optional second service where we consider a mixture of Gamma distributions to approximate the general densities of service and re-service times. Then, we use some

121 6.3 The M/G/1 queuing system with optional second service 113 results obtained by (23, 24)which allow us to estimate the mean number of customers in the system, mean busy period and probability of the idle period for this queuing system. In section 6.4, we explain our Bayesian approach by defining prior distributions, obtaining conditional distributions and propose a birth-death MCMC algorithm to obtain a sample from the posterior distributions of the parameters of the predictive service and re-service times distributions. In section 6.5 we explain how to approximate the general densities of service and re-service times by using the data generated from the birth-death MCMC algorithm. In section 6.6, we demonstrate how to estimate the system parameters and some performance measures of our queuing system from the BD-MCMC output. In section 6.7, we illustrate our methodology by performing several simulation studies. We conclude the paper with a discussion of the relevance of the various extensions in section The M/G/1 queuing system with optional second service Throughout, we are considering an M/G/1 queue, in which some customers with probability p need re-service, with First Come First Serve discipline, and independence between inter-arrival and service times. In this queuing system failed items are stockpiled in a failed queue (FQ) and re-serviced only after all customers in main queue (MQ) are serviced. After completion of re-service of all items in FQ, the server returns to MQ if there are any customers waiting in MQ; otherwise, the server is idle. So, in this queuing system we have two queues and one server. The variable T is the inter-arrival time with an exponential distribution. For service times, we suppose that service (S) and re-service ( S) times are independent and have general distributions, denoted by B 1 (.) and B 2 (.) with means µ 1, µ 2 and variances δ 1, δ 1 respectively. For these general distributions, we need a model, flexible enough to deal with typical features in service and re-service time distributions (skewness, multimodality, lots of mass near zero, even possibly a mode) and permits usual computations in queuing applications. Thus, we propose a semi-parametric model based on a mixture of Gamma distributions with a Bayesian framework. If S is a service time, we assume k 1 B 1 (s θ 1 ) = π 1i G(s α 1i, β 1i ), s > 0 1=1 where θ 1 = (k 1, π 1, α 1, β 1 ), k 1 is the unknown number of mixture components, π 1 =

122 114 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue (π 11, π 12,..., π 1k1 ) are weights and G(s α 1i, β 1i ) represents the Gamma density function, for i = 1,..., k 1, that is, Likewise, if S is a re-service time, we have, G(s α 1i, β 1i ) = (β 1i) α 1i Γ(α 1i ) sα 1i 1 e β 1is,, s > 0. k 2 B 2 ( s θ 1 ) = π 2i G( s α 2i, β 2i ), s > 0 1=1 where θ 2 = (k 2, π 2, α 2, β 2 ), k 2 and π 2 = (π 21, π 22,..., π 2k2 ) have the same interpretation as for the service times density. Thus, we have a queuing system with two queues, main queue and failed queue, and one server. Therefore, all parameters of these queuing system are (λ, θ 1, θ 2, p), in which λ is the parameter of inter-arrival times and p is the probability of the items needing reservice Some performance measures in this queuing system We assume that the queuing system is in equilibrium. This assumption is equivalent with assuming that the traffic intensity, ρ, is less than one (18). For this queuing system ρ = ρ 1 + pρ 2, in which ρ 1 = λµ 1 is traffic intensity in MQ and ρ 2 = λµ 2 is traffic intensity in FQ. As a result ρ = λ(µ 1 + pµ 2 ). (6.1) Under this steady state condition, other performance measures will be obtained. The expectation of busy period and probability of idle period To acquire the expectation of busy period for this queuing system we have E[busy period] = E[busy period in MQ] + pe[busy period in FQ], which the first expressions is equal to µ 1 /(1 λµ 1 ) and the second is pµ 2 /(1 λµ 1 ). Therefore, by some computation E(busy period) = µ 1 + p 2 µ 2 1 λµ 1. (6.2)

123 6.3 The M/G/1 queuing system with optional second service 115 Furthermore, the probability of idle period is P r(idle period) = 1 λµ p 2 λµ 2. (6.3) For more details see (23, 24). The expectation of the system size For our queuing system, suppose that X n is the number of customers remaining in MQ at the completion of the nth customer s service time also Y n is the number of customers remaining in FQ at the completion of the th customer s service time in the steady state. (19), by using the joint probability general function of (X n, Y n ) obtained the following expression of the mean system size. Theorem 6.1. (19) Mean number of customers in MQ and FQ are as below i) ii) respectively, where [ ( ) ] E(X n ) = ρ 1 + λ2 δ 1 + ρ 2 p λ 2 δ 2 + ρ pρ 2 λ 2 δ 1 +ρ 1 (2 ρ 1 ) 2 1 2(1 ρ 1 ) + 1 ρ 1 2 ( pρ 2 + (1 ρ 1 )Ψ(1 p + pb2(λ)) ) (6.4) E(Y n ) = p ( 2(1 ρ 1 ) 2 + λ 2 δ 1 + ρ 1 (1 ρ 1 ) ) 2 ( pρ 2 + (1 ρ 1 )Ψ(1 p + pb 2(λ)) ) (6.5) Ψ(u) = ub 1 [ λ(1 Ψ(u)) ], B 1(.) and B 2(.) are the Laplace Stieltjes Transform (LST) of the service and re-service times density, respectively. According to mixture of Gamma distributions for service and re-service times, the LST of service and re-service times are given by k j ( ) αji Bj α ji (t) = π ji, j = 1, 2 t + β ji and the variance of service and re-service times are given by i=1 k j δ j = i=1 π 2 ji ( ) α ji, j = 1, 2. βji 2

124 116 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue 6.4 Bayesian inference In this section we propose the Bayesian approach to infer the system parameters (λ, θ 1, θ 2, p). We observe n t inter-arrival times t = {t i } nt i=1, n s1 service times s 1 = {s 1i } n s1 i=1, n s2 reservice times s 2 = {s 2i } n s2 i=1 and n p indicators u = {u i } np i=1, in which u i = 1 if customer need re-service and u i = 0 if customer does not need re-service. We assume independence between the arrival, service times, re-service times and the probability of re-service. For the arrival rate, λ, we assume a Gamma prior distribution, λ G(ξ, ψ). Conditional on arrival data, the posterior distribution of λ is also Gamma distributed as G(ξ + n t, ψ + n t i=1 t i). For the parameter p, we assume a Beta prior distribution, p Beta(a, b). The posterior distribution of p given u is Beta(a + n p i=1 u i, b + n p n p i=1 u i). In the following section, we propose a Bayesian framework for mixture of Gamma distributions to approximate the general distribution of the service and re-service times, B 1 (.) and B 2 (.) Bayesian inference for finite Gamma Mixture To determine a Bayesian framework for the general distributions and based on mixture of Gamma distribution we assume that B(s θ) = k π i G(s α i, β i ), s > 0 i=1 where θ = (k, π, α, β). First, as is usually done in mixture models (e.g. (9)), we use a data augmentation algorithm, introducing for each datum, S j, component indicator variables, Z j, such that P ( Z j = i k, π ) = π i, i = 1,..., k Then, the conditional service time density, S j, given that Z j, is S j Z j = i G(s j α i, β i ), j = 1,..., n s. Following (21), we assume that the joint prior distribution on the mixture Gamma parameters, θ = (k, π, α, β) can be factorized as f(k, π, α, β, z) f(k)f(π k)f(z π, k)f(α k)f(β k).

125 6.4 Bayesian inference 117 To determine the prior distributions for the parameters of mixture distribution, first, we assume a truncated Poisson distribution for the mixture size, k, as below P (K = k) γk k!, k = 1,..., k max. (6.6) We define prior distributions for remaining parameters given that, as below π k D(φ 1,..., φ k ) (6.7) α i k G(ν, υ), i = 1,..., k (6.8) β i k G(η, τ), i = 1,..., k (6.9) where D(φ 1,..., φ k ) denotes a Dirichlet distribution with parameters φ i > 0. Typically, we might set φ i = 1, for all i = 1,..., k, giving a uniform U(0, 1) prior for the weights. Given k and the data, s, then it is straightforward to show the required posterior condition distributions for the MCMC algorithm are as below P ( ) Z j = i s, k, π, α, β π i G(s j α i, β i ), i = 1,..., k, π s, z, k D(φ 1 + n 1,..., φ k + n k ) β i s, z, k G(η + n i α i, τ + s j ), i = 1,..., k f(α i s, z, k, β) ( β α i i Γ(α i ) ) n i j:z j =i j:z j =i s j αi α ν 1 i e υα i, (6.10) where n i = { z j = i } for i = 1,..., k. This mixture model is invariant to permutation of the labels i = 1,..., k. For identifiability, it is important to adopt a unique labeling. Unless stated otherwise, we use that the π i are increasing; thus the prior distributions of the parameters are k! times the product of the individual Gamma densities, restricted to the set π i <... < π k, for more details see (27) and (25). In order to sample the posterior distributions, there are two main approaches in the context of mixture modeling with an unknown number of components: one approach is (13) reversible jump MCMC (RJ-MCMC) methodology. Another alternative approach is (26) birth-death MCMC (BD-MCMC) methodology. (26) introduced continuous time birth-

126 118 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue death MCMC processes for variable dimension problems. (6) showed that the essential mechanism in this approach was the same as with RJ-MCMC algorithm. Here we apply the BD-MCMC, which is simpler to implement and we have found to give better results in practice. We briefly outline this algorithm in the following section. For more details, including details of the construction of the BD-MCMC methodology see (26) and (10) BD-MCMC algorithm In this subsection, we obtain a sample from the posterior distributions of the parameters, θ = (k, π, α, β), by a BD-MCMC algorithm. This algorithm is based on a birth-death process and was introduced by (26) in the context of Normal mixtures. With this approach, the model parameters are interpreted as observations from a marked point process and the mixture size, k, changes so that births and deaths of the mixture components occur in continuous time. The rates at which this happens determine the stationary distribution of the process. In the birth-death process, births of mixture components occur at a constant rate which we might set equal to the parameter, γ, from the prior distribution of k in (6.6). A birth increases the number of mixture components by one. Whenever a new component is born, its weight is generated from a Beta distribution with parameters (1, k) and the remaining parameters are sampled from their posterior distributions. To include the new component, the old component weights are scaled down proportionally to make all the weights, including the new one, sum to 1, i.e. π i := π i /(1 + π). The death rate of every mixture component is a likelihood ratio of the model with and without this component, given by ( ) n s B(s r ) π j G(s r α j, β j ) =, j = 1,..., k. (6.11) (1 π j )B(s r ) r=1 The total death rate, = j j, of the process at any time is the sum of the individual death rates. A death decreases the number of mixture components by one. The birth and death processes are independent Poisson processes, thus, the time of birth/death event is exponentially distributed with mean 1/( + γ). Therefore, a birth or death occurs with probabilities proportional to gamma and, respectively. With this explanation, we define an BD-MCMC algorithm based on (26) as follows. Step one of the algorithm is the birthdeath process described above. Following (26), we have chosen t 0 = 1 and a birth rate equal to the parameter, γ. As expected, we have found in practice that larger values of the birth rate produce better mixing but require more time in the computation of the algorithm.

127 6.4 Bayesian inference Algorithm 3.1 Starting with initial values k (0), π (0), α (0) and β (0), iterate the following steps: 1. Run the birth-death process for a fixed time and the birth rate γ 1.1. Compute the death rates for each component, j, and the total death rate, = j j 1.2. Simulate the time to the next jump from an exponential distribution with mean 1/( + γ) 1.3. If the run time is less than t 0 continue otherwise proceed with step Simulate the type of jump: birth or death with probabilities Pr(birth) = γ (γ + ), Pr(death) = (γ + ) 1.5. Adjust the mixture components MCMC steps conditional on k 2. Update the latent variables by sampling from z (i+1) z s, k (i+1), π (i), α (i), β (i) 3. Update the weights by sampling from π (i+1) π s, k (i+1), z (i+1) 4. For r = 1,..., k (i+1) 4.1. Update the means by sampling from β (i+1) β r r s, k (i+1), z (i+1) 4.2. Update α r using a Metropolis-Hastings 5. Set i = i + 1 and go to step 1. Steps 2, 3 and 4.1 are standard Gibbs sampling moves, whereby the model parameters are updated conditional on the mixture size, k. The only complicated is step 4.2, where we introduce a Metropolis-Hastings step (16), to sample from the posterior distribution of α i. From the shape of the target distribution, (6.10), we propose a Gamma distribution with parameters G(ν, υ). With this proposal distribution the acceptance probability for a candidate point, α, becomes ( ) nr P r(α r, α ) = min 1, Γ(αr ) α α r β Γ(α r s j ) j:z j =r Remark 3.1 Due to the overall similarity of the shape of the proposal distribution and the target distribution, the acceptance probability is much better in simulation studies, Sect. 6.7, we get acceptance probabilities of around 20% compared to 1% elsewhere than previous work (29, 3). Algorithm 3.1 produces a sample from the posterior distributions. Thus, we can run this BD-MCMC algorithm for the parameters of service and re-service time densities, θ 1 =

128 120 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue ) (k 1, π 1, α 1, β 1 and θ 2 = (k 2, π 2, α 2, β 2 ), respectively Model identification The parameter k 1 and k 2 are model parameters, identifying models of a particular complexity. In this subsection, we discuss briefly how to perform ideal Bayesian model identification and a comparison with other approaches. There are many model selection criteria such as AIC, BIC, DIC, DIC+, MDL, Bayesian p-values and posterior predictive checks (for more details see (8)). But most of them are either unsuitable for mixture models or complex (7). This diversity of approaches, especially for variable-dimension parameters, reflects the different flavours of the model determination question that statisticians face. In reality there are a number of reasons why this simple idealized view fails to reflect practical applications. We briefly describe some fundamental issues that face the practitioner wishing to perform model choice for a real Bayesian problem. We omit further details as they are covered in (14) and (15). First at all, prior model probabilities may be fictional: the ideal Bayesian has real prior probabilities reflecting scientific judgment or belief across the model space. In practice, however, such priors may not be commonly available. Secondly, Bayesian models have no chance of passing the test of a sensitivity analysis: in ordinary parametric problems we commonly find that inferences are rather insensitive to moderately large variations in prior assumptions, except when there are very few data. In fact, the opposite case, of high sensitivity, poses a greater challenge to the non-bayesian as perhaps the data carry less information than hoped. Moreover, there may be improper parameter prior problems: in ordinary parametric problems it is commonly true that it is safe to use improper priors, specifically when posterior distributions are well-defined as limits of a sequence of approximating proper priors. However, when comparing models, improper parameter priors make Bayes factors indeterminate. For this reason, we use the marginal posterior probabilities for k 1 and k 2 to do model inference. In fact, rather than selecting the best model, these probabilities allow the Bayesian to use model averaging strategies or more qualitative model comparisons. 6.5 Predictive densities By using the BD-MCMC algorithm we can produce samples from the posterior distributions of the service and re-service times distribution. Thus, given the BD-MCMC output

129 6.6 Estimation of some performance measures via the BD-MCMC output 121 of size N after a burn-in period, for θ 1 and θ 2, and suitable regularity conditions [see, e.g., (28, page 65)], these quantities of interest can be consistently estimated by the sample path averages. We first estimate the mixture size of service and re-service distribution, k 1 and k 2. The estimates of the marginal posterior distributions of k 1 and k 2 are { 1 P r(k r = k data) = lim N n : k (n) N r { } 1 n : k (n) N r = k } = k, r = 1, 2. (6.12) This probability provides a tool for determining the number of phases of service and reservice distributions. using We can estimate the predictive density of the service and re-service time distributions ˆB r (t s 1 ) 1 N N j=1 r k (j) i=1 π (j) ri G(t α(j) ri, β(j) ri ), r = 1, 2, (6.13) also, the predictive density of the service and re-service time distributions can be estimated by ˆB r (t θ r ) = ˆk r i=1 ˆπ i G(t ˆα ri, ˆβ ri ), s > 0 where ˆk r has a maximum posterior probability in (6.12). Note that in the case where the posterior distribution of k r, α r, and β r is fairly spread out or even multi-modal, these plugin estimates would give a poor approximation of the predictive densities. 6.6 Estimation of some performance measures via the BD-MCMC output Given a sample realization of the MCMC output and a sample from f(λ t) and f(p x) of equal size, we can estimate performance measures. For example, given sample data, we would like to assess whether or not the system is stable. The system is stable if and only if the traffic intensity, ρ, is less than one. Thus, the estimation of the probability of having a stable system is P r(ρ < 1 t, x, s 1, s 2 ) 1 } {ρ N (n) < 1 (6.14)

130 122 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue where according to 6.1 we have k ρ (n) = λ (n) 1 (n) π (n) α (n) 1i 1i + p (n) i=1 β (n) 1i 2 k (n) i=1 π (n) 2i α (n) 2i β (n) 2i, in which { ( ) } N k (n) 1, π (n) 1, α (n) 1, β (n) 1 i=1 and { ( ) } N k (n) 2, π (n) 2, α (n) 2, β (n) 2 are the samples { } N p (i) are the i=1 { } i=1 N of size N obtained from the BD-MCMC algorithm, also λ (i) and i=1 samples of size N generated from the posterior distributions of λ and p, respectively. A consistent estimator of the traffic intensity is E(ρ t, x, s 1, s 2 ) E(λ t) 1 k N 1 (n) π (n) α (n) k 2 (n) 1i 1i + p (n) N β (n) 1i i=1 i=1 where E(λ t) = (ξ + n t )/(ψ + n t i=1 t i). i=1 π (n) 2i α (n) 2i β (n) 2i, (6.15) Moreover, by using the MCMC estimations of the parameters system, i.e. (λ, θ 1, θ 2, p), we can estimate the ρ 1, ρ 2, ρ, δ 1, δ 2, the mean system size 6.4 and 6.5, the mean busy period 6.2 and the probability of idle period of the system 6.3, as you see in the simulation study below. 6.7 Simulations This section illustrates the accuracy of the Bayesian methodology in two simulation examples of the M/G/1 queuing system with optional second service. In the first example we assume both real density of service and re-service times are mixture of Gamma distributions, mixture of two Gamma distributions for service and mixture of three Gamma distributions for re-service times. In order to test how our methodology deals with model misspecification, the second example considers a more complicated model, in which the true density of service is a mixture of two truncated Normal distributions and the true density of re-service is Log-Normal distribution. The R codes are available at under the research link and will soon be available as R-package.

131 6.7 Simulations Simulation study: mixture of Gammas Without loss of generality, we assume that the inter-arrival rate, λ, is known and equal to 0.26 and probability of re-service, p, is also known and equal to 0.3. We consider samples of 1, 000 service data from mixture of two Gamma distributions as below B 1 (s) = 0.6G(12, 1) + 0.4G(3, 2). Also, for the re-service, 1, 000 data simulated from a mixture of three Gamma distributions as below B 2 ( s) = 0.6G(100, 100/3) + 0.3G(200, 50) + 0.1G(300, 60). We assumed a Poisson prior distribution for k 1 with parameter γ = 2 which is truncated in point 100, and for remaining parameters we assume φ 1 =... = φ k = 1, ν = s 2 1/σ 2 s1, υ = 1/ ν, η = ( s 1 /σ 2 s1) 2 2 and in 6.6, 6.8, and 6.9, respectively. For the re-service data we take γ = 3 and the other parameters the same as their service equivalents. For a service and re-service data set, we carried out the Bayesian approach described in section We ran 200, 000 iteration of the BD-MCMC algorithm with 100, 000 iteration as burn-in. From the diagnostics it is clear that these numbers exceed what is needed for reliable results. Methods for choosing a burn-in time and number of iterations to use after burn-in are discussed in (12). Fig. 6.1 Predictive densities (solid line) and the true densities (dotted line) for (left) service time data and (right) re-service time data. Figure 6.1 provides the histograms of generated data set with the estimation of the predictive densities from formula 6.13 for service and re-service times. This figure shows that the productive densities for service and re-service times compare quite well with the

124 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue true densities. Fig. 6.2 (Left) A trace of k 1 for 200, 000 iterations after 100, 000 burn-in iterations and (right) for the k 2.

132 124 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue true densities. Fig. 6.2 (Left) A trace of k 1 for 200, 000 iterations after 100, 000 burn-in iterations and (right) for the k 2. Figure 6.2 (left) illustrates the mixing properties of the algorithm in terms of the evolution of the mixture size k 1 and (right) is for k 2. An essential element of the performance of our BD-MCMC algorithm is its ability to move between different values of k 1 and k 2. The chains appear to be mixing quite well, visiting many states, for both service and re-service times. Fig. 6.3 (Left) The estimation of posterior distribution of k 1, and (light) for k 2. Figure 6.3 on the left shows the estimation of posterior distribution of k 1, which is obtained from formula 6.12, and on the right for k 2. A useful check on the stationary is given by the plot of the cumulative occupancy fractions for different values of k 1 and k 2 against the number of iterations. These are represented in Figure 6.4 for the service and re-service data set, where it can be seen that the burn-in is more than adequate to achieve stability in the occupancy fractions.

133 6.7 Simulations 125 Fig. 6.4 The cumulative occupancy fractions for (left) service data, k 1, (right) the re-service data, k 2, for a complete run including burn-in. In the first and second row of table 6.1, we respectively tabulate the true values and the estimated values of traffic intensity, probability of having a stable system, expectation of number of customers in MQ, expectation of number of customers in FQ, expectation of busy period and probability of idle period of the system. They have been obtained from formula 6.2, 6.3, 6.4, 6.5, 6.14, and 6.15, respectively. The third row of the table shows the standard deviation (SD) of these estimates. E(busy period) P(idle period) E(X n ) E(Y n ) P (ρ < 1 data) E(ρ data) True value Estimates SD Table 6.1 True values and estimations of mean busy period, probability of idle period, mean number of customers in the system, and probabilities that system is stable. Third row is the standard deviation (SD) for these estimates. Also, the real value of the traffic intensity, ρ, from formula 6.1 is equal with A 95% credible interval is roughly given as the estimate ±1.96 SD. Considering this criterion, it is concluded that all the true values lie inside their 95% credible intervals Simulation study: Mixture of truncated Normal and Log-Normal In this section we would like to access the effect of model misspecification on our estimation procedure. Like in previous example, we assume that the inter-arrival rate and probability of re-service are known, λ = 0.28 and p = We consider samples of 1, 000 service data from mixture of two truncated Normal distributions on interval (, 0) as

134 126 Using Mixture of Gammas for Bayesian Analysis in an M/G/1 Queue below B 1 (s) = 0.4T N (0, ) (1.4, 2.3) + 0.6T N (0, ) (0.2, 0.3). Also, for the re-service, 1, 000 data simulated from a single Log-Normal distribution as below B 2 ( s) = LN(1, 0.5). With the same assumptions in previous example, we ran 200, 000 iteration of the BD- MCMC algorithm with 100, 000 iteration as burn-in. Fig. 6.5 Predictive densities (solid line) and the true densities (dotted line) for (left) service time data and (right) re-service time data. Figure 6.5 shows the histograms of generated data set with real densities for service and re-service times and the predictive densities from formula (13). This figure shows the predictive densities for service and re-service times in comparison with the true densities. Despite the fact that the true model is not part of our model class, it is sufficiently rich to approximate quite general densities. Figure 6.6 (left) illustrates the mixing properties of the algorithm in terms of the evolution of the mixture size k 1 and (right) is for k 2. An essential element of the performance of our BD-MCMC algorithm is its ability to move between different values of k 1 and k 1. The chains appear to be mixing quite well, visiting many states, for both service and re-service times. Figure 6.7 in the left shows the estimation of posterior distribution of k 1, which obtained from 6.12, and in the right is for k 1. For checking the stationary, figure 6.8 shows the cumulative occupancy fractions for different values of k 1 and k 2 against the number of iterations. These are represented for the

$Fig. 6.8 The cumulative occupancy fractions for (left) service data, k 1, (right) the re-service data, k 2, for a complete run including burn-in.$

135 6.7 Simulations 127 Fig. 6.6 (Left) A trace of k 1 for 200, 000 iterations after 100, 000 burn-in iterations and (right) for the k 2. Fig. 6.7 (Left) The estimation of posterior distribution of k 1, and (light) for k 2. Fig. 6.8 The cumulative occupancy fractions for (left) service data, k 1, (right) the re-service data, k 2, for a complete run including burn-in.

Bayesian model selection in graphs by using BDgraph package

Bayesian model selection in graphs by using BDgraph package A. Mohammadi and E. Wit March 26, 2013 MOTIVATION Flow cytometry data with 11 proteins from Sachs et al. (2005) RESULT FOR CELL SIGNALING DATA