An alternative marginal likelihood estimator for phylogenetic models

Size: px

Start display at page:

Download "An alternative marginal likelihood estimator for phylogenetic models"

Mark Arnold
5 years ago
Views:

1 An alternative marginal likelihood estimator for phylogenetic models arxiv: v1 [stat.co] 13 Jan 2010 Serena Arima 1 and Luca Tardella 2 1 Dipartimento di studi geoeconomici, linguistici, statistici e storici per l analisi regionale, SAPIENZA Università di Roma 2 Dipartimento di Statistica, Probabilità e Statistiche Applicate, SAPIENZA Università di Roma January 13, 2010 Abstract Bayesian phylogenetic methods are generating noticeable enthusiasm in the field of molecular systematics. Several phylogenetic models are often at stake and different approaches are used to compare them within a Bayesian framework. The Bayes factor, defined as the ratio of the marginal likelihoods of two competing models, plays a key role in Bayesian model selection. However, its computation is still a challenging problem. Several computational solutions have been proposed none of which can be considered outperforming the others simultaneously in terms of simplicity of implementation, computational burden and precision of the estimates. Available Bayesian phylogenetic software has privileged so far the simplicity of the harmonic mean estimator (HM) and the arithmetic mean estimator (AM). However it is known that the resulting estimates of the Bayesian evidence in favor of one model are often Dipartimento di studi geoeconomici, linguistici, statistici e storici per l analisi regionale, SAPIENZA Università di Roma, via del Castro Laurenziano 9, Roma, serena.arima@uniroma1.it 1

2 biased and inaccurate up to having an infinite variance so that the reliability of the corresponding conclusions is doubtful. We focus on an alternative generalized harmonic mean (GHM) estimator which, recycling MCMC simulations from the posterior, shares the computational simplicity of the original HM estimator, but, unlike it, overcomes the infinite variance issue. We show that the Inflated Density Ratio (IDR) estimator when applied to some standard phylogenetic benchmark data, produces fully satisfactory results outperforming those simple estimators currently provided by most of the publicly available software. keywords: Bayes factor, harmonic mean, importance sampling, marginal likelihood, phylogenetic models. 1 Introduction The theory of evolution states that all organisms are related through a history of common ancestor and that life on Earth diversified in a tree-like pattern connecting all living species. Phylogenetics aims at inferring the tree that better represents the evolutionary relationships among species studying differences and similarities in their genomic sequences. Since genes evolve accumulating changes, the larger the number of differences in the genetic code of two species is, the larger the evolutionary distance between them is likely to be. Alternative tree estimation methods have been proposed, such as parsimony methods (see [10], chapter 7) and distance methods [11, 6]; in this paper, we will focus on stochastic models for substitution rates and we address model choice in a fully Bayesian framework with an alternative model evidence estimation procedure. 1.1 Substitution models: a brief overview Comparing two species, we define substitution the replacement in the same situs of one nucleotide in one species by another one in the other species. The stochastic models describing this replacement process are called substitution models. A phylogeny or a phylogenetic tree is a representation of the genealogical relationships among species, also called taxa or taxonomies. The tips (leaves or external nodes) 2

3 represent the present-day species, while the internal nodes usually represent extinct ancestors for which genomic sequences are no longer available. The ancestor of all sequences is the root of the tree. The branching pattern of a tree is called topology, denoted with τ, while the lengths ν τ of the branches of the tree τ represent the time periods covered by the branches. DNA substitution models are probabilistic models which aim at modeling changes between nucleotides in homologous DNA strands; replacements within these sequences are modeled by a 4-state Markov process, in which each state represents a nucleotide. The nucleotide sites are usually assumed to evolve independently each other. Substitutions at any particular site are described by a Markov chain, with the four nucleotides as states of the chain. The evolution of this Markov chain is completely specified by a substitution rate matrix Q(t) = r ij (t), which defines the rates of substitution r ij of the four bases: each element r ij, i j, represents the instantaneous rate of substitution from nucleotide i to nucleotide j. The diagonal elements of the rate matrix are defined as r ii = j i r ij so that 4 j=1 r ij = 0, i. The probability matrix of change P (t) = {p ij (t)}, defines the probability of change from state i to state j, that is p ij (t) = P r{x(t) = j X(t 1) = i}. It is traditionally assumed that the substitution process is stationary with equilibrium distribution Π = (π A, π C, π G, π T ) and time-reversible, that is π i r ij = π j r ji (1.1) where π i is the proportion of time the Markov chain spends in state i and π i r ij is the amount of flow from state i to j. Equation (1.1) is known as detailed-balance condition and means that flow between any two states in the opposite direction is the same. The substitution process is assumed homogeneous over time and parameters of the substitution process constant over time so that Q(t) = Q and P (t) = P. Following the notation in [17], we define r ij = ρ ij π i, i j, where ρ ij is the transition rate from nucleotide i to nucleotide j. This reparameterization is particularly useful for the specification of substitution models, since it makes clear the distinction between the nucleotide frequencies π A, π G, π C, π T and substitution rates ρ ij, allowing for different assumptions on evolutionary patterns. Several substitutions 3

4 sub-models, reflecting specific biological assumptions, have been proposed. The most general time-reversible nucleotide substitution model is the so-called GTR defined by the following rate matrix - ρ AC π C ρ AG π G ρ AT π T Q = ρ AC π A - ρ CG π G ρ CT π T ρ AG π A ρ CG π C - ρ GT π (1.2) T ρ AT π T ρ CT π C ρ GT π G - and more thoroughly illustrated in [23]. Several substitution models, such as K80 and HKY85, can be obtained simplifying the Q matrix reflecting specific biological assumptions: the simplest one is the JC69 model, originally proposed in [21], which assumes that all nucleotides are interchangeable and have the same rate of change, that is ρ ij = ρ i, j and π A = π C = π G = π T. In this paper, for illustrative purposes we will consider GTR and JC69 models; for a wider illustration of alternative substitution models see for example [10, 8]. 1.2 Bayesian inference for substitution models The parameter space of a phylogenetic model can be represented as Ω = {τ, ν τ, θ} be where τ T is the tree topology, ν τ the set of branch length of topology τ, N T is the cardinality of the number of the possible topologies and θ = (ρ, π) the parameters of the rate matrix. Notice that N T is a huge number even for few species (for n = 10 species, N T ). Given a tree topology τ and branch lengths ν τ, one can compute the likelihood L = P (X τ, ν τ, θ) using the pruning algorithm, a practical and efficient recursive algorithm proposed in [9]. One can then make inference on the unknown model parameters looking for the parameters which maximize the likelihood. Instead, in a Bayesian framework, the parameter space is endowed with a prior distribution π(τ, ν τ, θ) and the likelihood is used to update the prior uncertainty about model parameters following the Bayes rule: p(τ, ν τ, θ X, M) = NT i=1 p(x τ, ν τ, θ)π(τ, ν τ, θ) ν τi θ p(x τ, ν τ, θ)π(τ, ν τi θ)dν i dθ. 4

5 The resulting distribution p(τ, ν τ, θ X, M) is the posterior distribution which coherently combines prior believes and data information. This distribution is clearly not analytically computable but can be approached through appropriate approximations. Indeed, over the last ten years, powerful numerical methods based on Markov Chain Monte Carlo have been developed, allowing one to carry out Bayesian inference under a large category of probabilistic models, even when dimension of the parameter space is very large. Indeed, several ad-hoc MCMC algorithms have been developed for phylogenetic models such as [24, 27] and are currently implemented in publicly available softwares such as in MrBayes [40] and PHASE [14]. 2 Model selection for substitution models Given the variety of substitution models, an important issue of any model-based phylogenetic analysis is to select the one that is the most supported by the data. Several model selection procedures have been proposed. A standard approach to model selection is to perform the hierarchical likelihood ratio test (LRT) [35] for choosing between alternative nested models. A number of popular programs allow users to compare pairs of model using this test such as PAUP [44], PAML [50] and the R package APE [36]. However, as it has been shown in [34], performing systematic LRT is not an optimal choice for model selection in phylogenetics. This is because the model that is finally selected can depend on the order in which the pairwise comparisons are performed [33]. Moreover, it is well-known that LRT tends to favor parameter rich models. The Akaike Information Criterion (AIC) is another model-selection criterion commonly used also in phylogenetics [34]: one of the advantages of the AIC is that it allows non nested models to be compared, and it can be easily implemented. However, the AIC, as well as LRT, tends to favor parameter-rich models. A slightly different approach proposed to overcome this selection bias, is the Bayesian Information Criterion BIC ([41]), which penalizes parameter-rich models. All these criteria can, in practice, select very different substitution models, as shown in [1]; moreover, they compare ratios of likelihood values penalized for an increase in the dimension of one of the models, without directly account for uncer- 5

6 tainty in the estimates of model parameters. This aspect is completely accounted for within a fully Bayesian framework through the use of the Bayes Factor. In fact, Bayes Factor directly incorporates this uncertainty and it is more intuitive than other methods since it can be directly used to assess the comparative evidence provided by the data in terms of the most probable model under equal prior model probabilities. The Bayes Factor can be thought of as the Bayesian counterpart of the likelihood ratio tests. In the past decades its use was limited to simple models because of analytical intractability in most complex models and computational problems. However, thanks to the development of effective computational tools combined with more efficient simulation methods, the Bayes Factor is becoming more and more common in many fields of application [38]. In particular, the use of the Bayes Factor for comparing phylogenetic models was firstly introduced in [42] and [43] and since then its popularity has grown and there are currently some publicly available softwares which output approximations of the Bayes Factor. Indeed the complexity of these models and the computational burden in the light of high-dimensional parameter space make the problem of finding alternative and more efficient computational strategy for computing Bayes Factor still open and in continuous development [25]. Suppose, we aim at comparing two competing substitution models M 0 and M 1. The Bayes Factor is defined as the ratio of the marginal likelihoods of the two models that is where, for i = 0, 1 N T p(y M i, θ (i) ) = BF 01 = p(y M 1) p(y M 0 ) τ=1 ν (i) τ θ (i) p(y M i )p(θ M i )dθ. The marginal likelihood basically corresponds to the denominator of the Bayes rule formula and is indeed the normalizing constant of the unnormalized density obtained as the product of prior and likelihood at the numerator. [22] gives numerical guidelines for interpreting the evidence scale. Values of BF 01 > 1 (log(bf 01 ) > 0) will be considered as the evidence in favor of M 1 but only a value of BF 01 > 100 (log(bf 01 > 4.6) can be considered as decisive. 6

7 3 Alternative computational tools for Bayesian model evidence It is clear that the computation of the marginal likelihoods for phylogenetic models is not straightforward: it involves summation over all possible topologies N T and the solution of the k-dimensional integral for the branch length parameters ν τ and the substitution rate matrix θ = (ρ, π). Most of the marginal likelihood estimation methods proposed in the literature have been applied extensively also in molecular phylogenetics [28, 25, 43]. Among these methods, many of them are valid only under very specific conditions. For instance, the Dickey-Savage ratio [46] applied in phylogenetics in [43], assumes nested models. The Laplace estimator [22] and the Bayesian Information Criterion [41], applied in phylogenetics firstly in [28], require large sample approximations around the maximum likelihood, which can be sometimes difficult to compute or approximate for very complex models. A recent appealing variation of the Laplace approximation has been proposed [39]: however, its use is endangered when the posterior distribution deviates from normality and the maximization of the likelihood can be neither straightforward nor accurate. The reversible jump approach [16, 5], where MCMC is devised to jump between models according to a Metropolis-Hastings rule, is, in principle, a wholly general algorithm. In practice, the Metropolis-Hastings proposal moves between models have to be accepted at a sufficient rate for the method to be practical. In order to meet this requirement, the algorithm has been extended to phylogenetic model selection in [19]; however, the implementation of this algorithm is not straightforward for the end user and often requires appropriate delicate tuning. Moreover, it suffers extra implementation difficulties when comparing models based on an entirely different parametric rationale [25]. Currently, to our knowledge, the most widely used methods for estimating the marginal likelihood of phylogenetic models are the thermodynamic integration reviewed in [13] and first applied in a phylogenetic context in [26] and the harmonic mean approach as originally proposed in [29]. The thermodynamic integration produces reliable estimates of Bayes Factors of phylogenetic models in a large varieties 7

8 of models. Although this method has the advantage of general applicability, it can incur high computational costs and may require specific adjustments; for certain model comparisons, a full thermodynamic integration may take weeks on a modern desktop computer, even under a fixed tree topology for small single protein data sets [39]. On the other hand, the HM estimator can be easily computed and it does not demand further computational efforts other than those already made to draw inference on model parameters, since it only needs simulations from the posterior distributions. For this reason in the next sections we focus on generalizations and alternative instances of the HM approach. 4 HM: Harmonic Mean estimators In this section we introduce the basic ideas and formulae for the Harmonic Mean (HM) estimator and its generalized version, the Generalized Harmonic Mean (GHM) estimator. Since the marginal likelihood is nothing but the normalizing constant of the unnormalized density obtained as the product of prior and likelihood we illustrate the derivation of GHM as follows. Let us consider the normalizing constant of a non-negative, integrable density g be defined as c = g(θ)dθ Ω where θ Ω R k and g(θ) is the unnormalized version of the probability distribution g(θ). The GHM estimator of c is based on the following identity c = 1 [ ( ) ] 1 (4.1) g(θ) E g f(θ) where f is a Lebesgue integrable function such that f(θ)dθ = 1. The GHM Ω estimator, denoted as ĉ GHM is the empirical counterpart of the identity (4.1), namely ĉ GHM = 1 1 T T t=1 f(θ t). (4.2) g(θ t) In Bayesian inference the very first instance of such GHM estimator was introduced to estimate the marginal likelihood considered as the normalizing constant corresponding to the unnormalized density g(θ) = π(θ)l(θ). Hence, taking f(θ) = π(θ) 8

9 one obtains as special case of (4.2) the Harmonic Mean estimator ĉ HM = 1 1 T 1 T t=1 L(θ t) (4.3) which can be easily computed by recycling simulations θ 1,..., θ T from the target posterior distribution g(θ) available from MC or MCMC sampling scheme. This probably explains the original enthusiasm in favor of ĉ HM which indeed was considered a potential convenient competitor of the standard Monte Carlo Importance Sampling estimate given by the (Prior) Arithmetic Mean (AM) estimator ĉ AM = 1 T T L(θ t ) (4.4) t=1 where θ 1,..., θ T are sampled from the prior π. The simplicity of implementation combined with a light computational burden due also to recycling posterior simulations reducing computing time of a factor in the hundreds with respect to thermodynamic integration has favored the widespread use of the Harmonic Mean estimator. In fact, the Harmonic Mean estimator is implemented in several Bayesian phylogenetic softwares as shown in Table 1 and recent biological papers [49, 48, 30] reports the HM as a routinely used model selection tool. In this paper we rely on MrBayes software [40] in order to implement MCMC machinery and obtain posterior distribution of the parameters of interest. However, Software MrBayes PhyloBayes PHASE BEAST Marginal likelihood estimation method Harmonic Mean Thermodynamic Integration under normal approximation Reversible Jump and Harmonic Mean Harmonic Mean or bootstrapped Harmonic Mean Table 1: Marginal likelihood estimation methods for the most widely used phylogenetic software: despite of its inaccuracy, the harmonic mean estimator is still the most diffuse marginal likelihood estimation tool. both ĉ AM and ĉ HM can end up with a very large variance and unstable behavior. For ĉ AM this is simply explained by the fact that the likelihood usually gives support to a region with low prior weight hence sampling from the prior yields low 9

10 chance to hit high likelihood region and large chance to hit much lower likelihood region ending up in a large variance of the estimate ĉ HM. Indeed, starting from the original paper [29], (see in particular R. Neal s discussion) it has been recognized that even in very simple and standard models the HM estimator may have infinite variance and lead to an unreliable approximation. Several generalizations and improved alternatives have been proposed and recently reviewed in [37]. This fact raises sometimes the question whether they are reliable tools and certainly has encouraged researchers to look for alternative solutions. In the following sections we will consider a new marginal likelihood estimator, the Inflated Density Ratio (IDR) estimator, proposed in [32], which is a particular instance of the Generalized Harmonic Mean (GHM) approach as proposed in [7] and [12]. This new estimator basically shares the original simplicity and computation feasibility of the HM estimator but, unlike it, it can guarantee important theoretical properties, such as the finiteness of the variance. 5 IDR: Inflated Density Ratio estimator The IDR estimator is an alternative way of estimating the normalizing constant of a density function. This new estimator is obtained via a basic identity involving the ratio of two functions: the unnormalized target density and a perturbed version of it. Like the HM estimator, it can be easily computed by recycling MC or MCMC simulations from the target distribution. In [32] theoretical properties of IDR are investigated. Is is shown that it is a consistent estimator with guaranteed finite variance under fairly mild sufficient conditions which are shown to hold for many typical families of densities and tail behavior ranging from gaussian to double-exponential up to Cauchy. An asymptotic estimator of the Relative Mean Square Error is provided as well as a simple confidence interval formula. A different formulation of the Generalized Harmonic Mean estimator, based on a particular choice of f(θ) is proposed in [32]. The function f(θ) is obtained through a perturbation of the original target function g(θ): this perturbed version of g, that we will denote g Pk, is defined such that its total mass has some known functional relation to the total mass c of the target density g. In particular, g Pk is obtained as a parametric inflation of g so 10

11 that Ω g Pk (θ) = c + k (5.1) where k is a known inflation constant which can be arbitrarily fixed. The perturbation method has been widely described in [31] and in [32] for unidimensional and multidimensional densities. In the unidimensional case the perturbed density is defined as follows where 2r k = k g(0) g(θ + r k ) if θ < r k g Pk (θ) = g(0) if r k θ r k (5.2) g(θ r k ) if θ > r k = 1 is the length of the interval centered around the origin. Figure 1 can help visualizing how the perturbation acts. The Inflated Density Ratio estimator ĉ IDR for c is defined as ĉ IDR = k 1 T g Pk (θ t) T t=1 g(θ t) 1 (5.3) where θ 1,..., θ T is a sample from the normalized target density g. The use of the perturbed density as importance function, leads to some advantages with respect to the other instances of c GHM proposed in the literature. Indeed the function f k (θ) = g P k (θ) g(θ) need not be positive and it is shown to yield a finite-variance k estimator under a wide range of g densities. Two simple sufficient conditions for the finiteness of the variance are introduced in [32]. Moreover, the use of a parametric perturbation makes the method more flexible and efficient with a moderate extra computational effort. Like all methods based on importance sampling strategies, the properties of the estimator ĉ IDR strongly depend on the ratio g P k (θ) : in particular, the asymptotic g(θ) Relative Mean Square Error of the estimator RMSEĉIDR is defined as 11

12 g(θ) Original density Perturbed density g Pk (θ) θ θ Figure 1: Left Panel: original density g. Right Panel: perturbed density g Pk defined as in (5.2) The total mass of the perturbed density is then c + k. The shaded area correspond to the inflated mass k with k = 2 r k g(0) as in (5.2). [ ) ] 2 RMSEĉIDR = (ĉidr c E g c V ar c k [ ] gpk (θ) g(θ) (5.4) A sufficient and necessary condition for the finiteness of the RMSEĉIDR is the finiteness of the second moment of the ratio g P k (θ), that is: g(θ) [ ] [ (gpk ) ] 2 gpk (θ) (θ) RMSEĉIDR = V ar g < E g < g(θ) g(θ) The Relative Mean Square Error can be estimated as follows: RMSEĉIDR ĉidr k [ ] gpk (θ) V ar g g(θ) (5.5) where V ar g is the sample variance of the ratio. The expression in Equation (5.5) clarifies the key role of the choice of k with respect 12

13 to the error of the estimator: for k 0, the variance of the ratio g P k (θ) tends to 0, g(θ) [ ] since g Pk is very close to g, but c gpk (θ) tends to infinity: in other words, if V k ar g g(θ) 1 would favor as little values of k as possible, acts in the opposite direction. In k order to address the choice of k, [32] suggested to choose the perturbation k which minimizes the estimation error: in practice, they suggest to calculate the values of the estimator for a grid different perturbation values, ĉ IDR (k), k = 1,..., K and choose the optimal k opt as the k for which RMSEĉIDR (k) is minimum. This procedure for the calibration of k requires iterative evaluation of ĉ IDR (k) hence is relatively heavier than the HM estimator, but it does not require extra simulations which in the phylogenetic context is often the most time-consuming part. Hence, the computational cost is alleviated by the fact that one uses the same sample from g and the only quantity to be evaluated K times is the inflated density g Pk. Once obtained the ratio of the perturbed and the original density, the computation of ĉ IDR (k) is straightforward. For practical purposes, the computation of the inflated density when the support of g is the whole R k can be easily implemented in R using a function available at In [32] and [4] it has been shown that in order to improve the precision of ĉ IDR it is recommended evaluate a a (local) mode ˆm 0 and the sample variance covariance matrix Ŝ based on the simulated θ 1,..., θ t and perform a simple one-to-one affine transform S 1 2 (θ i ˆm 0 ) so that the corresponding density which takes into account that change of measure has the same normalizing constant has a local mode at the origin where the inflation occurs and an approximately standard variance covariance matrix. This is indeed automatically implemented in the publicly available R code available. Some adaptions are needed in those contexts, such as phylogenetic evolutionary models, where the parameter space is constrained or of mixed nature and we will illustrate in the present paper how one can successfully implement those adaptions and make ĉ IDR available as a simple and effective alternative evaluation of model evidence in standard phylogenetic software. Preliminary investigation of the effectiveness and comparative performance of the ĉ IDR has been already carried out in [32] and [4] an we will briefly summarize the main encouraging findings. In order to assess the true effectiveness of the IDR method, it has been applied to simulated data from different distributions for 13

14 which the normalizing constant is known. As shown in [32], the estimator produces fully convincing results with several known distributions even for a 100-dimensional multivariate normal: in terms of estimator precision, these results are comparable with those in [26] obtained with the thermodynamic integration method. In [4] simple antithetic variates tricks allow the IDR estimator to perform well even for those distributions with severe variations from the symmetric gaussian case such as asymmetric and even some multimodal distributions. Table 2 shows the estimates obtained by applying the IDR method in several scenarios: univariate and multivariate (100) Normal distribution, univariate and multivariate (5 and 30) skew Normal distribution and multivariate (2,3, and 10) two components Normal mixture. Table 2 shows that the method correctly reproduces the true value of the normalizing constant for different shape and dimension of the target function. Some real data implementation with standard generalized linear models have been also reported in [32]. These results encouraged us to extend the use of ĉ IDR in more complex settings such as phylogenetic models. 6 Implementing IDR for substitution models with fixed topology The Inflated Density Ratio approach can be extended in order to compute the marginal likelihood for more complex models, such as phylogenetic models. The main difficulty in computing the marginal likelihood for these models is that they involve parameters defined in a mixed parameter space: in fact, the parameter of the substitution model θ and the branch lengths ν τ are defined in a continuous space while the tree topology τ in a discrete one. However we now show how to extend the flexibility of the IDR approach whether or not the topology is fixed. For a fixed topology τ and a sequence alignment X, the parameter space of a phylogenetic model M is defined by Ω = (θ, ν τ ); the joint posterior distribution on Ω is given by p(θ, ν τ X, M) = p(x θ, ν τ)p(θ, ν τ ) m(θ, ν τ ) (6.1) 14

15 Distribution log c k opt log ĉ IDR RMSEĉIDR N(µ, σ) N 100 (µ, σ) SN(µ, σ, τ) SN 5 (µ, σ, τ) SN 30 (µ, σ, τ) Mix N Mix N Mix N Table 2: IDR approach for Normal distributions (univariate and 100 dimension), univariate and multivariate (5 and 30 dimension) skew normal distributions and multivariate (2,3 and 10 dimension) mixture of two Normal components. The value log c represents the logarithm of the true value of the normalizing constant; k is the optimal inflation coefficient which minimizes the Relative Mean Square Error RMSEĉIDR computed as in (5.5). log ĉ IDR is the value of the estimated normalizing constant on a logarithmic scale. A sensitivity study has showed that the performance of the method is robust to changes of (µ, σ, τ) parameter choices. 15

16 where m(θ, ν τ ) = p(x θ, ν τ )p(θ, ν τ ) (6.2) θ ν τ is the marginal likelihood we aim at estimating. When the topology is fixed the parameter space Ω is homogeneous in the sense that it consists of parameters all defined on a continuous support. In order to apply the Inflated Density Ratio we only need the following two ingredients a sample (θ (1), ν τ (1) ),..., (θ (n), ν τ (n) ) from the posterior distribution, p(θ, ν τ X, M) the likelihood and the prior distribution evaluated at each posterior sampled value (θ (k), ν (k) τ ), that is p(x θ (k), ν τ (k) ) and p(θ (k), ν τ (k) ) The first ingredient is just the usual output of the Monte Carlo Markov Chain simulations derived from model M and the data X. The computation of the likelihood and the joint prior is indeed already coded within the available software. The first one is usually accomplished through the pruning algorithm while computing the prior is straightforward. A necessary condition for the inflation idea to work as prescribed in [32] is that parameter posterior density must have full support on the whole real k-dimensional space. In our phylogenetic models this is not always the case and we explain here simple fully automatic remedies to overcome this kind of obstacle. We start with parameters like the branch lengths are constrained to lie in the positive half-line. In that case the remedy is straightforward: a simple logarithmic transformation can reparameterize the posterior density in terms of ν τ = log(ν τ ) (6.3) so that the support corresponding to the reparameterized density is unconstrained. Obviously the log(ν τ ) reparameterization calls for the appropriate change-of-measure Jacobian when evaluating the corresponding transformed density. For model parameters with linear constraints like the substitution θ = {ρ, π}, a little less obvious transformation is needed. In this case θ = {ρ, π} are subject to the following set of constraints: i {A,T,C,G} j {A,T,C,G} π i = 1 ρ ij π j = 0 i {A, T, C, G} 16

17 Indeed it is well known that similarly to the first simplex constraint the last set of constraints together with the reversibility can be rephrased in terms of another simplex constraint concerning only the extra-diagonal entries of the substitution rate matrix (1.2) namely ρ AC + ρ AG + ρ AT + ρ CG + ρ CT + ρ GT = 1. We have relied on the so-called additive logistic transformation [45, 2] which is a one-to-one transformation from R D 1 to the (D 1)-dimensional simplex S D = (x 1,..., x d ) : x 1 > 0,..., x D > 0; x x D = 1. Hence we can use its inverse, called additive log-ratio transformation, which is defined as follows y i = log ( xi x d ) j = 1,..., D 1 for any x = (x 1,..., x d ) S D. Applying these transformations to nucleotide frequencies and to exchangeability parameters, the transformed parameters assume values in the entire real support and the IDR estimator can be computed. Again the reparameterization calls for the appropriate change-of-measure Jacobian when evaluating the corresponding transformed density and details for this particular reparameterization are given in [2]. In this work, the application of the IDR method has then been performed using the MCMC output of the simulations from the posterior distribution obtained using the MrBayes software and using the R packages PML, for the evaluation of the likelihood, and HI for the IDR perturbation f k (θ). Some ad-hoc R functions have been developed in order to implement the aforementioned parameters transformations; R code is available upon request from the first author. 7 Numerical examples and comparative performance In this section the successful implementation of the IDR estimator is illustrated in some of the typical phylogenetic models with literature benchmark datasets. Although it requires a little bit of extra computation with respect to model evidence 17

18 estimators such as HM, IDR can produce more reliable and accurate model evidence assessments keeping the simplicity of the original HM idea. For illustrative purposes we restrict ourselves to check the comparative performance of the IDR with respect to two of the simplest and most favorite model evidence output in the publicly available software (yet most problematic ones): the Harmonic Mean estimator and the Arithmetic Mean estimator. Indeed, while the former is guaranteed to be a consistent estimate of the marginal likelihood but possibly with infinite variance, the latter one is consistent only when formula (4.4) is applied when θ 1,..., θ T are sampled from the prior. Since it is known such a prior AM is very unstable and unreliable it has often been replaced by a posterior AM where θ 1,..., θ T are sampled from the posterior rather than from the prior. In that case the resulting quantity can be interpreted only as a surrogate evidence in favor of one model and by no means confused with the rigorous concept of marginal likelihood and its essential relation with Bayes Factor. We now show how the performance of IDR in two phylogenetic examples where simulated data is used to have a better control of what one should expect from marginal likelihood and comparative evidence of alternative models. 7.1 Hadamard 1: marginal likelihood computation We first use the synthetic data set Hadamard 1 in [10]. It consists of a sequence 1000 of amino acid alignments of six species, A, B, C, D, E and F simulated from a GT R + Γ model. The true tree in shown in the left Panel of Figure 2. MrBayes software [3] uses the Metropolis Coupling algorithm simulate a Markov Chain which is then used as a sample from the posterior distribution in order to make posterior inference on the model parameters and provide estimates of the marginal likelihood via AM and HM. Appropriate constraints have been considered in order to fix the topology. In this case, the parameter space Ω consists of 18 parameters. In order to reduce the autocorrelation and improve the convergence of the sampled values, has been discarded with thinning rate equal to 10. We have recycled this MCMC output to estimate the marginal likelihood of the GT R + Γ model for the known true topology through the IDR method. Results for different 18

19 F A B E D D C C A B Figure 2: True phylogenetic trees for simulated data: the tree in the left Panel refers to Hadamard 1 data and the tree in the right Panel to Hadamard 2. perturbation values k, are shown in Table 3: this table shows the values of the IDR estimator for different perturbation values, the Relative Mean Square Error ( RMSEĉIDR,Delta ) and the corresponding confidence interval of the estimated (ĈI). Since the smallest error corresponds to a perturbation value k opt = 10 7, the IDR estimator (on a logarithmic scale) for the GT R + Γ model is log ĉ IDR = We compare the results of the IDR method with those obtained with the Harmonic Mean (HM) and the posterior Arithmetic Mean (AM). The three methods produce somewhat different quantities although sometimes compatible once accounted for the estimation error. For each method, the Monte Carlo error of the estimates has been computed simulating and re-estimating the model 10 times ( RMSEĉ,MC ). However, it is known that under critical condition for HM the MC error of the HM cannot fully guarantee its precision, but it still remains a necessary premise for an accurate estimate. Hence we look at three different precision assessments based respectively on the asymptotic delta-method approximation of the RMSE ( RMSEĉ,Delta ), the estimated RMSEĉ,MC based on independent MC replicates of posterior simulations and ĉ and an evaluation of RMSE ( RMSEĉ,Boot ) based on bootstrap replication from a single simulation stream. We estimate the ratio bias of the estimator using a bootstrap strategy: B bootstrap samples have been considered and the marginal likelihood ĉ b have been estimated for each bootstrap sample. 19

20 k log ĉ IDR RMSEĉIDR,Delta RMSE ĉ IDR,Delta [ , ] [ , ] [ , ] [ , ] [ , ] ĈI Table 3: Inflated Density Ratio method for the Hadamard 1 data: the parameters lie in 18-dimensional space, that is Ω = R 18. The table shows the values of the IDR estimator for a small grid of different perturbation values, the corresponding relative mean square errors RMSEĉIDR,Delta as in (5.5) without accounting for autocorrelation, the relative mean square errors corrected for the autocorrelation ( RMSE ĉ IDR,Delta ) and the corresponding confidence interval (ĈI). Since the smallest error corresponds to a perturbation value k opt = 10 7, the IDR estimator (on a logarithmic scale) for the GT R+Γ model is log ĉ IDR = RMSE ĉ IDR,Delta has been obtained by applying formula (5.5) and considering an upper bound for the discounting factor of the sample variance as in (7.2), due to the autocorrelated simulations. In this application the discounting factor more than doubles the uncorrected RMSEĉIDR,Delta. 20

21 The bias B boot and the ratio standard deviation Ŝboot are estimated respectively as follows: RMSE boot = Tables 7.1 shows the obtained results. Method log(ĉ) RMSEĉ,Delta 1 B B i=1 RMSEĉ,MC (ĉb ) 2 i ĉ 1 (7.1) RMSEĉ,Boot RMSE ĉ,delta IDR HM > > AM Table 4: Hadamard 1 data (Ω = R 18 ): marginal likelihood estimates obtained with the Inflated Density Ratio method ĉ IDR, with the Harmonic Mean approach ĉ HM and with the Arithmetic Mean approach ĉ AM. Three different RMSE estimates are provided: RMSEĉ,Delta has been computed as in (5.5); comes from RMSEĉ,MC 10 Monte Carlo independent replicates of the estimation; RMSEĉ,Boot is a boostrap approximation of RMSE (B=1000); RMSE ĉ,delta comes from a delta-method approximation corrected for the autocorrelation as in (7.2). Indeed also the MC errors are very different. In fact, the smallest MC error is obtained by simply applying the arithmetic mean of the θ t values simulated from the posterior. We have already stressed the fact that the posterior AM does not aim at estimating the marginal likelihood, but we have nonetheless considered it in the following to verify how distant the corresponding values are and how different the conclusions can be when comparing alternative models via the AM estimator. For the Harmonic Mean method the MC error is quite high, reflecting the instability of the HM estimator. On the other hand, the Inflated Density Ratio approach seems to be a good compromise in terms of order of magnitude of the error of the estimate ĉ IDR and robustness of its relative error estimation ranging from 0.15 with independent Monte Carlo re-estimation to 0.30 of the Delta method. Notice that, in order to take into account the autocorrelation of the posterior simulated values, a correction has been applied to the errors estimated with delta and bootstrap method. In particular, RMSE ĉ IDR,Delta has been estimated as in (5.5) with a variance estimate inflated in the MCMC context to take into account of the 21

22 correlated samples leading to a protective effective sample size given by n ESS = n K k=1 ˆρ k (7.2) 7.2 Hadamard 2: Bayes Factor computation We have considered a second benchmark synthetic data set, from [10] referred to as Hadamard 2: it consists of 200 amino acids and four species, A, B, C, D; these data have been simulated from the Jukes-Cantor model (JC69) and the true tree is shown in the right Panel of Figure 2. For the true topology, we compute the marginal likelihood for the JC69 model and for the GTR+Γ model: parameters lie in a 5-dimensional space for the JC69 model and in 14-dimensional space for the GTR+Γ model. The simulated values from the Metropolis-Coupled algorithm have been rearranged to compute the marginal likelihoods for both models using the IDR method, the HM and the AM approach. As for the Hadamard 1 data, Monte Carlo errors have been computed by repeating the estimates 10 times. Tables 7.2 and Table 7.2 show the results obtained respectively for the GTR+Γ and the JC69 models. Method log(ĉ) RMSEĉ,Delta RMSEĉ,MC RMSEĉ,Boot RMSE ĉ,delta IDR HM AM Table 5: Hadamard 2 data: marginal likelihood estimates of the GTR+Γ model (Ω = R 14 ) obtained with the IDR method ĉ IDR, with the HM approach ĉ HM and with RMSEĉ,Delta the AM approach ĉ AM. Three different RMSE estimates are provided: has been computed as in (5.5); RMSEĉ,MC comes from 10 Monte Carlo independent replicates of the estimation; RMSEĉ,Boot is a bootstrap approximation of RMSE (B=1000); RMSE ĉ,delta comes from a delta-method approximation of RMSE corrected for the autocorrelation as in (7.2). All methods produce somewhat different results in terms marginal likelihood; as for the previous example, the smallest Monte Carlo error is associated with the Arithmetic Mean method. Also in this case, the MC error of the Harmonic Mean 22

23 Method log(ĉ) RMSEĉ,Delta RMSEĉ,MC RMSEĉ,Boot RMSE ĉ,delta IDR HM AM Table 6: Hadamard 2 data: marginal likelihood estimates of the Jukes-Cantor model ((Ω = R 5 )) obtained with the IDR method ĉ IDR, with the HM approach ĉ HM and with the AM approach ĉ AM. Three different RMSE estimates are provided: RMSEĉ,Delta has been computed as in (5.5); RMSEĉ,MC comes from 10 Monte Carlo independent replicates of the estimation; RMSEĉ,Boot is a bootstrap approximation of the RMSE (B=1000); RMSE ĉ,delta comes from a delta-method approximation of RMSE corrected for the autocorrelation as in (7.2). method is much larger than those produced by the Inflated Density Ratio method. The corresponding Bayes Factors (on logarithmic scale) for JC69 and GTR+Γ are shown in Table 7.2: considering the reference values for the Bayes Factor defined in [22], all methods strongly support the Jukes-Cantor model, which is the true model. However the strongest evidence in favor of the (known) correct model corresponds to the Bayes Factor as estimated by the Inflated Density Ratio. Method log( BF GT R+Γ JC69 ) MC-ĈI log( BF ) IDR [ , ] HM [5.0206, ] AM [1.0617, ] Table 7: Hadamard 2 data: Bayes Factors computed with IDR, HM and AM approach. of the estimates. MC-ĈI is confidence interval by considering the MC errors of 10 replications 7.3 Hadamard 2: tree selection We now show how it is possible to extend the IDR approach for dealing with non fixed topologies and selecting competing trees. For a fixed substitution model, competing trees can be compared by considering the evidence of the data for a fixed tree 23

24 topology: a tree topology τ i can be evaluated by computing its posterior probability π(τ X) derived from the Bayes theorem as p(τ i X) = p(x τ i)p(τ i ). p(x) Unlike the topology, the parameters θ of the evolutionary process are continuous; in order to compare tree topologies, the continuous parameters are treated as nuisance parameters and are integrated out: p(x τ i ) = Ω p(x τ i, θ)p(θ τ i )dθ In testing tree τ i against τ j, the Bayes Factor is computed as the ratio of the probabilities of the data under each topology: BF ij = p(x τ i) p(x τ j ) (7.3) where p(x τ i ) is the marginal likelihood of the tree τ i with respect to the other parameters [43]. Consider the Hadamard 2 data: in the previous Subsection, we have verified the feasibility of the Inflated Density Ratio approach in comparing the true model (Jukes- Cantor) with the GTR+Γ model, under a fixed topology. The method produces more stable and plausible results. For the same data, we fix the Jukes-Cantor model and compute the Bayes Factors in order to compare alternative topologies (N T = 3 possible topologies) with the true one (τ = 1). Results are shown in Table 7.3. Also in this case, the Inflated Density Ratio gains in precision and robustness τ Size MC-ĈI log( BF IDR ) MC-ĈI log( BF HM ) ) MC-ĈI log( BF AM ) log(bf 21 ) 10 4 [3.511, 3.599] [2.37, 4.066] [2.929, 2.989] log(bf 31 ) 10 4 [3.817, 3.901] [2.503, 3.163] [2.053, 3.131] Table 8: Hadamard 2 data: the Bayes Factor is computed in order to compare competing topologies with the true one. The Bayes Factor is approximated with the IDR method (BF IDR ), the HM (BF HM ) and the AM (BF AM ) approach. Confidence intervals have been constructed by considering the MC errors of 10 replications of the estimates. for the estimate with respect to the Harmonic Mean estimator. 24

25 8 Extending IDR to unknown topology In the previous Subsection, we apply the Inflated Density Ratio approach to estimate the marginal likelihood for competing substitution models under a fixed topology and for comparing competing topologies under a fixed substitution model. In both cases, the space of parameters involves only continuous parameters, so the IDR method can be applied directly as described in [32]. Moreover, the Inflated Density Ratio approach can be also extended in order to compute the marginal likelihood when both the substitution model parameters and the tree topology are unknown, that is N T i=1 ν τi θ p(x τ, ν τ, θ)p(τ, ν τi, θ)dν i dθ (8.1) where N T is the number of possible tree topologies. The Equation (8.1) involves the integration over possible substitution model parameters and branch lengths, and the summation over all possible tree topologies. The Inflated Density Ratio approach can be extended in order to compute the quantity in (8.1) as follows: to simplify notation, we set θ = (π, ρ, ν) the set of continuous parameters and τ a discrete parameter of the tree space. It follows that: g(θ, τ)dθ (8.2) where τ 1, 2,..., N T. We can then rewrite g Pk (θ, τ)dθ = c + k τ θ θ g Pk (θ, τ)dθ = c + k τ τ g Pk (θ, τ 1 )dθ g Pk (θ, τ NT )dθ = c + k θ θ g Pk (θ, τ 1 ) g(θ, τ 1) c 1 dθ g Pk (θ, τ NT ) g(θ, τ N T ) c NT dθ = c + k θ g(θ, τ 1 ) c 1 θ g(θ, τ NT ) c [ ] [ ] NT gpk (θ, τ 1 ) gpk (θ, τ NT ) c 1 E g(θ,τ1 ) c NT E g(θ,τnt ) = c + k g(θ, τ 1 ) g(θ, τ NT ) [ ] c 1 gpk (θ, τ 1 ) c E g(θ,τ 1 ) c [ ] N gpk T (θ, τ NT ) g(θ, τ 1 ) c E g(θ,τ NT ) 1 = k g(θ, τ NT ) c 25

26 where in g Pk (θ, τ i ) the perturbation is done of the continuous space of θ conditioned to τ = τ 1. Let c 1 c = w 1, c 2 c = w 2,..., c N T c = w NT and k = k 1 + k k NT ; therefore, the equality in (8.3) can be written as c = k w 1 E g(θ,τ1 )[ g I(θ,τ 1 ) ] + w g(θ,τ 1 ) 2E g(θ,τ2 )[ g I(θ,τ 2 ) ] w g(θ,τ 2 ) N T E g(θ,τnt )[ g I(θ,τ NT ) ] 1 (8.3) g(θ,τ NT ) in which each element can be estimated as: T k n T k i=1 g Pk (θ (i), τ k ) g(θ (i), τ k ) (8.4) where T k is the observed frequency of topology τ k The marginal likelihood estimator proposed in Equation (8.3) has been applied to Hadamard 2 data under the JC69 model. Results are shown in Table 8. Also in Model log(ĉ IDR ) RMSEĉ,Delta RMSEĉ,MC JC Table 9: Hadamard 2 data: marginal likelihood for JC69 model. RMSE have been estimated with the Delta method ( RMSEĉ,Delta ) and with 10 Monte Carlo replicates ( RMSEĉ,MC ). this case, the MC error and the RMSE of the estimate are quite low, confirming the results obtained with simulated data. Moreover in this way the method can be extended to more complex models defined in higher dimensional space. The extension of the Inflated Density Ratio method to mixed parameter space. However, the same extension, to our knowledge, is not straightforward for other marginal likelihood estimation methods. Indeed the path sampling in [26] is applied to very complex models, much more complex than those considered in this work, but the marginal likelihood is always computed under a fixed topology (in particular, the consensus tree topology). This fact testify the flexibility of the Inflated Density Ratio method and its possible applicability to very different and complex models. 26

27 9 Discussion In this paper, we investigate the possibility of using simple effective recipes for estimating the marginal likelihood for complex models such as phylogenetic models. In a Bayesian framework, several methods have been proposed in order to estimate the model evidence of competing models and then eventually evaluate the Bayes Factor. To our knowledge, the most widely used methods for estimating the marginal likelihood of phylogenetic models are the thermodynamic integration and the harmonic mean approach. The thermodynamic integration has been proved to produce more reliable estimates of Bayes Factors of phylogenetic models in a large varieties of models. Although this method has the advantage of general applicability, it can involve computational costs and may require tunings and adjustments. Moreover, under certain model comparisons, a full thermodynamic integration may take weeks on a modern desktop computer, even under a fixed tree topology for small single protein data sets. Indeed, the simplicity of implementation combined with a relatively high computational burden are two appealing features which explain why the HM is still currently one of the most favorite option for routine implementation (see [47]). However, the simplicity of HM is often not matched with its accuracy and recent literature is highlighting unreliability of HM estimators in phylogenetic models [26] as well as in more general biological applications [15]. In this paper, we have provided evidence of effectiveness of a simple alternative marginal likelihood estimator, the Inflated Density Ratio estimator (IDR), belonging to the classs of generalized harmonic mean estimators. It shares the original simplicity and computation feasibility of the HM estimator but, unlike it, it enjoys important theoretical properties, such as the finiteness of the variance. Moreover it allows one to recycle the posterior simulations and this is particularly appealing in those contexts such as the phylogenetic models where the computational burden of the simulation is heavier than the evaluation of the likelihood, posterior densities and the like. Like all importance sampling techniques based on a single stream of simulation the computational burden can be shared in a paralled computing environment reducing the computing time. Also the grid search for optimizing the estimated RMSE can be speeded up with a parallel evaluation for each inflated density. 27

Molecular Evolution & Phylogenetics

Molecular Evolution & Phylogenetics Heuristics based on tree alterations, maximum likelihood, Bayesian methods, statistical confidence measures Jean-Baka Domelevo Entfellner Learning Objectives know basic