Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains Marco A. R. Ferreira 1, Marc A. Suchard 2,3,4 1 Department of Statistics, University of Missouri at Columbia, USA 2 Department of Biomathematics and 3 Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, USA 4 Department of Biostatistics, School of Public Health, University of California, Los Angeles, USA Summary. We explore Bayesian analysis for continuous-time Markov chain (CTMC) models based on a conditional reference prior. For CTMC models, inference of the elapsed time between chain observations depends heavily on the rate of decay of the prior as the elapsed time increases. Moreover, improper priors on the elapsed time may lead to improper posterior distributions. In addition to the elapsed time, an infinitesimal rate matrix also characterizes the CTMC. Usually, experts have good prior knowledge about the parameters of the infinitesimal rate matrix, and thus can provide well-informed priors. We show that the use of a proper prior for the rate matrix parameters together with the conditional reference prior for the elapsed time yields a proper posterior distribution. Finally, we demonstrate that, when compared to analyses based on priors previously proposed in the literature, Bayesian analysis on the elapsed time based on the conditional reference prior possesses better frequentist properties. The conditional reference prior therefore represents a better default prior choice for widely-used estimation software. Keywords: Conditional reference prior; phylogenetic reconstruction; prior for branch length; frequentist coverage; mean square error. 1. Introduction Continuous-time Markov chains (CTMCs) are ubiquitous modeling tools. The chains define a stochastic process {Z(t) : t 0}, where Z(t) realize one of s discrete values from a statespace set S as elapsed time t proceeds. The chains also satisfy Markovian behavior, such that the probability distribution of Z(t 3 ) is independent of Z(t 1 ) conditional on Z(t 2 ) for t 3 > t 2 > t 1. Given this conditionally memoryless property, an infinitesimal rate matrix Q completely governs the process, where Q = {q ij } is an s s matrix with non-negative, off-diagonal elements and rows summing to 0. The conditional probability distribution of Z(t) naturally follows Pr(Z(t) = j Z(0) = i) = { e tq} ij, (1) where { } ij represents the ij-th matrix element, matrix exponentiation takes the form e A = I + k=1 Ak k! and I is the s s identity matrix. To gain intuition into the process, one can consider a graph composed of vertices, one for each state in S, and edges with weights q ij connecting vertices for i j. Random variable Z(t) then transforms into the location indicator of a particle on this graph at time t as the particle drifts from vertex to vertex. At any given time, the particle first waits an Exponential amount of time, with rate equal to the sum of the edge weights leaving the

2 Marco A. R. Ferreira and Marc A. Suchard particle s current state. After this waiting-time, the particle jumps to its next location with probability proportional to the edge weight connecting the current and destination vertices. Here we consider the case, often found in biology and linguistics, when we observe several replications of the pair {Z(0), Z(t)}, and the primary interest lies in infering the elapsed time t. Biology and linguistics are two seemingly disparate fields that make considerable use of CTMCs. Both fields exploit CTMCs to infer evolutionary histories. In the case of biology, researchers reconstruct the histories relating molecular sequences, such as short segments of DNA, genes or entire genomes; while, glottochronology aims to infer the ancestral relationships between languages as an approach to understand pre-historical human migration. The sequences in this latter example are strings of presence/absence indicators of critical words in a language. Often, analysis requires inferring only the times separating sequences on a pairwise-basis and not the entire underlying history. The first and most important step in any reconstruction is a description of how a character in one sequence relates to a possibly different character in the corresponding site of a second sequence (Figure 1). Here, CTMCs come to the rescue. Let S equal the set of possible characters at a single site. For nucleotide sequences, s = 4, containing adenosine (A), guanine (G), cytosine (C) and thymine (T). For amino-acid sequences, s = 20, codonbased models have 64 states, and s = 2 naturally describes the glottochronology state-space. Then, the two related characters at a site form realizations from a single CTMC observed at two different moments, and one typically assumes the chains at different sites are independent and identically distributed. Statistical inference reduces to estimating the elapsed time t between the observed sequences and, potentially, the infinitesimal rate matrix Q. To attack this statistical problem, let the data Y = (n 11,...,n 1s, n 21,...,n 2s,..., n s1,..., n ss ) count the observed number of transitions between pairs of states n ij for i, j S. While up to s(s 1) free parameters may characterize Q, one often employs structured matrices that are biologically-motivated and contain far fewer free parameters φ Φ. This yields the complete parameter vector θ = (t, φ). Usually, experts have good prior knowledge about the parameters of the infinitesimal rate matrix. Experts either fix φ to empirically estimated quantities derived from large databases, such as the PAM (Dayhoff et al., 1972), JTT (Jones et al., 1992), and WAG (Whelan and Goldman, 2001) models for amino acid chains, or can provide well-informed priors. Such is not the case for the elapsed time t a priori. Frequently, t is the most important quantity to be estimated as, for example, when the expert wishes to estimate divergence times between molecular sequences or languages. Viewing inference of t as paramount differentiates CTMC use in reconstruction from CTMCs to analyze panel data in the social sciences and econometrics (Kalbfleisch and Lawless, 1985; Geweke et al., 1986). In these latter models, the elapsed time between observations is known and inference focuses on the infinitesimal rate matrix parameters φ. Kalbfleisch and Lawless (1985) introduce maximum likelihood estimators for φ and Geweke et al. (1986) furnish Bayesian estimators under a uniform prior on rate matrices. Both maximum likelihood and Bayesian estimation of elapsed time t are not trivial. Assuming the chain is irreducible, as t the probability distribution on Z(t) converges to the stationary distribution of Q. As a consequence, the data likelihood function f(y θ) of the CTMC converges to a constant greater than zero. Thus, a Bayesian analysis with a marginal improper prior for t leads to a useless improper posterior distribution. Moreover, f(y θ) may be strictly increasing. Consequentially, with positive probability, the maximum likelihood estimate of t may not exist and prior choice for t can impart substantial influence on estimates.

Bayesian analysis of continuous-time Markov chains 3 To briefly demonstrate the poor tail behavior of the likelihood function, we consider a subset of glottochronology language data analyzed in Gray and Atkinson (2003). These data are binary characters indicating the presence/absence of cognates in Indo-European languages. Cognates are words that share a common origin in a predecessor language. One difficulty in these data is estimating of the divergence time of 16 Romance languages from Germanic languages, represented by Modern German. As German is the outgroup among the data, this distance becomes the evolutionary tree height. Figure 2 plots posterior histograms of the tree height under six different prior choices (note that the scales are different for each plot). Both Exponential and Uniform priors are standard in MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003), a popular Bayesian sampler for evolutionary reconstruction problems. As the likelihood function flattens out, little information is available in the data and the prior dictates the right-tail behavior of the tree height. This seriously affects Bayesian estimation and it can lead to disastrous consequences in model selection problems (Suchard et al., 2001). Even in moderate-sized phylogenetic problems, the elicitation of a joint prior distribution on the parameters θ is practically infeasible. Thus, automatic or semi-automatic methods may find use. An attractive strategy to overcome these difficulties and one that has seen significant development in recent years is to use an objective Bayesian analysis, where by that we mean the use of default priors derived by formal rules that use the structure of the problem at hand but do not require subjective prior elicitation. Two of the most popular of these methods are the Jeffreys prior (Jeffreys, 1961) and the reference prior (Bernardo, 1979; Berger and Bernardo, 1989, 1992). In this paper, we develop a default prior for the elapsed time t of CTMC models for use when little or no prior information is available on t but there is prior information on the parameters of the infinitesimal transition matrix. Specifically, we derive an explicit expression for the conditional reference prior (Sun and Berger, 1998) on t and establish the propriety of the resulting posterior distribution. We also study the small sample frequentist properties of Bayesian procedures based on the conditional reference prior and compare these properties to results obtained under alternative priors currently exploited by evolutionary reconstruction practitioners. 2. General Time Model We begin the development of a default prior by considering the CTMC model in its most general, irreducible form in which φ contains s(s 1) parameters. We refer to this as the general time (GT) model. Later we return to more restricted models, such as the Jukes- Cantor (Jukes and Cantor, 1969, JC69) and Kimura (Kimura, 1980, K80) models commonly employed in evolutionary reconstruction. To proceed, we first require an understanding of the tail-behavior of the data likelihood f(y θ). Consider the spectral decomposition Q = B(φ)Λ(φ)B 1 (φ), where Λ(φ) is the diagonal matrix of eigenvalues and the columns of B(φ) are the corresponding right-eigenvectors of Q. To simply notation and ease exposition, we drop the implicit dependence of Λ and B on φ, order the eigenvalues in decreasing order, such that 0 = λ 1 > λ 2... λ s, (2) and write B = {r ik } and B 1 = {c kj }. Interestingly, the largest eigenvalue λ 1 is equal to 0 because the row-sums of Q are all 0. The largest eigenvalue s corresponding right-

4 Marco A. R. Ferreira and Marc A. Suchard eigenvector is (r 11,...,r s1 ) = (1,..., 1) and we refer to the largest eigenvalue s lefteigenvector (c 11,...,c 1s ) as a stationary distribution π of the CTMC. CTMCs are trivially aperiodic and considering only irreducible chains enforces that π is unique through the Ergodic theorem. Since π is unique, the remaining eigenvalues are strictly less than λ 1, i.e. negative. Using the spectral decomposition, we re-write the conditional probability distribution on Z(t) given in Equation (1), such that the probability of transitioning from state i to state j in time t is P ij (t) = = {Be Λt B 1 } ij s r ik e λkt c kj = k=1 s d ijk. (3) Given the structure of the eigenvalues in Equation (2) and taking t, P ij (t) converges to the stationary distribution c 1j = π j that does not dependent on the starting state i. Note that π implicitly depends on φ. Returning to the data Y, the log-likelihood function becomes log f(y θ) = ( ) n ij log d ijk. (4) i,j Proposition 2.1. For the GT model, the log-likelihood function (4) is a continuous function on [0, ) satisfying: (i) { 1, if nij = 0 i j, lim f(y θ) = (5) t 0 + 0, otherwise; (ii) lim t log f(y θ) = i,j k k=1 n ij (log π i + log π j ); and (6) (iii) For i j n ij > 0 and small t, f(y θ) t i j nij q nij ij. (7) i j Proof. See Appendix. 3. Conditional reference prior There are at least two possible forms of noninformative priors for the CTMC parameters θ = (t, φ). The first is the joint Jeffreys-rule prior given by π(θ) det I(θ), where I(θ) is the Fisher information matrix. The ij-th entry of the Fisher information matrix is [ ] {I(θ)} ij = E 2 log f(y θ), θ 1 = t, θ 2 = φ, (8) θ i θ j where E[ ] refers to an expectation with respect to the conditional distribution of Y given θ. The joint Jeffreys prior for θ could be a reasonable choice if prior information were available on neither t nor φ, but usually there exists prior biological information on φ.

Bayesian analysis of continuous-time Markov chains 5 Typically, no prior information on t is available a priori, while expert opinion remains quite strong regarding reasonable values for φ. As discussed in Section 1 and illustrated in Figure 2, posterior analysis inferring t can be highly influenced by the prior on t. Consequentially, we wish to incorporate prior information about φ and, at the same time, use a noninformative prior for t. This leads us to a conditional reference prior for t (Sun and Berger, 1998). Let π(φ), with Φ π(φ)dφ = 1, characterize the prior information on the infinitesimal rate matrix parameter φ. Following Sun and Berger (1998), the conditional reference prior for t is π r (t φ) I(t φ), where Then, the joint prior density for θ becomes [ ] I(t φ) = E 2 t 2 log f(y θ). (9) π(φ)π r (t φ). (10) Theorem 3.1. Starting from the GT model with log-likelihood shown in Equation (4), the conditional reference prior for the elapsed time t given φ is π r ( s k=2 (t φ) g(t φ) = λ kr ik e λ kt c kj ) 2 ( s k=1 r ike λ kt c kj )( s k=2 λ2 k r ike λ kt c kj ) s k=1 r ike λ, kt c kj i,j where we remind readers that λ k, r ik and c kj all implicitly depend on φ. (11) Proof. See Appendix. An interesting feature of the unnormalized conditional reference prior g(t φ) given in Equation (11) is that its behavior close to zero is independent of the infinitesimal rate matrix Q and its behavior for large t depends only on the second largest eigenvalue of Q. The following corollary describes the tail behavior of g(t φ). Corollary 3.1. The unnormalized conditional reference prior of t given φ shown in Equation (11) is a non-negative, continuous function on [0, ), that satisfies: (i) g(t φ) = O( 1 t ) as t 0, and (12) (ii) g(t φ) = O(e λ2t ) as t. (13) Proof. See Appendix. As a consequence of Corollary 3.1 and recalling that λ 2 < 0, the conditional reference prior π r (t φ) is proper. Therefore, there are no normalization concerns (Sun and Berger, 1998), and the conditional reference prior can be written as π r (t φ) = K(φ)g(t φ), (14) where the normalizing constant K(φ) = { 0 g(t φ)dt } 1 can be easily computed using standard one-dimensional numerical integration.

6 Marco A. R. Ferreira and Marc A. Suchard Corollary 3.2. The joint prior distribution for θ induced by the product of the informative prior distribution π(φ) and the conditional reference prior π r (t φ) given in Equation (11) yields a proper posterior density. Proof. This follows directly from the fact that π(φ) and π r (t φ) are proper probability measures. Given the tail-behavior described in Corollary 3.1, a Gamma(1/2, λ 2 ) can serve as a reasonable approximation of π r (t φ) for implementation in software that allows only standard distributions. A comparison of the performance of Bayesian procedures based on the conditional reference prior and its Gamma(1/2, λ 2 ) approximation is presented in Section 5. 4. Commonly Employed CTMC Models We consider two restricted cases of the GT model that find general use across binary, nucleotide and amino acid sequences. The first parameterization of Q considers that all transitions of the CTMC occur with the same infinitesimal rate α, such that q ij = α i j. Although Jukes and Cantor (1969) first endorse this model for nucleotide substitution processes, the mathematical properties that we derive are shared with all standard models for amino acid sequences. Amino acid CTMC models follow empirically-estimated infinitesimal rate matrices, resulting in the same number of free parameters as the JC69 model. We also consider the K80 (Kimura, 1980) CTMC for nucleotide sequences. This model assumes that the states of the chain divide into two disjoint sets and that rates of transition within and between sets differ. Strong expert opinion exists about the relative rates of within and between events, suggesting use of a conditional reference prior. 4.1. Reference prior for the JC69 model Under an equal rates model, sufficient statistics of the data Y are the total number of sites N and the number of observed changes n = i,j n ij i j. Using these sufficient statistics, the likelihood function under the JC69 model becomes f(n, n t, α) = [ 1 4 + 3 ] N n [ ( 1 4 e αt 3 4 1 )] n 4 e αt. (15) As the likelihood function provides information only for the product αt, the infinitesimal rate α is fixed a priori. Different choices of α lead to differing scalings of t. For example, if α = 1/3 (the usual choice among phylogeneticists) then t scales in terms of the expected number of changes per site given the chain starts at stationarity. To appreciate this, we count the expected number of changes that occur between t [0, 1), i π i( q ii ) = 1, where π i = 1/4 under the JC69 model. Under the JC69 model, a maximum likelihood estimate of t does not exist for n/n > 3/4 and a Bayesian approach with a poor prior choice may not fair any better. As t, the distribution of the number of observed changes between sequences converges to a Binomial distribution with sample size N and probability of success 3/4. Likewise, the data likelihood converges to a positive constant. This behavior causes major problems for Bayesian estimation of t as the inference will depend heavily on the tail behavior of the prior

Bayesian analysis of continuous-time Markov chains 7 for t. In particular, improper priors will lead to useless improper posterior distributions and inference based on truncated uniform priors heavily depend on where the prior is truncated. From Theorem 3.1, the reference prior for the elapsed time t under the JC69 model is π r (t φ) I(t φ) e αt (1 + 3e αt ) (1 e αt ). (16) As the second largest eigenvalue of the infinitesimal transition matrix Q of the JC96 model is α, the prior above behaves as e αt for large t. 4.2. Reference prior for the Kimura Model The infinitesimal rate matrix Q under the K80 model for nucleotides is (κ + 2) κ 1 1 α κ (κ + 2) 1 1 1 1 (κ + 2) κ, (17) 1 1 κ (κ + 2) where we arbitrarily have ordered the states in Matrix (17) as {A, G, C, T }. Nucleotides A and G contain purine side-groups, while C and T contain pyrimidines. Purines and pyrimidines differ in the size of their aromatic hetero-cycles. Due to steric differences, CTMC jumps within groups, confusingly called transitions by evolutionary biologists, occur with infinitesimal rate κ α. This rate may differ from changes across groups, called transversions, at rate α. Following the JC69 model formulation, one fixes α such that t scales in terms of the expected number of changes per site. This choice implies α = (κ+2) 1. Sufficient statistics of the data are, again, the total number of N sites and the numbers of observed transitions n s and transversions n v. The data likelihood function under the K80 Model reduces to [ 1 f(n, n s, n v t, α, κ) = 4 + 1 4 e ακt + 1 ] N ns n v [ α(κ+1)t 1 2 e 2 2 1 ] nv 2 e ακt [ 1 4 + 1 4 e ακt 1 2 e α(κ+1)t 2 ] ns, (18) By Theorem 3.1, the conditional reference prior for t under the K80 model is π r (t φ) e 1 2 ακt 2κ 2 e α(κ+1)t + (κ + 1) 2 e 2ακt 4κ 2 e ακt 4κe ακt + 2κ 2 e αt (κ 1) 2. (19) (e ακt 1)(e α 3κ+1 2 t + e α κ+1 2 t 2e ακt )(e α 3κ+1 2 t + e α κ+1 2 t + 2e ακt ) From Corollary 3.1 and as the second largest eigenvalue of Q under the K80 model equals max{ 4α, 2α(κ + 1)}, π r (t φ) is approximately proportional to exp[max{ 4α, 2α(κ + 1)}] for large t. Usually, there is strong expert opinion about κ. For example, fixing κ = 2 regularly occurs in phylogenetic software (Felsenstein, 1995); while estimates of κ range as low as as 1.4 in regions of the human immunodeficiency virus (Leitner et al., 1997) to a median of approximately 4 with variance of 10 across mammalian gene sequences (Rosenberg et al., 2003). Taking π(κ) as a log-normally density fits these observations well. For richer infinitesimal rate matrix parameterization, the most popular CTMC for nucleotides is arguably the HKY85 (Hasegawa et al., 1985) model. This model extends the

8 Marco A. R. Ferreira and Marc A. Suchard K80 chain by allowing for a non-uniform stationary distribution π. While varying π affects the eigenvectors in B, it does not change the eigenvalues. Commonly one fixes π to their empirically observed estimates (Li et al., 2000) as their maximum likelihood estimates rarely differ by an appreciable amount. Then κ remains the only free parameter and all properties for the K80 model hold under HKY85 assumptions. 5. Frequentist properties The study of frequentist properties of Bayesian procedures has been proposed as one way to evaluate default priors (Berger et al., 2001, and references therein). In this section, we carry out a simulation study to examine the frequentist properties of Bayesian procedures based on the conditional reference prior and on previous priors proposed in the literature for the elapsed time t. The frequentist properties considered here are mean squared error (MSE) of parameter estimates and frequentist coverage of 95% credible intervals for the elapsed time t. We compare the analyses based on the conditional reference prior and its Gamma(1/2, λ 2 ) approximation with analyses under two priors previously proposed in the literature: a Uniform(0,10) as implemented in MrBayes version 2 (Huelsenbeck and Ronquist, 2001) and an Exponential(10) with mean 10 1 as implemented in MrBayes version 3 (Ronquist and Huelsenbeck, 2003). We simulate data under the JC69 model discussed in Section 4.1 with the parameter t equal to one of 100 values ranging from 0.01 to 1. For each parameter value, we simulate 1000 datasets and from each dataset we compute the posterior mean and the 95% credible interval using the reference, Gamma, Uniform and Exponential priors. We then compute the (estimated) MSE of the Bayesian estimators and the (estimated) frequentist coverage of the equal-tailed 95% credible intervals. Figure 3 shows the relative MSE of the Bayesian estimates and the frequentist coverage of 95% credible intervals as a function of the true value of t. In the range of values considered for t, the estimation based on the Exponential(10) prior is slightly better in terms of relative MSE than the estimation based on the reference prior. The performance of estimation based on the Uniform(0,10) prior deteriorates very fast for values of t larger than 0.7. In terms of frequentist coverage, for all considered values of t the reference prior yields credible interval with coverage close to nominal. This nominal coverage is consistent with Welch and Peers (1963), who demonstrate that a univariate Jeffreys prior can serve as a firstorder probability matching prior. The Uniform prior yields credible intervals with coverage slightly below nominal value, whereas the coverage of the Exponential prior induced credible intervals quickly drops below the nominal value as the true value of t increases. A decomposition of the relative MSE in terms of variance and squared bias (Figure 4) sheds light on the different behaviors of analyses based on the Exponential(10) and reference priors. For the Exponential(10) prior, the frequentist variance of the posterior mean is fairly insensitive to the true value of t; whereas, bias increases fairly fast as a function of t. Conversely, for the reference prior, the frequentist variance of the estimate increases with the true value of t; whereas, the bias remains close to zero. When variance and squared bias are combined into the MSE, both analyses have similar performance. Nevertheless, the poor frequentist coverage of the Exponential-prior-based analyses results from the fact that the prior strongly favors small values of t. Thus, when no prior information is available we recommend against using the Exponential(10) prior. Overall, the reference-prior-based

Bayesian analysis of continuous-time Markov chains 9 analyses are more robust to the true value of t, when compared with the Uniform and Exponential prior analyses. Finally, Figures 3 and 4 show that the conditional reference prior for t and the corresponding Gamma(1/2, λ 2 ) approximation yield analyses with similar frequentist properties. Therefore, the Gamma(1/2, λ 2 ) prior is a good alternative for implementation in evolutionary reconstruction packages. 6. Discussion In this work, we describe Bayesian analyses for CTMCs based on default priors. For the default prior, we focus on a conditional reference prior for the elapsed time t coupled with a proper prior for the parameters φ of the infinitesimal rate matrix Q that characterizes the CTMC. We have derived a general explicit expression for the conditional reference prior and have shown that the resulting posterior distribution is proper. We also investigate the frequentist properties of analyses based on the conditional reference prior and on two priors previously proposed in the literature through simulation under the JC69 model. In terms of MSE, parameter estimates based on the conditional reference prior are comparable to estimates based on the Exponential prior and are much better than estimates based on the Uniform prior. In terms of frequentist coverage, the credible interval based on the conditional reference prior is comparable to the credible interval based on the Uniform prior and is much better than the credible interval based on the Exponential prior. The lower than nominal coverage of the credible interval based on the Exponential prior exposes a severe underestimation of the posterior uncertainty. Therefore, when there is no prior information on t, we recommend the use of the conditional reference prior. Many Bayesian estimation packages allow for only standard distributions; in these situations, the Gamma(1/2, λ 2 ) should suffice. While considering frequentist properties of Bayesian procedures may smell of heresy, those properties are indeed relevant in the evaluation of default priors. Default priors purposefully find themselves implemented in standard estimation software. Amongst evolutionary biologists, these standard programs are often employed without much consideration of their underlying modeling assumptions. Each use of the program then represents an independent experimental replication. Over many different replications, it is reassuring that the estimators possess good frequentist properties. In our default prior construction, we have started with relatively simple CTMC models. Several extensions of the CTMCs find considerable use and warrant exploration. Amongst these directions, two are notable. Although pair-wise distances are most popular in molecular biology studies, phylogenetic reconstruction of the evolutionary histories relating 3 or more sequences dominates in evolutionary biology. Such a history consists of multiple, correlated branch lengths. Here, Yang and Rannala (2005) explore independent Exponential priors with two fixed rates, one for internal and one for external branches; while Suchard et al. (2001) introduce a hierarchical Exponential prior by simultaneously estimating the hyperprior rate. Neither are noninformative nor consider the correlation across branches, necessitating the development of a default prior over their joint distribution. Finally, introducing rate variation relaxes the assumption of identical distributions across sites (Yang, 1996), leading to CTMC mixture models; an open question remains surrounding how to construct appropriate default priors for these mixtures.

10 Marco A. R. Ferreira and Marc A. Suchard Acknowledgements We thank the generous support of the Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil, (grant 402010/2003-5) in fostering this collaboration. M.A.S. is an Alfred P. Sloan Research Fellow. Appendix Proof of Theorem 3.1 Note that { t d ijk = 0, k = 1 λ k d ijk, k > 1. Then, the first derivative of the log-likelihood function becomes (20) t log f(y θ) = i,j and the second derivative takes the form 2 t 2 log f(y θ) = i,j s k=2 n λ kd ijk ij s k=1 d, (21) ijk ( s k=1 n d ijk) ( s k=2 λ2 k d ) s ijk ( k=2 λ kd ijk ) 2 ij ( s k=1 d ijk) 2. (22) Using the fact that E(n ij ) = N s k=1 d ijk, the expected Fisher information is I(t φ) = N i,j ( s k=2 λ kd ijk ) 2 ( s k=1 d ijk) ( s s k=1 d ijk k=2 λ2 k d ijk ). (23) Therefore, the conditional reference prior of the elapsed time t for the GT model is π r (t φ) ( s k=2 λ kr ik e λ kt c kj ) 2 ( s k=1 r ike λ kt c kj )( s k=2 λ2 k r ike λ kt c kj ) s i,j k=1 r. (24) ike λ kt c kj Proof of Proposition 2.1 We may write the data likelihood function as f(y θ) = i,j ( ) nij r ik e λkt c kj. (25) k (i) Recall that B = {r ik }, B 1 = {c kj }, and BB 1 is an identity matrix. Then, if n ij = 0 i j, lim t 0 f(y θ) = lim t 0 ( ) nii r ik e λkt c ki = i i k ( ) nii r ik c ki = 1. (26) k

Bayesian analysis of continuous-time Markov chains 11 Conversely, if i j n ij > 0, then lim t 0 ( ) nij f(y θ) = lim r ik e λkt c kj t 0 i j k = i j ( ) nij r ik c kj k = 0. (27) (ii) The tail-behavior as t follows directly from the continuity of P ij (t) in Equation (3) and by considering the joint distribution of the data at stationarity, i.e. that the chain s initial state draws from π. (iii) If i j n ij > 0, then as t 0 f(y θ) ( ) nij r ik e λkt c kj = ( ) nij r ik [1 + λ k t + O(t 2 )]c kj i j k i j k = [ qij t + O(t 2 ) ] n ij t i j nij q nij ij. (28) i j i j Proof of Corollary 3.1 From Proposition 2.1(iii), for small t, the likelihood function behaves as t i j nij as a function of t. Therefore, when t 0, the conditional reference prior of the elapsed time t for the GT model will behave as 1/ t. We now consider the behavior of the conditional reference prior for t. In this case, the first two terms in s k=1 r ike λkt c kj will dominate the sum, so we approximate it by π j + r i2 e λ2t c 2j. Applying this approximation, the conditional reference prior π r (t φ) g(t φ) = i,j i,j π j r i2 e λ2t c 2j π j + r i2 e λ2t c 2j (λ 2 r i2 e λ2t c 2j ) 2 (π j + r i2 e λ2t c 2j ) (λ 2 2 r i2e λ2t c 2j ) π j + r i2 e λ2t c 2j i,j π jr i2 e λ2t c 2j (l,m) (i,j) (π m + r l2 e λ2t c 2m ) i,j (π j + r i2 e λ2t c 2j ) i,j = O(e λ2t ), [ r i2 c 2j e λ2t ( (l,m) π m ) + r i2 c 2j e 2λ2t (l,m) (i,j) r l2c 2m ( (l,m ) (l,m) π m )] i,j (π j + r i2 e λ2t c 2j ) where the last step substitutes π j + r i2 e λ2t c 2j π j for large t in the denominator and eliminates the leading e λ2t term through i,j r i2c 2j = 0. The latter equality follows from j r i2c 2j = r i2 (c 21,...,c 2s )(1,...,1) = r i2 (c 21,...,c 2s )(r 11,..., r s1 ) = 0 as B 1 B = I. (29)

12 Marco A. R. Ferreira and Marc A. Suchard References Berger, J. O. and Bernardo, J. M. (1989). Estimating a product of means: Bayesian analysis with reference priors. Journal of the American Statistical Association, 84, 200 207. (1992). On the development of the reference prior method. In Bayesian Statistics IV, eds. J. M. Bernardo, J. O. Berger, A. P. David, and A. F. M. Smith, 35 60. Oxford: Oxford University Press. Berger, J. O., de Oliveira, V., and Sansó, B. (2001). Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96, 456, 1361 1374. Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). Journal of the Royal Statistical Society B, 41, 113 147. Dayhoff, M., Eck, R., and Park, C. (1972). A model of evolutionary change in proteins. In Atlas of protein sequence and structure, vol. 5, 89 99. Washington, DC: National Biomedical Research Foundation. Felsenstein, J. (1995). PHYLIP (Phylogenetic Inference Package), Version 3.57. Seattle, WA: Distributed by the author. Department of Genetics, University of Washington. Geweke, J., Marshall, R., and Zarkin, G. (1986). Exact inference for continuous time Markov chain models. Review of Economic Studies, 53, 653 669. Gray, R. and Atkinson, Q. (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426, 435 439. Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22, 160 174. Huelsenbeck, J. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogeny. Bioinformatics, 17, 754 755. Jeffreys, H. (1961). Theory of Probability, 1nd edition. London: Oxford University Press. Jones, D., Taylor, W., and Thornton, J. (1992). The rapid generaton of mutation data matrices from protein sequences. CABIOS, 8, 275 282. Jukes, T. and Cantor, C. (1969). Evolution of protein molecules. In Mammaliam Protein Metabolism, ed. H. Munro, 21 132. New York: Academic Press. Kalbfleisch, J. and Lawless, J. (1985). The analysis of panel data under a Markov assumption. Journal of the American Statistical Assocation, 80, 863 871. Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, 111 120. Leitner, T., Kumar, S., and Albert, J. (1997). Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. Journal of Virology, 71, 4761 4770.

Bayesian analysis of continuous-time Markov chains 13 Li, S., Pearl, D., and Doss, H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. Journal of the American Statistical Association, 95, 493 508. Ronquist, F. and Huelsenbeck, J. (2003). MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19, 1572 1574. Rosenberg, M., Subramanian, S., and Kumar, S. (2003). Patterns of transitional mutation biases within and among mammalian genomes. Molecular Biology and Evolution, 20, 988 993. Suchard, M., Weiss, R., and Sinsheimer, J. (2001). Bayesian selection of continuous-time Markov chain evolutionary models. Molecular Biology and Evolution, 18, 1001 1013. Sun, D. and Berger, J. O. (1998). Reference priors with partial information. Biometrika, 85, 55 71. Welch, B. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society, Series B, 25, 318 329. Whelan, S. and Goldman, N. (2001). A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular Biology and Evolution, 18, 691 699. Yang, Z. (1996). Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology and Evolution, 11, 367 372. Yang, Z. and Rannala, B. (2005). Branch-length prior influences Bayesian posterior probability of phylogeny. Systematic Biology, 54, 455 470.

14 Marco A. R. Ferreira and Marc A. Suchard Sequence 1 AGCT ACAG Sequence 2 A A Fig. 1. Pairwise alignment of two nucleotide sequences with continuous-time Markov chain state space S = {A, G, C, T }. One homologous site is illustrated by a shaded box; sites are independent and identically distributed along the entire alignment. The observed data Y consist of the counts n ij of the character i S in a homologous site in sequence #1 ending as character j S in sequence #2. For example, n AA = 2 for the shown sites.

Bayesian analysis of continuous-time Markov chains 15 Exponential(10) Uniform(0,1) Density 0 4 8 12 0.0 0.5 1.0 1.5 2.0 Branch length Exponential(1) Density 0.0 0.6 1.2 0 2 4 6 8 10 12 Branch length Exponential(0.5) Density 0.0 0.3 0.6 Density 0 4 8 0.0 0.2 0.4 0.6 0.8 1.0 Branch length Uniform(0,2) Density 0.0 1.0 2.0 0.0 0.5 1.0 1.5 2.0 Branch length Uniform(0,10) Density 0.00 0.10 0 5 10 15 20 0 2 4 6 8 10 Branch length Branch length Exponential(0.1) Uniform(0,100) Density 0.00 0.06 0.12 0 20 40 60 80 Density 0.000 0.008 0 20 40 60 80 100 Branch length Branch length Fig. 2. Estimates of the elapsed time between German and the most recent common ancestor of Roman languages under six standard priors. Histograms summary the posterior, while we overlay the priors (dashed lines).

16 Marco A. R. Ferreira and Marc A. Suchard Relative MSE 0.0 0.1 0.2 0.3 0.4 0.5 Coverage Log(Density) 0.80 0.85 0.90 0.95 1.00 3 2 1 0 1 2 3 0.0 0.2 0.4 0.6 0.8 1.0 t Fig. 3. Analyses under the Jukes-Cantor model based on the conditional reference prior (solid black line), a Gamma approximation to the conditional reference prior (dotted-dashed green line), an Exponential(10) prior (dashed red line) and a Uniform(0,10) prior (dotted blue line). The first panel compares the log prior densities. The second panel plots the relative mean square error (MSE) of the Bayesian estimates. The final panel describes the frequentist coverage of the Bayesian 95% credible intervals.

Bayesian analysis of continuous-time Markov chains 17 Var t 2 (Bias t) 2 0.00 0.02 0.04 0.06 0.08 0.10 0.00 0.02 0.04 0.06 0.08 0.10 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 t Fig. 4. Decomposition of the relative mean square error (MSE) under the Jukes-Cantor model for the conditional reference prior (solid black line), the Gamma approximation to the conditional reference prior (dotted-dashed green line), the Exponential(10) prior (dashed red line) and the Uniform(0,10) prior (dotted blue line). The first panel traces out the posterior estimate variance scaled by the true elapsed time t 2 and the second panel represents the posterior estimate squared bias scaled by t 2.