Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains

Size: px
Start display at page:

Download "Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains"

Transcription

1 Bayesian Analysis of Elapsed Times in Continuous-Time Markov Chains Marco A. R. Ferreira 1, Marc A. Suchard 2,3,4 1 Department of Statistics, University of Missouri at Columbia, USA 2 Department of Biomathematics and 3 Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, USA 4 Department of Biostatistics, School of Public Health, University of California, Los Angeles, USA Summary. We explore Bayesian analysis for continuous-time Markov chain (CTMC) models based on a conditional reference prior. For CTMC models, inference of the elapsed time between chain observations depends heavily on the rate of decay of the prior as the elapsed time increases. Moreover, improper priors on the elapsed time may lead to improper posterior distributions. In addition to the elapsed time, an infinitesimal rate matrix also characterizes the CTMC. Usually, experts have good prior knowledge about the parameters of the infinitesimal rate matrix, and thus can provide well-informed priors. We show that the use of a proper prior for the rate matrix parameters together with the conditional reference prior for the elapsed time yields a proper posterior distribution. Finally, we demonstrate that, when compared to analyses based on priors previously proposed in the literature, Bayesian analysis on the elapsed time based on the conditional reference prior possesses better frequentist properties. The conditional reference prior therefore represents a better default prior choice for widely-used estimation software. Keywords: Conditional reference prior; phylogenetic reconstruction; prior for branch length; frequentist coverage; mean square error. 1. Introduction Continuous-time Markov chains (CTMCs) are ubiquitous modeling tools. The chains define a stochastic process {Z(t) : t 0}, where Z(t) realize one of s discrete values from a statespace set S as elapsed time t proceeds. The chains also satisfy Markovian behavior, such that the probability distribution of Z(t 3 ) is independent of Z(t 1 ) conditional on Z(t 2 ) for t 3 > t 2 > t 1. Given this conditionally memoryless property, an infinitesimal rate matrix Q completely governs the process, where Q = {q ij } is an s s matrix with non-negative, off-diagonal elements and rows summing to 0. The conditional probability distribution of Z(t) naturally follows Pr(Z(t) = j Z(0) = i) = { e tq} ij, (1) where { } ij represents the ij-th matrix element, matrix exponentiation takes the form e A = I + k=1 Ak k! and I is the s s identity matrix. To gain intuition into the process, one can consider a graph composed of vertices, one for each state in S, and edges with weights q ij connecting vertices for i j. Random variable Z(t) then transforms into the location indicator of a particle on this graph at time t as the particle drifts from vertex to vertex. At any given time, the particle first waits an Exponential amount of time, with rate equal to the sum of the edge weights leaving the

2 2 Marco A. R. Ferreira and Marc A. Suchard particle s current state. After this waiting-time, the particle jumps to its next location with probability proportional to the edge weight connecting the current and destination vertices. Here we consider the case, often found in biology and linguistics, when we observe several replications of the pair {Z(0), Z(t)}, and the primary interest lies in infering the elapsed time t. Biology and linguistics are two seemingly disparate fields that make considerable use of CTMCs. Both fields exploit CTMCs to infer evolutionary histories. In the case of biology, researchers reconstruct the histories relating molecular sequences, such as short segments of DNA, genes or entire genomes; while, glottochronology aims to infer the ancestral relationships between languages as an approach to understand pre-historical human migration. The sequences in this latter example are strings of presence/absence indicators of critical words in a language. Often, analysis requires inferring only the times separating sequences on a pairwise-basis and not the entire underlying history. The first and most important step in any reconstruction is a description of how a character in one sequence relates to a possibly different character in the corresponding site of a second sequence (Figure 1). Here, CTMCs come to the rescue. Let S equal the set of possible characters at a single site. For nucleotide sequences, s = 4, containing adenosine (A), guanine (G), cytosine (C) and thymine (T). For amino-acid sequences, s = 20, codonbased models have 64 states, and s = 2 naturally describes the glottochronology state-space. Then, the two related characters at a site form realizations from a single CTMC observed at two different moments, and one typically assumes the chains at different sites are independent and identically distributed. Statistical inference reduces to estimating the elapsed time t between the observed sequences and, potentially, the infinitesimal rate matrix Q. To attack this statistical problem, let the data Y = (n 11,...,n 1s, n 21,...,n 2s,..., n s1,..., n ss ) count the observed number of transitions between pairs of states n ij for i, j S. While up to s(s 1) free parameters may characterize Q, one often employs structured matrices that are biologically-motivated and contain far fewer free parameters φ Φ. This yields the complete parameter vector θ = (t, φ). Usually, experts have good prior knowledge about the parameters of the infinitesimal rate matrix. Experts either fix φ to empirically estimated quantities derived from large databases, such as the PAM (Dayhoff et al., 1972), JTT (Jones et al., 1992), and WAG (Whelan and Goldman, 2001) models for amino acid chains, or can provide well-informed priors. Such is not the case for the elapsed time t a priori. Frequently, t is the most important quantity to be estimated as, for example, when the expert wishes to estimate divergence times between molecular sequences or languages. Viewing inference of t as paramount differentiates CTMC use in reconstruction from CTMCs to analyze panel data in the social sciences and econometrics (Kalbfleisch and Lawless, 1985; Geweke et al., 1986). In these latter models, the elapsed time between observations is known and inference focuses on the infinitesimal rate matrix parameters φ. Kalbfleisch and Lawless (1985) introduce maximum likelihood estimators for φ and Geweke et al. (1986) furnish Bayesian estimators under a uniform prior on rate matrices. Both maximum likelihood and Bayesian estimation of elapsed time t are not trivial. Assuming the chain is irreducible, as t the probability distribution on Z(t) converges to the stationary distribution of Q. As a consequence, the data likelihood function f(y θ) of the CTMC converges to a constant greater than zero. Thus, a Bayesian analysis with a marginal improper prior for t leads to a useless improper posterior distribution. Moreover, f(y θ) may be strictly increasing. Consequentially, with positive probability, the maximum likelihood estimate of t may not exist and prior choice for t can impart substantial influence on estimates.

3 Bayesian analysis of continuous-time Markov chains 3 To briefly demonstrate the poor tail behavior of the likelihood function, we consider a subset of glottochronology language data analyzed in Gray and Atkinson (2003). These data are binary characters indicating the presence/absence of cognates in Indo-European languages. Cognates are words that share a common origin in a predecessor language. One difficulty in these data is estimating of the divergence time of 16 Romance languages from Germanic languages, represented by Modern German. As German is the outgroup among the data, this distance becomes the evolutionary tree height. Figure 2 plots posterior histograms of the tree height under six different prior choices (note that the scales are different for each plot). Both Exponential and Uniform priors are standard in MrBayes (Huelsenbeck and Ronquist, 2001; Ronquist and Huelsenbeck, 2003), a popular Bayesian sampler for evolutionary reconstruction problems. As the likelihood function flattens out, little information is available in the data and the prior dictates the right-tail behavior of the tree height. This seriously affects Bayesian estimation and it can lead to disastrous consequences in model selection problems (Suchard et al., 2001). Even in moderate-sized phylogenetic problems, the elicitation of a joint prior distribution on the parameters θ is practically infeasible. Thus, automatic or semi-automatic methods may find use. An attractive strategy to overcome these difficulties and one that has seen significant development in recent years is to use an objective Bayesian analysis, where by that we mean the use of default priors derived by formal rules that use the structure of the problem at hand but do not require subjective prior elicitation. Two of the most popular of these methods are the Jeffreys prior (Jeffreys, 1961) and the reference prior (Bernardo, 1979; Berger and Bernardo, 1989, 1992). In this paper, we develop a default prior for the elapsed time t of CTMC models for use when little or no prior information is available on t but there is prior information on the parameters of the infinitesimal transition matrix. Specifically, we derive an explicit expression for the conditional reference prior (Sun and Berger, 1998) on t and establish the propriety of the resulting posterior distribution. We also study the small sample frequentist properties of Bayesian procedures based on the conditional reference prior and compare these properties to results obtained under alternative priors currently exploited by evolutionary reconstruction practitioners. 2. General Time Model We begin the development of a default prior by considering the CTMC model in its most general, irreducible form in which φ contains s(s 1) parameters. We refer to this as the general time (GT) model. Later we return to more restricted models, such as the Jukes- Cantor (Jukes and Cantor, 1969, JC69) and Kimura (Kimura, 1980, K80) models commonly employed in evolutionary reconstruction. To proceed, we first require an understanding of the tail-behavior of the data likelihood f(y θ). Consider the spectral decomposition Q = B(φ)Λ(φ)B 1 (φ), where Λ(φ) is the diagonal matrix of eigenvalues and the columns of B(φ) are the corresponding right-eigenvectors of Q. To simply notation and ease exposition, we drop the implicit dependence of Λ and B on φ, order the eigenvalues in decreasing order, such that 0 = λ 1 > λ 2... λ s, (2) and write B = {r ik } and B 1 = {c kj }. Interestingly, the largest eigenvalue λ 1 is equal to 0 because the row-sums of Q are all 0. The largest eigenvalue s corresponding right-

4 4 Marco A. R. Ferreira and Marc A. Suchard eigenvector is (r 11,...,r s1 ) = (1,..., 1) and we refer to the largest eigenvalue s lefteigenvector (c 11,...,c 1s ) as a stationary distribution π of the CTMC. CTMCs are trivially aperiodic and considering only irreducible chains enforces that π is unique through the Ergodic theorem. Since π is unique, the remaining eigenvalues are strictly less than λ 1, i.e. negative. Using the spectral decomposition, we re-write the conditional probability distribution on Z(t) given in Equation (1), such that the probability of transitioning from state i to state j in time t is P ij (t) = = {Be Λt B 1 } ij s r ik e λkt c kj = k=1 s d ijk. (3) Given the structure of the eigenvalues in Equation (2) and taking t, P ij (t) converges to the stationary distribution c 1j = π j that does not dependent on the starting state i. Note that π implicitly depends on φ. Returning to the data Y, the log-likelihood function becomes log f(y θ) = ( ) n ij log d ijk. (4) i,j Proposition 2.1. For the GT model, the log-likelihood function (4) is a continuous function on [0, ) satisfying: (i) { 1, if nij = 0 i j, lim f(y θ) = (5) t 0 + 0, otherwise; (ii) lim t log f(y θ) = i,j k k=1 n ij (log π i + log π j ); and (6) (iii) For i j n ij > 0 and small t, f(y θ) t i j nij q nij ij. (7) i j Proof. See Appendix. 3. Conditional reference prior There are at least two possible forms of noninformative priors for the CTMC parameters θ = (t, φ). The first is the joint Jeffreys-rule prior given by π(θ) det I(θ), where I(θ) is the Fisher information matrix. The ij-th entry of the Fisher information matrix is [ ] {I(θ)} ij = E 2 log f(y θ), θ 1 = t, θ 2 = φ, (8) θ i θ j where E[ ] refers to an expectation with respect to the conditional distribution of Y given θ. The joint Jeffreys prior for θ could be a reasonable choice if prior information were available on neither t nor φ, but usually there exists prior biological information on φ.

5 Bayesian analysis of continuous-time Markov chains 5 Typically, no prior information on t is available a priori, while expert opinion remains quite strong regarding reasonable values for φ. As discussed in Section 1 and illustrated in Figure 2, posterior analysis inferring t can be highly influenced by the prior on t. Consequentially, we wish to incorporate prior information about φ and, at the same time, use a noninformative prior for t. This leads us to a conditional reference prior for t (Sun and Berger, 1998). Let π(φ), with Φ π(φ)dφ = 1, characterize the prior information on the infinitesimal rate matrix parameter φ. Following Sun and Berger (1998), the conditional reference prior for t is π r (t φ) I(t φ), where Then, the joint prior density for θ becomes [ ] I(t φ) = E 2 t 2 log f(y θ). (9) π(φ)π r (t φ). (10) Theorem 3.1. Starting from the GT model with log-likelihood shown in Equation (4), the conditional reference prior for the elapsed time t given φ is π r ( s k=2 (t φ) g(t φ) = λ kr ik e λ kt c kj ) 2 ( s k=1 r ike λ kt c kj )( s k=2 λ2 k r ike λ kt c kj ) s k=1 r ike λ, kt c kj i,j where we remind readers that λ k, r ik and c kj all implicitly depend on φ. (11) Proof. See Appendix. An interesting feature of the unnormalized conditional reference prior g(t φ) given in Equation (11) is that its behavior close to zero is independent of the infinitesimal rate matrix Q and its behavior for large t depends only on the second largest eigenvalue of Q. The following corollary describes the tail behavior of g(t φ). Corollary 3.1. The unnormalized conditional reference prior of t given φ shown in Equation (11) is a non-negative, continuous function on [0, ), that satisfies: (i) g(t φ) = O( 1 t ) as t 0, and (12) (ii) g(t φ) = O(e λ2t ) as t. (13) Proof. See Appendix. As a consequence of Corollary 3.1 and recalling that λ 2 < 0, the conditional reference prior π r (t φ) is proper. Therefore, there are no normalization concerns (Sun and Berger, 1998), and the conditional reference prior can be written as π r (t φ) = K(φ)g(t φ), (14) where the normalizing constant K(φ) = { 0 g(t φ)dt } 1 can be easily computed using standard one-dimensional numerical integration.

6 6 Marco A. R. Ferreira and Marc A. Suchard Corollary 3.2. The joint prior distribution for θ induced by the product of the informative prior distribution π(φ) and the conditional reference prior π r (t φ) given in Equation (11) yields a proper posterior density. Proof. This follows directly from the fact that π(φ) and π r (t φ) are proper probability measures. Given the tail-behavior described in Corollary 3.1, a Gamma(1/2, λ 2 ) can serve as a reasonable approximation of π r (t φ) for implementation in software that allows only standard distributions. A comparison of the performance of Bayesian procedures based on the conditional reference prior and its Gamma(1/2, λ 2 ) approximation is presented in Section Commonly Employed CTMC Models We consider two restricted cases of the GT model that find general use across binary, nucleotide and amino acid sequences. The first parameterization of Q considers that all transitions of the CTMC occur with the same infinitesimal rate α, such that q ij = α i j. Although Jukes and Cantor (1969) first endorse this model for nucleotide substitution processes, the mathematical properties that we derive are shared with all standard models for amino acid sequences. Amino acid CTMC models follow empirically-estimated infinitesimal rate matrices, resulting in the same number of free parameters as the JC69 model. We also consider the K80 (Kimura, 1980) CTMC for nucleotide sequences. This model assumes that the states of the chain divide into two disjoint sets and that rates of transition within and between sets differ. Strong expert opinion exists about the relative rates of within and between events, suggesting use of a conditional reference prior Reference prior for the JC69 model Under an equal rates model, sufficient statistics of the data Y are the total number of sites N and the number of observed changes n = i,j n ij i j. Using these sufficient statistics, the likelihood function under the JC69 model becomes f(n, n t, α) = [ ] N n [ ( 1 4 e αt )] n 4 e αt. (15) As the likelihood function provides information only for the product αt, the infinitesimal rate α is fixed a priori. Different choices of α lead to differing scalings of t. For example, if α = 1/3 (the usual choice among phylogeneticists) then t scales in terms of the expected number of changes per site given the chain starts at stationarity. To appreciate this, we count the expected number of changes that occur between t [0, 1), i π i( q ii ) = 1, where π i = 1/4 under the JC69 model. Under the JC69 model, a maximum likelihood estimate of t does not exist for n/n > 3/4 and a Bayesian approach with a poor prior choice may not fair any better. As t, the distribution of the number of observed changes between sequences converges to a Binomial distribution with sample size N and probability of success 3/4. Likewise, the data likelihood converges to a positive constant. This behavior causes major problems for Bayesian estimation of t as the inference will depend heavily on the tail behavior of the prior

7 Bayesian analysis of continuous-time Markov chains 7 for t. In particular, improper priors will lead to useless improper posterior distributions and inference based on truncated uniform priors heavily depend on where the prior is truncated. From Theorem 3.1, the reference prior for the elapsed time t under the JC69 model is π r (t φ) I(t φ) e αt (1 + 3e αt ) (1 e αt ). (16) As the second largest eigenvalue of the infinitesimal transition matrix Q of the JC96 model is α, the prior above behaves as e αt for large t Reference prior for the Kimura Model The infinitesimal rate matrix Q under the K80 model for nucleotides is (κ + 2) κ 1 1 α κ (κ + 2) (κ + 2) κ, (17) 1 1 κ (κ + 2) where we arbitrarily have ordered the states in Matrix (17) as {A, G, C, T }. Nucleotides A and G contain purine side-groups, while C and T contain pyrimidines. Purines and pyrimidines differ in the size of their aromatic hetero-cycles. Due to steric differences, CTMC jumps within groups, confusingly called transitions by evolutionary biologists, occur with infinitesimal rate κ α. This rate may differ from changes across groups, called transversions, at rate α. Following the JC69 model formulation, one fixes α such that t scales in terms of the expected number of changes per site. This choice implies α = (κ+2) 1. Sufficient statistics of the data are, again, the total number of N sites and the numbers of observed transitions n s and transversions n v. The data likelihood function under the K80 Model reduces to [ 1 f(n, n s, n v t, α, κ) = e ακt + 1 ] N ns n v [ α(κ+1)t 1 2 e ] nv 2 e ακt [ e ακt 1 2 e α(κ+1)t 2 ] ns, (18) By Theorem 3.1, the conditional reference prior for t under the K80 model is π r (t φ) e 1 2 ακt 2κ 2 e α(κ+1)t + (κ + 1) 2 e 2ακt 4κ 2 e ακt 4κe ακt + 2κ 2 e αt (κ 1) 2. (19) (e ακt 1)(e α 3κ+1 2 t + e α κ+1 2 t 2e ακt )(e α 3κ+1 2 t + e α κ+1 2 t + 2e ακt ) From Corollary 3.1 and as the second largest eigenvalue of Q under the K80 model equals max{ 4α, 2α(κ + 1)}, π r (t φ) is approximately proportional to exp[max{ 4α, 2α(κ + 1)}] for large t. Usually, there is strong expert opinion about κ. For example, fixing κ = 2 regularly occurs in phylogenetic software (Felsenstein, 1995); while estimates of κ range as low as as 1.4 in regions of the human immunodeficiency virus (Leitner et al., 1997) to a median of approximately 4 with variance of 10 across mammalian gene sequences (Rosenberg et al., 2003). Taking π(κ) as a log-normally density fits these observations well. For richer infinitesimal rate matrix parameterization, the most popular CTMC for nucleotides is arguably the HKY85 (Hasegawa et al., 1985) model. This model extends the

8 8 Marco A. R. Ferreira and Marc A. Suchard K80 chain by allowing for a non-uniform stationary distribution π. While varying π affects the eigenvectors in B, it does not change the eigenvalues. Commonly one fixes π to their empirically observed estimates (Li et al., 2000) as their maximum likelihood estimates rarely differ by an appreciable amount. Then κ remains the only free parameter and all properties for the K80 model hold under HKY85 assumptions. 5. Frequentist properties The study of frequentist properties of Bayesian procedures has been proposed as one way to evaluate default priors (Berger et al., 2001, and references therein). In this section, we carry out a simulation study to examine the frequentist properties of Bayesian procedures based on the conditional reference prior and on previous priors proposed in the literature for the elapsed time t. The frequentist properties considered here are mean squared error (MSE) of parameter estimates and frequentist coverage of 95% credible intervals for the elapsed time t. We compare the analyses based on the conditional reference prior and its Gamma(1/2, λ 2 ) approximation with analyses under two priors previously proposed in the literature: a Uniform(0,10) as implemented in MrBayes version 2 (Huelsenbeck and Ronquist, 2001) and an Exponential(10) with mean 10 1 as implemented in MrBayes version 3 (Ronquist and Huelsenbeck, 2003). We simulate data under the JC69 model discussed in Section 4.1 with the parameter t equal to one of 100 values ranging from 0.01 to 1. For each parameter value, we simulate 1000 datasets and from each dataset we compute the posterior mean and the 95% credible interval using the reference, Gamma, Uniform and Exponential priors. We then compute the (estimated) MSE of the Bayesian estimators and the (estimated) frequentist coverage of the equal-tailed 95% credible intervals. Figure 3 shows the relative MSE of the Bayesian estimates and the frequentist coverage of 95% credible intervals as a function of the true value of t. In the range of values considered for t, the estimation based on the Exponential(10) prior is slightly better in terms of relative MSE than the estimation based on the reference prior. The performance of estimation based on the Uniform(0,10) prior deteriorates very fast for values of t larger than 0.7. In terms of frequentist coverage, for all considered values of t the reference prior yields credible interval with coverage close to nominal. This nominal coverage is consistent with Welch and Peers (1963), who demonstrate that a univariate Jeffreys prior can serve as a firstorder probability matching prior. The Uniform prior yields credible intervals with coverage slightly below nominal value, whereas the coverage of the Exponential prior induced credible intervals quickly drops below the nominal value as the true value of t increases. A decomposition of the relative MSE in terms of variance and squared bias (Figure 4) sheds light on the different behaviors of analyses based on the Exponential(10) and reference priors. For the Exponential(10) prior, the frequentist variance of the posterior mean is fairly insensitive to the true value of t; whereas, bias increases fairly fast as a function of t. Conversely, for the reference prior, the frequentist variance of the estimate increases with the true value of t; whereas, the bias remains close to zero. When variance and squared bias are combined into the MSE, both analyses have similar performance. Nevertheless, the poor frequentist coverage of the Exponential-prior-based analyses results from the fact that the prior strongly favors small values of t. Thus, when no prior information is available we recommend against using the Exponential(10) prior. Overall, the reference-prior-based

9 Bayesian analysis of continuous-time Markov chains 9 analyses are more robust to the true value of t, when compared with the Uniform and Exponential prior analyses. Finally, Figures 3 and 4 show that the conditional reference prior for t and the corresponding Gamma(1/2, λ 2 ) approximation yield analyses with similar frequentist properties. Therefore, the Gamma(1/2, λ 2 ) prior is a good alternative for implementation in evolutionary reconstruction packages. 6. Discussion In this work, we describe Bayesian analyses for CTMCs based on default priors. For the default prior, we focus on a conditional reference prior for the elapsed time t coupled with a proper prior for the parameters φ of the infinitesimal rate matrix Q that characterizes the CTMC. We have derived a general explicit expression for the conditional reference prior and have shown that the resulting posterior distribution is proper. We also investigate the frequentist properties of analyses based on the conditional reference prior and on two priors previously proposed in the literature through simulation under the JC69 model. In terms of MSE, parameter estimates based on the conditional reference prior are comparable to estimates based on the Exponential prior and are much better than estimates based on the Uniform prior. In terms of frequentist coverage, the credible interval based on the conditional reference prior is comparable to the credible interval based on the Uniform prior and is much better than the credible interval based on the Exponential prior. The lower than nominal coverage of the credible interval based on the Exponential prior exposes a severe underestimation of the posterior uncertainty. Therefore, when there is no prior information on t, we recommend the use of the conditional reference prior. Many Bayesian estimation packages allow for only standard distributions; in these situations, the Gamma(1/2, λ 2 ) should suffice. While considering frequentist properties of Bayesian procedures may smell of heresy, those properties are indeed relevant in the evaluation of default priors. Default priors purposefully find themselves implemented in standard estimation software. Amongst evolutionary biologists, these standard programs are often employed without much consideration of their underlying modeling assumptions. Each use of the program then represents an independent experimental replication. Over many different replications, it is reassuring that the estimators possess good frequentist properties. In our default prior construction, we have started with relatively simple CTMC models. Several extensions of the CTMCs find considerable use and warrant exploration. Amongst these directions, two are notable. Although pair-wise distances are most popular in molecular biology studies, phylogenetic reconstruction of the evolutionary histories relating 3 or more sequences dominates in evolutionary biology. Such a history consists of multiple, correlated branch lengths. Here, Yang and Rannala (2005) explore independent Exponential priors with two fixed rates, one for internal and one for external branches; while Suchard et al. (2001) introduce a hierarchical Exponential prior by simultaneously estimating the hyperprior rate. Neither are noninformative nor consider the correlation across branches, necessitating the development of a default prior over their joint distribution. Finally, introducing rate variation relaxes the assumption of identical distributions across sites (Yang, 1996), leading to CTMC mixture models; an open question remains surrounding how to construct appropriate default priors for these mixtures.

10 10 Marco A. R. Ferreira and Marc A. Suchard Acknowledgements We thank the generous support of the Conselho Nacional de Desenvolvimento Científico e Tecnológico, Brazil, (grant /2003-5) in fostering this collaboration. M.A.S. is an Alfred P. Sloan Research Fellow. Appendix Proof of Theorem 3.1 Note that { t d ijk = 0, k = 1 λ k d ijk, k > 1. Then, the first derivative of the log-likelihood function becomes (20) t log f(y θ) = i,j and the second derivative takes the form 2 t 2 log f(y θ) = i,j s k=2 n λ kd ijk ij s k=1 d, (21) ijk ( s k=1 n d ijk) ( s k=2 λ2 k d ) s ijk ( k=2 λ kd ijk ) 2 ij ( s k=1 d ijk) 2. (22) Using the fact that E(n ij ) = N s k=1 d ijk, the expected Fisher information is I(t φ) = N i,j ( s k=2 λ kd ijk ) 2 ( s k=1 d ijk) ( s s k=1 d ijk k=2 λ2 k d ijk ). (23) Therefore, the conditional reference prior of the elapsed time t for the GT model is π r (t φ) ( s k=2 λ kr ik e λ kt c kj ) 2 ( s k=1 r ike λ kt c kj )( s k=2 λ2 k r ike λ kt c kj ) s i,j k=1 r. (24) ike λ kt c kj Proof of Proposition 2.1 We may write the data likelihood function as f(y θ) = i,j ( ) nij r ik e λkt c kj. (25) k (i) Recall that B = {r ik }, B 1 = {c kj }, and BB 1 is an identity matrix. Then, if n ij = 0 i j, lim t 0 f(y θ) = lim t 0 ( ) nii r ik e λkt c ki = i i k ( ) nii r ik c ki = 1. (26) k

11 Bayesian analysis of continuous-time Markov chains 11 Conversely, if i j n ij > 0, then lim t 0 ( ) nij f(y θ) = lim r ik e λkt c kj t 0 i j k = i j ( ) nij r ik c kj k = 0. (27) (ii) The tail-behavior as t follows directly from the continuity of P ij (t) in Equation (3) and by considering the joint distribution of the data at stationarity, i.e. that the chain s initial state draws from π. (iii) If i j n ij > 0, then as t 0 f(y θ) ( ) nij r ik e λkt c kj = ( ) nij r ik [1 + λ k t + O(t 2 )]c kj i j k i j k = [ qij t + O(t 2 ) ] n ij t i j nij q nij ij. (28) i j i j Proof of Corollary 3.1 From Proposition 2.1(iii), for small t, the likelihood function behaves as t i j nij as a function of t. Therefore, when t 0, the conditional reference prior of the elapsed time t for the GT model will behave as 1/ t. We now consider the behavior of the conditional reference prior for t. In this case, the first two terms in s k=1 r ike λkt c kj will dominate the sum, so we approximate it by π j + r i2 e λ2t c 2j. Applying this approximation, the conditional reference prior π r (t φ) g(t φ) = i,j i,j π j r i2 e λ2t c 2j π j + r i2 e λ2t c 2j (λ 2 r i2 e λ2t c 2j ) 2 (π j + r i2 e λ2t c 2j ) (λ 2 2 r i2e λ2t c 2j ) π j + r i2 e λ2t c 2j i,j π jr i2 e λ2t c 2j (l,m) (i,j) (π m + r l2 e λ2t c 2m ) i,j (π j + r i2 e λ2t c 2j ) i,j = O(e λ2t ), [ r i2 c 2j e λ2t ( (l,m) π m ) + r i2 c 2j e 2λ2t (l,m) (i,j) r l2c 2m ( (l,m ) (l,m) π m )] i,j (π j + r i2 e λ2t c 2j ) where the last step substitutes π j + r i2 e λ2t c 2j π j for large t in the denominator and eliminates the leading e λ2t term through i,j r i2c 2j = 0. The latter equality follows from j r i2c 2j = r i2 (c 21,...,c 2s )(1,...,1) = r i2 (c 21,...,c 2s )(r 11,..., r s1 ) = 0 as B 1 B = I. (29)

12 12 Marco A. R. Ferreira and Marc A. Suchard References Berger, J. O. and Bernardo, J. M. (1989). Estimating a product of means: Bayesian analysis with reference priors. Journal of the American Statistical Association, 84, (1992). On the development of the reference prior method. In Bayesian Statistics IV, eds. J. M. Bernardo, J. O. Berger, A. P. David, and A. F. M. Smith, Oxford: Oxford University Press. Berger, J. O., de Oliveira, V., and Sansó, B. (2001). Objective Bayesian analysis of spatially correlated data. Journal of the American Statistical Association, 96, 456, Bernardo, J. M. (1979). Reference posterior distributions for Bayesian inference (with discussion). Journal of the Royal Statistical Society B, 41, Dayhoff, M., Eck, R., and Park, C. (1972). A model of evolutionary change in proteins. In Atlas of protein sequence and structure, vol. 5, Washington, DC: National Biomedical Research Foundation. Felsenstein, J. (1995). PHYLIP (Phylogenetic Inference Package), Version Seattle, WA: Distributed by the author. Department of Genetics, University of Washington. Geweke, J., Marshall, R., and Zarkin, G. (1986). Exact inference for continuous time Markov chain models. Review of Economic Studies, 53, Gray, R. and Atkinson, Q. (2003). Language-tree divergence times support the Anatolian theory of Indo-European origin. Nature, 426, Hasegawa, M., Kishino, H., and Yano, T. (1985). Dating the human-ape splitting by a molecular clock of mitochondrial DNA. Journal of Molecular Evolution, 22, Huelsenbeck, J. and Ronquist, F. (2001). MRBAYES: Bayesian inference of phylogeny. Bioinformatics, 17, Jeffreys, H. (1961). Theory of Probability, 1nd edition. London: Oxford University Press. Jones, D., Taylor, W., and Thornton, J. (1992). The rapid generaton of mutation data matrices from protein sequences. CABIOS, 8, Jukes, T. and Cantor, C. (1969). Evolution of protein molecules. In Mammaliam Protein Metabolism, ed. H. Munro, New York: Academic Press. Kalbfleisch, J. and Lawless, J. (1985). The analysis of panel data under a Markov assumption. Journal of the American Statistical Assocation, 80, Kimura, M. (1980). A simple model for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. Journal of Molecular Evolution, 16, Leitner, T., Kumar, S., and Albert, J. (1997). Tempo and mode of nucleotide substitutions in gag and env gene fragments in human immunodeficiency virus type 1 populations with a known transmission history. Journal of Virology, 71,

13 Bayesian analysis of continuous-time Markov chains 13 Li, S., Pearl, D., and Doss, H. (2000). Phylogenetic tree construction using Markov chain Monte Carlo. Journal of the American Statistical Association, 95, Ronquist, F. and Huelsenbeck, J. (2003). MRBAYES 3: Bayesian phylogenetic inference under mixed models. Bioinformatics, 19, Rosenberg, M., Subramanian, S., and Kumar, S. (2003). Patterns of transitional mutation biases within and among mammalian genomes. Molecular Biology and Evolution, 20, Suchard, M., Weiss, R., and Sinsheimer, J. (2001). Bayesian selection of continuous-time Markov chain evolutionary models. Molecular Biology and Evolution, 18, Sun, D. and Berger, J. O. (1998). Reference priors with partial information. Biometrika, 85, Welch, B. and Peers, H. W. (1963). On formulae for confidence points based on integrals of weighted likelihoods. Journal of the Royal Statistical Society, Series B, 25, Whelan, S. and Goldman, N. (2001). A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Molecular Biology and Evolution, 18, Yang, Z. (1996). Among-site rate variation and its impact on phylogenetic analyses. Trends in Ecology and Evolution, 11, Yang, Z. and Rannala, B. (2005). Branch-length prior influences Bayesian posterior probability of phylogeny. Systematic Biology, 54,

14 14 Marco A. R. Ferreira and Marc A. Suchard Sequence 1 AGCT ACAG Sequence 2 A A Fig. 1. Pairwise alignment of two nucleotide sequences with continuous-time Markov chain state space S = {A, G, C, T }. One homologous site is illustrated by a shaded box; sites are independent and identically distributed along the entire alignment. The observed data Y consist of the counts n ij of the character i S in a homologous site in sequence #1 ending as character j S in sequence #2. For example, n AA = 2 for the shown sites.

15 Bayesian analysis of continuous-time Markov chains 15 Exponential(10) Uniform(0,1) Density Branch length Exponential(1) Density Branch length Exponential(0.5) Density Density Branch length Uniform(0,2) Density Branch length Uniform(0,10) Density Branch length Branch length Exponential(0.1) Uniform(0,100) Density Density Branch length Branch length Fig. 2. Estimates of the elapsed time between German and the most recent common ancestor of Roman languages under six standard priors. Histograms summary the posterior, while we overlay the priors (dashed lines).

16 16 Marco A. R. Ferreira and Marc A. Suchard Relative MSE Coverage Log(Density) t Fig. 3. Analyses under the Jukes-Cantor model based on the conditional reference prior (solid black line), a Gamma approximation to the conditional reference prior (dotted-dashed green line), an Exponential(10) prior (dashed red line) and a Uniform(0,10) prior (dotted blue line). The first panel compares the log prior densities. The second panel plots the relative mean square error (MSE) of the Bayesian estimates. The final panel describes the frequentist coverage of the Bayesian 95% credible intervals.

17 Bayesian analysis of continuous-time Markov chains 17 Var t 2 (Bias t) t Fig. 4. Decomposition of the relative mean square error (MSE) under the Jukes-Cantor model for the conditional reference prior (solid black line), the Gamma approximation to the conditional reference prior (dotted-dashed green line), the Exponential(10) prior (dashed red line) and the Uniform(0,10) prior (dotted blue line). The first panel traces out the posterior estimate variance scaled by the true elapsed time t 2 and the second panel represents the posterior estimate squared bias scaled by t 2.

Lecture Notes: Markov chains

Lecture Notes: Markov chains Computational Genomics and Molecular Biology, Fall 5 Lecture Notes: Markov chains Dannie Durand At the beginning of the semester, we introduced two simple scoring functions for pairwise alignments: a similarity

More information

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22 Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 24. Phylogeny methods, part 4 (Models of DNA and

More information

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe? How should we go about modeling this? gorilla GAAGTCCTTGAGAAATAAACTGCACACACTGG orangutan GGACTCCTTGAGAAATAAACTGCACACACTGG Model parameters? Time Substitution rate Can we observe time or subst. rate? What

More information

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26 Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 4 (Models of DNA and

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/36

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

BMI/CS 776 Lecture 4. Colin Dewey

BMI/CS 776 Lecture 4. Colin Dewey BMI/CS 776 Lecture 4 Colin Dewey 2007.02.01 Outline Common nucleotide substitution models Directed graphical models Ancestral sequence inference Poisson process continuous Markov process X t0 X t1 X t2

More information

Mutation models I: basic nucleotide sequence mutation models

Mutation models I: basic nucleotide sequence mutation models Mutation models I: basic nucleotide sequence mutation models Peter Beerli September 3, 009 Mutations are irreversible changes in the DNA. This changes may be introduced by chance, by chemical agents, or

More information

What Is Conservation?

What Is Conservation? What Is Conservation? Lee A. Newberg February 22, 2005 A Central Dogma Junk DNA mutates at a background rate, but functional DNA exhibits conservation. Today s Question What is this conservation? Lee A.

More information

Statistics 992 Continuous-time Markov Chains Spring 2004

Statistics 992 Continuous-time Markov Chains Spring 2004 Summary Continuous-time finite-state-space Markov chains are stochastic processes that are widely used to model the process of nucleotide substitution. This chapter aims to present much of the mathematics

More information

Inferring Speciation Times under an Episodic Molecular Clock

Inferring Speciation Times under an Episodic Molecular Clock Syst. Biol. 56(3):453 466, 2007 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150701420643 Inferring Speciation Times under an Episodic Molecular

More information

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007

Bayesian inference. Fredrik Ronquist and Peter Beerli. October 3, 2007 Bayesian inference Fredrik Ronquist and Peter Beerli October 3, 2007 1 Introduction The last few decades has seen a growing interest in Bayesian inference, an alternative approach to statistical inference.

More information

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution

Massachusetts Institute of Technology Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution Massachusetts Institute of Technology 6.877 Computational Evolutionary Biology, Fall, 2005 Notes for November 7: Molecular evolution 1. Rates of amino acid replacement The initial motivation for the neutral

More information

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic analysis Phylogenetic Basics: Biological

More information

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington. Maximum Likelihood This presentation is based almost entirely on Peter G. Fosters - "The Idiot s Guide to the Zen of Likelihood in a Nutshell in Seven Days for Dummies, Unleashed. http://www.bioinf.org/molsys/data/idiots.pdf

More information

Stat 516, Homework 1

Stat 516, Homework 1 Stat 516, Homework 1 Due date: October 7 1. Consider an urn with n distinct balls numbered 1,..., n. We sample balls from the urn with replacement. Let N be the number of draws until we encounter a ball

More information

Reading for Lecture 13 Release v10

Reading for Lecture 13 Release v10 Reading for Lecture 13 Release v10 Christopher Lee November 15, 2011 Contents 1 Evolutionary Trees i 1.1 Evolution as a Markov Process...................................... ii 1.2 Rooted vs. Unrooted Trees........................................

More information

Dr. Amira A. AL-Hosary

Dr. Amira A. AL-Hosary Phylogenetic analysis Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut University-Egypt Phylogenetic Basics: Biological

More information

Bayesian Models for Phylogenetic Trees

Bayesian Models for Phylogenetic Trees Bayesian Models for Phylogenetic Trees Clarence Leung* 1 1 McGill Centre for Bioinformatics, McGill University, Montreal, Quebec, Canada ABSTRACT Introduction: Inferring genetic ancestry of different species

More information

Markov Chains. Sarah Filippi Department of Statistics TA: Luke Kelly

Markov Chains. Sarah Filippi Department of Statistics  TA: Luke Kelly Markov Chains Sarah Filippi Department of Statistics http://www.stats.ox.ac.uk/~filippi TA: Luke Kelly With grateful acknowledgements to Prof. Yee Whye Teh's slides from 2013 14. Schedule 09:30-10:30 Lecture:

More information

Phylogenetic Inference using RevBayes

Phylogenetic Inference using RevBayes Phylogenetic Inference using RevBayes Model section using Bayes factors Sebastian Höhna 1 Overview This tutorial demonstrates some general principles of Bayesian model comparison, which is based on estimating

More information

Bayesian inference: what it means and why we care

Bayesian inference: what it means and why we care Bayesian inference: what it means and why we care Robin J. Ryder Centre de Recherche en Mathématiques de la Décision Université Paris-Dauphine 6 November 2017 Mathematical Coffees Robin Ryder (Dauphine)

More information

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004, Tracing the Evolution of Numerical Phylogenetics: History, Philosophy, and Significance Adam W. Ferguson Phylogenetic Systematics 26 January 2009 Inferring Phylogenies Historical endeavor Darwin- 1837

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

7.36/7.91 recitation CB Lecture #4

7.36/7.91 recitation CB Lecture #4 7.36/7.91 recitation 2-19-2014 CB Lecture #4 1 Announcements / Reminders Homework: - PS#1 due Feb. 20th at noon. - Late policy: ½ credit if received within 24 hrs of due date, otherwise no credit - Answer

More information

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression) Using phylogenetics to estimate species divergence times... More accurately... Basics and basic issues for Bayesian inference of divergence times (plus some digression) "A comparison of the structures

More information

Inferring Molecular Phylogeny

Inferring Molecular Phylogeny Dr. Walter Salzburger he tree of life, ustav Klimt (1907) Inferring Molecular Phylogeny Inferring Molecular Phylogeny 55 Maximum Parsimony (MP): objections long branches I!! B D long branch attraction

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: istance-based methods Ultrametric Additive: UPGMA Transformed istance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

Lecture 4. Models of DNA and protein change. Likelihood methods

Lecture 4. Models of DNA and protein change. Likelihood methods Lecture 4. Models of DNA and protein change. Likelihood methods Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 4. Models of DNA and protein change. Likelihood methods p.1/39

More information

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree Nicolas Salamin Department of Ecology and Evolution University of Lausanne

More information

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia

Lie Markov models. Jeremy Sumner. School of Physical Sciences University of Tasmania, Australia Lie Markov models Jeremy Sumner School of Physical Sciences University of Tasmania, Australia Stochastic Modelling Meets Phylogenetics, UTAS, November 2015 Jeremy Sumner Lie Markov models 1 / 23 The theory

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood

Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood Phylogeny Estimation and Hypothesis Testing using Maximum Likelihood For: Prof. Partensky Group: Jimin zhu Rama Sharma Sravanthi Polsani Xin Gong Shlomit klopman April. 7. 2003 Table of Contents Introduction...3

More information

Understanding relationship between homologous sequences

Understanding relationship between homologous sequences Molecular Evolution Molecular Evolution How and when were genes and proteins created? How old is a gene? How can we calculate the age of a gene? How did the gene evolve to the present form? What selective

More information

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS

EVOLUTIONARY DISTANCE MODEL BASED ON DIFFERENTIAL EQUATION AND MARKOV PROCESS August 0 Vol 4 No 005-0 JATIT & LLS All rights reserved ISSN: 99-8645 wwwjatitorg E-ISSN: 87-95 EVOLUTIONAY DISTANCE MODEL BASED ON DIFFEENTIAL EUATION AND MAKOV OCESS XIAOFENG WANG College of Mathematical

More information

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky MOLECULAR PHYLOGENY "Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally

More information

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation 1 dist est: Estimation of Rates-Across-Sites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The

More information

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Distance Methods COMP 571 - Spring 2015 Luay Nakhleh, Rice University Outline Evolutionary models and distance corrections Distance-based methods Evolutionary Models and Distance Correction

More information

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences

Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Molecular Evolution & Phylogenetics Traits, phylogenies, evolutionary models and divergence time between sequences Basic Bioinformatics Workshop, ILRI Addis Ababa, 12 December 2017 1 Learning Objectives

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University Phylogenetics: Bayesian Phylogenetic Analysis COMP 571 - Spring 2015 Luay Nakhleh, Rice University Bayes Rule P(X = x Y = y) = P(X = x, Y = y) P(Y = y) = P(X = x)p(y = y X = x) P x P(X = x 0 )P(Y = y X

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies 1 What is phylogeny? Essay written for the course in Markov Chains 2004 Torbjörn Karfunkel Phylogeny is the evolutionary development

More information

Bayesian Methods for Machine Learning

Bayesian Methods for Machine Learning Bayesian Methods for Machine Learning CS 584: Big Data Analytics Material adapted from Radford Neal s tutorial (http://ftp.cs.utoronto.ca/pub/radford/bayes-tut.pdf), Zoubin Ghahramni (http://hunch.net/~coms-4771/zoubin_ghahramani_bayesian_learning.pdf),

More information

Non-independence in Statistical Tests for Discrete Cross-species Data

Non-independence in Statistical Tests for Discrete Cross-species Data J. theor. Biol. (1997) 188, 507514 Non-independence in Statistical Tests for Discrete Cross-species Data ALAN GRAFEN* AND MARK RIDLEY * St. John s College, Oxford OX1 3JP, and the Department of Zoology,

More information

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information # Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Details of PRF Methodology In the Poisson Random Field PRF) model, it is assumed that non-synonymous mutations at a given gene are either

More information

A Bayesian Approach to Phylogenetics

A Bayesian Approach to Phylogenetics A Bayesian Approach to Phylogenetics Niklas Wahlberg Based largely on slides by Paul Lewis (www.eeb.uconn.edu) An Introduction to Bayesian Phylogenetics Bayesian inference in general Markov chain Monte

More information

Evolutionary Models. Evolutionary Models

Evolutionary Models. Evolutionary Models Edit Operators In standard pairwise alignment, what are the allowed edit operators that transform one sequence into the other? Describe how each of these edit operations are represented on a sequence alignment

More information

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny

Branch-Length Prior Influences Bayesian Posterior Probability of Phylogeny Syst. Biol. 54(3):455 470, 2005 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 10.1080/10635150590945313 Branch-Length Prior Influences Bayesian Posterior Probability

More information

Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data

Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data Estimating the Size of Hidden Populations using Respondent-Driven Sampling Data Mark S. Handcock Krista J. Gile Department of Statistics Department of Mathematics University of California University of

More information

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018 Maximum Likelihood Tree Estimation Carrie Tribble IB 200 9 Feb 2018 Outline 1. Tree building process under maximum likelihood 2. Key differences between maximum likelihood and parsimony 3. Some fancy extras

More information

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics - in deriving a phylogeny our goal is simply to reconstruct the historical relationships between a group of taxa. - before we review the

More information

Phylogenetic Tree Reconstruction

Phylogenetic Tree Reconstruction I519 Introduction to Bioinformatics, 2011 Phylogenetic Tree Reconstruction Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB Evolution theory Speciation Evolution of new organisms is driven

More information

Phylogenetics. BIOL 7711 Computational Bioscience

Phylogenetics. BIOL 7711 Computational Bioscience Consortium for Comparative Genomics! University of Colorado School of Medicine Phylogenetics BIOL 7711 Computational Bioscience Biochemistry and Molecular Genetics Computational Bioscience Program Consortium

More information

Taming the Beast Workshop

Taming the Beast Workshop Workshop David Rasmussen & arsten Magnus June 27, 2016 1 / 31 Outline of sequence evolution: rate matrices Markov chain model Variable rates amongst different sites: +Γ Implementation in BES2 2 / 31 genotype

More information

Constructing Evolutionary/Phylogenetic Trees

Constructing Evolutionary/Phylogenetic Trees Constructing Evolutionary/Phylogenetic Trees 2 broad categories: Distance-based methods Ultrametric Additive: UPGMA Transformed Distance Neighbor-Joining Character-based Maximum Parsimony Maximum Likelihood

More information

7. Tests for selection

7. Tests for selection Sequence analysis and genomics 7. Tests for selection Dr. Katja Nowick Group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute for Brain Research www. nowicklab.info

More information

Overall Objective Priors

Overall Objective Priors Overall Objective Priors Jim Berger, Jose Bernardo and Dongchu Sun Duke University, University of Valencia and University of Missouri Recent advances in statistical inference: theory and case studies University

More information

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation Syst Biol 55(2):259 269, 2006 Copyright c Society of Systematic Biologists ISSN: 1063-5157 print / 1076-836X online DOI: 101080/10635150500541599 Inferring Complex DNA Substitution Processes on Phylogenies

More information

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057 Estimating Phylogenies (Evolutionary Trees) II Biol4230 Thurs, March 2, 2017 Bill Pearson wrp@virginia.edu 4-2818 Jordan 6-057 Tree estimation strategies: Parsimony?no model, simply count minimum number

More information

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) Joe Felsenstein Department of Genome Sciences and Department of Biology Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30 A non-phylogeny

More information

Different Versions of the Dayhoff Rate Matrix

Different Versions of the Dayhoff Rate Matrix Different Versions of the Dayhoff Rate Matrix CAROLIN KOSIOL and NICK GOLDMAN* EMBL-European Bioinformatics Institute, Hinxton, CB10 1SD, U.K. *Corresponding author: Nick Goldman EMBL-European Bioinformatics

More information

The Bayesian Approach to Multi-equation Econometric Model Estimation

The Bayesian Approach to Multi-equation Econometric Model Estimation Journal of Statistical and Econometric Methods, vol.3, no.1, 2014, 85-96 ISSN: 2241-0384 (print), 2241-0376 (online) Scienpress Ltd, 2014 The Bayesian Approach to Multi-equation Econometric Model Estimation

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

Invariant HPD credible sets and MAP estimators

Invariant HPD credible sets and MAP estimators Bayesian Analysis (007), Number 4, pp. 681 69 Invariant HPD credible sets and MAP estimators Pierre Druilhet and Jean-Michel Marin Abstract. MAP estimators and HPD credible sets are often criticized in

More information

Rank Regression with Normal Residuals using the Gibbs Sampler

Rank Regression with Normal Residuals using the Gibbs Sampler Rank Regression with Normal Residuals using the Gibbs Sampler Stephen P Smith email: hucklebird@aol.com, 2018 Abstract Yu (2000) described the use of the Gibbs sampler to estimate regression parameters

More information

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution Ziheng Yang Department of Biology, University College, London An excess of nonsynonymous substitutions

More information

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Mathematical Statistics Stockholm University Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences Bodil Svennblad Tom Britton Research Report 2007:2 ISSN 650-0377 Postal

More information

MARKOV CHAIN MONTE CARLO

MARKOV CHAIN MONTE CARLO MARKOV CHAIN MONTE CARLO RYAN WANG Abstract. This paper gives a brief introduction to Markov Chain Monte Carlo methods, which offer a general framework for calculating difficult integrals. We start with

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A GAGATC 3:G A 6:C T Common Ancestor ACGATC 1:A G 2:C A Substitution = Mutation followed 5:T C by Fixation GAAATT 4:A C 1:G A AAAATT GAAATT GAGCTC ACGACC Chimp Human Gorilla Gibbon AAAATT GAAATT GAGCTC ACGACC

More information

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles John Novembre and Montgomery Slatkin Supplementary Methods To

More information

Taming the Beast Workshop

Taming the Beast Workshop Workshop and Chi Zhang June 28, 2016 1 / 19 Species tree Species tree the phylogeny representing the relationships among a group of species Figure adapted from [Rogers and Gibbs, 2014] Gene tree the phylogeny

More information

X X (2) X Pr(X = x θ) (3)

X X (2) X Pr(X = x θ) (3) Notes for 848 lecture 6: A ML basis for compatibility and parsimony Notation θ Θ (1) Θ is the space of all possible trees (and model parameters) θ is a point in the parameter space = a particular tree

More information

Chapter 7: Models of discrete character evolution

Chapter 7: Models of discrete character evolution Chapter 7: Models of discrete character evolution pdf version R markdown to recreate analyses Biological motivation: Limblessness as a discrete trait Squamates, the clade that includes all living species

More information

Bayes: All uncertainty is described using probability.

Bayes: All uncertainty is described using probability. Bayes: All uncertainty is described using probability. Let w be the data and θ be any unknown quantities. Likelihood. The probability model π(w θ) has θ fixed and w varying. The likelihood L(θ; w) is π(w

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Bayesian Interpretations of Heteroskedastic Consistent Covariance. Estimators Using the Informed Bayesian Bootstrap

Bayesian Interpretations of Heteroskedastic Consistent Covariance. Estimators Using the Informed Bayesian Bootstrap Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Dale J. Poirier University of California, Irvine May 22, 2009 Abstract This paper provides

More information

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks! Paul has many great tools for teaching phylogenetics at his web site: http://hydrodictyon.eeb.uconn.edu/people/plewis

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

Counting labeled transitions in continuous-time Markov models of evolution

Counting labeled transitions in continuous-time Markov models of evolution Journal of Mathematical Biology manuscript No. (will be inserted by the editor) Counting labeled transitions in continuous-time Markov models of evolution Vladimir N. Minin Marc A. Suchard Received: date

More information

Evolutionary Analysis of Viral Genomes

Evolutionary Analysis of Viral Genomes University of Oxford, Department of Zoology Evolutionary Biology Group Department of Zoology University of Oxford South Parks Road Oxford OX1 3PS, U.K. Fax: +44 1865 271249 Evolutionary Analysis of Viral

More information

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap

Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Bayesian Interpretations of Heteroskedastic Consistent Covariance Estimators Using the Informed Bayesian Bootstrap Dale J. Poirier University of California, Irvine September 1, 2008 Abstract This paper

More information

Reconstruire le passé biologique modèles, méthodes, performances, limites

Reconstruire le passé biologique modèles, méthodes, performances, limites Reconstruire le passé biologique modèles, méthodes, performances, limites Olivier Gascuel Centre de Bioinformatique, Biostatistique et Biologie Intégrative C3BI USR 3756 Institut Pasteur & CNRS Reconstruire

More information

Contrasts for a within-species comparative method

Contrasts for a within-species comparative method Contrasts for a within-species comparative method Joseph Felsenstein, Department of Genetics, University of Washington, Box 357360, Seattle, Washington 98195-7360, USA email address: joe@genetics.washington.edu

More information

Modeling Noise in Genetic Sequences

Modeling Noise in Genetic Sequences Modeling Noise in Genetic Sequences M. Radavičius 1 and T. Rekašius 2 1 Institute of Mathematics and Informatics, Vilnius, Lithuania 2 Vilnius Gediminas Technical University, Vilnius, Lithuania 1. Introduction:

More information

Letter to the Editor. Department of Biology, Arizona State University

Letter to the Editor. Department of Biology, Arizona State University Letter to the Editor Traditional Phylogenetic Reconstruction Methods Reconstruct Shallow and Deep Evolutionary Relationships Equally Well Michael S. Rosenberg and Sudhir Kumar Department of Biology, Arizona

More information

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive. Additive distances Let T be a tree on leaf set S and let w : E R + be an edge-weighting of T, and assume T has no nodes of degree two. Let D ij = e P ij w(e), where P ij is the path in T from i to j. Then

More information

Lecture 3: Markov chains.

Lecture 3: Markov chains. 1 BIOINFORMATIK II PROBABILITY & STATISTICS Summer semester 2008 The University of Zürich and ETH Zürich Lecture 3: Markov chains. Prof. Andrew Barbour Dr. Nicolas Pétrélis Adapted from a course by Dr.

More information

STA 294: Stochastic Processes & Bayesian Nonparametrics

STA 294: Stochastic Processes & Bayesian Nonparametrics MARKOV CHAINS AND CONVERGENCE CONCEPTS Markov chains are among the simplest stochastic processes, just one step beyond iid sequences of random variables. Traditionally they ve been used in modelling a

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California

Ronald Christensen. University of New Mexico. Albuquerque, New Mexico. Wesley Johnson. University of California, Irvine. Irvine, California Texts in Statistical Science Bayesian Ideas and Data Analysis An Introduction for Scientists and Statisticians Ronald Christensen University of New Mexico Albuquerque, New Mexico Wesley Johnson University

More information

Appendix: Modeling Approach

Appendix: Modeling Approach AFFECTIVE PRIMACY IN INTRAORGANIZATIONAL TASK NETWORKS Appendix: Modeling Approach There is now a significant and developing literature on Bayesian methods in social network analysis. See, for instance,

More information

Integrated Objective Bayesian Estimation and Hypothesis Testing

Integrated Objective Bayesian Estimation and Hypothesis Testing Integrated Objective Bayesian Estimation and Hypothesis Testing José M. Bernardo Universitat de València, Spain jose.m.bernardo@uv.es 9th Valencia International Meeting on Bayesian Statistics Benidorm

More information