Phylogenetics and Darwin. An Introduction to Phylogenetics. Tree of Life. Darwin s Trees

Similar documents
Modern Phylogenetics. An Introduction to Phylogenetics. Phylogenetics and Systematics. Phylogenetic Tree of Whales

Models of Molecular Evolution

C.DARWIN ( )

Phylogenetic Trees. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. September 8, 2009

Phylogenetic Trees. Bret Larget. Departments of Botany and of Statistics University of Wisconsin Madison. September 8, 2011

Bayesian Phylogenetics

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Phylogenetics. BIOL 7711 Computational Bioscience

Statistics 992 Continuous-time Markov Chains Spring 2004

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Constructing Evolutionary/Phylogenetic Trees

Molecular Evolution & Phylogenetics

Dr. Amira A. AL-Hosary

Using algebraic geometry for phylogenetic reconstruction

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

How should we go about modeling this? Model parameters? Time Substitution rate Can we observe time or subst. rate? What can we observe?

Inferring Speciation Times under an Episodic Molecular Clock


Reading for Lecture 13 Release v10

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Additive distances. w(e), where P ij is the path in T from i to j. Then the matrix [D ij ] is said to be additive.

Maximum Likelihood Tree Estimation. Carrie Tribble IB Feb 2018

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University

Lecture 6 Phylogenetic Inference

A Bayesian Approach to Phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

LECTURE 15 Markov chain Monte Carlo

Mutation models I: basic nucleotide sequence mutation models

Phylogenetic Tree Reconstruction

Evolutionary Models. Evolutionary Models

BINF6201/8201. Molecular phylogenetic methods

Reconstruire le passé biologique modèles, méthodes, performances, limites

Lecture 11 Friday, October 21, 2011

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Maximum Likelihood Until recently the newest method. Popularized by Joseph Felsenstein, Seattle, Washington.

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Lecture 4. Models of DNA and protein change. Likelihood methods

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

8/23/2014. Phylogeny and the Tree of Life

How to read and make phylogenetic trees Zuzana Starostová

Week 5: Distance methods, DNA and protein models

Probabilistic modeling and molecular phylogeny

What is Phylogenetics

Phylogenetics: Building Phylogenetic Trees

Phylogenetics: Parsimony and Likelihood. COMP Spring 2016 Luay Nakhleh, Rice University

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Introduction to Machine Learning CMU-10701

Inferring Complex DNA Substitution Processes on Phylogenies Using Uniformization and Data Augmentation

Substitution = Mutation followed. by Fixation. Common Ancestor ACGATC 1:A G 2:C A GAGATC 3:G A 6:C T 5:T C 4:A C GAAATT 1:G A

Algorithms in Bioinformatics

The Generalized Neighbor Joining method

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

4/25/2009. Xi (Cici) Chen Texas Tech University. Ancient: folk taxonomy impulse to classify organisms Carl Linnaenus: binominal nomenclature (1735)

EVOLUTIONARY DISTANCES

Lecture 24. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/22

O 3 O 4 O 5. q 3. q 4. Transition

An Introduction to Bayesian Inference of Phylogeny

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Theory of Evolution Charles Darwin

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Letter to the Editor. Department of Biology, Arizona State University

Estimating Evolutionary Trees. Phylogenetic Methods

Lecture 27. Phylogeny methods, part 4 (Models of DNA and protein change) p.1/26

Bayesian Models for Phylogenetic Trees

Phylogeny. November 7, 2017

Evolutionary Analysis of Viral Genomes

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Lecture 3: Markov chains.

Phylogenetic Inference using RevBayes

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Practical Bioinformatics

Lab 9: Maximum Likelihood and Modeltest

Bayesian inference & Markov chain Monte Carlo. Note 1: Many slides for this lecture were kindly provided by Paul Lewis and Mark Holder

A (short) introduction to phylogenetics

Multiple Alignment. Anders Gorm Pedersen Molecular Evolution Group Center for Biological Sequence Analysis

Stat 516, Homework 1

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Phylogenetic Inference using RevBayes

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

Today s Lecture: HMMs

Modern Evolutionary Classification. Section 18-2 pgs

Markov Chain Monte Carlo The Metropolis-Hastings Algorithm

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

MCMC: Markov Chain Monte Carlo

Some of these slides have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Mixture Models in Phylogenetic Inference. Mark Pagel and Andrew Meade Reading University.

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Inferring Molecular Phylogeny

Nomenclature and classification

Probabilistic Graphical Models

Chapter 16: Reconstructing and Using Phylogenies

Identifiability of the GTR+Γ substitution model (and other models) of DNA evolution

arxiv: v1 [q-bio.pe] 4 Sep 2013

Transcription:

Phylogenetics and Darwin An Introduction to Phylogenetics Bret Larget larget@stat.wisc.edu Departments of Botany and of Statistics University of Wisconsin Madison February 4, 2008 A phylogeny is a tree diagram that shows the evolutionary relationships among a group of species. The first phylogeny is due to Charles Darwin. In 1837, shortly after his famous five-year voyage as naturalist on the Beagle, Darwin sketched a tree diagram in one of his notebooks. This simple sketch is remarkably similar to modern diagrams of phylogenies. In addition, the sole figure in The Origin of Species is a phylogeny. 1 / 70 Introduction History and Darwin 2 / 70 Darwin s Trees Tree of Life Darwin s 1837 Sketch Figure from The Origin of Species In The Origin of Species, Darwin describes a Tree of Life that represents the true evolutionary history of life. The affinities of all the beings of the same class have sometimes been represented by a great tree. I believe this simile largely speaks the truth.... The limbs divided into great branches, and these into lesser and lesser branches, were themselves once, when the tree was small, budding twigs; and this connexion of the former and present buds by ramifying branches may well represent the classification of all extinct and living species in groups subordinate to groups.... As buds give rise by growth to fresh buds, and these, if vigorous, branch out and overtop on all a feebler branch, so by generation I believe it has been with the Tree of Life, which fills with its dead and broken branches the crust of the earth, and covers the surface with its ever branching and beautiful ramifications. Introduction History and Darwin 3 / 70 Introduction History and Darwin 4 / 70

Early Phylogenetics Haeckel s Stylized Trees Shortly after the 1859 publication of The Origin of Species, many biologists came to accept the truth of a universal Tree of Life. Ernst Haeckel and many others created highly stylized trees that were based on expert opinion. A century passed before development of formal scientific methods for estimating phylogenies began. Introduction Early Phylogenetics 5 / 70 Introduction Early Phylogenetics 6 / 70 Modern Phylogenetics Phylogenetics and Systematics Phylogenies are usually estimated from aligned DNA sequence data. Phylogenetics is the primary tool for systematics. Phylogenetics is used for studying viruses such as HIV. Phylogenetics has been used in court for forensic purposes. Phylogenetics is being used increasingly in comparative genomics and study of gene function. Phylogenetic methods, particularly for molecular sequence data, have become the primary tool for systemicists to determine evolutionary relationships. These tools have been used to confirm expected relationships for example, that chimpanzees are the closest living relative to humans and have also been key in revealing several more surprising findings, including: birds are descended from dinosaurs; polar bears form a monophyletic group within brown bears; the most closely related land mammal to whales is the hippopotamus. Introduction Some Modern Uses of Phylogenies 7 / 70 Introduction Some Modern Uses of Phylogenies 8 / 70

RF BRVA US3 D31 SFMHS8 P896 SFMHS7 ENVVG YU2 ENVUSR2 SFMHS2 JH32 US2 SC WEAU160 JRCSF HXB2 SF2 ENVVA GB8.C1 ENVVF SF128A PHI159 SFMHS1 ADA ALA1 SFMHS20 85WCIPR54 CAM1 NY5CG MNCG SC14C DH123 100 US4 89SP061 US1 CDC452 HAN WR27 100 MBC200 100 MBC18 TH936705 78 MBC925 MANC RL42 PHI133 100 OYI MBCD36 Phylogenetic Tree of Whales Phylogenetics and Forensics ScienceDirect - Full Size Image 09/04/2007 09:10 AM CLOSE Phylogenetic trees have been used in several instances in the courts to provide evidence about the likely transmission of HIV. Examples include: Confirming that a nurse contracted HIV from mishap with a broken glass blood collection tube from an infected patient and not from an alternative source; Providing evidence of deliberate infection in a criminal case; Indicated that an infected friend was likely not the direct source of infection in a case. http://www.sciencedirect.com/science?_ob=miamicaptionurl&_method urlversion=0&_userid=443835&md5=df655f7ee732c807488f9262b841bcfc Page 1 of 2 Introduction Some Modern Uses of Phylogenies 9 / 70 Introduction Phylogenetics and Forensics 10 / 70 Forensic Phylogenetic Tree DNA Data from a Sample of Birds Reviews 30 HIV Forensics Figure 2 10% LC50 LC49 3 A40 A34 A41 A32 A37 A36 A30 A39 A38 A44 B22 B25 B27 B26 B24 B23 A43 A31 A33 A42 LC47 LC46 LC45 LC48 B28 B29 Figure 2. Neighbor-joining phylogram representing the reconstruction of the phylogenetic relationships between the env (C2-V5) sequences obtained from the index case (A31-44), the alleged recipient (B22-29), three local controls (LC45 and LC48; LC46 and LC47; and LC49 and LC50) and 48 sequences chosen from GenBank. Ten iterations of random sequence addition were used. Scale bar represents 10% genetic distance. Bootstrap values are shown at nodes with greater than 70% support. 2 1 4 First 24 bases of 1558 from Cox I gene. Alligator GTG AAC TTC CAC --- CGT TGA CTC... Emu GTG ACA TTC ATT ACT CGA TGA TTT... Kiwi GTG ACC TTT ACT ACT CGA TGA CTC... Ostrich GTG ACC TTC ATT ACT CGA TGA CTT... Swan GTG ACC TTC ATC AAC CGA TGA CTA... Goose GTG ACC TTC ATC AAC CGA TGA CTA... Chicken GTG ACC TTC ATC AAC CGA TGA TTA... Woodpecker GTG ACC TTC ATC AAC CGA TGA TTA... Finch ATG ACA TAC ATT AAC CGA TGA TTA... Ibis GTG ACC TTC ATC AAC CGA TGA CTA... Stork GTG ACC TTC ATT ACC CGA TGA CTA... Osprey ATG ACA TTC ATC AAC CGA TGA CTA... Falcon GTG ACC TTC ATC AAC CGA TGA CTA... Vulture ATG ACA TTC ATC AAT CGA TGA CTA... Penguin GTG ACC TTC ATT AAC CGA TGA CTA... Introduction Phylogenetics and Forensics 11 / 70 Example Phylogeny of Birds 12 / 70

An Estimated Phylogeny Activity 1: Example Tree Penguin Vulture Stork Ibis Woodpecker Osprey Finch Falcon Chicken Goose Swan Ostrich How many descendent taxa does the common ancestor of taxa A and C have? Which taxon is sister to A? Which taxa are more closely related, A and C or C and D? Which taxa are more closely related, A and E or D and E? F E D C B A Kiwi Emu Alligator Example Phylogeny of Birds 13 / 70 Trees Phylogeny Basics 14 / 70 Activity 2: Compare Trees Activity 3: Unrooted Trees Which trees have the same tree topology? F E D C B A E F D A E F D A B C F E D C B A C D E F E F D A Some methods estimate unrooted trees. If C is the outgroup, what is the rooted tree topology? If taxon C is the outgroup, which node is sister to B? If taxon A is the outgroup, which node is sister to B? How many rooted tree topologies are consistent with this unrooted tree topology? E A B D C B B B C A C Trees Unrooted Trees 16 / 70

How Many Trees? Formula for Counting Trees # of Taxa # Unrooted Trees # Rooted Trees 1 1 1 2 1 1 3 1 3 4 3 15 5 15 105 6 105 945 7 945 10395 8 10395 135135 9 135135 2027025 10 2027025 34459425 11 34459425 654729075 12 654729075 13749310575 13 13749310575 316234143225 The number of rooted tree topologies with n taxa is 1 3 (2n 3) (2n 3)!! for n 3. There are more rooted trees with 51 species (2.7 10 78 ) than estimated # of hydrogen atoms in the universe (1.3 10 77 ). Biologists often estimate trees with more than 100 species. Trees Counting Trees 17 / 70 Trees Counting Trees 18 / 70 Probabilistic Framework Markov Property Essentially, all models are wrong, but some are useful. George Box Commonly used models of molecular evolution treat sites as independent. These common models just need to describe the substitutions among four bases A, C, G, and T at a single site over time. The substitution process is modeled as a continuous-time Markov chain. Use the notation X (t) to represent the base at time t. X (t) {A, C, G, T } for DNA. Formal statement: P {X (s + t) = j X (s) = i, X (u) = x(u) for u < s} = P {X (s + t) = j X (s) = i} Informal understanding: given the present, the past is independent of the future If the expression does not depend on the time s, the Markov process is called homogeneous. Models of Molecular Evolution Continuous-time Markov Chains 19 / 70 Models of Molecular Evolution Continuous-time Markov Chains 20 / 70

Rate Matrix Alarm Clock Interpretation A stationary, homogeneous, continuous-time, finite-state-space Markov chain is parameterized by a rate matrix where: off-diagonal rates are nonnegative; diagonal terms are negative row sums of off-diagonal elements; consequently, row sums are zero. Example: Q = {q ij } = 1.1 0.3 0.6 0.2 0.2 1.1 0.3 0.6 0.4 0.3 0.9 0.2 0.2 0.9 0.3 1.4 How to simulate a continuous-time Markov chain beginning in state i. time to the next transition Exp(qi ) where q i q ii. transition is to state j with probability q ij k i q ik Models of Molecular Evolution Continuous-time Markov Chains 21 / 70 Models of Molecular Evolution Continuous-time Markov Chains 22 / 70 Path Probability Density Calculation Probability Transition Matrices Example: Begin at A, change to G at time 0.3, change to C at time 0.8, and then no more changes before time t = 1. P {path} = P {begin at A} ( 1.1e (1.1)(0.3) 0.6 ) 1.1 ( 0.9e (0.9)(0.5) 0.3 ) 0.9 (e (1.1)(0.2)) The transition matrix is P(t) = e Qt where e A = k=0 A k k! = I + A + A2 2 + A3 6 + A probability transition matrix has non-negative values and each row sums to one. Each row contains the probabilities from a probability distribution on the possible states of the Markov process. Models of Molecular Evolution Continuous-time Markov Chains 23 / 70 Models of Molecular Evolution Continuous-time Markov Chains 24 / 70

Examples Spectral Decomposition P(0.1) = P(1) = 0.897 0.029 0.055 0.019 0.019 0.899 0.029 0.053 0.037 0.029 0.916 0.019 0.019 0.080 0.029 0.872 0.407 0.190 0.276 0.126 0.126 0.464 0.190 0.219 0.184 0.190 0.500 0.126 0.126 0.329 0.190 0.355 P(0.5) = P(10) = 0.605 0.118 0.199 0.079 0.079 0.629 0.118 0.174 0.132 0.118 0.671 0.079 0.079 0.261 0.118 0.542 0.200 0.300 0.300 0.200 0.200 0.300 0.300 0.200 0.200 0.300 0.300 0.200 0.200 0.300 0.300 0.200 The matrix Q can be factored as V ΛV 1 where Λ is a diagonal matrix of the eigenvalues and V is the matrix whose columns are corresponding eigenvectors. All rate matrices Q will have an eigenvalue 0 with an eigenvector of all 1s as the rows sum to 0 by construction. Our example rate matrix Q has eigenvalues 0, 1, 1.5, and 2. The probability transition matrix is of the form P(t) = V e Λt V 1. This means that each probability can be written as a linear combination of exponential functions of the product of the time t and an eigenvalue. P(t) = i w ie λ i t. Models of Molecular Evolution Continuous-time Markov Chains 25 / 70 Models of Molecular Evolution Continuous-time Markov Chains 26 / 70 Numerical Example Stationary Distribution Q = V ΛV 1 = 1.1 0.3 0.6 0.2 0.2 1.1 0.3 0.6 0.4 0.3 0.9 0.2 0.2 0.9 0.3 1.4 1 1 3 0 1 1 0 2 1 1 2 0 1 1 0 3 0 0 0.0 0 0 1 0.0 0 0 0 1.5 0 0 0 0.0 2 0.2 0.3 0.3 0.2 0.2 0.3 0.3 0.2 0.2 0.0 0.2 0.0 0.0 0.2 0.0 0.2 Well-behaved continuous-time Markov chains have a stationary distribution π. (For finite-state-space chains, irreducibility is sufficient.) When the time t is large enough, the probability P ij (t) will be close to π j for each i. (See P(10) from earlier.) The stationary distribution can be thought of as a long-run average the proportion of time the state spends in state i converges to π i. The stationary distribution satisfies π Q = 0. Also, π P(t) = π for any time t. Models of Molecular Evolution Continuous-time Markov Chains 27 / 70 Models of Molecular Evolution Continuous-time Markov Chains 28 / 70

Numerical Example Usual Parameterization ( ) 0.2 0.3 0.3 0.2 1.1 0.3 0.6 0.2 0.2 1.1 0.3 0.6 0.4 0.3 0.9 0.2 0.2 0.9 0.3 1.4 π Q = 0 = ( 0 0 0 0 ) The matrix Q = {q ij } is typically scaled and parameterized for i j where µ = i q ij = r ij π j /µ π i r ij π j which guarantees that π will be the stationary distribution when r ij = r ji. With this scaling, there is one expected transition per unit time. j i Models of Molecular Evolution Continuous-time Markov Chains 29 / 70 Models of Molecular Evolution Continuous-time Markov Chains 30 / 70 Time-reversibility General Time-Reversible Model A continuous-time Markov chain is time-reversible if the probability of a sequence of events is the same going forward as it is going backwards. The matrix Q is the matrix for a time-reversible Markov chain when π i q ij = π j q ji for all i and j. That is, the overall rate of substitutions from i to j equals the overall rate of substitutions from j to i for every pair of states i and j. The matrix equivalent is ΠQ = Q Π where Π = diag(π). The GTR model is the most general basic time-reversible continuous-time Markov model for nucleotide substitution. The model is typically parameterized with 8 free parameters where { rij π j /µ for i j q ij = j i q ij for i = j with µ = i π i j i r ijπ j. The stationary distribution pi has three free parameters as π sums to one; The vector r = (rac, r AG,..., r GT ) is usually constrained to five degrees of freedom (either by setting r GT = 1 or constraining the sum). Many other popular models are special cases. These models are often named by the initials of the authors and the year in which they were published. Models of Molecular Evolution Continuous-time Markov Chains 31 / 70 Models of Molecular Evolution General Time-Reversible Model 32 / 70

Other Common Models Transition Probabilities Long Name Short Name π r Jukes-Cantor JC69 uniform r AC = r AG = r AT = r CG = r CT = r GT Kimura 80 K80 uniform r AG = r CT, r AC = r AT = r CG = r GT Felsenstein 81 F81 free r AC = r AG = r AT = r CG = r CT = r GT Felsenstein 84 F84 free r AC = r AT = r CG = r GT r AG = (1 + κ/(π A + π G ))r AC r CT = (1 + κ/(π C + π T ))r AC Hasegawa et al. HKY85 free r AC = r AT = r CG = r GT r AG = r CT = κr AC Timura-Nei 93 TN93 free r AC = r AT = r CG = r GT r AG = κ 1 r AC r CT = κ 2 r AC There are closed form solutions to the probability transition matrices for each of the previous models except for GTR. All but GTR are special cases of Tamura-Nei. Models of Molecular Evolution General Time-Reversible Model 33 / 70 Models of Molecular Evolution General Time-Reversible Model 34 / 70 Tamura-Nei Model Tamura-Nei Model The rate matrix for TN93 is: 0 Q = µ 1 B @ where (κ R π G + π Y ) π C κ R π G π T π A (κ Y π T + π R ) π G κ Y π T κ R π A π C (κ R π A + π Y ) π T π A κ Y π C π G (κ Y π C + π R ) πr = π A + π Y ; πy = π C + π T ; µ = 2(κR π A π G + κ Y π C π T + π R π Y ). 1 C A The transition probabilites for TN93 are P(t) = 2 6 4 π A + π A π Y π R π A + π A π Y π R where β 2 + π G β π 3 π C (1 β 2 ) π G + π G π Y β R π 2 π G β R π 3 π T (1 β 2 ) R β 2 π A β π 3 R π C (1 β 2 ) π G + π G π Y β π 2 + π A β R π 3 R π T (1 β 2 ) π C + π C π R β π 2 + π T β Y π 4 Y π G (1 β 2 ) π T + π T π R β π 2 π T β Y π 4 Y π A (1 β 2 ) π A (1 β 2 ) π C + π C π R β π 2 π C Y β2 = exp( t/µ); β 3 = exp( (π R κ 1 + π Y )t/µ); β 4 = exp( (π Y κ 2 + π R )t/µ). β π 4 π G (1 β 2 ) π T + π T π R Y π Y β 2 + π C π Y β 4 3 7 5 Models of Molecular Evolution General Time-Reversible Model 35 / 70 Models of Molecular Evolution General Time-Reversible Model 36 / 70

Rate Variation Among Sites Other Extensions A common extension to the standard CTMC models is to assume that there is rate variation among sites. At these sites, the Q matrix is multiplied by a site-specific rate. The two most popular extensions are: Invariant sites: some sites have rate 0 Gamma-distributed rates: rates are drawn from a mean 1 gamma distribution For computational tractability, the Gamma distribution is typically replaced by a mean 1 discrete distribution with four distinct rates based on quantiles of a Gamma distribution. There are many other model extensions in common use and under development. It is common to partition sites (by gene, by codon position, by genomic location) and to use different models for each part. The covarion model allows different lineages to have different rates at the same site. This is typically modeled with a hidden Markov model where the site can turn off. There are models for amino acid substitution, models for codons, models for RNA pairs, models that incorporate protein structure information, and so on. Current models still do not capture much of the important biological processes that affect evolution of molecular sequences. Models of Molecular Evolution General Time-Reversible Model 37 / 70 Models of Molecular Evolution General Time-Reversible Model 38 / 70 Distance Between Pairs of Taxa In a two-taxon tree, the distance between two taxa can be estimated under any model by maximum likelihood. If the distance is t and at site i one species has base A and the other has base C, the contribution to the likelihood at this site j is for a time-reversible model. The overall likelihood is L j (t) = π A P AC (t) = π C P CA (t) L(t) = j L j (t) Distance Between Pairs of Taxa For models with free π, it is common to estimate π with observed base frequencies. Other parameters are usually estimated by maximum likelihood. The simplest models have closed form solutions, others require numerical optimization. and the log-likelihood is l(t) = j log L j (t) = j ( log πx[j] + log P x[j]y[j] (t) ) Maximum Likelihood Estimation Maximum Likelihood Estimation for Pairs 39 / 70 Maximum Likelihood Estimation Maximum Likelihood Estimation for Pairs 40 / 70

Notation for the Alignment Notation for the Tree An alignment of m taxa and n sites will have mn nucleotide bases. Let the observed base for the ith taxon and the jth site be x ij. With a time-reversible model, the location of a root (where the CTMC begins at stationarity) does not affect the likelihood calculation. We can assume an unrooted tree without loss of generality. An unrooted tree with m taxa will have m 2 internal nodes. Number these nodes i = 1,..., 2m 2 with the first m for leaf nodes and the last m 2 for internal nodes. For calculation purposes, we will denote node ρ (which could be any node) as the root. There are 2m 3 edges in the tree, numbered e = 1,..., 2m 3. Relative to root node ρ, edge e connects parent node p(e) and child node c(e) where p(e) is closer to ρ than c(e). Edge e has length t e. Maximum Likelihood Estimation Likelihood Calculations on Trees 41 / 70 Maximum Likelihood Estimation Likelihood Calculations on Trees 42 / 70 Notation for Unobserved Data Likelihood of a Tree The likelihood for a tree is computed by summing over all possible bases at the internal nodes for each of the n sites. For each site, there are 4 m 2 possible allocations of bases at internal nodes we will index by k. Internal node i is set to nucleotide b ik at the kth allocation, i = m + 1,..., 2m 2. Let z(i, j, k) be the nucleotide at node i, site j, and allocation k. { xij if i m (i is a leaf node) z(i, j, k) = if i > m (i is an internal node) b ik Let P(t) be the 4 4 probability transition matrix over an edge of length t. The likelihood of the tree is ( ) π z(ρ,j,k) P z(p(e),j,k)z(c(e),j,k) (t e ) j k Notice that the sum is over the 4 m 2 possible allocations. A naive calculation would not be tractible for large trees. e Maximum Likelihood Estimation Likelihood Calculations on Trees 43 / 70 Maximum Likelihood Estimation Likelihood Calculations on Trees 44 / 70

Felsenstein s Pruning Algorithm The Algorithm for One Site Felsenstein s pruning algorithm is an example of dynamic programming. By saving partial calculations, the time complexity of the likelihood evaluation grows linearly with the number of sites, not exponentially. For each site and node, the algorithm depends on calculating the probability in the subtree rooted at that node for each possible base. The algorithm begins at the leaves of the tree and recurses to the root. The likelihood of the site is a weighted average of the conditional subtree probabilities at the root weighted by the stationary distribution. Define f j (i, b) to be the probability of the data at site j in the subtree rooted at node i conditional on the nucleotide at this node being b. For a leaf node, f j (i, b) = 1{x ij = b} For an internal node with children nodes indexed by c attached by edges of length t c, f j (i, b) = ( ) P bz (t c )f j (c, z) c z The likelihood at site j is L j = b π b f j (ρ, b) Maximum Likelihood Estimation Likelihood Calculations on Trees 45 / 70 Maximum Likelihood Estimation Likelihood Calculations on Trees 46 / 70 Example Example Do an example with five taxa for one site. See chalk board for example. P1 P2 0.81 0.05 0.10 0.04 0.66 0.10 0.17 0.07 0.04 0.81 0.05 0.10 0.07 0.68 0.10 0.15 0.07 0.05 0.84 0.04 0.11 0.10 0.72 0.07 0.04 0.14 0.05 0.77 0.07 0.23 0.10 0.60 f A C G T [1,] 1.000000000 0.000000e+00 0.000000000 0.000000e+00 [2,] 0.000000000 0.000000e+00 1.000000000 0.000000e+00 [3,] 1.000000000 0.000000e+00 0.000000000 0.000000e+00 [4,] 0.000000000 0.000000e+00 1.000000000 0.000000e+00 [5,] 0.000000000 0.000000e+00 1.000000000 0.000000e+00 [6,] 0.081000000 2.000000e-03 0.058800000 2.000000e-03 [7,] 0.058052700 3.200000e-04 0.003866940 3.200000e-04 [8,] 0.000806449 1.403328e-05 0.004439667 1.403328e-05 Maximum Likelihood Estimation Likelihood Calculations on Trees 47 / 70 Maximum Likelihood Estimation Likelihood Calculations on Trees 48 / 70

Maximum Likelihood Estimation for one Tree Tree Search For a single tree topology, the ML estimation requires optimization of branch lengths and of any parameters in the substitution model. Numerical optimization methods are required even for simple models and small trees. The search for the maximum likelihood tree conceptually requires obtaining the maximum likelihood for each possible tree topology and then picking the best of these. For more than a dozen or so taxa, exhaustive search is non feasible. Heuristic search algorithms typically define a neighborhood structure for possible topologies. The search goes through neighbors and jumps to the first neighbor with a higher likelihood. When all neighbors are inferior to the current tree, the search stops. Much improvement has been made in recent years (RAxML and GARLI are two modern ML programs). Maximum Likelihood Estimation Likelihood Calculations on Trees 49 / 70 Maximum Likelihood Estimation Search for Maximum Likelihood 50 / 70 Bayesian Inference Bayesian Phylogenetics In Bayesian inference, the posterior distribution is proportional to the product of the likelihood and the prior distribution. For parameters θ and data D, P {θ D} = P {D θ} P {θ} P {D} The denominator is the marginal likelihood of the data, which is the integral of the likelihood against the prior distribution.. For a phylogenetic problem, the parameter θ typically includes the tree topology, the edge lengths, and parameters for the substitution model. θ = (τ, ν, φ) Often we assume independence of these components: P {θ} = P {τ} P {ν} P {φ}. In a typical phylogenetic problem, the marginal likelihood cannot be computed as P {D} = P {D θ} P {θ} dθ Θ is a sum of very many terms (one for each topology) where each term is a high-dimensional integral of a complicated function. Bayesian Phylogenetics Mathematical Background 51 / 70 Bayesian Phylogenetics Phylogenetics 52 / 70

Phylogenetic Inference Sample-based Inference We may be interested in the posterior distribution of the tree topology, P {τ D}. When this posterior distribution is diffuse, we can summarize it by computing posterior distributions of clades. The posterior probability of a clade C is the sum of the posterior probabilities of all tree topologies that contain it. P {C D} = P {τ D} τ:c τ A consensus tree which includes as many clades with high posterior probability as possible is often used as a single tree summary of a distribution of the tree topology. Any aspect of a posterior distribution can be estimated from a sample drawn from the distribution. For example, the sample proportion of trees with topology τ 0 is an estimate of P {τ 0 D}. Also, the sample mean of a transition/transversion parameter κ is an estimate of the posterior mean E [κ D]. But how do we sample from a complicated posterior distribution? Bayesian Phylogenetics Phylogenetics 53 / 70 Bayesian Phylogenetics Phylogenetics 54 / 70 Markov Chain Monte Carlo Metropolis-Hastings Markov chain Monte Carlo (MCMC) is a mathematical method for obtaining dependent samples from a target distribution (such as a posterior distribution). The idea is to construct a Markov chain whose state space is the parameter space Θ where the stationary distribution of the Markov chain matches the target distribution, say P {θ D}. Simulating the Markov chain produces a sample θ 0, θ 1,... which, after discarding an initial burn-in portion, may be treated as a dependent sample from the target distribution. For notational convenience, let the target distribution be π(θ) = P {θ D}. The most common form of MCMC uses the Metropolis-Hastings algorithm in which a proposal distribution q which can depend on the most recently sampled θ i generates a proposal θ which is accepted with some probability. When accepted, θ i+1 = θ. When rejected, θ i+1 = θ i. The proposal distribution q is essentially arbitrary provided it can move around the entire space Θ. MCMC MCMC 55 / 70 MCMC MCMC 56 / 70

Metroplis-Hastings Algorithm MCMC Example The acceptance probability is { min 1, π(θ ) π(θ) q(θ } θ ) q(θ θ) J where J is a Jacobian. Notice the target density appears only as a ratio this means that it only need be known up to scalar, and we can simply evaluate h(θ) = P {D θ} P {θ} since π(θ ) π(θ) = P {D θ } P {θ } /P {D} P {D θ} P {θ} /P {D} = h(θ ) h(θ) Note that the proposal ratio can be tricky to compute. q(θ θ ) q(θ θ) MCMC MCMC 57 / 70 Target Distribution MCMC Example 58 / 70 First Point Proposal Distribution Initial Point Proposal Distribution MCMC Example 59 / 70 MCMC Example 60 / 70

First Proposal Second Proposal First Proposal Accept with probability 1 Second Proposal Accept with probability 0.153 MCMC Example 61 / 70 MCMC Example 62 / 70 Third Proposal Beginning of Sample Third Proposal Accept with probability 0.144 Sample So Far MCMC Example 63 / 70 MCMC Example 64 / 70

Larger Sample Comparison to Target Second Proposal MCMC Example 65 / 70 MCMC Example 66 / 70 Subtree Pruning Regrafting Rescaling a Tree See example from board. More details will be posted in a separate document. More details will be posted in a separate document. Acceptance Probabilities Examples 67 / 70 Acceptance Probabilities Examples 68 / 70

Cautions Bayesian Inference MCMC does not always converge; Should always run several chains with different random numbers and compare answers; If the true tree has some very short internal edges, Bayesian inference can mislead; Different likelihood models can lead to different results. Development of Bayesian methods has led to continual improvement in our ability to model and learn about molecular evolution. Bayesian inference uses likelihood, but requires a prior distribution. Bayesian inference is computationally intensive, but can be less so than ML plus bootstrapping. Bayesian inference directly measures items of interest on an easily interpretable probability scale. Some folks dislike the requirement of specifying a prior distribution. Summary Cautions 69 / 70 Summary Cautions 70 / 70