Phylogenetic Inference: Building rees Philippe Lemey and Marc. Suchard Rega Institute Department of Microbiology and Immunology K.U. Leuven, Belgium, and Departments of Biomathematics and Human Genetics David Geffen School of Medicine at UL Department of Biostatistics UL School of Public Health SISMID p.1 Intra-Host Viral Evolution Retroviruses (and HBV) exist as a quasi-species within infected patients: Shared substitutions may be insufficient to resolve intra-host phylogenies Improve resolution using joint model: 1195 env sequences from 9 HIV+ patients [taken from Rambaut et al. (2004)] Indel rates substitution rates Opportunity to detect intra-host recombination SISMID p.2
Reconstructing the ree of Life: re Humans Just Big Slime Molds? Branches lade Support.50.74.75.99 >.99 Scale.10 hermotoga hloroflexus hermus Escheria Streptomyces Pyrococcus Euryarchaeota Nanoarchaeum rchaeoglobus Ferroplasma Methanosarcina Halobacterium Methanothermobacter Methanococcus eropyrum Sulfolobus Pyrobaculum Eubacteria Eocytes Giardia Eukaryotes Entamoeba Saccharum Euglena rypanasoma Dictyostelium Saccharomyces Homo B Uncertain D ertain hermotoga ILVV D GPMPQREHVLLRQVEVPYMIVFINKDM VD DPELIDL hloroflex ILVVS PD GPMPQREHILLRQVQVPIVVFLNKVDM MD DPELLEL Escheria ILVV D GPMPQREHILLGRQVGVPYIIVFLNKDM VD DEELLEL Pyrococcus VLVV D GVMPQKEHFLRLGIKHIIVINKMDM VNYNQKRFEE Halobacterium VLVV DD GVPQREHVFLSRLGIDELIVVNKMDV VDYDESKYNE Methanococus VLVVNV DDKSGIQPQREHVFLIRLGVRQLVVNKMD VNFSEDYNE eropyrum ILVVSRKGEFEGMSE G QREHLLLRMGIEQIIVVNKMDPDVNYDQKRYEF Sulfol obus ILVVSKKGEYEGMSE G QREHIILSKMGINQVIVINKMDLDPYDEKRFKE Giardia ILVVGQGEFEGISKD G QREHLNLGIKMIIVNKMDDGQVKYSKERYDE Euglena VLVIDSGGFEGISKD G QREHLLYLGVKQMIVNKFDDKVKYSQRYEE Dictyostelium VLVISPGEFEGIKN G QREHLLYLGVKQMIVINKMDEKSNYSQRYDE Homo VLIVGVGEFEGISKN G QREHLLYLGVKQLIVGVNKMDSEPPYSQKRYEE hermotoga hloroflex Escheria Pyrococcus Halobacterium Methanococus eropyrum Sulfol obus Giardia Euglena Dictyostelium Homo GIDKDEVERGQVL PGSIKPHKRFKQIYVLKKEEGGRHPFKGYKPQFYIRDV GIERDVERGQVL PGSIKPHKKFEQVYVLKKEEGGRHPFFSGYRPQFYIRDV GIKREEIERGQVL KPGIKPHKFESEVYILSKDEGGRHPFFKGYRPQFYFRDV GVSKNDIKRGDVGHN PPVVRKDFKQIIVLNH PIVGYSPVLHHQV GIGKDDIRRGDVGPDD PPSV DFQQVVVMQH PSVIGYPVFHHQV GVGKKDIKRGDVLGHN PPV DFQIVVLQH PSVLDGYPVFHHQI GVSKSDIKRGDVGHLDK PPV EEFERIFVIWH PSIVGYPVIHVHSV GVEKKDVKRGDVGSVQN PPV DEFQVIVIWH PVGVGYPVLHVHSI GLVKDLKKGYVVGDVNDPPVG KSFQVIVMNH PKKIQPGYPVIDHHI NVSVKDIRRGYVSNKNDPKE DFQVIILNH PGQIGNGYPVLDHHI NVSVKEIKRGMVGDSKNDPPQE EKFVQVIVLNH PGQIHGYSPVLDHHI NVSVKDVRRGNVGDSKNDPPME GFQVIILNH PGQISGYPVLDHHI MP ree Partial u Plot ontentious issue among paleobiologists: Do rchaea (Euryarchaeota/Eocytes) form one or two domains? Weekly World News calls humans slime molds. SISMID p.3 he hicken or the (Small) Genome: Which ame First? Reptiles Mammals Birds log Genome Size 0.2 0.4 0.6 0.8 1.0 1.2 1.4 Evolutionary History and Genome Sizes of Reptiles, Dinosaurs, Birds and Mammals Issue: Bird genomes are markedly smaller than those from other vertebrates. Question: Did small genomes precede flight or co-evolve? SISMID p.4
Maximum Parsimony (MP) Most often used best, not even statistically consistent, but fast, fast, fast...if youknowthetree Key: Findtreewithminimal#of suspected substitutions (internal states are not observed, 0/1 model process) ounting minimum # of substitutions is easy Enumerating (searching through) all possible trees is hard Human GG himp GG Mouse Fly G G Site:1 2 3 4 5 6 7 8 9 10 long Molecular Sequence Sites are independent Pre Order raversal 5 6 7 1 2 3 4 {,G} X {,} G X Post Order raversal SISMID p.5 Maximum Parsimony (MP) littlehistory: nthony Edwards/Luca avalli-sforza (1963,1964) Both students of R. Fisher Introduced both parsimony and likelihood methods (for continuous quantities, e.g. gene frequencies) in one paper amin and Sokal (1965) provide first program for molecular sequences Fitch and Margoliash (1967) provide efficient algorithm Pre Order raversal 7 6 5 1 2 3 4 {,} {,G} X X G Post Order raversal SISMID p.6
Maximum Parsimony lgorithm procedure Fitch and Margoliash (1967) lgorithm cost 0 {Initialization} pointer k 2N 1 {at the root node} o obtain the set R k of possible states at node k {Recursion} if k is leaf then R k observed character for taxon k else ompute R i,r j for daughters i, j of k if R i R j 0then R k R i R j 7 else R k R i R 6 j +1 5 end if end if minimum cost is {ermination} Pre Order raversal 1 2 3 4 {,} {,G} X X G Post Order raversal SISMID p.7 Searching for the MP ree omplexity: Find MP score is NP-complete 5 Branches B 3 Branches 5 Branches B B B D B D 7 Branches Find MP tree is NP-hard B B D 5 Branches Recall that # of N-taxon rooted trees is 3 5 2N 3 ttack exponential-order space Branch-and-Bound: Monotonic order: min PS 2 min PS 3... Bound if min PS k > best n-taxon PS found so far. SISMID p.8
Neighbor-Joining (Saitou and Nei, 1987) omputational algorithm: alignment single tree Site (data) Pairwise distances = 0 = 1 = 1 1 2 3 2 3 4 10 12 15 30 20 25 Join (closest) neighbors 12* 1 2 Recompute distances (using 12*,3,4) dvantages: veryfast,greatfor1000sofsequences Disadvantages: nosite-to-siteratevariation,nonatural ways to compare trees/measure data support SISMID p.9 Neighbor-Joining aveat: Pairs i, j with min d ij are not necessarily nearest neighbors. E.g., d B =3<d =5 1 B 1 1 4 4 Solution: Subtractofftheaveragedistancestoallother leaves via 1 D ij = d ij (r i + r j ), r i = d ik, L 2 k L where L is the current set of leaves. Proof in Studier and Keppler (1988). omputational: O(N 3 ) D SISMID p.10
Likelihood-based Methods (Felsenstein, 1973) Statistical technique: assumes an unknown tree and a stochastic model for character change along the tree Site (data) state G ontinuous ime Markov hain time Unknown tree/internal node states dvantages: site-to-siterate/treevariationiseasy,can formulate probability statements Disadvantages: must search tree-space slow Foundation of Bayesian Phylogenetics SISMID p.11 raditional Phylogenetic Reconstruction Human himp Mouse Fly Reconstruction Example Site: 1 Human himp G G G G G G 2 3 4 5 6 7 8 9 10 long Molecular Sequence Mouse Fly Substitution: singleresidue replaces another Insertion/deletion: residues are inserted or deleted Statistical Model ssume: Homologous sites are iid and site patterns (e.g. dotted box) XY...Z Multinomial(p XY...Z ) where p XY...Z is determined by an unknown tree τ, branch lengths t and continuous-time Markov chain model (for residue substitution) given by infinitesimal rate matrix Q P(X Y in time t) =e tq SISMID p.12
M(Q) =ɛ Normal(µ, σ 2 ) of Phylogenetics ontinuous in elapsed time t, discrete in starting/ending state! Memory-less process in which the probability that state b replaces state a during (t, t + s) =sq ab + o(s) state G time Infinitesimal generator matrix Q has off-diagonal entries q ab and row sums =0 hink: ExponentialwaitingtimewithrateR a = b q ab until chain leaves a. hen the new state b is independently chosen with probabilities q ab /R a SISMID p.13 From Infinitesimal to Finite ime Let p ab (t) =the finite-time probability of the chain moving from state a at time 0 to state b at time t, thenmatrix P (t) ={p ab (t)} satisfies with solution d P (t) =P (t)q where P (0) = I dt P (t) =e tq = I + tq + 1 2 (tq)2 + = k=0 1 k! (tq)k as d dt etq = Qe tq = e tq Q for t real SISMID p.14
Example: wo-state Model onsider purines (R) pyrimidines (Y). Kolmogorov forward equation: R α Y p RY (t + s) =p RR (t)αs + p RY (t)(1 s)+o(s) yielding d dt p RY(t) =αp RR (t) p RY (t) Q = ( α α ) Solutions of P (t) = e tq have the form with eigenvalues 0 and (α + ) c + de (α+)t SISMID p.15 Standard Ms for Phylogenetics Jukes and antor (J69), π a = 1 4, κ 1 = κ 2 =1 Kimura (K80), π a = 1 4, κ 1 = κ 2 Hasegawa, Kishino and Yano (HKY85), κ 1 = κ 2 (most common) amura and Nei (N93), right κ 1 κ 2 General ime Reversible (GR) Note identifiability concern in e tq.ommonsolutionistofix 1d.f.suchthat q aa π a = 1 a Scaling: t =1 1 expected substitution per site π π G π G π SISMID p.16
Explicit Parameterization of N93 Nucleotides mutate according to a Markovian process Pr(X Y in time t) =e tq Nuc where Q Nuc is a 4x4 infinitesimal rate matrix and t is a branch length. π π π G κ 1 0 G κ 1 π G π π κ 1 π π π Q Nuc = B @ π π G κ 2 π κ 2 π π G κ 2 π π 1, where κ 1, κ 2 are transition:transversion rate ratios and π is the stationary distribution of {,G,,}. controls the overall rate and can vary from site-to-site. SISMID p.17 Site-to-Site Rate Variation Variation occurs quite naturally and is also an important inference short range: codon phase (slow-slow-fast) long range: enzymatic active sites, protein folding, immunological pressures/selection ssume: infinitesimalratesforsitek are r k t q ab.variouspriorson r k with E(r k )=1.ImplicitlyBayesian Yang (1994) discretized Gamma distribution SISMID p.18
General ime Reversible M Let Q = RD π where R is symmetric and D π is a diagonal matrix composed of the stationary distribution π. Detailed balance π a q ab = π b q ba.balance+ irreducibility reversible Note Q is similar to R, asd 1/2 QD 1/2 = R Hence, Q must have real eigenvalues and real eigenvectors he properties speed up computation of the finite-time transition matrix P (t) =e tq SISMID p.19 alculating the Probability of a Single Site Pattern Y i Given the tree and unobserved internal node states, the probability is the product of the finite time mutation probabilities over all branches: t1 t 2 t Y t 5 4 X t 3 G L(Y i ) p G = Pr(Y,t 1 )Pr(X G,t 2 ) X Y Pr(X,t 3 )Pr(Y,t 4 )Pr(X Y,t 5 )π X (1) Number of sumants grow rapidly in N sum-product/peeling algorithm to distribute sums across the product SISMID p.20
Pruning lgorithm Felsenstein (1981) Let P (L k a) =likelihoodofleavesbelownodek given k is in state a. hen,recursively compute P (L k a) given P (L i b) and P (L j c) for daughters i, j of k: Set pointer k 2N 1 {the root, initialization} Node k ompute P (L k a) a as follows: {recursion} if k is a leaf node then if a is observed then P(L P (L k a) =1 i b) else P (L k a) =0 P(L end if j c) else ompute P (L i a) and P (L j a) a for daughters i, j of k {post-order traversal} P (L k a) = P P b c Pr(a b, t i)p (L i b) Pr(a c, t j )P (L j c) end if L(Y i ) P a P (L 2n 1 a)π a {termination} a b a c SISMID p.21 ML ree or MP ree? Reporting uncertainty on tree estimates: he Bootstrap Most common ssumes evolutionary events are reproducible. If I went back out to the field and recollected exchangeable data... Bayesian inference Returns the probability of a tree given the observed data and model Requires MM (e.g., MrBayes or BES) dvantages Does not rely on asymptotics (hypothesis testing) Naturally incorporates uncertainty in all parameters (including discrete quantities: trees, site-classifications, etc.) rguably faster algorithms Disadvantages Must specify (justifiable) prior distributions SISMID p.22