Estimating Evolutionary Trees. Phylogenetic Methods

Similar documents
POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Using phylogenetics to estimate species divergence times... Basics and basic issues for Bayesian inference of divergence times (plus some digression)

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Lecture 6 Phylogenetic Inference

8/23/2014. Phylogeny and the Tree of Life

Diffusion Models in Population Genetics

Dr. Amira A. AL-Hosary

UoN, CAS, DBSC BIOL102 lecture notes by: Dr. Mustafa A. Mansi. The Phylogenetic Systematics (Phylogeny and Systematics)

Gene Genealogies Coalescence Theory. Annabelle Haudry Glasgow, July 2009

Inferring Molecular Phylogeny

Phylogenetics. BIOL 7711 Computational Bioscience

Bustamante et al., Supplementary Nature Manuscript # 1 out of 9 Information #

Markov chain Monte-Carlo to estimate speciation and extinction rates: making use of the forest hidden behind the (phylogenetic) tree

Robust demographic inference from genomic and SNP data

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Taming the Beast Workshop

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Supplemental Information Likelihood-based inference in isolation-by-distance models using the spatial distribution of low-frequency alleles

Coalescent based demographic inference. Daniel Wegmann University of Fribourg

Molecular Evolution & Phylogenetics

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2009 University of California, Berkeley

Bayesian Phylogenetics:

Constructing Evolutionary/Phylogenetic Trees

Concepts and Methods in Molecular Divergence Time Estimation

Constructing Evolutionary/Phylogenetic Trees

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Intraspecific gene genealogies: trees grafting into networks

Many of the slides that I ll use have been borrowed from Dr. Paul Lewis, Dr. Joe Felsenstein. Thanks!

Frequency Spectra and Inference in Population Genetics

Theory of Evolution Charles Darwin

Today's project. Test input data Six alignments (from six independent markers) of Curcuma species

A Bayesian Approach to Phylogenetics

Non-Parametric Bayesian Population Dynamics Inference

Phylogeny and systematics. Why are these disciplines important in evolutionary biology and how are they related to each other?

Bayesian Phylogenetics

Who was Bayes? Bayesian Phylogenetics. What is Bayes Theorem?

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Quartet Inference from SNP Data Under the Coalescent Model

Discrete & continuous characters: The threshold model

Algorithms in Bioinformatics

EVOLUTIONARY DISTANCES

DNA-based species delimitation

Anatomy of a species tree

Reconstructing the history of lineages

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

How should we organize the diversity of animal life?

Genetic Drift in Human Evolution

Bayesian phylogenetics. the one true tree? Bayesian phylogenetics

MCMC: Markov Chain Monte Carlo

Phylogenetic Tree Reconstruction

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Systematics - Bio 615

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Consistency Index (CI)

Inferring Species Trees Directly from Biallelic Genetic Markers: Bypassing Gene Trees in a Full Coalescent Analysis. Research article.

Mathematical models in population genetics II

1 ATGGGTCTC 2 ATGAGTCTC

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Evolutionary Tree Analysis. Overview

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

To link to this article: DOI: / URL:

Lecture 11 Friday, October 21, 2011

What is Phylogenetics

Phylogenetic inference

Phylogeny: building the tree of life

One-minute responses. Nice class{no complaints. Your explanations of ML were very clear. The phylogenetics portion made more sense to me today.

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2012 University of California, Berkeley

A Phylogenetic Network Construction due to Constrained Recombination

BINF6201/8201. Molecular phylogenetic methods

Biology 211 (2) Week 1 KEY!

Phylogeny 9/8/2014. Evolutionary Relationships. Data Supporting Phylogeny. Chapter 26

Phylogenies & Classifying species (AKA Cladistics & Taxonomy) What are phylogenies & cladograms? How do we read them? How do we estimate them?

Workshop III: Evolutionary Genomics

C.DARWIN ( )

"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B Spring 2011 University of California, Berkeley

JML: testing hybridization from species trees

A (short) introduction to phylogenetics

GENETICS - CLUTCH CH.22 EVOLUTIONARY GENETICS.

Integrative Biology 200A "PRINCIPLES OF PHYLOGENETICS" Spring 2008

Phylogenetic Analysis

Phylogenetic Analysis

Chapter 26 Phylogeny and the Tree of Life

From Individual-based Population Models to Lineage-based Models of Phylogenies

How robust are the predictions of the W-F Model?

7. Tests for selection

EVOLUTION INTERNATIONAL JOURNAL OF ORGANIC EVOLUTION

Cladistics and Bioinformatics Questions 2013

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Inferring Speciation Times under an Episodic Molecular Clock

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

Using Trees: Myrmecocystus Phylogeny and Character Evolution and New Methods for Investigating Trait Evolution and Species Delimitation

Incomplete Lineage Sorting: Consistent Phylogeny Estimation From Multiple Loci

Introduction to characters and parsimony analysis

Understanding How Stochasticity Impacts Reconstructions of Recent Species Divergent History. Huateng Huang

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Estimating effective population size from samples of sequences: inefficiency of pairwise and segregating sites as compared to phylogenetic estimates

Consensus Methods. * You are only responsible for the first two

Phylogeny and the Tree of Life

Transcription:

Estimating Evolutionary Trees v if the data are consistent with infinite sites then all methods should yield the same tree v it gets more complicated when there is homoplasy, i.e., parallel or convergent mutations at the same position v more than one tree may be equally good as a hypothesis of the genealogical history Phylogenetic Methods v UPGMA (single pass algorithm) v neighbor-joining (single pass algorithm) v Parsimony ² search more or less exhaustively for the tree with the smallest number of steps (mutations) required to explain the data v maximum likelihood ² search more or less exhaustively for the tree (topology and branch lengths) that maximizes the likelihood of the observed data v Bayesian MCMC methods ² summarize the posterior distribution of trees to estimate the probability of clades in the tree 1

Does it matter for pop gen? v we don t need to know the genealogy for each locus to make inferences/estimate population genetic parameters v but, analyzing data that are not consistent with infinite sites requires more complex coalescent and/or mutation models Gene Trees versus Species Trees v reciprocal monophyly 2

Gene Trees versus Species Trees v incomplete lineage sorting The Lineage Sorting Process v Speciation at time X ² ancestral polymorphism retained v The gene tree is polyphyletic for both species between times X and Y v The gene tree is paraphyletic within species between time Y and Z v Reciprocal monophyly at time Z 3

Gene Trees versus Species Trees v Incongruence ² between gene tree and species tree ² and between different gene trees Probability of Incongruence 2 3 e t 2 N v for the simple 3 taxon case, where t is the number of generations between speciation events and one sample per taxon v also applies when lineage sorting is complete within each of the terminal taxa ² incongruence as a result of incomplete lineage sorting in the past 4

The lasting effects of incomplete lineage sorting Species 1 Species 2 Species 3 S1 S2 S3 S1 S2 S3 Ancestral population probability of mtdna and nuclear gene trees matching species tree as a function of internode length Moore 1995 Evolution 49, 718-726 5

Interpreting Single Gene Trees? v human mtdna tree ² consistent with out of Africa hypothesis Avise et al. 1990 Evolution 6

Other causes of incongruence v hybridization/introgression/horizontal transfer v balancing selection v gene duplication and loss 7

Introgression plus Selective Sweep A B Time Species Tree C C B A Gene Tree Introgression followed by a selective A B sweep C Balancing Selection v results in a balanced allele frequency maintained by frequency-dependent selection v can maintain pre-existing alleles over long stretches of time H C G H C G H C G 8

From Klein, Takahata, Ayala 1993 Gene Duplication and Loss Gene duplication Actual phylogeny a b c d a b c d Apparent phylogeny a b c d 9

phylogeny of a subunits of voltage-gated calcium channels Piontkivska & Hughes, 2003, JME Approaches for making inferences/ estimating parameters v direct estimates from summary statistics ² E.g., 1 F ST = 1 4Nm 4m = 1 F ST F ST ² but this typically requires significant assumptions ² genetic equilibrium, constant population size, etc. v simple coalescent simulations to generate confidence intervals 10

11 Distribution of θ S estimates 0" 20" 40" 60" 80" 100" 120" 140" 160" 0.9" 1" 1.1" 1.2" 1.3" 1.4" 1.5" 1.6" 1.7" 1.8" 1.9" 2" 2.1" 2.2" 2.3" 2.4" 2.5" 2.6" 2.7" 2.8" 2.9" 3" 3.1" 3.2" 3.3" 3.4" 3.5" 3.6" 3.7" 3.8" 3.9" 4" 4.1" k"="10" k"="20" 0" 20" 40" 60" 80" 100" 120" 140" 160" 0.9" 1" 1.1" 1.2" 1.3" 1.4" 1.5" 1.6" 1.7" 1.8" 1.9" 2" 2.1" 2.2" 2.3" 2.4" 2.5" 2.6" 2.7" 2.8" 2.9" 3" 3.1" 3.2" 3.3" 3.4" 3.5" 3.6" 3.7" 3.8" 3.9" 4" 4.1" k"="10" k"="20" Distribution of θ estimates

More sophisticated approaches for making inferences/estimating parameters v start with historical model MIGRATE-N ² simulates N populations connected by gene flow ² estimates population sizes and migration rates (both scaled by N and µ) ² equilibrium model ² coalescence of all samples requires migration between demes because populations do not merge as you go back in time Beerli P, Felsenstein J (1999) Maximum-likelihood estimation of migration rates and effective population numbers in two populations using a coalescent approach. Genetics 152, 763 773. Beerli P, Felsenstein J (2001) Maximum likelihood estimation of a migration matrix and effective population sizes in n subpopulations by using a coalescent approach. PNAS 98, 4563 4568. 12

IM - Isolation with Migration ² model of population divergence with gene flow ² estimates population sizes, migration rates and divergence time(s) Approaches for making inferences/ estimating parameters v Bayesian MCMC analyses to estimate demographic and historical parameters ² based either on maximum likelihood and the Felsenstein equation or on summary statistics (Approximate Bayesian Computation, ABC) ² the Felsenstein Equation gives the likelihood of the data given a set of model parameters, Θ ( ) = Pr( X G) Pr X Θ G p( G Θ)dG where X is the data, Θ is the set of model parameters, and G is the set of all possible genealogies given Θ 13

Calculating the likelihood of the data for a given genealogy v given a model of sequence evolution, a tree (=genealogy) with branch lengths, and observed character states (DNA sequences in the samples)... v we can calculate the likelihood (probability) of the data at a given sequence position A C C G t 1 t 2 C t 4 t 5 y t 3 w t t z 7 6 A tree/genealogy with branch lengths and the data at a single DNA sequence position x t 8 Pr(X i G) = Pr(A,C,C,C,G, x, y, z, w G) x y z w Pr(y x,t 6 )Pr(A y,t 1 )Pr(C y,t 2 )Pr(z x,t 8 ) Pr(C z,t 3 )Pr(w z,t 7 )Pr(C w,t 4 )Pr(G w,t 5 ) x ² in this example, this quantity is summed over 256 (=44) possible combinations of x, y, z, w ² number of calculations increases exponentially with more taxa, but computational shortcuts are employed 14

Calculating the likelihood of the data for a given genealogy v given a model of sequence evolution, a tree (=genealogy) with branch lengths, and observed character states (DNA sequences in the samples)... v we can calculate the likelihood (probability) of the data at a given sequence position v the overall likelihood of the data is the product of the likelihoods for individual sites or the sum of the ln likelihoods m m L = Pr(X G) = Pr(X i G) ln L = ln L i i=1 i=1 In practice v for a sample of k alleles, draw random coalescence times from the exponential distribution, as appropriate given the historical and demographic model parameters v estimate the likelihood (probability) of the observed DNA sequences for genealogies generated under the model ( ) = Pr( X G) Pr X Θ G p( G Θ)dG v change a model parameter (according to carefully designed rules), generate a new set of genealogies and calculate likelihood v we now have two results 15

In practice v if the new result is better, accept the new set of model parameters ( x! ) and continue the process by taking another step in the Markov Chain (i.e., updating a model parameter, generating genealogies, etc ) v if the result is worse, either accept the new set of model parameters ( x! ) or go back to the previous set of parameters ( x), with the coin flip probabilities as defined by the Metropolis- Hastings Algorithm A x x" v repeat millions of times ( ) = min 1, P ( x" ) % P( x) # $ ( ) ( x ) g x" x g x " & ( ' Markov Chain Monte Carlo methods 16

Gutenkunst RN, Hernandez RD, Williamson SH, Bustamante CD (2009) Inferring the joint demographic history of multiple populations from multidimensional SNP frequency data. PLoS Genetics 5, e1000695. v a i uses the joint allele frequency distribution as the observed input data v uses the diffusion approximation to estimate the expected j.a.f.d. for a given set of model parameters v and then calculates the likelihood of the observed data based on the above 20 4 2 4 1 4 1 6 1 4 1 2 3 40 19 3 1 2 2 2 2 2 2 2 18 1 1 1 1 2 2 1 2 2 3 2 1 1 1 17 4 1 1 1 2 East African allele frequency (n = 10 birds, 20 alleles) 16 1 1 2 1 1 3 1 1 15 7 2 2 1 1 14 2 1 1 1 1 1 1 13 5 2 4 1 1 1 1 12 5 1 1 2 1 1 1 1 11 6 1 1 1 10 2 1 1 1 1 4 1 1 9 4 2 1 1 2 1 1 1 8 5 2 2 2 1 1 1 7 6 2 1 1 1 1 1 6 7 1 1 1 1 5 13 1 1 1 1 1 2 4 12 1 2 1 1 2 1 2 1 3 14 1 1 1 1 1 1 1 2 1 1 1 2 47 3 1 1 2 1 1 1 1 1 1 1 1 132 7 7 6 3 5 1 2 2 1 2 1 1 0 326 89 48 39 27 19 7 4 5 8 3 2 2 2 3 3 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 West African allele frequency (n = 10 birds, 20 alleles) 17