Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Similar documents
Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Algorithms in Bioinformatics

Multiple Sequence Alignment. Sequences

Dr. Amira A. AL-Hosary

BINF6201/8201. Molecular phylogenetic methods

Evolutionary Tree Analysis. Overview

Phylogenetic Tree Reconstruction

Phylogenetic inference

Constructing Evolutionary/Phylogenetic Trees

EVOLUTIONARY DISTANCES

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Phylogeny: building the tree of life

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Phylogeny: traditional and Bayesian approaches

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

What is Phylogenetics

Theory of Evolution Charles Darwin

A (short) introduction to phylogenetics

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogeny. November 7, 2017

Phylogenetics: Distance Methods. COMP Spring 2015 Luay Nakhleh, Rice University


Phylogeny Tree Algorithms

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

How to read and make phylogenetic trees Zuzana Starostová

C3020 Molecular Evolution. Exercises #3: Phylogenetics

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Consistency Index (CI)

Constructing Evolutionary/Phylogenetic Trees

A Phylogenetic Network Construction due to Constrained Recombination

Evolutionary trees. Describe the relationship between objects, e.g. species or genes

Estimating Phylogenies (Evolutionary Trees) II. Biol4230 Thurs, March 2, 2017 Bill Pearson Jordan 6-057

Workshop III: Evolutionary Genomics

Lecture 11 Friday, October 21, 2011

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogeny and Evolution. Gina Cannarozzi ETH Zurich Institute of Computational Science

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Phylogenetics. Applications of phylogenetics. Unrooted networks vs. rooted trees. Outline

Phylogenetic analyses. Kirsi Kostamo

Page 1. Evolutionary Trees. Why build evolutionary tree? Outline

CS5238 Combinatorial methods in bioinformatics 2003/2004 Semester 1. Lecture 8: Phylogenetic Tree Reconstruction: Distance Based - October 10, 2003

Bioinformatics 1 -- lecture 9. Phylogenetic trees Distance-based tree building Parsimony

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Phylogenetic trees 07/10/13

Plan: Evolutionary trees, characters. Perfect phylogeny Methods: NJ, parsimony, max likelihood, Quartet method

Inferring phylogeny. Today s topics. Milestones of molecular evolution studies Contributions to molecular evolution

C.DARWIN ( )

Phylogene)cs. IMBB 2016 BecA- ILRI Hub, Nairobi May 9 20, Joyce Nzioki

Molecular Phylogenetics (part 1 of 2) Computational Biology Course João André Carriço

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

Cladistics and Bioinformatics Questions 2013

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Phylogenetics. BIOL 7711 Computational Bioscience

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Integrative Biology 200 "PRINCIPLES OF PHYLOGENETICS" Spring 2018 University of California, Berkeley

Inferring Phylogenetic Trees. Distance Approaches. Representing distances. in rooted and unrooted trees. The distance approach to phylogenies

Phylogenetics: Building Phylogenetic Trees

Intraspecific gene genealogies: trees grafting into networks

Phylogenetic inference: from sequences to trees

Theory of Evolution. Charles Darwin

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Michael Yaffe Lecture #4 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Lecture 27. Phylogeny methods, part 7 (Bootstraps, etc.) p.1/30

8/23/2014. Phylogeny and the Tree of Life

Molecular Evolution and Phylogenetic Tree Reconstruction

Microbial Diversity and Assessment (II) Spring, 2007 Guangyi Wang, Ph.D. POST103B

The Tree of Life. Phylogeny

Phylogenetics: Bayesian Phylogenetic Analysis. COMP Spring 2015 Luay Nakhleh, Rice University

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

(Stevens 1991) 1. morphological characters should be assumed to be quantitative unless demonstrated otherwise

Estimating Evolutionary Trees. Phylogenetic Methods

Effects of Gap Open and Gap Extension Penalties

Phylogenetics: Parsimony

Fractional Replications

Organizing Life s Diversity

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Evolutionary Trees. Evolutionary tree. To describe the evolutionary relationship among species A 3 A 2 A 4. R.C.T. Lee and Chin Lung Lu

Thanks to Paul Lewis and Joe Felsenstein for the use of slides

Molecular Clock. МОЛЕКУЛЯРНАЯ ЭКОЛОГИЯ, 31 марта 2017, Пятн, #5. Molecular Clock

CS5263 Bioinformatics. Guest Lecture Part II Phylogenetics

CSCI1950 Z Computa4onal Methods for Biology Lecture 5

MULTIPLE SEQUENCE ALIGNMENT FOR CONSTRUCTION OF PHYLOGENETIC TREE

Molecular Evolution & Phylogenetics

molecular evolution and phylogenetics

Introduction to characters and parsimony analysis

Phylogenetics in the Age of Genomics: Prospects and Challenges

Is the equal branch length model a parsimony model?

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

Bootstrap confidence levels for phylogenetic trees B. Efron, E. Halloran, and S. Holmes, 1996

Chapter 26 Phylogeny and the Tree of Life

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

MOLECULAR EVOLUTION AND PHYLOGENETICS SERGEI L KOSAKOVSKY POND CSE/BIMM/BENG 181 MAY 27, 2011

Letter to the Editor. Department of Biology, Arizona State University

Biology 559R: Introduction to Phylogenetic Comparative Methods Topics for this week (Jan 27 & 29):

Transcription:

Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Outline Basic Concepts Tree Construction Methods Distance-based methods Character-based methods Bootstrap Analysis

What is Phylogenetics The term phylogenetics derives from Greek phyle (φυλή) : tribe and race ; genetikos (γενετικός): relative to birth "Phylogenetics" is the study or estimation of the evolutionary history that underlies that biological diversity. This organization is visually described through "trees".

Molecular phylogenetics Molecular phylogenetics: the study of evolutionary relationships among organisms or genes by using molecular data (e.g., DNA or protein sequences) and statistical techniques. Molecular systematics, if the relationships of organisms are the concern.

Why Study Molecular Phylogenetics? Provides insights into relationships among organisms - through "species" trees. Provides insights into the evolution and history of genes - through "gene" trees.

Example: Phylogeny of human and apes Apes Great apes: chimpanzee, bonobo (pygmy chimp), gorilla, orangutan Lesser apes: Gibbon

White-handed gibbon Hylobates lar

Borneo orangutan

Western lowland gorilla

Common chimpanzee

Homo sapiens

Traditional view Species Human Chimpanzee Gorilla Orangutan Gibbon

Same tree, different presentations Human Chimpanzee Gorilla Orangutan Gibbon Human Chimpanzee Gorilla Orangutan Gibbon Rectangular tree Curved tree

An example of gene tree Lin et al PNAS 2006

Objectives To understand the basic concepts and terminology of molecular phylogenetics; To understand tree topologies and how to read them To understand the basic concepts of different tree building methods

Tree Terminology Topology: the branching pattern of a tree

Nodes and OTUs External (or terminal) nodes: Nodes at the tips of the tree. Nodes A, B, C, D, and E. They represent extant taxonomic units: operational taxonomic units (OTUs). Internal nodes: all others(f,g,h,i). Internal nodes represent ancestral units. I G F H A B C D E

(a) Rooted tree One sequence the most basal ancestor of the tree R A B C (b) Unrooted tree Unrooted trees do not imply a known ancestral root A C D D B E E Time

The branches of a phylogenetic tree may be represented two different ways: Unscaled - branch lengths not proportional to the number of changes on the branches, shows topology only. Scaled - branch lengths proportional to the numbers of changes that have occurred in the branches (a) Unscaled I 2 G 6 2 F 1 1 3 H 1 2 A B C D (b) Scaled 2 1 6 2 A 3 B 2 1 D C E 1 unit E

Bifurcating vs. Multifurcating Human Chimpanzee Gorilla Orangutan Gibbon Bifurcating Tree: a node has only two immediate descendant lineages in a rooted tree Human Chimpanzee Gorilla Orangutan Gibbon Multifurcating tree: more than two children at some nodes and a rooted tree

Classification of Tree Reconstruction Methods 1. Distance methods or distance matrix method 2. Character state methods Inferring a phylogeny: an estimation procedure

Data set collection Multiple sequence alignment Character-based Tree construction Distance-based Parsimony Maximum Optimal criteria Likelihood UPGMA Neighbor Joining Fitch- Margoliash Kitch Distance Test reliability of the tree by analytical and/or resampling procedure

Inference Procedure: Two steps (1) Define an optimality criterion, or objective function, i.e., the value that is assigned to a tree and used to compare one tree to another (2) Develop algorithms to compute the value of the objective function, thereby to identify the tree or a set of trees that have the best values according to this criterion.

Optimality criterion Global optimality criterion: Consider all possible trees Local optimality criterion: Consider only a limited number of trees at each stage of tree reconstruction.

v Types of data used in phylogenetic inference Character-based methods: Use the aligned characters, such as DNA or protein sequences directly during tree inference. Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. 26

Types of Data: Characters and distances Character: A nucleotide at a site in a DNA seq. An amino acid at a site in a protein seq. Sequence 1: GACTGGTAC-A Sequence 2: GATTGGTAC-A Sequence 3: GATAGGCACTA Sequence 4: GACAAGCACTA Binary state characters: insertions and deletions Multi-state characters: 4 nucleotides or 20 amino acids

Distance Matrix Methods The evolutionary distances for all pairs of taxa are presented in a matrix. Algorithms based on some functional relationships among the distance values

Distance Methods - Example distances between sequences distance table 29

Distance-based method Unweighted Pair-Group Method with Arithmetic Mean (UPGMA)

Unweighted Pair-Group Method with Arithmetic Mean (UPGMA) Sequential clustering algorithm Local topological relationships are inferred in order of decreasing similarity and a phylogenetic tree is built in a stepwise manner.

Distance Methods - UPGMA Unweighted Pair Group Method with Arithmetic mean molecular clock assumed Clustering All leaves are iteratively merged according to their distance Construct a distance tree A -GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C ACTTGTCCGAAACGAT D -ACTTGACCGTTTCCTT E AGATGACCGTTTCGAT F -ACTACACCCTTATGAG A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8

v Molecular Clocks The Molecular Clock Hypothesis Proposed by Zuckerkandl and Pauling (1965) The rate of evolution in a given protein (or DNA) molecule is approximately constant over time and among evolutionary lineages Amount of genetic difference between sequences is statistically proportional to the time since separation 33

OTU A B C B d AB C d AC d BC D d AD d BD d CD Assume that d AB has the smallest value. OTUs A and B are the first to be clustered. Branching point: at a distance of d AB /2. Fig. 4 OTUs A and B are now considered as a single OTU, called a composite OTU.

OTU (AB) C C d (AB)C D d (AB)D d CD d (AB)C = (d AC + d BC ) /2 d (AB)D = (d AD + d BD ) /2 Assume that d (AB)C is the smallest: Branching point at a distance of d (AB)C /2

(a) d AB / 2 A B (c) A B (b) A B d (ABC)D / 2 C D d (AB)C / 2 C

Distance between two Composite OTUs OTUs (ij) and (mn): d (ij)(mn) = (d im + d in + d jm + d jn )/4 Arithmatic mean Same weight for all pairs of OTUs

Distance Methods - UPGMA First round 1 A A B C D E B 1 B 2 d C 4 4 (AB)C = (d AC +d BC )/2 = 4 d D 6 6 6 (AB)D = (d AD +d BD )/2 = 6 d (AB)E = (d AE +d BE )/2 = 6 E 6 6 6 4 d (AB)F = (d AF +d BF )/2 = 8 F 8 8 8 8 8 Choose the most similar pair, cluster them together and calculate the new distance matrix. A,B C D E C 4 D 6 6 E 6 6 4 F 8 8 8 8 38

Distance Methods - UPGMA Second round A,B C D E C 4 D 6 6 E 6 6 4 F 8 8 8 8 Third round A,B C D,E C 4 D,E 6 6 F 8 8 8 1 1 A B d (DE)(AB) = (d D(AB) +d E(AB) )/2 = 6 d (DE)C = (d DC +d EC )/2 = 6 d (DE)F = (d DF +d EF )/2 = 8 1 1 1 2 2 2 d (ABC)(DE) = (d (AB)(DE) +d C(DE) )/2 = 6 d (ABC)F = (d (AB)F +d CF )/2 = 8 2 2 D E A B C D E 39

Distance Methods - UPGMA Fourth round AB,C D,E D,E 6 F 8 8 1 1 1 1 1 2 2 2 A B C D E Fifth round ABC, DE F 8 d (ABCDE)F = (d (ABCD)F +d (DE)F )/2 = 8 1 1 A 1 1 B 1 2 C 1 2 2 D 4 E F

Human Chimp Gorilla Chimp 1.24 Gorilla 1.62 1.63 Orang 3.08 3.12 3.09

H C G C 1.24 G 1.62 1.63 O 3.08 3.12 3.09 4.8 ~6.4 Mya 12 ~ 16 Mya 6.3 ~ 8.5 Mya 0.73 0.20 0.62 Human Chimpanzee Gorilla Orangutan Mya : million years before present

Common chimpanzee: our closest relative

Supposed that you have the following tree: Since the divergence of A and B, B has accumulated mutations at a much higher rate than A. http://evolution-textbook.org/content/free/book/toc.html

Conclusion: The unequal rates of mutation has led to a completely different tree topology.

UPGMA Good for explaining some basic concepts and principles in tree reconstruction. UPGMA clustering method is very sensitive to unequal evolutionary rates. When one of the OTUs has incorporated more mutations over time, than the other OTU, one may end up with a tree that has the wrong topology There are better methods

Distance-based method Neighbors-relation method

Neighbors Two OTUs are said to be neighbors if they are connected through one single internal node. In Fig. (a) : A and B are neighbors, and so are D and E, but not A and C. (a) A B D C E

However, if A and B are treated as a composite OTU, then (AB) and C are neighbors. D E (A B) C

Additive Trees Molecular clock defines additive distances A tree is said to be additive if the distance between any two nodes is equal to the sum of the lengths of all the branches connecting them. As shown below A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8 1 1 1 4 1 1 1 2 2 2 A B C D E F

Four-Point Condition If additivity holds A a x c C d AC + d BD = d AD + d BC = a + c + X + b + d +x B b d D = a + b + c + d + 2x = d AB + d CD +2x, so d AC + d BD > d AB + d CD x: the length of the internal branch

d AB + d CD < d AC + d BD d AB + d CD < d AD + d BC A a x c C The two equations are the four-point condition B b d D Conversely: for four OTUs with unknown phylogeny, the two conditions can be used to identify the neighbors. Once the two pairs of neighbors are determined, the tree topology is determined.

Sattath and Tversky (1977): Neighbors relation method First, compute a distance matrix For every possible quadruple, say OUT i, j, m and n, compute (1) d ij + d mn (2) d im + d jn, and i j i j m n d ij (3) d in + d jm m d im d jm Choose the two neighbor pairs n d in d jn d mn

Neighbors relation method For a pair that is chosen to be neighbors, it receives a score of 1; otherwise, it receives 0. After every possible quadruple is considered, the pair with the highest score is chosen as the first pair of neighbors. They are then treated as a composite OUT. A new distance matrix is computed to search the next pair of neighbors. This procedure is repeated until the number of OTUs is reduced to 3.

A B C D B 22 C 39 41 D 39 43 18 E 39 43 20 10 Chosen set of 4 Sum of distances Pairs chosen ABCD nab + ncd = 22 + 18 = 40 AB, CD nac + nbd = 39 + 41 = 80 nad + nbc = 39 + 41 = 80 ABCE nab + nce = 22 + 20 = 42 AB, CE nac + nbe = 39 + 43 = 82 nae + nbc = 39 + 41 = 82 ABDE nab + nde = 22 + 10 = 32 AB, DE nad + nbe = 39 + 43 = 82 nae + nbd = 41 + 41 = 82 ACDE nac + nde = 39 + 10 = 49 AC, DE nad + nce = 39 + 20 = 59 nae + ncd = 41 + 18 = 59 BCDE nbc + nde = 41 + 10 = 51 BC, DE nbd + nce = 41 + 20 = 61 nbe + ncd = 43 + 18 = 61 AB (3) DE (3) CD (1) CE (1) BC (1). AB and DE are therefore closest neighbors

Distance-based method Neighbor-joining Method

Neighbor-joining Method Minimum evolution tree: a tree with the smallest sum of branch lengths. The neighbor-joining method finds neighbors sequentially that may minimize the total length of the tree.

Procedure Start with a star-like tree X 7 3 5 1 4 8 6 2 < < = = = = = m j i ij m j i ij m i ix d T m T d m L S 1 1 1 1 0

Procedure Try one pair of neighbors, but put the rest in a star-cluster. = = = = + = m i i m i i d R d R d m R R T S 1 2 2 1 1 1 12 2 1 12 2 2) 2( 2 Y 5 3 7 6 4 8 2 1 X

Procedure Try all possible n(n - 1)/2 pairs Choose the pair with the smallest sum of branch lengths as neighbors (a composite OUT). Y 5 3 7 6 4 8 2 1 X Y 5 2 7 6 4 8 3 1 X Y 5 3 7 6 4 2 8 1 X

Convert matrix Original distance matrix 1 2 3 4 5 6 7 2 d 12 3 d 13 d 23 4 d 14 d 24 d 34 5 d 15 d 25 d 35 d 45 6 d 16 d 26 d 38 d 46 d 56 7 d 17 d 27 d 37 d 47 d 57 d 67 8 d 18 d 28 d 38 d 48 d 58 d 68 d 78 New distance matrix 1 2 3 4 5 6 7 2 S 12 3 S 13 S 23 4 S 14 S 24 S 34 5 S 15 S 25 S 35 S 45 6 S 16 S 26 S 38 S 46 S 56 7 S 17 S 27 S 37 S 47 S 57 S 67 8 S 18 S 28 S 38 S 48 S 58 S 68 S 78

An example Molecular Evolution and Phylogenetics 2000, Nei and Kumar

Advantages and disadvantages of the neighbor-joining method Advantages is fast and thus suited for large datasets and for bootstrap analysis permits lineages with largely different branch lengths permits correction for multiple substitutions Disadvantages sequence information is reduced gives only one possible tree strongly dependent on the model of evolution used.

Character-based method Maximum Parsimony Method

Maximum Parsimony Method Uses character state data Search for a tree that requires the smallest number of evolutionary changes to explain the data. A tree thus inferred is called the maximum parsimony tree

Maximum Parsimony Methods Using character state data for a given multiple sequence alignment Give a score to a phylogenetic tree The score is a measure of the number of evolutionary changes that would be required to generate the data given that particular tree. Of the possible trees, the one considered most likely to represent the true history of the OTUs is the one with the lowest score (i.e., the one requiring the fewest evolutionary changes). A tree thus inferred is called the maximum parsimony tree

Informative sites A site is parsimony-informative if it favors some trees over the others, that is it contains at least two types of nucleotides (or amino acids), and at least two of them occur with a minimum frequency of two. Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G A A 1 3 Invariable site A A 1 2 A A 1 2 2 3 A A 4 A A A A 0 CHANGE 0 CHANGE 0 CHANGE 4 4 3

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G A G 1 3 Uninformative sites A G 1 2 A G 1 2 2 4 G G 3 G G 4 4 G G 1 CHANGE 1 CHANGE 1 CHANGE 3

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G G A 1 3 G A Informative sites G G 1 2 G G 1 2 2 G A 1 CHANGE 4 3 4 A A A A 2 CHANGES 2 CHANGES 4 3

Character 5 is invariant site (uninformative)

Character 9 is uninformative

Number of possible trees

Tree search methods Exhaustive = Examine all trees, get the best tree (guaranteed). Branch-and-Bound = Examine some trees, get the best tree (guaranteed). Heuristic = Examine some trees, get a tree that may or may not be the best tree.

Exhaustive

Branch-and-Bound Obtain a tree by a fast method. (e.g., the neighbor-joining method) Compute minimum number of substitutions (L) for the obtained tree. Turn L into an upper bound value. If the local parsimony score of the current incomplete tree T 0 is larger or equal to the best global score for any complete tree seen so far, then we do not generate or search the enumeration subtree

Branch-and-Bound

Branch-and-Bound: an example Assume we are given an msa A = {a1, a2,..., a5} and at a given position i the characters are: a1i = A, a2i = A, a3i = C, a4i = C and a5i = A. Assume that the Neighbor-Joining tree on A looks like this: Parsimony score of an alignment PS = 2

Only the first of the three trees fulfills the bound criterion and we do not pursue the other two trees.

The second step: The first three trees are optimal. Note that the bound criterion in the first step reduced the number of full trees to be considered by two thirds.

Heuristic

Heuristic

Heuristic

Heuristic

Heuristic

Maximum Likelihood Methods Well-established approach in statistics The ML method requires a probabilistic model for the process of nucleotide substitution. That is, we must specify the transition probability from one nucleotide state to another in a time interval in each branch.

Maximum Likelihood Method Likelihood provides probabilities of the sequences given a model of their evolution on a particular tree. The more probable the sequences given the tree, the more the tree is preferred. All possible trees are considered; computationally intense. Because the user can choose a model of evolution, the method can be useful for widely divergent groups or other difficult situations.

One-Parameter Model P ii ( t ) = 1 4 3 + e 4 4αt P ij ( t ) = 1 4 1 e 4 4αt

Likelihood function Example: Four sequences A constant rate of substitution

x y t 1 t 1 + t 2+ t 3 t 2+ t 3 z t 2 t 3 4 l 3 k 2 j 1 i

The likelihood function for a site with nucleotides i, j, k, and l in sequences 1, 2, 3, and 4: If the nucleotide at the root was x, the probability of having nucleotide l in sequence 4 is P (t xl 1 +t 2 +t 3 ) because t 1 +t 2 +t 3 is the total amount of time between the two nodes. Probability of having nucleotide y at the common ancestral node of sequences 1, 2, and 3 is P (t xy 1 ), and so on.

Since we do not know the ancestral nucleotide, we can only assign a probability g, usually the frequency x of nucleotide x in the sequence. Noting that x, y, and z can be any of the 4 nucleotides, we sum over all possibilities and obtain the following likelihood function:

h(i, j, k, l) = x g x P xl (t 1 +t 2 +t 3 ) y P xy (t 1 )P yk (t 2 +t 3 ) z P yz (t 2 )P zi (t 3 )P zj (t 3 )

The above formula is for a single site. The likelihood for all sites is the product of the likelihoods for individual sites if all the nucleotide sites evolve independently.

For a given set of data, one computes the maximum likelihood value for each tree topology; this procedure is essentially to find the branch lengths that give the largest value for the likelihood function. Finally, chooses the topology with the highest maximum likelihood value as the best tree, which is called the maximum likelihood tree.

Estimation of Branch Lengths Tree topology: Phylogenetic relationships Branch lengths: Degree of separation Topology and branch lengths are estimated at the same time: UPGMA Maximum parsimony Maximum likelihood

Tree Reliability Tests-Bootstrap analysis Reliability refers to the probability that members of a clade will be part of the true tree. Bootstrapping is the most common reliability test. In bootstrapping, re-sampling of the sites in the alignment is used to build new trees. These extra samples are created with "replacement" - it is possible that some positions will be repeated in the subsample, while some positions will be left out. Multiple re-samples (hundreds to thousands) are run.

Bootstrap analysis Thus bootstrap analysis: is a statistical method for obtaining an estimate of error is used to evaluate the reliability of a tree is used to examine how often a particular cluster in a tree appears when nucleotides or amino acids are re-sampled

The closer the score is to 100, the more significant the grouping. Bootstrapping can be used with distance, parsimony and likelihood methods.

An example of bootstrap analysis

Bootstrap dataset 3...... Bootstrap dataset 999