Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Phylogenetic Analysis Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Outline Basic Concepts Tree Construction Methods Distance-based methods Character-based methods Bootstrap Analysis

What is Phylogenetics The term phylogenetics derives from Greek phyle (φυλή) : tribe and race ; genetikos (γενετικός): relative to birth "Phylogenetics" is the study or estimation of the evolutionary history that underlies that biological diversity. This organization is visually described through "trees".

Molecular phylogenetics Molecular phylogenetics: the study of evolutionary relationships among organisms or genes by using molecular data (e.g., DNA or protein sequences) and statistical techniques. Molecular systematics, if the relationships of organisms are the concern.

Why Study Molecular Phylogenetics? Provides insights into relationships among organisms - through "species" trees. Provides insights into the evolution and history of genes - through "gene" trees.

Example: Phylogeny of human and apes Apes Great apes: chimpanzee, bonobo (pygmy chimp), gorilla, orangutan Lesser apes: Gibbon

White-handed gibbon Hylobates lar

Borneo orangutan

Western lowland gorilla

Common chimpanzee

Homo sapiens

Traditional view Species Human Chimpanzee Gorilla Orangutan Gibbon

Same tree, different presentations Human Chimpanzee Gorilla Orangutan Gibbon Human Chimpanzee Gorilla Orangutan Gibbon Rectangular tree Curved tree

An example of gene tree Lin et al PNAS 2006

Objectives To understand the basic concepts and terminology of molecular phylogenetics; To understand tree topologies and how to read them To understand the basic concepts of different tree building methods

Tree Terminology Topology: the branching pattern of a tree

Nodes and OTUs External (or terminal) nodes: Nodes at the tips of the tree. Nodes A, B, C, D, and E. They represent extant taxonomic units: operational taxonomic units (OTUs). Internal nodes: all others(f,g,h,i). Internal nodes represent ancestral units. I G F H A B C D E

(a) Rooted tree One sequence the most basal ancestor of the tree R A B C (b) Unrooted tree Unrooted trees do not imply a known ancestral root A C D D B E E Time

The branches of a phylogenetic tree may be represented two different ways: Unscaled - branch lengths not proportional to the number of changes on the branches, shows topology only. Scaled - branch lengths proportional to the numbers of changes that have occurred in the branches (a) Unscaled I 2 G 6 2 F 1 1 3 H 1 2 A B C D (b) Scaled 2 1 6 2 A 3 B 2 1 D C E 1 unit E

Bifurcating vs. Multifurcating Human Chimpanzee Gorilla Orangutan Gibbon Bifurcating Tree: a node has only two immediate descendant lineages in a rooted tree Human Chimpanzee Gorilla Orangutan Gibbon Multifurcating tree: more than two children at some nodes and a rooted tree

Classification of Tree Reconstruction Methods 1. Distance methods or distance matrix method 2. Character state methods Inferring a phylogeny: an estimation procedure

Data set collection Multiple sequence alignment Character-based Tree construction Distance-based Parsimony Maximum Optimal criteria Likelihood UPGMA Neighbor Joining Fitch- Margoliash Kitch Distance Test reliability of the tree by analytical and/or resampling procedure

Inference Procedure: Two steps (1) Define an optimality criterion, or objective function, i.e., the value that is assigned to a tree and used to compare one tree to another (2) Develop algorithms to compute the value of the objective function, thereby to identify the tree or a set of trees that have the best values according to this criterion.

Optimality criterion Global optimality criterion: Consider all possible trees Local optimality criterion: Consider only a limited number of trees at each stage of tree reconstruction.

v Types of data used in phylogenetic inference Character-based methods: Use the aligned characters, such as DNA or protein sequences directly during tree inference. Distance-based methods: Transform the sequence data into pairwise distances, and use the matrix during tree building. 26

Types of Data: Characters and distances Character: A nucleotide at a site in a DNA seq. An amino acid at a site in a protein seq. Sequence 1: GACTGGTAC-A Sequence 2: GATTGGTAC-A Sequence 3: GATAGGCACTA Sequence 4: GACAAGCACTA Binary state characters: insertions and deletions Multi-state characters: 4 nucleotides or 20 amino acids

Distance Matrix Methods The evolutionary distances for all pairs of taxa are presented in a matrix. Algorithms based on some functional relationships among the distance values

Distance Methods - Example distances between sequences distance table 29

Distance-based method Unweighted Pair-Group Method with Arithmetic Mean (UPGMA)

Unweighted Pair-Group Method with Arithmetic Mean (UPGMA) Sequential clustering algorithm Local topological relationships are inferred in order of decreasing similarity and a phylogenetic tree is built in a stepwise manner.

Distance Methods - UPGMA Unweighted Pair Group Method with Arithmetic mean molecular clock assumed Clustering All leaves are iteratively merged according to their distance Construct a distance tree A -GCTTGTCCGTTACGAT B ACTTGTCTGTTACGAT C ACTTGTCCGAAACGAT D -ACTTGACCGTTTCCTT E AGATGACCGTTTCGAT F -ACTACACCCTTATGAG A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8

v Molecular Clocks The Molecular Clock Hypothesis Proposed by Zuckerkandl and Pauling (1965) The rate of evolution in a given protein (or DNA) molecule is approximately constant over time and among evolutionary lineages Amount of genetic difference between sequences is statistically proportional to the time since separation 33

OTU A B C B d AB C d AC d BC D d AD d BD d CD Assume that d AB has the smallest value. OTUs A and B are the first to be clustered. Branching point: at a distance of d AB /2. Fig. 4 OTUs A and B are now considered as a single OTU, called a composite OTU.

OTU (AB) C C d (AB)C D d (AB)D d CD d (AB)C = (d AC + d BC ) /2 d (AB)D = (d AD + d BD ) /2 Assume that d (AB)C is the smallest: Branching point at a distance of d (AB)C /2

(a) d AB / 2 A B (c) A B (b) A B d (ABC)D / 2 C D d (AB)C / 2 C

Distance between two Composite OTUs OTUs (ij) and (mn): d (ij)(mn) = (d im + d in + d jm + d jn )/4 Arithmatic mean Same weight for all pairs of OTUs

Distance Methods - UPGMA First round 1 A A B C D E B 1 B 2 d C 4 4 (AB)C = (d AC +d BC )/2 = 4 d D 6 6 6 (AB)D = (d AD +d BD )/2 = 6 d (AB)E = (d AE +d BE )/2 = 6 E 6 6 6 4 d (AB)F = (d AF +d BF )/2 = 8 F 8 8 8 8 8 Choose the most similar pair, cluster them together and calculate the new distance matrix. A,B C D E C 4 D 6 6 E 6 6 4 F 8 8 8 8 38

Distance Methods - UPGMA Second round A,B C D E C 4 D 6 6 E 6 6 4 F 8 8 8 8 Third round A,B C D,E C 4 D,E 6 6 F 8 8 8 1 1 A B d (DE)(AB) = (d D(AB) +d E(AB) )/2 = 6 d (DE)C = (d DC +d EC )/2 = 6 d (DE)F = (d DF +d EF )/2 = 8 1 1 1 2 2 2 d (ABC)(DE) = (d (AB)(DE) +d C(DE) )/2 = 6 d (ABC)F = (d (AB)F +d CF )/2 = 8 2 2 D E A B C D E 39

Distance Methods - UPGMA Fourth round AB,C D,E D,E 6 F 8 8 1 1 1 1 1 2 2 2 A B C D E Fifth round ABC, DE F 8 d (ABCDE)F = (d (ABCD)F +d (DE)F )/2 = 8 1 1 A 1 1 B 1 2 C 1 2 2 D 4 E F

Human Chimp Gorilla Chimp 1.24 Gorilla 1.62 1.63 Orang 3.08 3.12 3.09

H C G C 1.24 G 1.62 1.63 O 3.08 3.12 3.09 4.8 ~6.4 Mya 12 ~ 16 Mya 6.3 ~ 8.5 Mya 0.73 0.20 0.62 Human Chimpanzee Gorilla Orangutan Mya : million years before present

Common chimpanzee: our closest relative

Supposed that you have the following tree: Since the divergence of A and B, B has accumulated mutations at a much higher rate than A. http://evolution-textbook.org/content/free/book/toc.html

Conclusion: The unequal rates of mutation has led to a completely different tree topology.

UPGMA Good for explaining some basic concepts and principles in tree reconstruction. UPGMA clustering method is very sensitive to unequal evolutionary rates. When one of the OTUs has incorporated more mutations over time, than the other OTU, one may end up with a tree that has the wrong topology There are better methods

Distance-based method Neighbors-relation method

Neighbors Two OTUs are said to be neighbors if they are connected through one single internal node. In Fig. (a) : A and B are neighbors, and so are D and E, but not A and C. (a) A B D C E

However, if A and B are treated as a composite OTU, then (AB) and C are neighbors. D E (A B) C

Additive Trees Molecular clock defines additive distances A tree is said to be additive if the distance between any two nodes is equal to the sum of the lengths of all the branches connecting them. As shown below A B C D E B 2 C 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8 1 1 1 4 1 1 1 2 2 2 A B C D E F

Four-Point Condition If additivity holds A a x c C d AC + d BD = d AD + d BC = a + c + X + b + d +x B b d D = a + b + c + d + 2x = d AB + d CD +2x, so d AC + d BD > d AB + d CD x: the length of the internal branch

d AB + d CD < d AC + d BD d AB + d CD < d AD + d BC A a x c C The two equations are the four-point condition B b d D Conversely: for four OTUs with unknown phylogeny, the two conditions can be used to identify the neighbors. Once the two pairs of neighbors are determined, the tree topology is determined.

Sattath and Tversky (1977): Neighbors relation method First, compute a distance matrix For every possible quadruple, say OUT i, j, m and n, compute (1) d ij + d mn (2) d im + d jn, and i j i j m n d ij (3) d in + d jm m d im d jm Choose the two neighbor pairs n d in d jn d mn

Neighbors relation method For a pair that is chosen to be neighbors, it receives a score of 1; otherwise, it receives 0. After every possible quadruple is considered, the pair with the highest score is chosen as the first pair of neighbors. They are then treated as a composite OUT. A new distance matrix is computed to search the next pair of neighbors. This procedure is repeated until the number of OTUs is reduced to 3.

A B C D B 22 C 39 41 D 39 43 18 E 39 43 20 10 Chosen set of 4 Sum of distances Pairs chosen ABCD nab + ncd = 22 + 18 = 40 AB, CD nac + nbd = 39 + 41 = 80 nad + nbc = 39 + 41 = 80 ABCE nab + nce = 22 + 20 = 42 AB, CE nac + nbe = 39 + 43 = 82 nae + nbc = 39 + 41 = 82 ABDE nab + nde = 22 + 10 = 32 AB, DE nad + nbe = 39 + 43 = 82 nae + nbd = 41 + 41 = 82 ACDE nac + nde = 39 + 10 = 49 AC, DE nad + nce = 39 + 20 = 59 nae + ncd = 41 + 18 = 59 BCDE nbc + nde = 41 + 10 = 51 BC, DE nbd + nce = 41 + 20 = 61 nbe + ncd = 43 + 18 = 61 AB (3) DE (3) CD (1) CE (1) BC (1). AB and DE are therefore closest neighbors

Distance-based method Neighbor-joining Method

Neighbor-joining Method Minimum evolution tree: a tree with the smallest sum of branch lengths. The neighbor-joining method finds neighbors sequentially that may minimize the total length of the tree.

Procedure Start with a star-like tree X 7 3 5 1 4 8 6 2 < < = = = = = m j i ij m j i ij m i ix d T m T d m L S 1 1 1 1 0

Procedure Try one pair of neighbors, but put the rest in a star-cluster. = = = = + = m i i m i i d R d R d m R R T S 1 2 2 1 1 1 12 2 1 12 2 2) 2( 2 Y 5 3 7 6 4 8 2 1 X

Procedure Try all possible n(n - 1)/2 pairs Choose the pair with the smallest sum of branch lengths as neighbors (a composite OUT). Y 5 3 7 6 4 8 2 1 X Y 5 2 7 6 4 8 3 1 X Y 5 3 7 6 4 2 8 1 X

Convert matrix Original distance matrix 1 2 3 4 5 6 7 2 d 12 3 d 13 d 23 4 d 14 d 24 d 34 5 d 15 d 25 d 35 d 45 6 d 16 d 26 d 38 d 46 d 56 7 d 17 d 27 d 37 d 47 d 57 d 67 8 d 18 d 28 d 38 d 48 d 58 d 68 d 78 New distance matrix 1 2 3 4 5 6 7 2 S 12 3 S 13 S 23 4 S 14 S 24 S 34 5 S 15 S 25 S 35 S 45 6 S 16 S 26 S 38 S 46 S 56 7 S 17 S 27 S 37 S 47 S 57 S 67 8 S 18 S 28 S 38 S 48 S 58 S 68 S 78

An example Molecular Evolution and Phylogenetics 2000, Nei and Kumar

Advantages and disadvantages of the neighbor-joining method Advantages is fast and thus suited for large datasets and for bootstrap analysis permits lineages with largely different branch lengths permits correction for multiple substitutions Disadvantages sequence information is reduced gives only one possible tree strongly dependent on the model of evolution used.

Character-based method Maximum Parsimony Method

Maximum Parsimony Method Uses character state data Search for a tree that requires the smallest number of evolutionary changes to explain the data. A tree thus inferred is called the maximum parsimony tree

Maximum Parsimony Methods Using character state data for a given multiple sequence alignment Give a score to a phylogenetic tree The score is a measure of the number of evolutionary changes that would be required to generate the data given that particular tree. Of the possible trees, the one considered most likely to represent the true history of the OTUs is the one with the lowest score (i.e., the one requiring the fewest evolutionary changes). A tree thus inferred is called the maximum parsimony tree

Informative sites A site is parsimony-informative if it favors some trees over the others, that is it contains at least two types of nucleotides (or amino acids), and at least two of them occur with a minimum frequency of two. Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G A A 1 3 Invariable site A A 1 2 A A 1 2 2 3 A A 4 A A A A 0 CHANGE 0 CHANGE 0 CHANGE 4 4 3

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G A G 1 3 Uninformative sites A G 1 2 A G 1 2 2 4 G G 3 G G 4 4 G G 1 CHANGE 1 CHANGE 1 CHANGE 3

Sites Sequence 1 2 3 4 5 6 7 8 9 1 A A G A G T G C A 2 A G C C G T G C G 3 A G A T A T C C A 4 A G A G A T C C G G A 1 3 G A Informative sites G G 1 2 G G 1 2 2 G A 1 CHANGE 4 3 4 A A A A 2 CHANGES 2 CHANGES 4 3

Character 5 is invariant site (uninformative)

Character 9 is uninformative

Number of possible trees

Tree search methods Exhaustive = Examine all trees, get the best tree (guaranteed). Branch-and-Bound = Examine some trees, get the best tree (guaranteed). Heuristic = Examine some trees, get a tree that may or may not be the best tree.

Exhaustive

Branch-and-Bound Obtain a tree by a fast method. (e.g., the neighbor-joining method) Compute minimum number of substitutions (L) for the obtained tree. Turn L into an upper bound value. If the local parsimony score of the current incomplete tree T 0 is larger or equal to the best global score for any complete tree seen so far, then we do not generate or search the enumeration subtree

Branch-and-Bound

Branch-and-Bound: an example Assume we are given an msa A = {a1, a2,..., a5} and at a given position i the characters are: a1i = A, a2i = A, a3i = C, a4i = C and a5i = A. Assume that the Neighbor-Joining tree on A looks like this: Parsimony score of an alignment PS = 2

Only the first of the three trees fulfills the bound criterion and we do not pursue the other two trees.

The second step: The first three trees are optimal. Note that the bound criterion in the first step reduced the number of full trees to be considered by two thirds.

Heuristic

Maximum Likelihood Methods Well-established approach in statistics The ML method requires a probabilistic model for the process of nucleotide substitution. That is, we must specify the transition probability from one nucleotide state to another in a time interval in each branch.

Maximum Likelihood Method Likelihood provides probabilities of the sequences given a model of their evolution on a particular tree. The more probable the sequences given the tree, the more the tree is preferred. All possible trees are considered; computationally intense. Because the user can choose a model of evolution, the method can be useful for widely divergent groups or other difficult situations.

One-Parameter Model P ii ( t ) = 1 4 3 + e 4 4αt P ij ( t ) = 1 4 1 e 4 4αt

Likelihood function Example: Four sequences A constant rate of substitution

x y t 1 t 1 + t 2+ t 3 t 2+ t 3 z t 2 t 3 4 l 3 k 2 j 1 i

The likelihood function for a site with nucleotides i, j, k, and l in sequences 1, 2, 3, and 4: If the nucleotide at the root was x, the probability of having nucleotide l in sequence 4 is P (t xl 1 +t 2 +t 3 ) because t 1 +t 2 +t 3 is the total amount of time between the two nodes. Probability of having nucleotide y at the common ancestral node of sequences 1, 2, and 3 is P (t xy 1 ), and so on.

Since we do not know the ancestral nucleotide, we can only assign a probability g, usually the frequency x of nucleotide x in the sequence. Noting that x, y, and z can be any of the 4 nucleotides, we sum over all possibilities and obtain the following likelihood function:

h(i, j, k, l) = x g x P xl (t 1 +t 2 +t 3 ) y P xy (t 1 )P yk (t 2 +t 3 ) z P yz (t 2 )P zi (t 3 )P zj (t 3 )

The above formula is for a single site. The likelihood for all sites is the product of the likelihoods for individual sites if all the nucleotide sites evolve independently.

For a given set of data, one computes the maximum likelihood value for each tree topology; this procedure is essentially to find the branch lengths that give the largest value for the likelihood function. Finally, chooses the topology with the highest maximum likelihood value as the best tree, which is called the maximum likelihood tree.

Estimation of Branch Lengths Tree topology: Phylogenetic relationships Branch lengths: Degree of separation Topology and branch lengths are estimated at the same time: UPGMA Maximum parsimony Maximum likelihood

Tree Reliability Tests-Bootstrap analysis Reliability refers to the probability that members of a clade will be part of the true tree. Bootstrapping is the most common reliability test. In bootstrapping, re-sampling of the sites in the alignment is used to build new trees. These extra samples are created with "replacement" - it is possible that some positions will be repeated in the subsample, while some positions will be left out. Multiple re-samples (hundreds to thousands) are run.

Bootstrap analysis Thus bootstrap analysis: is a statistical method for obtaining an estimate of error is used to evaluate the reliability of a tree is used to examine how often a particular cluster in a tree appears when nucleotides or amino acids are re-sampled

The closer the score is to 100, the more significant the grouping. Bootstrapping can be used with distance, parsimony and likelihood methods.

An example of bootstrap analysis

Bootstrap dataset 3...... Bootstrap dataset 999