Phylogeny and Molecular Evolution Introduction 1
2/62
3/62
Credit Serafim Batzoglou (UPGMA slides) http://www.stanford.edu/class/cs262/slides Notes by Nir Friedman, Dan Geiger, Shlomo Moran, Ron Shamir, Sagi Snir, Michal Ziv-Ukelson Durbin et al. Jones and Pevzner s lecture notes 4/62
Outline Evolutionary Tree Reconstruction Character Based Phylogeny Small Parsimony Problem Fitch and Sankoff Algorithms 5/62
6/62
7/62
Characterizing Evolution Anatomical and behavioral features were the dominant criteria used to derive evolutionary relationships between species since Darwin Equipped with analysis based on these relatively subjective observations, the evolutionary relationships derived from them were often inconclusive and/or later proved incorrect 8/62
How did the panda evolve? For roughly 100 years scientists were unable to figure out which family the giant panda belongs to In 1870 Père Armand David, returned to Paris from China with the bones of the mysterious creature which he called simply black and white bear. Biologists examined the bones and concluded that they more closely resembled the bones of a red panda (raccoons) than those of bears. In 1985, Steve O Brien et al. solved the giant panda classification problem using DNA sequences (the giant panda is a bear) Giant panda Red panda 9/62
Evolutionary Tree of Bears and Raccoons (O Brien 1985) O Brien s study used about 500,000 nucleotides to construct the evolutionary tree of bears and raccoons. Note that bears and raccoons diverged just 35 million years ago and they share many morphological features. 10/62
Evolutionary Tree of Humans Around the time the giant panda riddle was solved, a DNA-based model of the human evolutionary tree lead to the Out of Africa Hypothesis: Claims our most ancient ancestor lived in Africa roughly 200,000 years ago 11/62
Human Evolutionary Tree (cont d) Based on 53 individuals mitochondrial DNA (16,587bp ) http://www.mun.ca/biology/scarr/out_of_africa2.htm 12/62
The Origin of Humans: Out of Africa vs Multiregional Hypothesis Out of Africa: Humans evolved in Africa ~150,000 years ago Humans migrated out of Africa, replacing other humanoids around the globe There is no direct descendence from Neanderthals Multiregional: Humans evolved in the last two million years as a single species. Independent appearance of modern traits in different areas Humans migrated out of Africa mixing with other humanoids on the way There is a genetic continuity from Neanderthals to humans
Human Migration Out of Africa http://www.becominghuman.org
mtdna analysis supports Out of Africa Hypothesis African origin of humans inferred from: African population was the most diverse (sub-populations had more time to diverge) The evolutionary tree separated one group of Africans from a group containing all five populations. Tree was rooted on branch between groups of greatest difference.
Evolutionary Tree of Humans (mtdna) The evolutionary tree separates one group of Africans from a group containing all five populations. Vigilant, Stoneking, Harpending, Hawkes, and Wilson (1991)
Two Neanderthal Discoveries Feldhofer, Germany Mezmaiskaya, Caucasus Distance: 25,000km
Two Neanderthal Discoveries Is there a connection between Neanderthals and today s Europeans? If humans did not evolve from Neanderthals, whom did we evolve from?
Multiregional Hypothesis? May predict some genetic continuity from the Neanderthals through to the Cro- Magnons up to today s Europeans Can explain the occurrence of varying regional characteristics
Sequencing Neanderthal s mtdna mtdna from the bone of Neanderthal is used because it is up to 1,000x more abundant than nuclear DNA DNA decay over time and only a small amount of ancient DNA can be recovered (upper limit: 100,000 years) PCR of mtdna (fragments are too short, human DNA may mixed in)
Neanderthals vs Humans: surprisingly large divergence AMH vs Neanderthal: 22 substitutions and 6 indels in 357 bp region AMH vs AMH only 8 substitutions AMH = Anatomically Modern Human
22/62
New Fossil (Manot Cave) Supports OOA This means several things. 1. First, unless and until other fossil evidence is found, AMHs once they left Africa came though the Sinai and Levant region stopping in what is now modern day Israel before migrating outwards into Europe and the rest of Asia. 2. Secondly, this discovery conclusively shows that AMHs were indeed living near and perhaps even next to Neanderthals as early as 60,000 years ago. 3. Thirdly, the Out of Africa (OOA) hypothesis becomes the best evidenced hypothesis regarding how early humans migrated and conquered the planet. 23/62
Study evolution? If two sequences from different organisms are similar, they may have a common ancestor (Homologues). So sequence alignment (both pairwise and multiple ) can help construct the phylogenetic tree. -GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Edit Distance = 4
Phylogenetic Tree Reconstruction Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences? One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetic tree). Total #substitutions = 4 AAG AAA GGA AAA AAA AAA 1 2 1 AGA 25
Example Continued There are many trees possible. For example: AAA AAA AAA AGA AAA AAA AAG AAA GGA AGA AAG AGA AAA GGA 26
Example Continued There are many trees possible. For example: 1 AAG AAA AAA AAA 1 1 GGA AGA AGA Total #substitutions = 3 Total #substitutions = 4 The left tree is better than the right tree. 27 AAG AAA AGA AAA AAA AAA 1 1 2 GGA Questions: Is this principle yielding realistic phylogenetic trees? (Evolution) How can we compute the best tree efficiently? (Computer Science) What is the probability of substitutions given the data? (Learning) Is the best tree found significantly better than others? (Statistics)
Tree Reconstruction How are these trees built from sequences? First, a little background 28/62
Rooted Trees Infer an evolutionary ancestor leaves represent existing species internal vertices represent hypothetical ancestors can be viewed as directed trees from the root to the leaves 29/62
Tree of life
Revolutionizing the Classification of Life
Ernst Haeckel (1866) placed all unicellular organisms in a kingdom called Protista, separated from Plantae and Animalia. In the very beginning Life was classified as plants and animals When Bacteria were discovered they were initially classified as plants.
When electron microscopes were developed, it was found that Protista in fact include both cells with and without nucleus. Also, fungi were found to differ from plants, since they are heterotrophs (they do not synthesize their food). Thus, life were classified to 5 kingdoms: LIFE Procaryotes Plants Animals Protists Fungi
Later, plants, animals, protists and fungi were collectively called the Eucarya domain, and the procaryotes were shifted from a kingdom to be a Bacteria domain. Domains Bacteria Eucarya Kingdoms Plants Animals Protists Fungi Even later, a new Domain was discovered
rrna was sequenced from a great number of organisms to study phylogeny The translation apparatus is universal and probably already existed in the beginning.
Carl R. Woese and rrna phylogeny
-GCGC-ATGGATTGAGCGA TGCGCCATTGAT-GACC-A Edit Distance = 4 A distance matrix was computed for each two organisms. In a very influential paper, they showed that methanogenic bacteria are as distant from Bacteria as they are from Eucaryota (1977).
One sentence about methanogenic bacteria There exists a third kingdom which, to date, is represented solely by the methanogenic bacteria, a relatively unknown class of anaerobes that possess a unique metabolism based on the reduction of carbon dioxide to methane. These "bacteria" appear to be no more related to typical bacteria than they are to eucaryotic cytoplasms.
From sequence analysis only, it was thus established that life is divided into 3: Bacteria Archaea Eucarya
The rrna phylogenetic tree
Jill Banfield 43/62
44/62
Binary trees Biologists often work with binary weighted trees: every internal vertex has degree 3 if the tree is rooted then the root has degree 2 every edge has a positive weight (or length) 49/62
Distances in Trees Edges may have weights, which reflect: Number of mutations on evolutionary path from one species to another Or, time estimate for evolution of one species into another In a tree T with n leaves, we often compute the length of a path between leaves i and j, d ij (T) d ij refers the the distance between i and j and is the sum of the weight of the edges between i and j 50/62
Distance in Trees (cont d) j For i = 1, j = 4, d ij is: i d(1,4) = 12 + 13 + 14 + 17 + 13 = 69 51/62
52/62
Type of Tree Reconstruction Character-based Input is a multiple alignment of the sequences at the leaves. (find the topology that best explains the evolution of leaf sequences via mutations) Distance-based Input is a matrix of distances between species. 53/62
Character Based Tree Reconstruction Parsimony A tree with a total minimum number of character changes between nodes. Maximum likelihood - Finding the best Bayesian network of a tree shape. The method of choice nowadays. 54/62