MOLECULAR PHYLOGENY
"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky
EVOLUTION - theory that groups of organisms change over time so that descendeants differ structurally and functionally from their ancestors - biological process by which organisms inherit morphological and physiological features that define a species Darwin 1859 On the Origin of Species by Means of Natural Selection, or the Preservation of Favoured Races in the Struggle for life http://www.literature.org/authors/darwin-charles/the-origin-of-species/
Principles of evolution At the molecular level evolution is a process of mutation with selection Reproduction Variation Competition/selective pressure
Phylogeny Inference of evolutionary relationships Molecular phylogeny uses sequence information (as opposed to other characteristics frequently used in the past such as morphological features) * Lesk chapter 4 * Handout * Lecture slides
How is evolutionary time related to molecular changes in DNA and protein? Comparison of protein and gene sequences from different organims gave rise to the : Molecular clock hypothesis For every given gene or protein, the rate of molecular evolution is approximately constant.
Molecular clock hypothesis * Rates of change are different for each protein * These differences reflect functional constraints imposed by natural selection Richard Dickerson, 1971
A molecular clock may be used in the estimation of time of divergence between two species r = K / 2T or T = K/2r where r = rate of nucleotide substitution (known from fossil records) K = number of substitutions K between the two homologous sequences T = Time of divergence between the two species
However, there are cases where the molecular clock is very inaccurate: - The rate of evolution varies among different organisms. Examples: * viral sequences tend to change very rapidly as compared to other life forms * Rodents have a faster molecular clock than primates
Goals of molecular phylogeny Deduce the correct trees for all species of life
Nomenclature of trees nodes branch external (OTUs) internal root connects 2 nodes OTUs are existing sequences / / species / populations / individuals an internal node is an inferred ancestor (not observed) unscaled tree scaled tree
Cladogram Branches are unscaled (OTUs aligned in a vertical column) Phylogram Branches are scaled, branch lengths are proportional to the number of amino acid or nucleotide changes that occured between sequences
Goals of molecular phylogeny Deduce the correct trees for all species of life Topology Branch lengths
A tree is multifurcated if it has a node with three or more branches (In a bifurcated tree any branch that divides splits into two daughter branches)
Root Common ancestor of all sequences in the tree Rooted tree Root Unique path from the root to each of the other nodes Direction of each path corresponds to evolutionary time Unrooted tree No root No complete definition of evolutionary path Direction of time is not determined
unrooted tree Two methods of rooting: Outgroup. Phylogenetically distant organism is added to the set of sequences. Midpoint rooting. Longest branch is selected as site for rooting.
Comparing the numbers of rooted and unrooted trees - 3 OTUs
and 4 OTUs
Phylogenetic analysis - Selection of sequences for analysis - Multiple sequence alignment - Construction of tree - Evaluation of tree
What sequences to use? DNA? RNA? protein?
Slowly changing sequences * Protein * ribosomal RNA, for instance 16S rrna Useful for comparing widely divergent species. Ribosomal database (rdp.cme.msu.edu) > 50,000 aligned sequences More rapidly changing sequences * DNA * Mitochondrial DNA Useful for comparing more closely related species or populations within a species.
Two homologous protein sequences are more similar than the corresponding DNA sequences. This is to a large extent related to the degeneracy of the genetic code Seq 1 GGC AAG CGA AGU Seq 2 GGA AGA CGT UCA Seq 1 G R R S Seq 2 G K R K
Synonymous and non-synonymous changes Human atg gga caa aag Mouse atg ggc caa gag Human M G Q K Mouse M G Q E Comparison of the rates of nonsynonymous substitution (N) versus synonymous substitution (S) may reveal evidence of positive or negative selection S > N Negative selection. Change in amino acid sequence is restricted because sequence is important for protein function N > S Positive selection. Example: Duplicated gene is under pressure to evolve new function.
Approximate rates of substitution (number of substitutions per site & billion years) rrna ~ 0.1 protein 0.01-10 Hypervariable regions in mitochondria 10 HIV (RNA virus) >1000
Step 2. Producing the multiple alignment Phylogeny is one out of many applications of multiple alignments: * Identify conserved motifs - patterns (PROSITE) * Profiles (Pfam) * Prediction of protein secondary structure
Step 2. Producing the multiple alignment Alignment may be produced * using software such as CLUSTAL or TCOFFEE * or using protein three dimensional structure information (structural alignment)
Inspecting and processing the multiple alignment Critical issues - alignment should contain only homologous sequences - no partial sequences - overall identity should be significant ensuring that alignment is correct - gaps should be avoided - columns containing gaps may well be removed prior to phylogenetic analysis GGGCGGCGAGGCATTTATCGGGGGGTTGCAAAAT GGGCGGTGAGGCATTTATCGGGGGGTTGCAAAAT GGGCGGCGAAGCATAAATCGGGGAGTTGCAAAAT GGGCGGCGAGGCATTTATCGGGGGGTTGCGAAAT GGGCGGCGAGGCATTTATCGGGGGGCTGCAAAAT
Step 3. Construction of the phylogenetic tree Distance methods Character methods Maximum parsimony Maximum likelihood
Distance methods Simplest distance measure: Consider every pair of sequences in the multiple alignment and count the number of differences. Degree of divergence = Hamming distance (D) D = n/n where N = alignment length n = number of sites with differences Example: AGGCTTTTCA AGCCTTCTCA D = 2/10 = 0.2
Problem with distance measure: As the distance between two sequences increases, the the probability increases that more than one mutation has occured at any one site. Therefore, methods have been developed to compensate for this
Corrected distances Jukes and Cantor Kimura two parameter model rate of transitions is different from rate of transversions P = the fraction of sequence positions differing by a transition Q = the fraction of sequence positions differing by a transversion.
Distance methods UPGMA (unweighted pair group method with arithmetic mean) Neighbor-joining