The use of molecular tools for taxonomic research in zoology & botany

Outline Why employ molecular genetic markers? Brief historical overview of DN research Molecular techniques for genetic analysis DN sequence analysis Collect data Retrieve homologous sequences Multiple sequence alignment DN sequence alignment Terminology phylogenetic trees Phylogenetic inference phylogenetic inference

Why employ Molecular Genetic Markers Systematics: the biological discipline that is devoted to characterizing the diversity of life and organizing our knowledge about this diversity Tools Morphology Physiology Behaviour Embryology Other organismal characteristics Genomic information Carolus von Linnaeus (1707-1778) Swedish scientist who laid the foundation for modern taxonomy

Why employ Molecular Genetic Markers Genomic information - Human genome 3.000.000.000 bp (3 billion bp) 20.000 25.000 genes 1.5 % coding for proteins Fungi, plants, animals 10 million bp 200 billion bp Bacterial genomes 0.5 million bp 10 million bp Protists 20 million bp 500 billion bp

Why employ Molecular Genetic Markers Levels of genetic variation Randomly drawn pairs of homologues DN sequences from the human gene pool differ typically at about 0.1% of nucleotide positions Two random human genomes differ approximately at 3 million nucleotide positions Most other species display higher levels of nucleotide diversity

Why NOT employ Molecular Genetic Markers Molecular Laboratory Trained staff Genetic analysis Data analysis Cost

Historical overview 1944: experimental evidence that DN is genetic material 1953: Watson and Crick propose a molecular model for DN structure 1966: Margoliash determines amino acid sequence of cytochrome c in several taxa and generates the first phylogenetic tree 1968: Kimura proposes the neutral theory of molecular evolution 1977: Maxam & Gilbert and Sanger et al describe laboratory methods for DN sequencing 1979: vise et al and Brown et al introduce mtdn approaches to study natural populations 1981: Palmer et al initiate the use cpdn for molecular phylogenetic reconstruction in plants

Historical overview 1985: Saiki and Mullis et al report the enzymatic in vitro amplification of DN via the polymerase chain reaction (PCR) 1989: Kocher et al discover conserved PCR-primers to amplify mtdn fragments from many species (insert picture mtdn) 2001: Publication of draft sequence of the human genome by Lander et al and Venter et al 2005: Margulies et al developed a high-throughput parallel sequence technology for sequencing full genomes (454 sequencing) May 31, 2007 454 Life Sciences Corporation, in collaboration with scientists at the Human Genome Sequencing Center, Baylor College of Medicine, announced today in Houston, Texas, the completion of a project to sequence the genome of James D. Watson, Ph.D., co-discoverer of the double-helix structure of DN. The mapping of Dr. Watson s genome was completed using the Genome Sequencer FLX system and marks the first individual genome to be sequenced for less than $1 million. When we began the Human Genome Project, we anticipated it would take 15 years to sequence the 3 billion base pairs and identify all the genes, said Richard Gibbs, Ph.D., director, Human Genome Sequencing Center, Baylor College of Medicine. We completed it in 13 years in 2003 coinciding with the 50th anniversary of the publication of the work of Watson and Dr. Francis Crick that described the double helix. Today, we give James Watson a DVD containing his personal genome a project completed in only two months. It demonstrates how far sequencing technology has come in a short time.

Historical overview Interactive Timelines: HUGO (http://www.genome.gov/25019887) http://www.dnai.org/timeline/index.html

Molecular techniques Protein immunology (since 1904) Immunological distance between taxa First method used for phylogenetics Protein electrophoresis (mid-1960s) Starch-gel electrophoresis (SGE) llozyme polymorphisms

Molecular techniques DN technology: DN-DN hybridization Yields mean genetic differences across a large fraction of any two genomes Source of phylogenetic information (30 000 DN-DN hybridizations on 1700 avian species) Restriction analysis Discovery of restriction endonucleases (1968) Cleave duplex DN at particular oligonucleotide sequences (EcoRI 5 -GTTC-3 ) RFLP: Restriction Fragment Length Polymorphism

Molecular techniques DN technology: RPD Randomly mplified Polymorphic DN FLP mplified Fragment-Length Polymorphism SSCP Single-Strand Conformational Polymorphism SINE Short Interspersed Elements STR Short Tandem Repeat (microsatellites) SNP Single Nucleotide Polymorphism DN sequencing

DN sequencing

DN sequence alignment GCGGCCC TCGGTGTT GGTGG GCGGCCC TCGGTGTT GGTGG GCGTTCC TCGCTGGTT GGTGG GCGTCCC TCGCTGTT GGTGG GCGGCGC TTGCTGTT GGTG ******** ********** ***** TTGCTG CCGGGG--- CCG TTGCTG CCGGTG--GT GCC TTGCTG -CTGG--- CGCG TTGCTG -CTGGGC CGCG TTGCTC -CTCTG--- CGCG ********?????????? *****

What is a Multiple lignment? n lignment is an hypothesis of positional homology between nucleotide bases / mino cids. GCGGCCC TCGGTGTT GGTGG GCGGCCC TCGGTGTT GGTGG GCGTTCC TCGCTGGTT GGTGG GCGTCCC TCGCTGTT GGTGG GCGGCGC TTGCTGTT GGTG ******** ********** ***** TTGCTG CCGGGG--- CCG TTGCTG CCGGTG--GT GCC TTGCTG -CTGG--- CGCG TTGCTG -CTGGGC CGCG TTGCTC -CTCTG--- CGCG ********?????????? *****

Multiple Sequence lignment- Methods Manual utomatic Combined

Overview of ClustalW procedure Hbb_Human 1 - Hbb_Horse 2.17 - Hba_Human 3.59.60 - Hba_Horse 4.59.59.13 - Myg_Whale 5.77.77.75.75 - Quick pairwise alignment: calculate distance matrix Hbb_Human Hbb_Horse Hba_Human Hba_Horse Myg_Whale 2 3 4 1 Neighbor-joining tree (guide tree) 1 PEEKSVTLWGKVN--VDEVGG 2 GEEKVLLWDKVN--EEEVGG 3 PDKTNVKWGKVGHGEYG 4 DKTNVKWSKVGGHGEYG 5 EHEWQLVLHVWKVEDVGHGQ 2 3 4 1 Progressive alignment following guide tree

ClustalW- First pair lign the two most closely-related sequences first. This alignment is then fixed and will never change. If a gap is to be introduced subsequently, then it will be introduced in the same place in both sequences, but their relative alignment remains unchanged.

ClustalW- Decision time Next consult the guide tree to see what alignment is performed next. It can either be two different sequences that are aligned together or a third sequence can be aligned to the first two. Hbb_Human Hbb_Horse Hba_Human Hba_Horse 2 3 4 1 Myg_Whale

ClustalW- lternative 1 If the situation arises where a third sequence is aligned to the first two, then when a gap has to be introduced to improve the alignment, each of these two entities are treated as two single sequences.

ClustalW- lternative 2 If, on the other hand, two separate sequences have to be aligned together, then the first pairwise alignment is placed to one side and the pairwise alignment of the other two is carried out.

ClustalW- Progression The alignment is progressively built up in this way, with each step being treated as a pairwise alignment, sometimes with each member of a pair having more than one sequence.

ClustalW-Good points/bad points dvantages: Speed. Disadvantages: No objective function. No way of quantifying whether or not the alignment is good No way of knowing if the alignment is correct.

ClustalW- User-supplied values Two penalties are set by the user (there are default values, but you should know that it is possible to change these). GOP- Gap Opening Penalty is the cost of opening a gap in an alignment. GEP- Gap Extension Penalty is the cost of extending this gap.

dvice on alignments Treat cautiously Can be improved by eye (usually) Often helps to have colour-coding. Depending on the use, the user should be able to make a judgement on those regions that are reliable or not. For phylogeny reconstruction, only use those positions whose hypothesis of positional homology is unimpeachable

Terminology tree is a mathematical structure that is used to model the actual evolutionary history of a group of sequences or organisms. Represents phylogenetic relationship between organisms or genes, consists of nodes connected by braches: Terminal nodes, leaves, OTUs (Operational Taxonomic Units) or terminal taxa. Internal nodes represent hypothetical ancestors

Terminology Types of phylogenetic trees Cladogram: shows relative recency of common ancestry dditive trees (or metric or phylograms): contains additional information, namely branch lengths, which correspond to the amount of evolutionary change. Ultra metric trees (or dendrograms): special kind of additive tree in which the tips are all equidistant from the root

Terminology Rooted versus Unrooted trees Rooted tree: root node direction = evolutionary time. Unrooted tree: specifies relationship between OTUs does not define the evolutionary path.

Terminology

Terminology Homoplasy Parallel evolution Convergent evolution Secondary loss Similarity: ny 2 sequences can be compared and the similarity computed (% nucleotide identity). llowing gaps, 2 non-homologous nt sequences can have a similarity of up to 50%; for aa sequence this can be up to 20%.

Phylogenetic Inference Commonly used methods are usually classified into four major groups: parsimony methods distance methods likelihood methods Bayesian methods

Phylogenetic Inference

Cluster methods vs. search methods Cluster methods use an algorithm (set of steps) to generate a tree. easy to implement computationally efficient produce a single tree tree depends upon the order in which we add sequences to the tree Search methods use some sort of optimality criteria to choose among the set of all possible trees. The optimality criteria gives each tree a score that is based on the comparison of the tree to data dvantage: search methods use an explicit function relating the trees to the data Disadvantage: computationally very expensive (NP complete problem).

Maximum Parsimony ims to find the tree topology that can be explained with the smallest number of character changes The most parsimonous or most simple explanation is evolutionary also the most likely one Given a set of characters, such as aligned sequences, parsimony analysis works by determining the fit (number of steps) of each character on a given tree The sum over all characters is called Tree Length Most parsimonious trees (MPTs) have the minimum tree length needed to explain the observed characters Evaluation of the tree length for all possible topologies

Maximum Parsimony Site 1 2 3 4 5 seq 1 T T T seq 2 T C G T seq 3 G C G T seq 4 G C C G T 1 2 1 3 1 G 3 4 2 4 2 Site Tree 1 2 3 4 5 Total ((1,2),(3,4)) 1 1 2 1 0 5 ((1,3),(2,4)) 2 2 1 1 0 6 ((1,4),(2,3)) 2 2 2 1 0 7 4 3

Maximum Parsimony Results: One or more most parsimonious trees Hypotheses of character evolution associated with each tree (where and how changes have occurred) Branch lengths (amounts of change associated with branches) Various tree and character statistics describing the fit between tree and data

Maximum Parsimony dvantages: is a simple method - easily understood operation does not seem to depend on an explicit model of evolution gives trees and associated hypotheses of character evolution reliable results if the data is well structured and homoplasy is either rare or widely (randomly) distributed on the tree Disadvantages May give misleading results if homoplasy is common Underestimates branch lengths Model of evolution is implicit - behaviour of method not well understood

Distance methods

Distance methods Distance estimates attempt to estimate the mean number of changes per site since 2 taxa last shared a common ancestor During evolution, multiple hits can have happened at a single position: the evolutionary distance is almost always larger than the dissimilarity (% nucleotide divergence) Sequence difference Correction Expected difference based on number of mutations that happened Observed difference Time/Evolutionary distance

Distance methods Computation of evolutionary distances 1 2 3 T C G T C G G T T C G T C C G T T G C T C G T T C T C G G C C C G 2 3 1 2 3 0.266 0.333 0.333 dissimilarity Convert dissimilarity to evolutionary distance by correcting for multiple events per site according to a certain model of evolution 2 3 1 2 3 0.328 0.44 0.44 evolutionary distance

Distance methods model of evolution PURINES α G α α α α PYRIMIDINES C α T ll substitution rates are equal (α)

Distance methods 4 possible transitions: G C T 8 possible transversions: C T G C G T Thus if mutations were random, transversions are 2 times more likely than transitions. Due to steric hindrance and chemical properties, the opposite is true, transitions occur in general 2 times more often. Transversions result in more disruptive amino acid changes

Distance methods model of evolution PURINES α G β β β β PYRIMIDINES C T Rate for transitions (α) is different from transversions (β) α

Distance methods Nucleotide substitution models Jukes-Cantor (JC) model Equal base frequencies ll substitutions equally likely llow for transition/ transversion bias llow base frequencies to vary Felsenstein (F81) model Unequal base frequencies ll substitutions equally likely llow for transition/ transversion bias Kimura 2 parameter (K2P) model Equal base frequencies Transversions and transitions have different substitution rates llow base frequencies to vary Hasegawa et al. (HKY85) Unequal base frequencies Transversions and transitions have different substitution rates llow all six pairs of substitutions to have different rates General reversible (GTR) Unequal base frequencies ll six pairs of substitutions have different rates

Distance methods dvantages: Fast - suitable for analysing data sets which are too large for ML large number of models are available with many parameters - improves estimation of distances Disadvantages: Information is lost - given only the distances it is impossible to derive the original sequences Only through character based analyses (ML, parsimony) can the most informative positions be inferred Generally outperformed by Maximum likelihood methods in choosing the correct tree in computer simulations

Maximum likelihood methods Maximum likelihood methods of phylogenetic inference evaluate a hypothesis about evolutionary history (the branching order and branch lengths of a tree) in terms of a probability that a proposed model of the evolutionary process and the hypothesised history (tree) would give rise to the data we observe The likelihood of observing a given set of sequence data for a specific substitution model is maximized for each topology and the topology that gives the highest maximum likelihood is chosen as the final tree. The method requires a probabilistic model for the process of nucleotide substitutions. Maximum likelihood methods of tree building must solve two problems: For a given topology, what set of branch lengths makes the observed data most likely (what is the maximum likelihood value for that tree)? Which tree of all the possible trees has the greatest likelihood?

Maximum likelihood methods set of aligned nucleotide sequences for four OTU s What is the probability that this tree could have generated the data under our chosen model of evolution. Under the assumption that nucleotide sites evolve independently, we can calculate we can calculate the likelihood for each site separately, and combine the likelihoods into a total value. To calculate the likelihood for some site j consider all possible scenarios there are 16 possibilities to consider. Having calculated the likelihoods at each site, the joint probability that the tree and model confer upon all sites is computed as the product of the individual site likelihoods Because the probability of any single observation is an extremely small number, we almost always evaluate the log of the likelihood instead, so the probabilities are accumulated as the sum of the logs of the single site likelihoods.

Maximum likelihood methods dvantages: Mathematically rigorous & performs well in computer simulations llows investigation of the fit between model and data Provides a simple way of comparing trees according to their likelihoods (difference tests - Kishino Hasegawa Test) Disadvantages: Maximum likelihood will only be consistent (converge on the true tree) if evolution proceeds according to the assumed model: How well does the model fit the data? Becomes impossible computationally if many taxa or many model parameters

Choosing Models Models can be made more parameter rich to increase their realism: But the more parameters you estimate from the data the more time needed for an analysis and the more sampling error accumulates One might have a realistic model but large sampling errors Realism comes at a cost in time and precision! Fewer parameters may give an inaccurate estimate, but more parameters decrease the precision of the estimate In general use the simplest model which fits the data Compare nested models incorporating additional parameters for their likelihoods

Cluster methods UPGM Unweighted pair group method with arithmetic means Clustering is done by searching for the smallest distance in pairwise distance matrix Only one tree is obtained Neighbour-joining The NJ algorithm uses as branch length criterion a corrected average of an OTU with all other OTUs: unequal branch length are allowed Only one tree is obtained

Cluster methods UPGM Suppose a matrix of pairwise distances B C D E B C 2 4 4 D 6 6 6 E 6 6 6 4 F 8 8 8 8 8 1 1 B Compute new distances between (B) and other OTUs d (B)C = (d C + d BC )/2 = 4 d (B)D = (d D + d BD )/2 = 6 d (B)E = (d E + d BE )/2 = 6 d (B)F = (d F + d BF )/2 = 8

Clustering methods UPGM (B) C D E C D 4 6 6 E 6 6 4 F 8 8 8 8 2 2 D E Compute new distances between (DE) and other OTUs d (DE)(B) = (d D(B) + d E(B) )/2 = 6 d (DE)C = (d DC + d EC )/2 = 6 d (DE)F = (d DF + d EF )/2 = 8

Clustering methods UPGM C (DE) (B) C (DE) 4 6 6 1 1 1 B F 8 8 8 2 C Compute new distances between (BC) and other OTUs d (BC)(DE) = (d (B)(DE) + d C(DE) )/2 = 6 d (BC)F = (d (B)F + d CF )/2 = 8

Clustering methods UPGM (BC) (DE) 1 1 1 1 B (DE) 6 2 C F 8 8 1 2 2 D E Compute new distances between (BCDE) and OTU F d (BCDE)F = (d (BC)F + d (DE)F )/2 = 8

Clustering methods UPGM 1 1 1 1 B (BC),(DE) F 8 1 1 2 C 2 D 2 E 4 F

search methods Exhaustive search: guaranteed to find the minimum tree because all tree topologies are evaluated. Not possible for more than ±10 sequences Branch and bound: guaranteed to find the minimum tree without evaluating all tree topologies: a larger number of taxa can be evaluated but still limited (depends on the dataset) Heuristic searches: not guaranteed to find the minimal tree Uses stepwise addition of taxa and rearrangement process (branch swapping)

search methods B C D C B D B C D B C B C D B C D B C D B C D B C D C B D C B D C B D C B D C B D D B C D B C D B C D B C D B C E E E E E E E E E E E E E E E 1 3 15 105... ( ) N n n U n = ( )!! 2 5 2 3 3

Branch and bound search methods C B B B 13 subst D C 16 subst D D 17 subst C E E E C E 16 substitutions with 4 taxa 17 substitutions with 4 taxa B E C D Do not retain topologies with more substitutions than encountered in a next step: Only 5 topologies have to be investigated instead of 15! B 15 subst D 15 substitutions with 5 taxa --- Introductory seminar on the E use of molecular tools in natural history collections - 6-7 November 2007, RMC ---

search methods Heuristic Start with stepwise addition Perform branch swapping e.g. Tree Bisection Reconnection (TBR) B C C D D G E E F B D C B G G E F F B F C D G E

search methods Heuristic

Bootstrapping

Bayesian phylogenetics Prior probability Pr[Tree i] : Probability of tree before observations have been made Likelihood Pr[Data Tree i] Proportional to the probability of the observations (=alignment) Requires specific assumptions about the process generating the observations (=parameters evolutionary model) Posterior probability Pr[Tree i Data] : The probability of the tree conditional on the observations (=alignment) Obtained by combining prior & likelihood for each tree using Bayes formula:

Bayesian phylogenetics The optimal tree is the one that maximizes the posterior probability Bayesian methods allow complex methods of evolution to be implemented (ML methods have problems when the ratio of data points to parameters is low) Baysian methods rely on an algorithm (MCMC, Markov Chain Monte Carlo) that does not attempt to find the highest point in the space of all parameters Treats parameters in a different way compared to ML methods. (marginal vs joint estimation) Provides support measures (no bootstrapping)

Bayesian phylogenetics

Summary Holder & Lewis 2003 Nature Reviews Genetics (4) 275-283

Terminology Gene trees and species trees The divergences of genes is longer than the time of species divergence. Topology of gene tree can be different from the species tree due to lineage sorting depends on long-term effective population size generation time interval between successive speciations When the speciation event occurs every 1 or 2 million years it is unlikely that the species tree differs from the gene tree.

Maximum likelihood methods non-biological example: coin tossing If the probability of an event X dependent on model parameters p is written: P ( X p ) then we would talk about the likelihood L ( p X ) that is, the likelihood of the parameters given the data. Likelihood is the hypothetical probability that an event that has already occurred would yield a specific outcome. The concept differs from that of a probability in that a probability refers to the occurrence of future events, while a likelihood refers to past events with known outcomes.