G4120: Introduction to Computational Biology

Alignment Alignment (MSA) A multiple sequence alignment is an alignment of a set of sequences with structurally similar and evolutionarily homologous residues aligned in columns. In an ideal alignment, columns of aligned amino acid residues would have similar locations in the 3D structure of a protein and would diverge from a common ancestral residue. In theory, an unambigously correct evolutionary alignment exists, but can be difficult to infer and computationally intensive to calculate. Where structural data is lacking or limited, as is generally the case, it is not possible to unambiguously identify structurally similar positions. Thus, defining a single unambiguous ideal alignment can be very difficult.

Alignment Algorithms Dynamic Programming vs. Heuristic Alignment Using dynamic programming algorithms (such as Smith-Waterman or Needleman-Wunsch) to perform an optimal alignment of more than a few sequences is computationally intensive, and generally impractical for large sets of sequences or lengthy sequences. As a result, most commonly used multiple sequence alignment algorithms take a heuristic approach. One common heuristic approach is progressive alignment, in which the problem is broken down into a series of pairwise alignments. The details of how to choose the initial pair to align, how to score alignments, how to align subsequent sequences, and whether subfamilies of alignments should be created can all vary. MSA (Dynamic) This algorithm uses a technique that reduces the complexity of dynamic programming when applied to multiple sequences, and can give an optimal alignment for not more than ten short (200-300 a.a.) protein sequences in a reasonable amount of time. For alignments with more or longer sequences, a heuristic approach is more practical. Feng-Doolittle (Heuristic) One of the first progressive alignment algorithms. It does not take advantage of profiles, which can increase the accuracy of the alignment. ClustalW (Heuristic) A profile based progressive alignment algorithm which uses a number of heuristics to rapidly generate multiple sequence alignments, including phylogeny and scalable gap penalties.

Sequence Definitions Identity The extent to which two sequences are invariant. Similarity The extent to which sequences are related, based on sequence identity and/or conservation. Conservation Changes in an amino acid sequence that preserve the biochemical properties of the original residue. This is measured in most sequence comparison algorithms by substitution matrices in which scores for each position are derived from observations of the frequencies of substitutions in blocks of local alignments in related proteins.

Alignment with Text Monospaced or Fixed Width vs. Variable Width Fonts Each character in a monospaced (or fixed width) font takes up the same amount of horizontal space, like early typewriter fonts, allowing multiple sequence alignments to properly align. Variable width fonts can throw off multiple sequence alignments. Fixed Width fonts in OS X: Andale Mono, Courier, Courier New, Monaco, V100 Fixed Width Font Alignment (Courier):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D Variable Width Font Alignment (Times):... m s h N q f q f i G n L t r D M A s R G v N K V I L V G n L G q D M A v R G I N K V I L V G R L G k D

Displaying Sequence Data Displaying Information Take care with your choice of fixed or variable width fonts. Use fonts carefully and consistently. Avoid overuse or arbitrary use of fonts. Use black or dark text against a white or very light background (no more than 20% color) to maximize comprehension. Avoid text that blends with a background, and be cautious in using light text on a dark background. Use shading, case, bold, italic or color when appropriate, to add emphasis, contrast, or draw attention to a feature. Avoid displays where everything blends together or lacks contrast. Align items to each other to establish a visual connection. Related items should be grouped in close proximity. Avoid simply placing items arbitrarily. Use color logically and aesthetically. Avoid the overuse of color. References The Mac is Not a Typewriter and The Non-Designer s Design Book by Robin Williams The Visual Display of Quantitiative Information by Edward R. Tufte Type & Layout by Colin Wheildon

Alignment with Excel 1 50 RK2... m s h N q f q f i G n L t r D t E V R h g n s n k p q A i f d i A v n E e W R n d a. G d k E. coli M A s R G v N K V I L V G n L G q D P E V R Y m P N G G A V A N i t l A T S E S W R D K a T G E M F M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M ColIb-P9 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M R64 M s a R G I N K V I L V G R L G n D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E M pip71a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E S W R D K Q T G E i pip231a M A v R G I N K V I L V G R L G k D P E V R Y I P N G G A V A N L Q V A T S E t W R D K Q T G K M!! 51 100 RK2 q E r T d f f R i k c F G s q A E a h G k Y L g K G s l V f v q G k i R n t k y E k d. G q T v Y E. coli k E Q T E W H R V V L F G K L A E V A s E Y L R K G s Q V Y I E G Q L R T R k W t D q s G q d R Y F R E Q T E W H R V V L F G K L A E V A G E c L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y ColIb-P9 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y R64 R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W d D N. G I T R Y pip71a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y pip231a R E Q T E W H R V V L F G K L A E V A G E Y L R K G A Q V Y I E G Q L R T R S W E D N. G I T R Y 101 150 RK2 g T d f.. i a d k v d y l d t k A p G g s n Q e........................ E. coli t T E v v V n v g G T M Q M L G g r q G g g a p a g g n i g g. G Q P Q s g w g q p q q p q g G n F v T E I L V K T T G T M Q M L v r A a G a q t Q p e e g q Q f s G Q P Q p e p q a E a g t K K G G ColIb-P9 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G R64 i T E I L V K T T G T M Q M L G s A p q q n a Q a q p k p Q q n G Q P Q s a d a t.... K K G G pip71a v T E I L V K T T G T M Q M L G r A a G t q t Q p e e a q Q f s G Q P Q p e s q p E p.. K K G G pip231a v T E I L V K T T G T M Q M L G r A a G a q t Q p e e g q Q s a. Q P Q p e p q s E a g t K K G G % Identity % Similarity 151 181 RK2................................. 100.0 100.0 E. coli q f s g G a q s r p q Q s a P a a p s n E p p m d f d. D D I P F! 32.8 56.0 F A K T K G R g R K A A Q P E P Q p Q p P E G d D Y G F S D D I P F! 28.5 54.3 ColIb-P9 A K T K G R g R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F! 30.2 52.6 R64 A K T K G R e R K A A Q P E P Q p Q t P E G e D Y G F S D D I P F 30.2 52.6 pip71a A K T K G R e R K A A Q P E P r q p s e p a.. Y D F d D D I P F 29.3 55.2 pip231a A K T K G R g R K V A Q P E P Q l Q p P E G d D Y G F S D D I P F 29.3 54.3 Can use with any font, as Excel allows you to manually adjust the alignment.

ClustalW and ClustalX ClustalW ClustalW first generates a pairwise distance matrix for all the sequences by pairwise dynamic programming alignment. It then estimates evolutionary distance from similarity scores and constructs a guide tree using the neighbor joining distance matrix method. Dynamic progamming is then used to align the most closely related pairs of sequences. A sequence profile is constructed from these alignments, and the remaining sequences are progressively aligned to each other in order of decreasing similarity by profile-profile, profile-sequence or sequence-sequence alignment, until a complete multiple sequence alignment has been generated. ClustalW automatically chooses the optimal scoring matrix for protein alignments based on whether the sequences are close or distant neighbors in the tree. Thus it might use BLOSUM62 (optimal for close relationships) for close neighbors, and BLOSUM45 (optimal for distant relationships) for distant neighbors. ClustalW also allows for scalable gap penalties in protein profile alignments. A gap opening next to a highly conserved residue can be more heavily penalized than a gap opening next to an unconserved residue, for example. ClustalX This is a version of ClustalW with a graphical user interface, which is more intuitive to use, though the formatting requirements for input files need to be followed closely. It can display multiple sequence alignments onscreen, or output them as Postscript, which can automatically be converted to PDF format by OS X 10.3.

Alignment with ClustalX CLUSTAL X (1.82) MULTIPLE SEQUENCE ALIGNMENT File: tadafasta.ps Date: Wed Apr 2 12:19:01 2003 Page 1 of 2 ::.. : * :. : ::: * **:. *: V_fisch1 ----------------MDQNKSIYIEIRAQIFDVLD--AETVN---------------------SLSKE--QLHNQLSN--------------------------------AIDLLIERHEWPVSTIVRAEYVTSLVNELQGLGPLQVLM 77 V_fisch2 ----------------MNNNKALYIQLRTQIFNALE--PEALN---------------------KLTKQ--ELTQQLSN--------------------------------AVDLLIDREQLPVSLIMKNEYVESLVNELVGLGPLQNLM 77 V_vulnII1_6 ----------------MNQLKQIYLDLRDEIFDAID--ASTLS---------------------EISNE--ELAEQLSE--------------------------------SVNILIDKKQLQVSSLKRAELVKALYDELKGLGPLQKLV 77 Y_pes ----------------MIVPLKIQELMRERMLANID--INKVE---------------------LLVGDRNKLIGLLSQ--------------------------------TFDDLFNNNEYNLTTQAQKYIIEMIADEITGFGPLRELM 79 Y_ent ------------------------------MLASID--IDQVQ---------------------YLVDDYSKLSELLSQ--------------------------------TLDELFNNNDYKLTTQDQKKIITMIADEITGFGPLRELM 65 A_act -----------------MLTKQQKILLRSEVLSNLD--IEKID---------------------ELQSERSSLVNELVQ--------------------------------IVNRVANKSGAYLTSADTLVMAEIVADEIEGYGPLRDLM 78 H_aph -----------------MLTKEQQIFLRSEVLSNLD--IEKID---------------------ALQSERNLLVNELVQ--------------------------------IVNRVASKSGTYLTSADTLVMAEIVADEIEGYGPLRDLM 78 P_mul -----------------MLTKEQQVFFRNELLSNLD--IEKID---------------------EIQSERDKLVDELVQ--------------------------------VVYKVAGKGNIYITSADALFMAECIADEIDGYGPIRELM 78 H_duc -----------------MLTKDQQVFFRNALLSNLN--VDTLD---------------------EIENERSKLVTELTQ--------------------------------SLYRVANTNNIYITPYDATDMAEIVADEIGGYGPIRELM 78 A_pleur -----------------MLTKEQQIFFRTELLSNLD--VEKLD---------------------EIQNERNKLIDELTQ--------------------------------SLYRISNLHSIYLTPADAAYMAGLVADEIGGYGPIRELM 78 V_vulnI8_11 MFGN--------KTQMVNVSRGNPLVMPEAAQTAFEKLIEPSE---------------------AVKLTRKQLQQEIKK-------------------------------AVAQLSAQ-QLLPYNQSELAILVEQLCDDMLGVGPIQCLV 89 V_vulnI6_11 MFFKRKNINPEFQEKAAALEAQPSSTISDEVISDIESNVQPIDSNRVEPMQQDKKLLERQAKDKAVEEARKQLEQELAIKHYYHQRLLETLDLGLLSSLEKERAKKDLHDAIVQLMAEDQTHPMSSEGRKRVIKQIEDEVFGLGPLEPLL 150 ruler 1...10...20...30...40...50...60...70...80...90...100...110...120...130...140...150 : :.**::** :::* * : *.. :* :.*:..**:*:. * *:** ****:* *:*::*. :***:*. : :.::: : :.. **:::****:**** ** :* * :*:: V_fisch1 EDESISDIMINGYDKIFIERAGLVEVAPVSFIDEEQLLHIAKRVASQVGRRVDDSSPTCDARLADGSRVNIVIPPIAIDGTSMSIRKFKKDSIGLEKLTEFGALSQEMAQLLMIASRCRLNILISGGTGSGKTTMLNALSQYISEKERIV 227 V_fisch2 DDETITDIMINGHENVFIERDGLVEKVSVNFIDEQQLIDIAKRIASRVGRRVDESSPTCDARLEDGSRVNIVIPPIAIDGTSISIRKFKKQSIAFSDLVEFGAMSKEMAQILMVASRCRLNILISGGTGSGKTTMLNALSQFISEGERIV 227 V_vulnII1_6 ENDDISDIMINGPYDVFIEIGGKVEKSPIQFVNEKQLNTIAKRIASNVGRRIDESSPLCDARLKDGSRVNIVIPPLAIDGTSISIRKFKEQKIKLENLVEFGAMSIEMAKLLSIASHCKCNILISGGTGSGKTTLLNALSGFIGEGERVV 227 Y_pes EDDSISDIMVNGPERIFIERYGLLKLTDRRFVNNTQLTDIAKRLMQKVNRRIDEGRPLADARLIDGSRINVAISPIALDGTALSIRKFSKNKRRLEDLVDMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSKYISEDERVI 229 Y_ent EDDSISDIMVNGPEKIFIERFGMITLTSRRFINNAQLTDIAKRLMQRANRRIDEGRPLADARLIDGSRINVAISPIALDGTVLSIRKFSNNKRKLEDLVEMGAMSSDMANFLIIAASCRVNIIISGGTGSGKTTLLNALSMYISENERVI 215 A_act ADDTINDILVNGPNDIWVERAGILEKTDKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNAVIAPIALDGTSISIRKFSKNKKTLQELVNFGSMTRNGE-FLNYCCRSRVNIIVSGGTGSGKTTLLNALSNYISHTERVI 227 H_aph ADDTINDILVNGPDDVWIERAGILEKTSKEFVSNEQLTDIAKRLVARVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSVSIRKFSKNKKTLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISHSERVI 228 P_mul EDETVNDILVNGPDDVWVERAGILEKTDKKFISNEQLTDIAKRLVAKVGRRIDDGSPLVDSRLPDGSRLNVVIAPIALDGTSISIRKFSKSKKSLQELVNFGSMTREMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSNYISPKERVI 228 H_duc EDDTVNDILVNGPDNIWIERAGVLEKTNKTFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQELVNFGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSSYISPTERVL 228 A_pleur EDEGVNDILVNGPDNIWVERAGILEKTDKKFINNEQLTDIAKRLVARVGRRIDEGMPLVDSRLPDGSRLNVVIQPIALDGTSISIRKFSKSKKSLQDLVNYGSMTLDMANFLIIAARSRVNIIVSGGTGSGKTTLLNALSHYISHTERVL 228 V_vulnI8_11 EDPSVSDILVNGPEQIYIERQGKLLKTDIRFRDKKHLLNVAQRIVNAVGRRLDESTPLVDARLEDGSRVNIIAPPLALNGVCISIRKFPERQYDLPGLVAFGSLSEEMAQCLALAARCRLNILVSGGTGAGKTTLLNAMSTPISDDERII 239 V_vulnI6_11 HDKTVSDILVNGPKNIFVERRGKLEKTPYTFLDDRHLRNIIDRIVSQVGRRIDEASPMVDARLLDGSRVNAIIPPLALDGASVSIRRFAVDKLTMDNMLGYNSLSPQMAKFVEAAVKGELNILIAGGTGSGKTTTLNIFSGFIPSDDRII 300 ruler...160...170...180...190...200...210...220...230...240...250...260...270...280...290...300 *:**:*** * :** :::***.. *.* :: :*** *:*****:**::** ** *:.:** ******:**:.*:***:* ** * * *. *. :. :* * **:.:::* *.**.*:: * : *:* : : :: V_fisch1 TIEDAAELKLLQPHVVRLETRNSGIEGNGAITQQDLVINALRMRPDRIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANTPRDAMARVEAMVMMASNNLPLEAIRRTIVSAVDIVIQISRLHDGSRKVMSITEVIGLEGNNVVLEELYKF 377 V_fisch2 TIEDAAELKLQQPHVVRLETRTSGIEGTGVVSQRDLVINSLRMRPDRIIVGECRGGEAFEMLQAMNTGHDGSMSTLHANSPRDALSRVEAMVMMATNNLPLEAVRRTIVSAVDIVIQISRLHDGTRKVMSISEVVGLEGNNVVLEEIFAF 377 V_vulnII1_6 TIEDAAELQLQKPHIVRLETRQASVEGTGQITARDLVINALRMRPDRIIVGECRGAEAFEMLQAMNTGHDGSMSTLHANTPRDAIARTESMVMMATASLPLEAIRRTIVSAVDLIVQVRRLHDGSRKVMYISEIVGLEGNNVVMEDIFRF 377 Y_pes TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPLITIRRNIASAINLIVQVSRMNDGSRKIRNISEIMGMEGEHVVLQDIFTF 379 Y_ent TLEDAAELNLEQPHVVRMETRLAGLENTGQITMRDLVINSLRMRPDRIIIGECRGEETFEMLQAMNTGHNGSMSTLHANTPRDAVARLESMIMMGPVNMPILTIRRNIASAINLIVQVSRMNDGSRKLSHISEIMGMEGDNVILQDIFSF 365 A_act TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNASLPLEAIRRNISSAVNIIVQASRLNDGSRKIMNITEVMGMENGQIVLQDMFSY 377 H_aph TLEDTAELRLEQPHVVRLETRLAGVEHTGEVTMKDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATSRLESMVMMSNATLPLEAIRRNIASAVNIIVQASRLNDGSRKIVNITEIMGMENGQIVLQDIFSY 378 P_mul TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGGEAFQMLQAMNTGHDGSMSTLHANSPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIVQASRLNDGSRKIMNITELMGMENGQIVMQDIFSY 378 H_duc TLEDTAELRLEQPHVVRLETRLAGVERTGEITMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANTPRDATARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVMNITEVMGMENGQIVLQDIFSF 378 A_pleur TLEDTAELRLEQPHVVRLETRLAGVERTGEISMQDLVINALRMRPERIIVGECRGAEAFQMLQAMNTGHDGSMSTLHANSPRDALARLESMVMMSNASLPLEAIRRNIASAVNIIIQASRLNDGSRKVTNITEVMGMENGQIVLQDIFSY 378 V_vulnI8_11 TIEDAAELSLTQPHWIQLETRTASSEGTGAVTVRDLVKNALRMRPDRIILGEVRGAEAFDMLQAMNTGHDGSLCTLHANSPADAMLRLENMLMMGAEQIPSAVLRQQISSALDLVVQLERSHDGKRRVTAISAVGGIEQGQIVVHPLFEC 389 V_vulnI6_11 TIEDSAELQLQQPHVVRLETRPPNLEGKGEITQRDLVKNALRMRPDRIVLGEVRGAEAVDMLAAMNTGHDGSLATIHANTPRDALSRVENMFAMAGWNISTKNLRAQIASAIHLVVQMERQEDGKRRMVSIQEINGMEGEIITMSEIFHF 450 ruler...310...320...330...340...350...360...370...380...390...400...410...420...430...440...450

Mutation Rate r = K/2T r = rate of substitution K = number of substitutions per site T = divergence time rate of substitution = number of substitutions per site / 2 x divergence time When substitutions are common, a particular site may have undergone multiple changes. Thus, alignments between sequences with many differences will underestimate the true number of substitutions that has occurred. The true number of substitutions can be estimated by K = -3/4 ln [1 -(4/3)(p)], where p is the fraction of nucleotides that differ between the two sequences.

Constraints on Mutations Transitions vs. Transversions Transitions (exchanging one purine (A or G) for another purine (G or A), or one pyrimidine (C, T or U) for another pyrimidine (U, T or C)) are three times as common as transversions (purine for pyrimidine or vice versa). Functional Constraints Functional constraints in coding or regulatory regions also impact the rate of change. Divergence Among Human, Mouse, Rabbit and Cow Globin Genes by Region Noncoding 3.33 substitutions/site/10 9 years Coding 1.58 substitutions/site/10 9 years 5' untranslated 1.86 substitutions/site/10 9 years 3' untranslated 3.00 substitutions/site/10 9 years

Synonymous vs. Nonsynonymous Substitutions Nondegenerate Sites Codon positions where any nucleotide mutation would cause a change in the amino acid (a nonsynonymous substitution). Example: Phenylalanine (UUU) Twofold Degenerate Sites Codon positions where one nucleotide mutation would not cause a change in the amino acid (but the two other possible mutations would). Example: Aspartic acid third codon position (GAU, GAC) Fourfold Degenerate Sites Codon positions where any nucleotide mutation would not cause a change in the amino acid (a synonymous substitution). Example: Glycine third codon position (GGG, GGA, GGU, GGC). Divergence Among Human and Rabbit Globin Genes by Sites Nondegenerate 0.56 substitutions/site/10 9 years Twofold Degenerate 1.67 substitutions/site/10 9 years Fourfold Degenerate 2.35 substitutions/site/10 9 years vs. Noncoding 3.33 substitutions/site/10 9 years Coding 1.58 substitutions/site/10 9 years

Variations in Evolutionary Rates Variations in Substitution Rates Substitution rates do not appear to be constant within even genomes of closely related species, and also vary from species to species. Relative Rate Test It is possible to estimate the overall rate of substitution in different lineages without knowing the exact divergence time by determining a relative rate of substitution. To compare the relative rate of substitution in species A and species B, one designates a less related species, C, as an outgroup. This allows you to estimate the amount of divergence that has taken place in species A and B since they last shared a common ancestor. Relative Rates of Synonymous Substitutions in Genes Mammalian Fibrinopeptide 4 Mammalian Hemoglobin 1 Mammalian Cytochrome c 0.2 Influenza NS 1,000,000 Relative Rates of Synonymous Substitution in Genomes Human genome 1 Plant genome 1 Mouse genome 2 Rat genome 2 Human mitochondria 10 Plant chloroplast 0.3

A Brief History of Phylogeny 1735 Taxonomy Karl von Linné 1750 Phenetic Taxonomy Michel Adanson 1859 Evolution Charles Darwin 1866 Phylogeny Ernst Haeckel 1950 Cladistic Taxonomy Willi Hennig [in time] we shall have very fairly true genealogical trees of each great kingdom of nature Charles Darwin, 1857, letter to T. H. Huxley The History of the Germ in an epitome of the History of the Descent, or, in other words: that Ontegeny is a recapitulation of Phylogeny Ernst Haeckel, 1897, The Evolution of Man The universal phylogenetic tree not only spans all extant life, but its root and earliest branchings represent stages in the evolutionary process before modern cell types had come into being. The evolution of the cell is an interplay between vertically derived and horizontally acquired variation. Primitive cellular entities were necessarily simpler and more modular in design than are modern cells. Consequently, horizontal gene transfer early on was pervasive, dominating the evolutionary dynamic. The root of the universal phylogenetic tree represents the first stage in cellular evolution when the evolving cell became sufficiently integrated and stable to the erosive effects of horizontal gene transfer that true organismal lineages could exist. Carl Woese, 2000, Interpreting the Universal Phylogenetic Tree

Taxonomy Taxonomy is the classification of organisms into an ordered system that indicates natural relationships. Karl von Linné, a.k.a. Caroli Linnaei or Linnaeus (1707-1778), invented modern taxonomy by developing a hierarchy of taxa (kingdom, class, order, genus, and species) and a system of binomial nomenclature based on genus and a characteristic feature of a species.

Phylogeny Phylogeny is the sequence of events involved in the evolutionary development of a species or taxonomic group. Ernest Haeckel (1834-1919), was originally trained as a physician, but devoted himself to the study of evolution after reading Darwin s Origin of Species. He was made famous by his own phrase ontogeny recapitulates phylogeny. He coined the term phylogeny and created the first phylogenetic trees.

Phenetic vs. Cladistic Approaches Phylogenetic Reconstruction Phylogenetic reconstruction attempts to estimate the phylogeny for some data. Any collection of sequences will share some ancestral relationship, and the data within the sequences contains information that can be used to reconstruct or infer these ancestral relationships. A phylogenetic tree is a branching structure which illustrates the relationships between the sequences. Nearly any approach currently used in phylogenetic reconstruction has adherents and detractors. At the moment, there is some disagreement as to best practices and principles for phylogenetic analysis. Phenetic Approach Phenetic taxonomy was invented in 1750 by Michel Adanson. In the phenetic approach, a tree is constructed by considering the phenotypic similarities of the species without trying to understand the evolutionary pathways of the species, and thus may or not be the correct phylogeny. Trees constructed by this method are called phenograms or dendrograms. Cladistic Approach Cladistic taxonomy was invented by the German entomologist Willi Hennig in 1950. It involves the rigorous application of the concept of evolution to taxonomy. Taxa are defined by what distinctive features their members have, not what features they share with others. In the cladistic approach, a phylogentic tree is reconstructed by considering the various possible pathways of evolution and choosing from amongst these the best possible tree, that is, the tree that involves the fewest changes, and thus the least amount of convergent evolution. Trees reconstructed by this method are called cladograms.

Phylogenetic Trees Rooted Trees In a rooted tree, a single node is designated as a common ancestor, and a unique path leads from it through evolutionary time to all other nodes. It thus provides information about the common ancestry of sequences and the direction of evolution, and is the most common type of tree used to study evolutionary relationships. Rooted Tree with Scaled Branches Unrooted Trees Unrooted trees specify only the relationship between nodes, and nothing about the direction in which evolution occurred. A root can be assigned to an unrooted tree through the use of an outgroup, for example a species that unambiguously previously separated from the other species being compared (e.g. baboon, when comparing humans and gorillas). Source: Krane & Raymer, Fundamental Concepts of Bioinformatics, NCBI

Tree Topology Operational Taxonomic Unit (OTU) This corresponds to the terminal nodes of a phylogenetic tree (also known as leaves, tips or external nodes). They represent the genes, organisms, families, species or populations, as appropriate, for which you have data. Internal Node This corresponds to points within a phylogenetic tree where interior branches meet (also known as vertices). These represent inferred ancestors. Outgroup An OTU or taxa included for the purpose of rooting a tree.

Rooted Tree Reconstruction The possible number of unrooted trees is one step less (i.e. 5 species or OTUs 15 trees, still an enormous number with many species or OTUs). The number of possible trees for n OTUs can be estimated by (2n-3)!/(2n-2(n-2)!) for bifurcating rooted trees and (2n-5)!/(2n-3(n-3)!) for bifurcating unrooted trees (Brian Golding, Reconstructing Phylogenies).

Phylogenetic Terminology Homologs Genes with a common ancestral sequence. They may have been separated by speciation (orthologs) or duplication (paralogs). Orthologs Homologous genes in different species that arose from a common ancestor. They tend to have similar structure and function. Paralogs Similar genes within a single species that are the result of a gene duplication. They tend to have different but related functions. Xenologs Genes acquired by horizontal transfer between species, typically mediated by a plasmid, transposable element, or virus. Symplesiomorphy Having characters that are both derived from a common ancestor and uniquely shared by a group. This is essential to clearly establishing a phylogeny. Having only derived or shared characters is not sufficient to establish a phylogeny. Homoplasies Convergences of a particular character at a particular site. These typically pose the most difficulty in attempting to reconstruct the ancestral phylogenetic tree.

Phylogenetic Tree Terminology Monophyletic A group descended from a single common ancestor that contains only and all descendants from that ancestor. Paraphyletic A group descended from a single common ancestor that does not contain all the descendants from that ancestor. Polyphyletic A group whose members are not descended from a single common ancestor. Gene Tree A phylogenetic tree based on divergence observed within a single homologous gene in different species. It may accurately represent the evolutionary history of that gene, but not necessarily of the species. Species trees are best based on the comparison of numerous genes. Bootstrapping A method for checking the robustness of a given phylogentic tree by checking whether every portion of the alignment equally supports the structure of the tree. Newick Tree Format A common text file format for representing simple phylogentic trees in a set of nested parenthesis, i.e. (B,(A,C,E),D); or with branch lengths included, (B:6.0,(A:5.0,C: 7.0,E:4.0):4.0,D:10.0);

Distance Matrix Methods Distance Method Distance based methods attempt to construct trees based on measures of distance between OTUs (i.e. genes or species). In contrast, character based methods evalute particular features (i.e. DNA sequence, amino acid sequence, # of legs, etc.). Unweighted-Pair-Group Method with Arithmetic Mean (UPGMA) A clustering algorithm which constructs a distance matrix, then clusters together the least distant pair of Operational Taxonomic Units (OTUs), followed by successively more distant OTUs. At each step of the algorithm, the number of OTUs declines by one, replaced by a joint OTU, from which subsequent distances from other OTUs are calculated, until the algorithm finishes by clustering the last pair of OTUs. This method assumes that the rate of evolutionary change between all branches of the tree is the same, which is generally not a valid assumption. In nature, examples of rates of evolution varying between taxa are common. As a result, corrections to this assumption are often used with this approach. Neighbor Joining Method This attempts to correct for the assumption made by UPGMA that the same rate of of evolutionary change applies to all branches of the tree. It is otherwise similar to UPGMA, but generally gives better results. It yields an unrooted tree. Fitch and Margoliash This method attempts to find an optimal tree of minimal distance. It yields an unrooted tree.

Maximum Parsimony Methods Maximum Parsimony The maximum parsimony method involves evaluating as many trees as possible, giving each a score that is used to choose between different trees. The highest scoring, or most parsimonious tree is the one with the minimum number of evolutionary changes. A number of different methods can be used to calculate scoring. Fitch Parsimony For a particular tree, traverse from the leaves toward the root of the tree. At each internal node, determine the set of possible states (i.e. nucleotides). Then, traverse the tree from the root towards the leaves, picking ancestral states for each internal node to minimize the number of changes required. The Fitch algorithm assumes position independence, and that any state is equally likely to change to any other state. Variations which weight the costs of changes differently exist. Dollo Parsimony Assumes that derived states are irreversible, that is, a derived character state cannot be lost and then regained. Hence, the state can evolve and be lost many times throughout evolution, but cannot be inferred to have evolved twice. The tree with maximum parsimony is the one in which derived characters have been lost the fewest number of times. This method has been used with restriction fragment length polymorphism (RFLP) data, since restriction sites are difficult to gain, but easy to lose. It may be more useful when dealing with non-sequence data, for example, complex phenotypes, which are unlikely to have evolved more than once. Source: Brian Golding, Reconstructing Phylogenies

The Principle of Parsimony Occam s Razor Pluralitas non est ponenda sine necessitas (Do not increase the number of entities required to explain anything beyond what is strictly necessary) William of Occam (or Ockham) (1284-1347) Requires less changes than its neighbor These two trees are equally parsimonious

Other Methods Maximum Likelihood The method of maximum likelihood attempts to reconstruct a phylogeny using an explicit model of evolution. It specifies values for the likelihood of a given trait evolving within a lineage, and chooses the most likely tree, given these values. It attempts to predict the most likely interior nodes given the OTUs, then the most likely tree. Theoretically, this may be the most powerful method available. For a given model of evolution, no other method will perform as well nor provide you with as much information about the tree. Unfortunately, this is computationally difficult to do and hence, the model of evolution must be a simple one. Even with simple models of evolutionary change the computational task is enormous and this is the slowest of all methods. Compatibility Compatibility methods recode data involving multi-state characters to include knowledge of the ancestral states of characters, and from this determine what changes are compatible. Compatibility methods are more accurate when there are slow rates of evolutionary change. Both compatibility and parsimony assume that homoplasies will be rare. Source: Brian Golding, Reconstructing Phylogenies

Rules of Thumb for Phylogeny Use more than one method. Each one will provide a phylogenetic history biased by that method s assumptions. Bootstrap or jackknife your data to test the quality of your tree. When bootstrapping, use at least several hundred iterations of resampling and tree generation. Run your analysis with different subsets of taxa to see if the trees thus generated are congruent. Dropping a single OTU should not dramatically change your tree. Treat long branches with caution. They tend to attract each other. Beware of non-orthologous genes, horizontal gene transfers, or recombinant sequences. Standard pylogenetic methods do not handle them well. When using outgroups, consider including more than one outgroup taxa, and choose outgroup species that are evenly spaced on the tree. Including intermediate taxa can help resolve even the relationship of a few taxa. When the number of substitutions per site is unusually high or low, distance methods may perform better than parsimony methods. If you expect homoplasies to be scattered at random throughout the sequence data, then a parsimony method will perform best. If homoplasies are expected to be concentrated in a few characters, whose identities are known in advance, then compatibility will perform better than parsimony.

Phylogenetic Software Packages PHYLIP The leading free package for phylogeny. It includes programs to carry out parsimony, distance matrix methods, maximum likelihood, and other methods on a variety of types of data, including DNA and RNA sequences, protein sequences, restriction sites, 0/1 discrete characters data, gene frequencies, continuous characters and distance matrices. Although it is free, it can be complex to use. http://evolution.genetics.washington.edu/phylip.html Phylodendron A free web-based tree drawing program with a simple user interface and many output options. http://iubio.bio.indiana.edu/treeapp/treeprint-form.html ClustalX A free multiple sequence alignment program that includes the ability to create phylogenetic trees based on the Neighbor Joining Method. http://www-igbmc.u-strasbg.fr/bioinfo/clustalx/top.html PAUP The leading commercial package for phylogeny. It includes parsimony, distance matrix, invariants, and maximum likelihood methods and many indices and statistical tests. http://paup.csit.fsu.edu/ and http://www.sinauer.com/ MacClade A commercial package for interactive analysis of evolution of a variety of character types, including discrete characters and molecular sequence. It works well with PAUP. http://www.sinauer.com/