Introduction to MEGA - PDF Free Download

Introduction to MEGA Download at: http://www.megasoftware.net/index.html Thomas Randall, PhD tarandal@email.unc.edu Manual at: www.megasoftware.net/mega4

Use of phylogenetic analysis software tools Bioinformatics software for biologists in the genomics era Sudhir Kumar and Joel Dudley Bioinformatics 23: 1713-1717 Fig 1(B) Relative impacts of evolutionary analysis software packages over the last 10 years. Only non-commercial software packages available on-line (without fee) are included, except for two available for a nominal fee (shown with dashed line). Data for both panels were obtained from the Web of Science (February 2007 edition). For panel B, the numbers of new citation were generated using the Cited References facility with the search arguments for author name, cited work and citation year kindly provided by Joe Felsenstein for MEGA (www.megasoftware.net), PAUP (paup.csit.fsu.edu), PHYLIP (evolution.genetics.washington.edu/phylip.html), MrBayes (mrbayes.csit.fsu.edu), Puzzle (www.tree-puzzle.de), PhyML (atgc.lirmm.fr/phyml) andpaml (abacus.gene.ucl.ac.uk/software/paml.html).

MEGA contains all elements necessary for building a tree Import and editing sequence/chromatographs Clustalw for alignment Various options for contructing a phylogeny Several options for generating statistical significance Tree viewing function

Basic steps to build a phylogeny 1. Import and Align sequences 2. Select tree building option 3. Select distance matrix 4. Choose type of bootstrapping 5. Manipulate tree with tree viewer

Phylogeny options in MEGA4 UPGMA Neighbor joining Minimum evolution Maximum parsimony Distance methods General rules build tree with two independent methodologies for confirmation in MEGA - one distance method plus parsimony Maximum parsimony less effective for more distantly related sequences due to homoplasy (multiple substitutions at same site can accumulate over time) WARNING: Phylogenetics has a long history of heated arguments about the relative merits of different methods researchers in the field seem preadapted for ideological warfare Huelsenbeck et al., Syst. Biol. 51: 673

UPGMA (Unweighted Pair Group Method with Arithmatic Mean ) UPGMA employs a sequential clustering algorithm (neighbor joining), in which pairwise distances between sequences are computed, and the phylogenetic tree is built in a stepwise manner. We first identify from among all the sequences the two that are most similar to each other and then treat these as a new single branch. Subsequently from among the remaining sequences we identify the pair with the highest similarity, and so on. Assumes equal evolutionary rates (a clock)

Neighbor-Joining An algorithm for constructing phylogenetic trees using distance data. Once a distance measurement between a set of sequences has been determined, a neighbor joining algorithm will find the two closest, group them, then look for the next closest until all sequences are fit into a tree. Different algorithms for doing this have been written that either do or do not consider evolutionary distance. Examples: clustalw, UPGMA, neighbor (phylip) Difference between this and UPGMA (also a neighbor joining method) Is it does not assume a constant evolutionary rate in all lineages

Minimum evolution All possible trees are produced, the tree with the smallest total branch Length is chosen as the best tree. Branch length is proportional to the distance between each sequence. Maximum Parsimony The selection of the phylogenetic tree requiring the least number of substitutions from among all possible phylogenetic trees as the most likely to be the true phylogenetic tree. Usefulness declines with increasing evolutionary distance

Johns Hopkins University - Fall 2003 Phylogenetics & Computational Genomics - 410.640.71 Informative sites in parsimony 1 2 3 4 5 6 7 8 9 10 Sites OTU 1 T C A G A T C T A G 2 T T A G A A C T A G 3 T T C G A T C G A G 4 T T C T A A G G A C Invariant sites are not used in parsimony (they yield no information on character state changes) Informative sites (at least two different kinds of residues each present at least two times) are used by parsimony because they discriminate between topologies i.e. different topologies require different numbers of changes between residues Singleton sites can not be used to discriminate between topologies (they require 1 change for all topologies) Lecture #7 Page 3

Maximum parsimony (MP) options Exhaustive Not an option here, but all possible trees are searched, practically this takes too much time so various shorcuts (branch and bound, heuristic) have been developed Branch and bound * This is a method of searching through tree space in order to find optimal trees. It is not exhaustive, trees with a total length longer than those already examined are not considered, reducing the complexity of the search. Guaranteed to find all MP trees. Becomes time consuming if more than 20 sequences are considered Heuristic Another approximate search, still using a branch and bound approach but making more assumptions. More useful for larger trees but no guarantee of finding the MP tree with the shortest length CNI (Close-Neighbor-Interchange) In any method, examining all possible topologies is very time consuming. This algorithm reduces the time spent searching by first producing a temporary tree, and then examining all of the topologies that are different from this temporary tree by a topological distance of dt = 2 and 4. If this is repeated many times, and all the topologies previously examined are avoided, one can usually obtain the tree being sought.

Statistical tests of significance Bootstrapping * This is a method of attempting to estimate confidence levels of inferred relationships. The bootstrap proceeds by resampling the original data matrix with replacement of the characters. It is analagous to cutting the data matrix into individual columns of data and throwing the characters into a hat. A character is then drawn at random from this hat and it becomes the first character of the new datamatrix. The character is then replaced in the hat, the hat is shaken and again another character is drawn from the hat. This process is repeated until our new pseudoreplicate is the same size as the original. Some characters will be sampled more than once and some will not be sampled at all. This process is repeated many times (say, 100-1,000) and phylogenies are reconstructed each time. After the bootstrap procedure is finished, a majority-rule consensus tree is constructed from the optimal tree from each bootstrap sample. The bootstrap support for any internal branch is the number of times it was recovered during the bootstrapping procedure. Interior Branch Test Similar to bootstrapping but is unwieldy with a large number of taxa. A t-test, which is computed using the bootstrap procedure, is constructed based on the interior branch length and its standard error and is available only for the NJ and Minimum Evolution trees. MEGA shows the confidence probability in the Tree Explorer; if this value is greater than 95% for a given branch, then the inferred length for that branch is considered significantly positive.

Other phylogeny software PHYLIP MrBayes: Bayesian Inference of Phylogeny TREE-PUZZLE 5.2: Maximum likelihood analysis MEGA has no ability to do either maximum likelihood analysis or bayesian inference. These are more sophisticated, and computationally intensive (and can be more accurate for distantly related sequences)

Distance Distance is a phylogenetic method that considers the additive differences between either nucleotides or amino acids along the entire length of sequence. A distance measurement is made considering each type of substitution (either transversion or transition) weighted differently, depending on the distance algorithm and weighting matrix used. As distances are re-computed for all possible pairs of sequence during each step of the assembly this can be computationally intensive. 12 898 Homo_sapie AAGCTTCACC GGCGCAGTCA TTCTCATAAT CGCCCACGGG CTTACATCCT Pan AAGCTTCACC GGCGCAATTA TCCTCATAAT CGCCCACGGA CTTACATCCT Gorilla AAGCTTCACC GGCGCAGTTG TTCTTATAAT TGCCCACGGA CTTACATCAT Pongo AAGCTTCACC GGCGCAACCA CCCTCATGAT TGCCCATGGA CTCACATCCT Hylobates AAGCTTTACA GGTGCAACCG TCCTCATAAT CGCCCACGGA CTAACCTCTT Macaca_fus AAGCTTTTCC GGCGCAACCA TCCTTATGAT CGCTCACGGA CTCACCTCTT M_mulatta AAGCTTTTCT GGCGCAACCA TCCTCATGAT TGCTCACGGA CTCACCTCTT M_fascicul AAGCTTCTCC GGCGCAACCA CCCTTATAAT CGCCCACGGG CTCACCTCTT M_sylvanus AAGCTTCTCC GGTGCAACTA TCCTTATAGT TGCCCATGGA CTCACCTCTT Saimiri_sc saagcttcac CGGCGCAATG ATCCTAATAA TCGCTCACGG GTTTACTTCG Tarsius_sy aaagtttcat TGGAGCCACC ACTCTTATAA TTGCCCATGG CCTCACCTCC Lemur_catt AAGCTTCATA GGAGCAACCA TTCTAATAAT CGCACATGGC CTTACATCAT 12 Homo_sapie 0.000000 0.094328 0.110803 0.182639 0.210562 0.286715 0.288560 0.310181 0.321059-1.000000-1.000000 0.431062 Pan 0.094328 0.000000 0.113612 0.195508 0.219479 0.303507 0.315343 0.339246 0.311692-1.000000-1.000000 0.432920 Gorilla 0.110803 0.113612 0.000000 0.189484 0.219367 0.292586 0.291143 0.329470 0.304045-1.000000-1.000000 0.403571 Pongo 0.182639 0.195508 0.189484 0.000000 0.220062 0.306528 0.309930 0.330862 0.302154-1.000000-1.000000 0.401607 Hylobates 0.210562 0.219479 0.219367 0.220062 0.000000 0.308618 0.297051 0.322962 0.301975-1.000000-1.000000 0.407699 Macaca_fus 0.286715 0.303507 0.292586 0.306528 0.308618 0.000000 0.036582 0.088360 0.135182-1.000000-1.000000 0.382417 M_mulatta 0.288560 0.315343 0.291143 0.309930 0.297051 0.036582 0.000000 0.098273 0.129816-1.000000-1.000000 0.393103 M_fascicul 0.310181 0.339246 0.329470 0.330862 0.322962 0.088360 0.098273 0.000000 0.133409-1.000000-1.000000 0.407353 M_sylvanus 0.321059 0.311692 0.304045 0.302154 0.301975 0.135182 0.129816 0.133409 0.000000-1.000000-1.000000 0.390241 Saimiri_sc -1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000 0.000000 0.483555-1.000000 Tarsius_sy -1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000-1.000000 0.483555 0.000000-1.000000 Lemur_catt 0.431062 0.432920 0.403571 0.401607 0.407699 0.382417 0.393103 0.407353 0.390241-1.000000-1.000000 0.000000 matrix listing all pairwise differences

DNA Distance matrices A G C T Jukes-Cantor distance In the Jukes-Cantor model, the rate of nucleotide substitution is the same for all pairs of the four nucleotides A, T, C, and G. Many more models, with increasing complexity

Distance matrices in Mega Kimura 2-parameter distance Kimura s two parameter model corrects for different substitution rates between transitions (i.e. purine to purine) and transversions (i.e. purine to pyrimidine). Tamura-Nei distance The Tamura-Nei model (1993) corrects for multiple hits, taking into account the differences in substitution rate between nucleotides and the inequality of nucleotide frequencies. It distinguishes between transitional substitution rates between purines and transversional substitution rates between pyrimidines. It also assumes equality of substitution rates among sites (see related gamma model). Also: # differences Tamura 3-parameter LogDet

Which DNA distance matrix is appropriate? When the Jukes-Cantor * estimate of the number of nucleotide substitutions per site (d) between different sequences is about 0.05 or less (d < 0.05), use the Jukes-Cantor distance whether there is a transition/transversion bias or not or whether the substitution rate (l) varies with nucleotide site or not. In this case, the Kimura distance or the gamma distance gives essentially the same value as the Jukes-Cantor distance. One may also use the p-distance for constructing a topology. When 0.05 < d < 0.3, use the Jukes-Cantor distance unless the transition/transversion ratio (R) is high, say R >5. When this ratio is high and the number of nucleotides examined is large, (>10K) use the Kimura distance or the gamma distances for Kimura's 2-parameter model. When 0.3 < d < 1 and there is evidence that l varies extensively with site, use gamma distances. In general, one may choose different gamma distances, estimating a from data. When 0.3 < d < 1 and the frequencies of the four nucleotides (A, T, C, G) deviate substantially from equality but there is no strong transition/transversion bias, use the Tajima- Nei distance. When there are strong transition/transversion and G+C content biases, use the Tamura or Tamura-Nei distance. When d > 1 for many pairs of sequences, the phylogenetic tree estimated is not reliable for a number of reasons (e.g., large standard errors of d's and sequence alignment errors). We therefore suggest that these sets of data should not be used.

Protein Distance matrices in Mega p-distance This distance is the proportion (p) of amino acid sites at which the two sequences to be compared are different. It is obtained by dividing the number of amino acid differences by the total number of sites compared. It does not make any correction for multiple substitutions at the same site or differences in evolutionary rates among sites. Equal Input Model (Amino acids) In real data, frequencies usually vary among different kind of amino acids. In this case, the correction based on the equal input model gives a better estimate of the number of amino acid substitutions than the Poisson correction distance. Note that this assumes an equality of substitution rates among sites and the homogeneity of substitution patterns between lineages. Poisson correction The Poisson correction distance assumes equality of substitution rates among sites and equal amino acid frequencies while correcting for multiple substitutions at the same site. PAM & JTT * The PAM and JTT distances correct for multiple substitutions based on a model of amino acid substitution described as substitution-rate matrices.

ModelTest does a likelihood analysis on your data to determine The most appropriate DNA substitution matrix. WARNING: only for advanced users, also requires PAUP for an input

FindModel web based version of ModelTest Input is a concatenated fasta file http://hcv.lanl.gov/content/hcv-db/findmodel/findmodel.html

Result: MODEL CONSIDERED: JC : Jukes-Cantor (model 1) AIC1 = 27875.89594 lnl = -13937.947970 FindModel output JC+G : Jukes-Cantor plus Gamma (model 3) AIC3 = 27877.899848 lnl = -13937.949924 F81 : Felsenstein 1981 (model 5) AIC5 = 27352.654274 lnl = -13673.327137 F81+G : Felsenstein 1981 plus Gamma (model 7) AIC7 = 27354.660556 lnl = -13673.330278 K80 : Kimura 2-parameter (model 9) AIC9 = 27871.085794 lnl = -13934.542897 K80+G : Kimura 2-parameter plus Gamma (model 11) AIC11 = 27872.977786 lnl = -13934.488893 HKY : Hasegawa-Kishino-Yano (model 13) AIC13 = 27336.418362 lnl = -13664.209181 HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15) AIC15 = 27338.425764 lnl = -13664.212882 TrN : Tamura-Nei (model 21) AIC21 = 27338.336148 lnl = -13664.168074 AIC = Akaike Information Criterion lnl = maximum likelihood AICi = 2 ln Li + 2ki Model favored is the one with the lowest AIC TrN+G : Tamura-Nei plus Gamma (model 23) AIC23 = 27340.335138 lnl = -13664.167569 GTR : General Time Reversible (model 53) AIC53 = 27342.287716 lnl = -13663.143858 GTR+G : General Time Reversible plus Gamma (model 55) AIC55 = 27344.30355 lnl = -13663.151775 AIC-SELECTED MODEL: HKY : Hasegawa-Kishino-Yano (model 13) lnl = -13664.209181 AIC = 27336.418362

DNA Substitution models in ModelFind Reduced set: JC : Jukes-Cantor (model 1) JC+G : Jukes-Cantor plus Gamma (model 3) F81 : Felsenstein 1981 (model 5) F81+G : Felsenstein 1981 plus Gamma (model 7) K80 : Kimura 2-parameter (model 9) K80+G : Kimura 2-parameter plus Gamma (model 11) HKY : Hasegawa-Kishino-Yano (model 13) HKY+G : Hasegawa-Kishino-Yano plus Gamma (model 15) TrN : Tamura-Nei (model 21) TrN+G : Tamura-Nei plus Gamma (model 23) GTR : General Time Reversible (model 53) GTR+G : General Time Reversible plus Gamma (model 55) Red indicates models available in MEGA If a model in black is suggested, use the one immediately below If GTR is suggested, use LogDet

parallelized clustalw non parallelized clustalw http://cbsuapps.tc.cornell.edu/clustalw.aspx http://inquiry.unc.edu/inquiry/

Many MSA algorithms PLOS Comp. Biol. 3: e123

Alternative alignment tools FACT: in published comparisons between alignment tools, clustalw usually comes out close to the bottom T Coffee better, more computationally intensive Muscle better, less intensive than T Coffee Promals designed to optimize alignment for distantly related sequences Outputs for above need to be put in Appropriate format (.aln,.phy,.nex) http://www.drive5.com/muscle/ http://prodata.swmed.edu/promals/promals.php http://cbsuapps.tc.cornell.edu/t_coffee.aspx

Displaying extensions on a PC My Computer > Tools > Folder Options > View > unclick on Hide Extensions Also, Control Panels > Folder Options > View > unclick on Hide Extensions

Test data sets Nature 442: 37 Science 320: 499

Computing d 1) Compute Jukes-Cantor distance; examine distance matrix. If d < 0.05 stop and use Jukes-Cantor substitution model 2) If 0.05 < d < 0.3, check R also; use Kimura 2 parameter option for computing d; change Substitutions to Include option from d: transitions + transversions to R = s/v and calculate 3) Choose model based on the guide on previous page

Analysis Preferences: Setting up an analysis User defined options

Analysis Preferences (Distance Computation) Substitution Model - In this set of options, you choose the various attributes of the substitution models. Model - Here you select a stochastic model for estimating evolutionary distance by clicking on the ellipses to the right of the currently selected model (click on the lime square to select this row first). This will reveal a menu containing many different distance methods and models. Substitutions to Include - Depending on the distance model or method selected, the evolutionary distance can be teased into two or more components. By clicking on the drop-down button (first click on the lime square to select this row), you will be provided with a list of components relevant to the chosen model. Transition/Transversion Ratio - This option will be visible if the chosen model requires you to provide a value for the Transition/Transversion ratio (R). Pattern among Lineages - This option becomes available if the selected model has formulas that allow the relaxation of the assumption of homogeneity of substitution patterns among lineages. Rates among Sites - This option becomes available if the selected distance model has formulas that allow rate variation among sites. If you choose gamma-distributed rates, then the Gamma parameter option becomes visible.

Treatment of gaps Gaps often are inserted during the alignment of homologous regions of sequences and represent deletions or insertions (indels). They introduce some complications in distance estimation. Furthermore, sites with missing information sometimes result from experimental difficulties; they present the same alignment problems as gaps. In the following discussion, both of these situations are treated in the same way. In MEGA, there are two ways to treat gaps. One is to delete all of these sites from the data analysis. This option, called the Complete-Deletion, is generally desirable because different regions of DNA or amino acid sequences evolve under different evolutionary forces. The second method is relevant if the number of nucleotides involved in a gap is small and if the gaps are distributed more or less randomly. In that case it may be possible to compute a distance for each pair of sequences, ignoring only those gaps that are involved in the comparison; this option is called Pairwise-Deletion. The following table illustrates the effect of these options on distance estimation with the following three sequences: Complete-Deletion * Pairwise-Deletion

Uniform Rates vs. Gamma distribution Ignore this option as MEGA has no way to calculate a, the value of gamma distribution A gamma distribution reflects that there is a substitution difference between different amino acids/nucleotides; a = 1, subsitution variation is very high; a = infinity, all substitutions are equally likely

Tree Explorer Save tree as.emf file (for ppt or word) 36 48 57 26 northnigeria turkey Turkey2005 swan Czech2006 mallard B avaria2006 swan Mongolia2005 swan Astrakhan2005 turkey Suzdalka2005 swan Iran2006 mallard Italy2005 Save tree as.nwk file (for opening in other tree viewers) 96 27 ((((northnigeria:0.00240616,((turkey_turkey2005:0.00314559,swan_czech2006:0.00255065)0.96:0.00324648,(mallard_bavaria2006:0.00405158, swan_mongolia2005:0.00164881)0.27:0.00003413)0.27:0.00003360)0.57:0.00084134,swan_astrakhan2005:0.00402428)0.49:0.00081094, turkey_suzdalka2005:0.00717544)0.37:0.00028133,swan_iran2006:0.00464073,mallard_italy2005:0.09386135); Save tree as.mts file (for opening in MEGA)

MEGA NJ bootstrapping <1 min laptop 1G RAM 27 4 24 13 11 9 92 63 66 80 60 87 72 55 26 98 91 LagosSO452 LagosSO494 LagosSO300 LagosSO493 chicken Egypt2006 swan Czech2006 turkey Turkey2005 northnigeria swan Mongolia2005 goose Iraq2006 mallard Bavaria2006 duck Kurgan2005 swan Astrakhan2005 LagosBA209 LagosBA210 LagosBA211 goose Novo2005 Mr Bayes 1,000,000 generations 1.5 hrs-cluster 15 20 56 31 63 13 12 100 100 100 34 40 15 100 99 100 LagosSO494 LagosSO452 LagosSO493 LagosSO300 chicken Egypt20 turkey Turkey20 swan Czech2006 goose Iraq2006 swan Mongolia20 northnigeria mallard Bavaria goose Novo2005 chicken Tula200 duck Kurgan2005 swan Astrakhan2 LagosBA209 LagosBA211 100 37 turkey Suzdalka2005 Gull Qinghai2005 100 100 LagosBA210 Gull Qinghai200 39 35 swan Iran2006 78 16 swan Iran2006 82 chicken Tula2005 chicken Thai2005 87 turkey Suzdalka chicken Thai200 duck Jiangxi2005 duck Jiangxi200 chicken Hebei2005 chicken Hebei20 mallard Italy2005 mallard Italy20 swan Iran2 PHYLIP dnapars bootstrapping 30 min laptop 1G RAM 500 247 258 397 134 456 416 482 472 goose Iraq swan Astra chicken Tu turkey Suz goose Novo Gull Qingh northniger LagosBA209 LagosBA210 LagosBA211 LagosSO300 56 65 mallard It chicken He duck Jiang chicken Th turkey Suz chicken Tu goose Iraq mallard Ba goose Novo swan Iran2 Gull Qingh 209 243 303 161 73 213 55 51 422 213 297 215 77 405 LagosSO493 LagosSO452 LagosSO494 swan Czech turkey Tur duck Kurga mallard Ba swan Mongo chicken Th duck Jiang chicken He chicken Eg mallard It Tree-Puzzle maximum likelihood 10,000 steps <1 min laptop 1G RAM Dataset from Nature 442: 37 Multiple introductions of H5N1 in Nigeria 74 96 97 87 98 91 95 65 98 98 65 99 swan Mongo northniger LagosBA209 LagosBA211 LagosBA210 swan Astra duck Kurga swan Czech turkey Tur chicken Eg LagosSO493 LagosSO300 LagosSO494 LagosSO452

Tree Explorer Condensed Trees When several interior branches of a phylogenetic tree have low statistical support (PC or PB) values, it often is useful to produce a multifurcating tree by assuming that all interior branches have a branch length equal to 0. We call this multifurcating tree a condensed tree. In MEGA, condensed trees can be produced for any level of PC or PB value. For example, if there are several branches with PC or PB values of less than 50%, a condensed tree with the 50% PC or PB level will have a multifurcating tree with all its branch lengths reduced to 0. Consensus Tree The MP method produces many equally parsimonious trees. Choosing this command produces a composite tree that is a consensus among all such trees, for example, either as a strict consensus, in which all conflicting branching patterns among the trees are resolved by making those nodes multifurcating or as a Majority-Rule consensus, in which conflicting branching patterns are resolved by selecting the pattern seen in more than 50% of the trees. Importing trees from other phylogenetic tools Work outtrees from phylip,.dnd and.phb files from clustalw TreePuzzle, Mr Bayes (.con file needs a little processing)

MEGA4 Caption View Caption function gives a publication quality summary of analysis, and suggested references for publication

About authors Gene Duplication and Gene Subsitution in Evolution Masatoshi Nei Nature 221: 40 Evolution by the Birth-and-Death Process in Multigene Families of the Vertebrate Immune System Nei, M., et al. Proc. Natl. Acad. Sci USA 94: 7799 MEGA3: Integrated software for Molecular Evolutionary Genetics Analysis and sequence alignment Sudhir Kumar, K Tamura, and M Nei Briefings in Bioinformatics 5:150-163 The Neighbor-joining Method: A New Method for Reconstructing Phylogenetic Trees Naruya Saitou and Masatoshi Nei Mol. Biol. Evol 4: 406 Much of the material in this handout derived from: Molecular evolution and phylogenetics 2000 M Nei, S Kumar - Oxford Univ. Press, New York