Effects of Gap Open and Gap Extension Penalties

Similar documents
Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Large Grain Size Stochastic Optimization Alignment

PSODA: Better Tasting and Less Filling Than PAUP

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Phylogenetic hypotheses and the utility of multiple sequence alignment

Phylogenetic analyses. Kirsi Kostamo

An open source phylogenetic search and alignment package

Copyright 2000 N. AYDIN. All rights reserved. 1

Phylogenetic Tree Reconstruction

Phylogenetic Trees. What They Are Why We Do It & How To Do It. Presented by Amy Harris Dr Brad Morantz

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

A Comparative Analysis of Popular Phylogenetic. Reconstruction Algorithms

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Tools and Algorithms in Bioinformatics

Letter to the Editor. Department of Biology, Arizona State University

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Phylogenetic Inference and Parsimony Analysis

Dr. Amira A. AL-Hosary

Single alignment: Substitution Matrix. 16 march 2017

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Unsupervised Learning in Spectral Genome Analysis

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Pairwise & Multiple sequence alignments

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

In-Depth Assessment of Local Sequence Alignment

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

A Fitness Distance Correlation Measure for Evolutionary Trees

Sequence Alignment Techniques and Their Uses

MOLECULAR PHYLOGENY AND GENETIC DIVERSITY ANALYSIS. Masatoshi Nei"

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

Multiple sequence alignment

EVOLUTIONARY DISTANCES

Phylogenetic Trees. Phylogenetic Trees Five. Phylogeny: Inference Tool. Phylogeny Terminology. Picture of Last Quagga. Importance of Phylogeny 5.

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Multiple sequence alignment accuracy and phylogenetic inference

Thanks to Paul Lewis, Jeff Thorne, and Joe Felsenstein for the use of slides

Algorithmic Methods Well-defined methodology Tree reconstruction those that are well-defined enough to be carried out by a computer. Felsenstein 2004,

DNA Phylogeny. Signals and Systems in Biology Kushal EE, IIT Delhi

A profile-based protein sequence alignment algorithm for a domain clustering database

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Evaluating phylogenetic hypotheses

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Quantifying sequence similarity

POPULATION GENETICS Winter 2005 Lecture 17 Molecular phylogenetics

Assessing an Unknown Evolutionary Process: Effect of Increasing Site- Specific Knowledge Through Taxon Addition

Algorithms in Bioinformatics

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Constructing Evolutionary/Phylogenetic Trees

Improving divergence time estimation in phylogenetics: more taxa vs. longer sequences

Likelihood Ratio Tests for Detecting Positive Selection and Application to Primate Lysozyme Evolution

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Phylogenetic Networks, Trees, and Clusters

Constructing Evolutionary/Phylogenetic Trees

Phylogenetic inference

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Introduction to Bioinformatics Online Course: IBT

ChemAlign: Biologically Relevant Multiple Sequence Alignment Using Physicochemical Properties

Parsimony via Consensus

Bayesian Inference using Markov Chain Monte Carlo in Phylogenetic Studies

Examining Phylogenetic Reconstruction Algorithms

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Algorithms in Bioinformatics

Sequence Alignment (chapter 6)

Sequence analysis and Genomics

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Graph Alignment and Biological Networks

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Cladistics. The deterministic effects of alignment bias in phylogenetic inference. Mark P. Simmons a, *, Kai F. Mu ller b and Colleen T.

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

Phylogenetic relationship among S. castellii, S. cerevisiae and C. glabrata.

Overview Multiple Sequence Alignment

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Phylogenetic analysis of Cytochrome P450 Structures. Gowri Shankar, University of Sydney, Australia.

An Introduction to Sequence Similarity ( Homology ) Searching

Computational approaches for functional genomics

Bootstrapping and Tree reliability. Biol4230 Tues, March 13, 2018 Bill Pearson Pinn 6-057

Pairwise sequence alignment

Phylogenetics: Building Phylogenetic Trees

A (short) introduction to phylogenetics

Introduction to Bioinformatics Introduction to Bioinformatics

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Objectives. Comparison and Analysis of Heat Shock Proteins in Organisms of the Kingdom Viridiplantae. Emily Germain 1,2 Mentor Dr.

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Phylogenetics: Building Phylogenetic Trees. COMP Fall 2010 Luay Nakhleh, Rice University

Lecture 5,6 Local sequence alignment

Phylogenetic molecular function annotation

BINF6201/8201. Molecular phylogenetic methods

Lecture Notes: Markov chains

Application of new distance matrix to phylogenetic tree construction

Motivating the need for optimal sequence alignments...

Transcription:

Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See next page for additional authors Follow this and additional works at: https://scholarsarchive.byu.edu/facpub Part of the Computer Sciences Commons Original Publication Citation Effects of Gap Open and Gap Extension Penalties, Hyrum Carroll, Perry Ridge, Mark Clement, Quinn Snell, Biotechnology and Bioinformatics Symposium (BIOT), Provo, Utah, October 2, pp 19. BYU ScholarsArchive Citation Carroll, Hyrum; Clement, Mark J.; Ridge, Perry; and Snell, Quinn O., "Effects of Gap Open and Gap Extension Penalties" (200). All Faculty Publications. 290. https://scholarsarchive.byu.edu/facpub/290 This Peer-Reviewed Article is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for inclusion in All Faculty Publications by an authorized administrator of BYU ScholarsArchive. For more information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu.

Authors Hyrum Carroll, Mark J. Clement, Perry Ridge, and Quinn O. Snell This peer-reviewed article is available at BYU ScholarsArchive: https://scholarsarchive.byu.edu/facpub/290

Effects of Gap Open and Gap Extension Penalties Hyrum Carroll, Perry Ridge 1, Mark Clement, Quinn Snell Computer Science Department, Brigham Young University Provo, Utah 02, USA {hdc,clement,snell}@cs.byu.edu, perry.ridge@gmail.com Abstract Fundamental to multiple sequence alignment algorithms is modeling insertions and deletions (gaps). The most prevalent model is to use gap open and gap extension penalties. While gap open and gap extension penalties are well understood conceptually, their effects on multiple sequence alignment, and consequently on phylogeny scores are not as well understood. We use exhaustive phylogeny searching to explore the effects of varying the gap open and gap extension penalties for three nuclear ribosomal data sets. Particular attention is given to optimal phylogeny scores for 200 alignments of a range of gap open and gap extension penalties and their respective distribution of phylogeny scores. Keywords: Alignment, gap penalties I. INTRODUCTION The explosion in DNA sequence data has revolutionized the way scientists perform biological and genetic analysis. By analyzing sequence data for different species, researchers can determine which species are most closely related and make conservation decisions based on these results [1]. Multiple sequence alignment (MSA) is frequently the first step in determining where active regions in proteins are located and plays a critical role in understanding the function of genes and how they govern life. Alignment also plays a central role in sequence analysis as the first step in comparing corresponding regions in the genomes of different organisms (comparative genomics). Since a refined multiple sequence alignment is crucial to so many different types of life-saving research, it is surprising that multiple sequence alignment does not receive more attention from the research community. Multiple sequence alignment can be performed with DNA nucleotide or amino acid sequences. MSA algorithms insert gaps in order to align the sequences to maximize similarity according to the evolutionary model summarized in the substitution matrix [2]. Gaps correspond to an insertion or deletion of a substring (sometimes a single residue). Gaps can occur because of single mutations, unequal crossover in meiosis, DNA slippage in the replication process or translocation of DNA between chromosomes. One of the most popular algorithms for MSA is the progressive sequence alignment algorithm [2], [3]. In a progressive sequence alignment algorithm, the substitution matrix is used to determine the likelihood of an observed mismatch (the mismatch may be the result of a mutation or sequencing error). The algorithm then decides to either insert a gap or allow the 1 Present Address: University of Nebraska - Lincoln, Lincoln, Nebraska 505, USA Yes Input sequences Calculate all pairwise similarity scores Neighbor-Joining cluster analysis Take two most similar (remaining) sequences or clusters 2-way alignment Output a consensus and the sequences with gaps inserted More sequences? Fig. 1. No ClustalW flowchart Output final alignment mismatch to remain in the alignment. In progressive sequence alignment algorithms, inserted gaps are never removed. The popular alignment program, ClustalW [2], is used in this research. ClustalW utilizes the progressive sequence alignment algorithm (see Figure 1). There are two main phases to progressive alignment. First, a distance matrix is calculated from similarity scores for every possible pair of sequences. ClustalW uses the Wilbur and Lipman algorithm [] to calculate the distances. These similarity scores are only very general approximations, but work as a starting point []. The similarity scores are clustered together with a modified version of the Needleman-Wunsch algorithm [5], producing a guide tree. The second phase consists of following the topology of the guide tree, and at every node aligning the sequences in each of the subtrees until all sequences have been included in the alignment. The first phase generally requires the vast majority of the time and can be skipped by supplying a guide tree. Since there are several accepted methods for computing a multiple sequence alignment, it is difficult to evaluate the accuracy of an alignment. The alignment score is dependent on the substitution matrix and gap penalties. ClustalW provides 19

Data Number of Average Max Set Gene Sequences Length Length 1 1S 12 15.25 199 2 2S 12 332.3 30 3 2S 12 57.03 710 TABLE I CHARACTERISTICS OF THE THREE RIBOSOMAL DATA SETS USED. AVERAGE LENGTH AND MAX LENGTH ARE MEASURED IN BASE PAIRS. an alignment score for each multiple sequence alignment performed. However, since this score is dependent on the substitution matrix and gap penalties it cannot be used to compare different alignments of the same data set. The minimum cost for a phylogeny inferred from a given MSA has been suggested as an unbiased measure of the quality of the alignment [] [9]. Because no better, unbiased, metric has been commonly used, this research uses the minimum cost phylogeny to determine alignment quality. Although most phylogenetic search applications use a given multiple sequence alignment as a starting point [10] [13], multiple sequence alignment has received much less attention than phylogenetic search algorithms [9]. The importance of a quality alignment for the phylogeny search must not be minimized [1], [15]. Morrison et al. [1] has even suggested that the resulting phylogeny is affected more by the method used for performing the multiple sequence alignment than the method used to perform the phylogeny search itself. A. Related Work The effects of varying parameters for MSA applications was first covered by Fitch and Smith [1] and Williams and Fitch [17]. Several researchers have looked at the effects of parameters for both MSA and phylogenetic search algorithms. These sensitivity analyses have shown that differences in input parameters for MSA have had a greater impact on the phylogeny score then varying the phylogeny search application [1]. Other studies have focused on nodal support and nodal stability [1]. II. DATA SETS We used three data sets to study the effects of gap open and gap extension penalties (see Table I). The three data sets cover two nuclear ribosomal genes for a wide diversity of hexapod species and are provided by Michael Whiting, a researcher in the Biology Department at Brigham Young University. Data set 1 has twelve 1S gene sequences, while data sets 2 and 3 each of have twelve 2S gene sequences. The sequences in each data set were randomly chosen from larger data sets. We limited the number of sequences included in this study due to the intrinsic computational time limitations of exhaustive searching. III. RESULTS Varying the gap open and gap extension costs not only produces very different alignments [1] but produces different distributions of phylogeny scores. Figures 2- plot optimal phylogeny scores for gap open penalties (GOP) ranging from 1.0 to 20.0 and gap extension penalties (GEP) evenly distributed between 0 and one half of the respective GOP. For each of the 200 data points in each graph, we used ClustalW to produce the alignment, and then PAUP* [13] to exhaustively generate the phylogenies. While heuristic searches are commonly used, an exhaustive search is necessary to ensure the optimality of the phylogeny score for an alignment. In each of these graphs, the default parameters for ClustalW (GOP 15.0, GEP.) are labeled. Although it is expected that ClustalW s defaults do not produce the optimal alignment with the lowest phylogeny score, it is noteworthy that these scores are 11.0% worse (data set 3) and 10 steps (data set 1) than the best optimal phylogeny score found. Also, the plot for data set 2 clearly reveals that local minima of optimal phylogeny scores exist. For that data set, the local minima has a GOP of 3.0 and a GEP of 1.2. The phylogeny score at that point is 30. Attempts to search over the GOP-GEP space need to incorporate some sort of hill-climbing feature to overcome such local minima. In addition to gap open and gap extension penalties affecting optimal phylogeny scores, they also greatly affect the distribution of phylogeny scores. Figure 5 illustrate histograms of the phylogeny scores for data sets 1-3. A clear example of the difference in phylogenies scores is exhibited by data set 3. 9% of the possible phylogenies with a GOP of 20.0 and a GEP of 10.0 have of parsimony score worse than any of the phylogenies with a GOP of 2.0 and a GEP of 0.. In general, varying the GEP parameters shifts the histograms of parsimony scores. The shifted distribution retains its general shape and features. For example, the histograms presented for data set 2 each have a second hump aside from another much larger one. Adjusting the gap parameters causes substantial change to both the optimal phylogeny score and its distribution. IV. CONCLUSION Gap open and gap extension penalties have long been used to model insertions and deletions. We explored the effects of independently varying the gap open and gap extension penalties for three nuclear ribosomal gene data sets. We employed exhaustive phylogeny searching to guarantee optimal maximum parsimony phylogenies. Varying these parameters not only yields very different optimal phylogenies, but also greatly effects the distribution of possible phylogeny scores. Furthermore, algorithms for traversing the GOP-GEP space need to employ hill-climbing techniques to avoid local minima. REFERENCES [1] W.-H. Li, Molecular Evolution. Sunderland, Massachusetts: Sinauer Associates, 1997. [2] J. D. Thompson, D. G. Higgins, and T. J. Gibson, Clustal W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice, Nucleic Acids Research, vol. 22, pp. 73 0, 199. [3] D. Feng and R. F. Doolittle, Progressive sequence alignment as a prerequisite to correct phylogenetic trees, J. Mol. Evol., vol. 0, pp. 351 30, 197. 20

Most Parsimonious Tree Scores (per alignment parameters) (Data Set 1) 10 10 120 100 150 150 150 1520 ClustalW Defaults 10 10 120 100 150 150 150 1520 20 0 1 1 1 2 1 3 12 10 5 GOP 7 GEP 2 9 0 10 Fig. 2. Optimal parsimony scores for 200 alignments of data set 1. The minimum optimal parsimony score of 1531 has a gap open penalty of 1.0 and two gap extension penalties of 0. and 0.5. ClustalW default parameters yield an optimal phylogeny score of 13. Most Parsimonious Tree Scores (per alignment parameters) (Data Set 2) 392 390 3 3 3 32 30 37 37 ClustalW Defaults 392 390 3 3 3 32 30 37 37 20 0 1 1 1 2 1 3 12 10 5 GOP 7 GEP 2 9 0 10 Fig. 3. Optimal parsimony scores for 200 alignments of data set 2. The minimum optimal parsimony score of 37 has a gap open penalty of.0 and a gap extension penalty of 3.2. ClustalW default parameters yield an optimal phylogeny score of 3. Most Parsimonious Tree Scores (per alignment parameters) (Data Set 3) 720 710 700 90 0 70 0 50 0 30 20 10 ClustalW Defaults 720 710 700 90 0 70 0 50 0 30 20 10 20 0 1 1 1 2 1 3 12 10 5 GOP 7 GEP 2 9 0 10 Fig.. Optimal parsimony scores for 200 alignments of data set 3. The minimum optimal parsimony score of 1 has a gap open penalty of 2.0 and a gap extension penalty of 0.. ClustalW default parameters yield an optimal phylogeny score of. 21

All s (Data Set 1) Number of Phylogenies 3500000 3000000 2500000 2000000 1500000 1000000 500000 GOP 2, GEP 0. GOP 5, GEP 1.5 GOP 10, GEP 3.0 GOP 15, GEP. GOP 20, GEP 10.0 0 1500 100 1700 100 1900 2000 2100 2200 2300 All s (Data Set 2) Number of Phylogenies 20000000 15000000 10000000 5000000 GOP 2, GEP 0. GOP 5, GEP 1.5 GOP 10, GEP 3.0 GOP 15, GEP. GOP 20, GEP 10.0 0 30 00 20 0 0 0 500 520 50 50 50 All s (Data Set 3) Number of Phylogenies 10000000 5000000 GOP 2, GEP 0. GOP 5, GEP 1.5 GOP 10, GEP 3.0 GOP 15, GEP. GOP 20, GEP 10.0 0 550 00 50 700 750 00 50 900 Fig. 5. Representative histograms of all parsimony scores for various alignments parameters for data sets 1-3. 22

[] W. Wilbur and D. Lipman, The context dependent comparison of biological sequences, SIAM J. Appl. Math, vol., pp. 557 57, 19. [5] S. B. Needleman and C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins, J. Mol. Biol., vol., pp. 3 53, 1970. [] A. G. Kluge, A Concern for Evidence and a Phylogenetic Hypothesis of Relationships Among Epicrates (Boidae, Serpentes), Systematic Zoology, vol. 3, no. 1, pp. 7 25, 199. [7] W. C. Wheeler and D. S. Gladstein, MALIGN: A multiple sequence alignment program, J. Hered, vol. 5, pp. 17 1, 199. [] W. C. Wheeler, Optimization alignment: the end of multiple sequence alignment in phylogenetics? Cladistics, vol. 12, pp. 1 9, 199. [9] A. Phillips, D. Janies, and W. C. Wheeler, Multiple sequence alignment in phylogenetic analysis, Molecular Phylogenetics and Evolution, vol. 1, no. 3, pp. 317 330, September 2000. [10] J. S. Farris, HENNIG, version 1.5, Program and Documentation: Port Jefferson Station, New York, 19. [11] J. Felsenstein, PHYLIP phylogeny inference package (version 3.2), Cladistics, vol. 5, pp. 1 1, 199. [12] P. Goloboff, Analyzing large datasets in reasonable times: Solutions for composite optima, Cladistics, vol. 15, pp. 15 2, 1999. [13] D. L. Swofford, PAUP*. Phylogenetic Analysis Using Parsimony (* and Other Methods). Version. Sunderland, Massachusetts: Sinauer Associates, 2003. [1] D. A. Morrison and J. T. Ellis, Effects of nucleotide sequence alignment on phylogeny estimation: A case study of 1s rdnas of apicomplexa, Molecular Biology and Evolution, vol. 1, no., pp. 2 1, 1997. [15] N. Mugridge, D. Morrison, T. Jakel, A. Heckeroth, A. Tenter, and A. Johnson, Effects of sequence alignment and structural domains of ribosomal dna on phylogeny reconstruction for the protozoan family sarcocystidae, Molecular Biology and Evolution, vol. 17, pp. 12 153, 2000. [1] W. M. Fitch and T. F. Smith, Optimal sequence alignments, Proc. Natl. Acad. Sci. USA, vol. 0, pp. 132 13, March 193. [17] P. L. Williams and W. M. Fitch, The Hierarchy of Life. Elsevier, Amsterdam, 199, ch. Finding the minimal change for a given tree. [1] G. Giribet, Stability in phylogenetic formulations and its relationship to nodal support, Systematic Biology, vol. 52, no., pp. 55 5, 2003. 23