Biologically significant sequence alignments using Boltzmann probabilities

Similar documents
In-Depth Assessment of Local Sequence Alignment

Sequence Alignment Techniques and Their Uses

An Introduction to Sequence Similarity ( Homology ) Searching

Single alignment: Substitution Matrix. 16 march 2017

Sequence Database Search Techniques I: Blast and PatternHunter tools

A profile-based protein sequence alignment algorithm for a domain clustering database

Pairwise sequence alignment

Introduction to Bioinformatics

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Optimization of a New Score Function for the Detection of Remote Homologs

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Local Alignment Statistics

Week 10: Homology Modelling (II) - HHpred

Tools and Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

E-value Estimation for Non-Local Alignment Scores

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Large-Scale Genomic Surveys

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Alignment & BLAST. By: Hadi Mozafari KUMS

Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics

Segment-based scores for pairwise and multiple sequence alignments

Syllabus of BIOINF 528 (2017 Fall, Bioinformatics Program)

Practical considerations of working with sequencing data

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

EECS730: Introduction to Bioinformatics

Fundamentals of database searching

Pairwise sequence alignments

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Sequence Comparison. mouse human

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

A New Similarity Measure among Protein Sequences

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Lecture 5,6 Local sequence alignment

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Generalized Affine Gap Costs for Protein Sequence Alignment

MPIPairwiseStatSig: Parallel Pairwise Statistical Significance Estimation of Local Sequence Alignment

E-SICT: An Efficient Similarity and Identity Matrix Calculating Tool

Motivating the need for optimal sequence alignments...

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

1.5 Sequence alignment

Effects of Gap Open and Gap Extension Penalties

Sequence analysis and Genomics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Chapter 7: Rapid alignment methods: FASTA and BLAST

Sequence analysis and comparison

STRUCTURAL BIOINFORMATICS I. Fall 2015

O 3 O 4 O 5. q 3. q 4. Transition

2 Dean C. Adams and Gavin J. P. Naylor the best three-dimensional ordination of the structure space is found through an eigen-decomposition (correspon

BLAST: Target frequencies and information content Dannie Durand

Substitution matrices

BLAST. Varieties of BLAST

Introduction to Evolutionary Concepts

Dirichlet Mixtures, the Dirichlet Process, and the Topography of Amino Acid Multinomial Space. Stephen Altschul

Whole Genome Alignments and Synteny Maps

CAP 5510: Introduction to Bioinformatics CGS 5166: Bioinformatics Tools. Giri Narasimhan

Bioinformatics. Dept. of Computational Biology & Bioinformatics

BLAST: Basic Local Alignment Search Tool

Bioinformatics for Biologists

Lecture 7 Sequence analysis. Hidden Markov Models

Computational Biology

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Getting statistical significance and Bayesian confidence limits for your hidden Markov model or score-maximizing dynamic programming algorithm,

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Multiple sequence alignment

Bioinformatics and BLAST

Pair Hidden Markov Models

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

Markov Chains and Hidden Markov Models. = stochastic, generative models

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

PROTEIN FUNCTION PREDICTION WITH AMINO ACID SEQUENCE AND SECONDARY STRUCTURE ALIGNMENT SCORES

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Copyright 2000 N. AYDIN. All rights reserved. 1

11.3 Decoding Algorithm

Heuristic Alignment and Searching

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Introduction to Computation & Pairwise Alignment

IMPORTANCE OF SECONDARY STRUCTURE ELEMENTS FOR PREDICTION OF GO ANNOTATIONS

A Practical Approach to Significance Assessment in Alignment with Gaps

Pairwise & Multiple sequence alignments

Basic Local Alignment Search Tool

Combining pairwise sequence similarity and support vector machines for remote protein homology detection

Do Aligned Sequences Share the Same Fold?

Lecture 4: September 19

Transcription:

Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this paper, we give a dynamic programming algorithm with quadratic time and space complexity to compute the partition function for both global and local sequence alignments of two peptides and, thus providing an efficient computation of the Boltzmann probability that a particular pair of amino acids is aligned. As proof of concept, our probabilistic refinement of both the Needleman-Wunsch [16] global and Smith-Waterman [19] local alignment algorithm is then compared with pairwise BLAST to determine an optimal local alignment of bovine trypsin and pig elastase, an example considered in Lipman et al. [14]. A web-server of our prototype tool is currently available.[] 1 Introduction Sequence alignment is one of the most important initial steps taken in trying to understand the function, evolutionary relationship, and general biology (e.g. binding sites) of an amino acid or nucleotide sequence. Using dynamic programming, Needleman and Wunsch [16] designed a quadratic time/space algorithm to determine an optimal global sequence alignment of given sequences Key words: dynamic programming, sequence alignment, Smith-Waterman algorithm, Boltzmann probability, partition function

and, provided that the cost of successive gaps is, for some fixed constant. 1 Building on this algorithm, Smith and Waterman [19] later provided a quadratic time/space algorithm to determine an optimal local sequence alignment of (convex) subwords from with from, again with the restriction to linear gap penalty. A year later, Gotoh [9] introduced a clever trick to compute global and local alignments with affine gap penalty! #"%$&'( *),+- in quadratic time and space. When aligning a sequence with all sequences from a database, quadratic time is prohibitive, so the BLAST algorithm of Altschul et al. [2] was introduced as a heuristic to approximate the Smith-Waterman algorithm. The advantage of BLAST over Smith-Waterman is that the expected run time is linear 2 in sequence and database size and that statistical significance (. -value, / -value) can be computed by virtue of the Karlin-Altschul [12, 13] result that the distribution of BLAST hits is the Fisher-Tippett (a.k.a. extreme-value or Gumbel) distribution. Multiple sequence alignment is a difficult (021 -complete) problem, for which several different approaches have been developed: the Carillo-Lipman algorithm [4, 14], hidden Markov models [8], ClustalW [20], etc. More recently, in order to detect distantly related proteins, Altschul et al. developed PSI-BLAST [3], which iteratively builds a profile [10], then blasts databases with the profile. Despite its success, it should be noted that PSI-BLAST depends heavily on the quality of the multiple sequence alignment obtained from pairwise BLAST hits in order to build a correct profile. For additional background on computational biology, see the Clote-Backofen text [6], and for additional remarks on algorithmic complexity for both sequential and parallel algorithms, see the recent Clote-Kranakis monograph [7]. In this paper, we adapt an idea of McCaskill [1], who extended the Zuker- Sankoff [24] energy minimization algorithm for RNA secondary structure prediction, to give an efficient computation of the partition function for the ensemble of RNA secondary structures. Our contribution in this paper is to extend the Needleman-Wunsch, Smith-Waterman and Gotoh algorithms, so as to compute 4A@CBDFE the partition function 347698(:<;=> of optimal global and local pairwise alignments using an affine gap penalty. This allows us then to provide a mathematically rigorous notion of biological significance to whether particular residue pairs 1 Sequence alignment distance using a linear gap penalty is known in computer science as edit distance. Though the Needleman-Wunsch and Gotoh algorithms were originally formulated in terms of distance, rather than similarity, each can be trivially reformulated for similarity measure. 2 Note that BLAST has worst-case quadratic run time, though not generally encountered in practice. 2

*, or residues and gaps ), ) are likely to be reliably aligned. In future work, we plan to extend these notions to multiple sequence alignments, structural alignments and to a prototype version of PSI-BLAST with Boltzmann probabilities. 2 Global alignment partition function for linear gap penalty Let and be two given amino acid sequences. 3 Throughout, let denote the similarity of residue with ; for instance, in Section, we use the PAM20 similarity matrix [17], though of course BLOSUM62 [11] or any other similary matrix could have been used. For didactic reasons, in this section we present the gist of our quadratic time/space algorithm to compute the partition function for global alignments using a linear gap penalty probability 1C A, where constant. In this case, the Boltzmann that is aligned with, formally defined later, is! -6 "$# where 4@ BDFE 3 4 698 : ;(=>, 4@ BDFE 3 4% 698 : ;(=>, and 4A@CBD E 3 4 698(:<;=> and & ranges over all alignments of with, & ranges over all alignments of with, and & over all possible alignments of with. An approximate, but incorrect, intuition for the probability 1 would be to consider all exponentially many global alignments of with, and to return the number of times that is aligned with divided by the number of alignments. This intuition would be essentially correct, if we were to weight each count by a factor deriving from Boltzmann s criterion, so that the weight for the alignment ----------- -------- '((()+*), ' '0((( *), ' -* ------------ 21 ---------- *. ' ((( -*. ' ((( -/ would be close to. An explicit exponential time computation of partition function can be avoided by noting that since the similarity score for subwords is additive, the partition function is multiplicative. We now proceed to the details. 3 Our implementation actually handles any finite alphabet for which a similarity matrix is provided, thus in particular, our code applies to the alignment of nucleotide sequences. 3

4 The Needleman-Wunsch algorithm computes the " +9 " +- path matrix, where for and, is the maximum similarity score between and. 4 Let be the (negative) penalty for a gap and let be the cost for gap initiation and $ be the cost for gap extension. Typical values for BLAST with PAM20 are ) +, $ ). A linear gap penalty is, while an affine gap penalty is " $ ( ) +9, both for a gap of size, where $. Algorithm 1 (Needleman-Wunsch [16] global pairwise alignment with linear gap penalty) For + and +, let *,, and define by ) + ) +- " $ ) +- "# ) + "# Since each entry in the array requires constant time to be computed, the Needleman-Wunsch algorithm runs in time and space, assuming that. By construction, is the maximum similarity score of any alignment of with. This optimal alignment can be obtained by the usual method of tracebacks (for details, see Clote-Backofen [6]). Note that we could have computed a reverse path matrix, defined for + ",+ and + " + by setting to be the maximum similarity score of any alignment of with. This observation, lifted to the calculation of a forward and backward partition function, is crucial for our computation of the Boltzmann probabilities. In the following algorithm, is the forward partition function, defined for and! by #" 4@ BDFE 6 8(:<;=> where & ranges over all possible alignments of with Boltzmann s constant and is temperature., is 4 The Needleman-Wunsch algorithm was originally formulated in terms of distance, rather than similarity. The use of similarity, along with minor changes in the base and inductive cases and the definition of traceback, yields the Smith-Waterman local alignment algorithm. In our implementation, we experimented with $ % & and '(% & as well as '(% &*) +*+,.-*/10.2*+32*434*4./3-*6+, which latter corresponds to replacing 7*89;:=<>A@CBEDGFAHJI by,389;:=<>@kbed. 4

4 Algorithm 2 (Forward partition function for linear gap penalty) For + ( and +, define 6 "$#, 6 define by ) + ) +- -6 "$# " ) +- -6 "$# " ) + 96 Analogously, we compute the backward partition function + " + and +! " + by " 4A@CBDFE 6 8(:<;=> where & ranges over all possible alignments of with Algorithm 3 (Backward partition function for linear gap penalty) For " + + and "+ +, let "+9,6 + 6 "$# and define to be " + 7" +- 96 "$# " 7" +- -6 " # " "$#, and "$#, defined for " + -6 One can easily check that + +- and that this value is 3 where & ranges over all alignments of with. 6 The Boltzmann probability 1 ) + ) +- -6 " #, ". "$# 476 " #, - that will be aligned with is then! "$# " + " +- Similarly, the Boltzmann probability that will be aligned above a gap ), while, is given by is aligned with ) + 96 "$# " + " +9 Finally, the Boltzmann probability that will be aligned below a gap ), while, is given by ) " + " is aligned with +- 96 "$# +9 6 It should be noted that in any implementation, these values will be different because the sum of many (large) numbers from left to right is not the same as the sum from right, a well-known phenomenon due to limited machine precision and truncation error. For this reason, it is more useful when debugging to verify that the relative error @ D "! @$#% #D is very close to @ D 2.

3 Local alignment partition function for linear gap penalty At first thought, one could attempt to define a partition function with respect to all local alignments. After initial investigation, this is clearly not the most reasonable choice (note that it is possible that two optimal local alignments are disjoint). Instead, on input and, we first obtain the optimal local alignment & of subwords 9 and, then determine the forward and backward partition functions for these subwords, where are computed by the technique of the previous section in performing a global alignment on 9 C and. Algorithm 4 (Smith-Waterman algorithm for local alignment with linear gap function) For and, let. and, and define to be ) + ) +- " 0 ) +- "# ) + "# Determine the indices where achieves a maximum, and perform the traceback until indices where. This determines the local alignment with. Algorithm (Partition function for local alignments with linear gap penalty) Given amino acid or nucleotide sequences, 1. Use Algorithm 4 to determine optimal local alignment & of subwords with. 2. Use Algorithms 2 and 3 to compute partition functions, for alignment &. 3. Suppose that the resulting optimal local alignment & of with is of the form, where are either ) or single-letter residue codes, such that [resp. ] are obtained after removing ) from [resp. ]. For +, compute the Boltzmann pair probabilities 1C in the manner described after Algorithm 3. : 6

1 "" 0.9 0.8 0.7 0.6 0. 0.4 0.3 0.2 0.1 0 0 0 100 10 200 20 Figure 1: Local alignment Boltzmann probabilities of portions of bovine trypsin and pig elastase (see text) 4 Quadratic time algorithm for affine gap penalty Let denote the penalty for successive gaps. In the following sections, we assume that ( " $ ) +- is an affine function, where $ and [resp. $ ] denotes the gap initiation [resp. gap extension] cost. Let 1,, be the maximum alignment score of any alignment of a suffix of with a suffix of, where 1. is aligned with ) in the case of 1, 2. ) is aligned with in the case of, 3. is aligned with in the case of. Algorithm 6 (Needleman-Wunsch-Gotoh [9] global alignment with affine gap penalty) For (, and, let 1, ), ), 1 ),, ). Define the inductive cases for 1 as follows: 1 " + 7" +- 1 7" +- "&$ " +9 " " +- " 7

" + 7" +- " + " +9 1 " + " " + " $ " + " 1 " $ " $ " In contrast to 1, values of the *" +- " +- matrix are ordered pairs, where (1 is the maximal score of an alignment of a suffix of with, and P, Q, R. For scores,, while for, contains P if 1, contains Q if, contains R if. In other words, * gives not only the maximal score of an alignment of a suffix of with, but indicates as well how that score is obtained. Note that the details of this algorithm, as well as in Algorithm 7, apart from the fact that we are dealing with similarity rather than distance matrices, are different than those given in Gotoh s original work [9] as well as in the exposition in Clote- Backofen [6]. In particular, in addition to 1, we have an additional matrix, along with different traceback information in. This explicit separation into three distinct cases of 1 is crucial to avoid overcounting in the computation of the forward 1,,, and backward partition functions 1,,, in the case of affine gap penalty. Algorithm 6 can easily be modified yield to the following local alignment algorithm for affine gap penalties, whose time and space complexity is. Algorithm 7 (Smith-Waterman-Gotoh local alignment with affine gap penalty) For and, let 1, 1, *, *. For +, + #, define 1 as in Algorithm 6, except that the maximum includes the value. To define the local alignment, proceed as follows. 8

4 4 4 T 1. determine such that and is the maximum such score possible 2. set, and 3. set alignment list while!" choose first matrix //preference given to # $ if % $ to front of ; '&( ; ) *+&,( append else if # append.&/ to front of ;0 1&( ; else if # 2 append 3& ; to front of ; ) *&( We now are in a position to give the pseudocode for the computation of the forward partition function. Algorithm 8 (Forward partition functions for affine gap penalty) -6789: ;< =(> 2+6789: ;@ (> $+6789: ;@ =( for (BACDA*E -6789 ;1 =( ; 2+6789 ;1 GF.H @ B DJI E@ ; $+6789K ;< =( ; -6783( K ;< GF H @ ML @ B DNI EA@ for (BA*OAPQ -678 : ;< GF H @CB DNI EA@ ; 2678 : ;< =( ; $+678 : ;< =( ; 2+678.(J ;1 GF H @ ML @ B DNI EA@ for (BA*OAP $+678.(J ;< GF H @ 8 for (BACDA*E $+6783( K ;< GF H @ 8 for (BA*OAP for (BACDA*E V SR @ @CB UT SR V @C@ B DNI EA@ DNI EA@ 9 ; ; ; ;

4 B F B 6 ifo( DNI EA@ -6 K O GF -6 &( F L B DNIEA@ 2+61&( K *F L B DNIEA@ $+61&( K if ( 26 O F L B DNIEA@ -6 &(J DNIEA@ 2+6 &(J *F L B DJIE@ $+6 &(J ifo( and D( V B DJIE@ $+6 K O GF 8 R T -61&,( &,(J V F B DJIE@ 8 R T 26 '&( &*(J V B DJIE@ F 8 R T $+61&( &(J for A*OAP 6 : O GF.H @ B DNI EA@ for ACDA*E @ B DNI EA@ 69K O GF.H for (BA*OAP for (BACDA*E 6 K O -6 2+6 K $+6 return -6 2+6 $+6 In an analogous manner, the backward partition functions 1,,, can be defined. Algorithm 8, along with our explicit algorithms for the earlier treatment of linear gap penalty should provide sufficient detail for any reader, in order to fill in the code of our current implementation. With this, we conclude that the partition functions and hence Boltzmann probabilities can be computed in time and space. Example Let s compare the output of pairwise BLAST at the NCBI server [18] on two biologically related proteins bovinetrypsin (PDB identity 1TGB) and pigelastase (chain A with SwissProt accession 1C1MA). These sequences were chosen, because they were used by Lipman et al. [14] to illustrate the improvement that Carrillo-Lipman multiple sequence alignment provides over dynamic programming local pairwise alignment. The BLAST alignment using PAM20 with gap initiation cost of 14 and gap extension cost of 2 is given by HFCGGSLINSQWVVSAAHCYKSGIQVRL--GEDNINVVEGNEQFISASKSIVHPSYNSNT HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD 10

LNN--DIMLIKLKSAASLNSRVASISLPTSCA--SAGTQCLISGWGNTKSSGTSYPDVLK VAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQ CLKAPILSDSSCKSAY-PGQITSNMFCAGYLEGGKDSCQGDSGGPVVCSGK----LQGIV AYLPTVDYAICSSSSYWGSTVKNSMVCAG-GDGVRSGCQGDSGGPLHCLVNGQYAVHGVT SWGS--GCAQKNKPGVYTKVCNYVSWIKQTIAS SFVSRLGCNVTRKPTVFTRVSAYISWINNVIAS The alignment given by my implementation of Smith-Waterman s local alignment algorithm with the same gap initiation and gap extension parameters with PAM20 (using Gotoh s trick to ensure quadratic time and space complexity) is as follows: HFCGGSLINSQWVVSAAHCYKSGIQVR--LGEDNINVVEGNEQFISASKSIVHPSYNSNT HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD L-NN-DIMLIKLKSAASLNSRVASISLP-TSCA-SAGTQCLISGWGNTKSSGTSYPDVLK VAAGYDIALLRLAQSVTLNSYVQLGVLPRAGTILANNSPCYITGWGLTRTNGQLAQTLQQ CLKAPILSDSSCKSA-YPGQITSNMFCAGYLEGGKDSCQGDSGGPVVC--SGK--LQGIV AYLPTVDYAICSSSSYWGSTVKNSMVCAG-GDGVRSGCQGDSGGPLHCLVNGQYAVHGVT SW-GS-GCAQKNKPGVYTKVCNYVSWIKQTIAS SFVSRLGCNVTRKPTVFTRVSAYISWINNVIAS Both methods align the subsequence of bovine trypsin starting at position 29 through 238 with the subsequence of pig elastase starting at position 28 through 239. The BLAST record indicates which positions in the alignment involve identical residues (with the residue name written between the aligned residues) or similar residues (with a " written between the aligned residues). Thus the first line of the BLAST output is as follows: HFCGGSLINSQWVVSAAHCYKSGIQVRL--GEDNINVVEGNEQFISASKSIVHPSYNSNT H CGG+LI +WV++AAHC + R+ GE+N+N +G EQ+V+ K VVHP N++ HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD In contrast, in our alignment, " designates a Boltzmann probability of 7%- 100%, while corresponds to 0%-7%, ) to 2%-0%, and nothing to 0%- 2%. 11

HFCGGSLINSQWVVSAAHCYKSGIQVR--LGEDNINVVEGNEQFISASKSIVHPSYNSNT ++++++++++++++++++++++++++++ ++++++++ +++++++++++++++ HTCGGTLIRQNWVMTAAHCVDRELTFRVVVGEHNLNQNDGTEQYVGVQKIVVHPYWNTDD The Boltzmann probabilities of the first 60 aligned positions are given in Figure 2. These probabilities are graphically displayed in the Figure 1 in the initial portion from 1 to 60 of the -axis. 6 Discussion The significance, in terms of Boltzmann probability, of how well two residues (or a residue and a gap) are aligned in an optimal scoring alignment, developed in this paper. is quite distinct from any Viterbi probability or sum-of-all-path probabilities from a trained hidden Markov model. Using publicly available HMMs, it is easy to find a pair of sequences, whose HMM alignment differs from Needleman- Wunsch or Smith-Waterman, hence HMMs have little to do with the concepts developed in this paper. As well, the algorithms of Waterman [22] and [23] concern subsequent modifications of the path matrix after the optimal alignment is found, hence have nothing to do with our approach. Finally, the method of threading, discussed in Clote-Backofen [6] concerns sampling -mer conformations from the PDB, assuming that the resulting distribution is Boltzmann distributed, and taking the negative logarithm of these frequencies as a suitable pseudo-energy. In threading, there is no computation of the partition function, and the alignment of certain -mers (i.e. the threading of convex subwords of the peptide) does not admit gaps within the -mers, nor does it consider the partition function over all such possible alignments of -mers. Thus, to the best of our knowledge, our results are new and bear little in common with HMMs, suboptimal alignment algorithms, or threading. 7 Conclusions and future work In this work, we have designed and implemented a new quadratic time and space algorithm to compute the partition function for global and local sequence alignments of two peptides, thus obtaining an efficient computation of the Boltzmann probability that a particular pair of amino acids residues or a gap and a residue are aligned. Additionally, we have created a web-server to make the algorithm 12

available for testing. Our prototype programs and cgi-scripts are written in the platform-independent, object-oriented scripting language Python [21]. We are currently extending the Boltzmann probability computation to multiple sequence alignments (Feng-Doolittle and ClustalW algorithms), to dynamic time warping of cdna microarray data as implemented by in Aach-Church [1], structural alignements, etc. To address efficiency issues, a collaborator is beginning the translation of our Python code into C/C++. We are currently investigating both FSSP and 3dAli structural alignment databases, to calibrate our method of using Boltzmann probabilities to correlate the biological significance of certain portions of an alignment. Acknowledgements I d like to thank Stephen H. Bryant for a brief suggestion that we contrast our method with that of profile hidden Markov models, E-values, threading and suboptimal alignments. References [1] J. Aach and G. Church. Aligning gene expression time series with time warping algorithms. Bioinformatics, 17(6):49 08, 2001. [2] S.F. Altschul, W. Gish, W. Miller, E.W. Myers, and D.J. Lipman. Basic local alignment search tool. J. Mol. Biol., 21:403 410, 1990. [3] S.F. Altschul, T.L. Madden, A.A. Schffer, J. Zhang, W. Miller, and D.J. Lipman. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res., 2:3389 3402, 1997. [4] H. Carillo and D. Lipman. The multiple sequence alignment problem in biology. SIAM J. Appl. Math., 48():1073 1082, 1988. [] P. Clote. Boltzmann alignment server cslab.bc.edu:8080/ compbio/boltzmannalignment.html is only a prototype implentation. An expanded webserver (currently under construction) will be hosted elsewhere. [6] P. Clote and R. Backofen. Computational Molecular Biology: An Introduction. John Wiley & Sons, 2000. 286 pages. [7] P. Clote and E. Kranakis. Boolean Functions and Computation Models. Springer-Verlag, 2002. 601 pages. 13

[8] S.R. Eddy. Hidden Markov models and large-scale genome analysis. In C.Rawlings et al., editor, Proc. Third Int. Conf. Intelligent Systems for Molecular Biology, pages 114 12. AAAI Press, Menlo Park, 199. [9] O. Gotoh. An improved algorithm for matching biological sequences. J. Mol. Biol., 162:70 708, 1982. [10] M. Gribskov, A.D. McLachlan, and D. Eisenberg. Profile analysis: Detection of distantly related proteins. Proc. Natl. Acad. Sci. USA, 84:43 438, 1987. [11] S. Henikoff and J.G. Henikoff. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. USA, 89:1091 10919, 1992. [12] S. Karlin and S.F. Altschul. Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. USA, 87:2264 2268, 1990. [13] S. Karlin and S.F. Altschul. Applications and statistics for multiple high-scoring segments in molecular sequences. Proc. Natl. Acad. Sci. USA, 90:873 877, 1993. [14] D.J. Lipman, S.F. Altschul, and J.D. Kececioglu. A tool for multiple sequence alignment. Proc. Natl. Acad. Sci. USA, 86:4412 441, 1989. [1] J.S. McCaskill. The equilibrium partition function and base pair binding probabilities for rna secondary structure. Biopolymers, 29:110 1119, 1990. [16] S.B. Needleman and C.D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Bio., 48:443 43, 1970. [17] R.M. Schwartz and M.O. Dayhoff. Matrices for detecting distant relationships. In M.O. Dayhoff, editor, Atlas of Protein Sequence and Structure, volume 2, pages 33 38. Natl. Biomed. Res. Found., Washington, DC., 1978. Vol., Suppl. 3. [18] BLAST server http://www.ncbi.nlm.nih.gov/blast/bl2seq/bl2.html [19] T.F. Smith and M.S. Waterman. Identification of common molecular subsequences. J. Mol. Biol., 147:19 197, 1981. [20] J. Thompson, D. Higgins, and T. Gipson. Clustalw: Improving the sensitivity of progressive multiple sequence alignment through sequence weighting, positions-specific gap penalties and weight matrix choice. Nucleic Acids Research, 22:4673 4680, 1994. [21] G. von Rossum. Python programming language. www.python.org. [22] M.S. Waterman. Sequence alignments in the neighborhood of the optimum with general application to dynamic programming. Proc. Natl. Acad. Sci. USA, 80:3123 3124, 1983. [23] M.S. Waterman and M. Eggert. A new algorithm for best subsequence alignments with applications to trna rrna. J. Mol. Bio., 197:723 728, 1987. [24] M. Zuker. RNA secondary structures and their prediction. Bulletin of Mathematical Biology, 46(4):91 621, 1984. 14

Appendix H H 1.0 F T 1.0 C C 1.0 G G 0.999999999998 G G 0.999999999998 S T 0.999999999999 L L 1.0 I I 1.0 N R 1.0 S Q 1.0 Q N 1.0 W W 0.999999999997 V V 0.999999999997 V M 1.0 S T 1.0 A A 1.0 A A 1.0 H H 1.0 C C 0.999999999999 Y V 1.0 K D 0.999999999997 S R 0.999999999901 G E 0.999999999893 I L 0.999999999879 Q T 0.999999999916 V F 0.999999999978 R R 0.999667207824 - V 0.991942420438 - V 6.9360833833e-06 L V 6.4148782130e-06 G G 6.40441406993e-06 E E 0.999863362217 D H 0.99998809778 N N 0.999991840169 I L 0.99999218143 N N 0.9999944211 V Q 0.99466786612 V N 0.9876394744 E D 0.8608987441 G G.3697020189e-06 N T 0.109119031826 E E 0.1231478343 Q Q 0.120933244927 F Y 0.120927638301 I V 1.8287487771e-0 S G 0.9999942629 A V 0.99999437029 S Q 0.999999866331 K K 0.999999999747 S I 1.0 I V 1.0 V V 1.0 H H 1.0 P P 1.0 S Y 1.0 Y W 1.0 N N 0.99999998466 S T 0.9999999763 N D 0.99999996082 T D 0.999996614662 1 Figure 2: Probabilities for fragment 1-60 of Local alignment Boltzmann probabilities of portions of bovine trypsin and pig elastase (see text)