SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X)

Size: px
Start display at page:

Download "SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X)"

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 20 # Mary Ann Liebert, Inc. Pp DOI: 0.089/cmb Research Articles Sequence Alignment as Hypothesis Testing LU MENG, FENGZHU SUN, 2 XUEGONG ZHANG, and MICHAEL S. WATERMAN 2 ABSTRACT Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the log-likelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs. Key words: hypothesis testing, local alignment, power, scoring function, sequence alignment.. INTRODUCTION Sequence alignment is one of the most important problems in computational biology. Similar segments in gene or protein sequences often indicate evolutionary homology or functional relationships between the genes or proteins. Sequence alignment tasks are generally categorized into three different types: () global sequence alignment, which determines the best alignment of the sequences with their entire lengths by adjusting their relative positions and inserting gaps when necessary; (2) local sequence alignment, which determines segments in the sequences that are most similar with each other; and (3) semi-global or fit alignment, which searches for the occurrence of a short query sequence in a large sequence database (Waterman, 995). In any case, a scoring function needs to be defined to evaluate the similarity of the sequences and a computational algorithm is employed to search for the best alignment. The classical algorithms are based on dynamic programming: the Smith-Waterman algorithm (Smith and Waterman, 98) for local alignment and the Needleman-Wunsch algorithm (Needleman and Wunsch, 970) for global alignment. Many heuristic algorithms have been proposed to speed up the search procedure, such as BLAST (Altschul et al., 990), FASTA (Pearson, 990), CLUSTAL W (Thompson et al., 994), PSI-BLAST (Altschul et al., MOE Key Lab of Bioinformatics and Bioinformatics Division of TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China. 2 Molecular and Computational Biology, University of Southern California, Los Angeles, California; and TNLIST/ Department of Automation, Tsinghua University, Beijing, P.R. China. 677

2 678 MENG ET AL. 997), BLAT (Kent, 2002), BALSA (Zhu et al., 998; Webb et al., 2002), ProbCons (Do et al., 2005), DIALIGN-T(X) (Subramanian et al., 2005, 2008), and Bowtie (Langmead et al., 2009). No matter what algorithm is used, the choice of a scoring function is the key to producing good alignments. Several scoring functions have been introduced, most of which are implicitly log-odds matrices (Altschul, 99), i.e., loglikelihood ratio scoring functions. Dayhoff s PAM (Dayhoff et al., 978) and Henikoff s BLOSUM (Henikoff and Henikoff, 992) matrices are generally considered the standard in many applications. The PAM matrices were initially derived based on an explicit evolutionary model of closely related sequences and the observed mutations in these sequences. Dayhoff et al. (978) separated proteins into families, constructed phylogenetic trees for each family, and examined every branch of the resulting trees for substitutions. The score for two letters is defined by the ratio of probability of substitution between the two letters over the expected probability. Some scoring functions such as the JTT ( Jones et al., 992) and the GCB (Gonnet et al., 992) matrices were derived by extrapolation from closely related sequences based on PAM evolutionary model to increase accuracy of homology searches. The VTML matrix (Müler et al., 2002) was obtained by maximum likelihood estimation of Dayhoff s parameters. The BLOSUM matrices were derived based on known multiple aligned blocks of sequences with blocks being the aligned regions without gaps. The score between two amino acids was defined as 2 times log-ratio of the probability that the two amino acids are aligned in the blocks over the corresponding expected probability assuming the sequences are independent. The PMB matrix (Veerassamy et al., 2003) is based on the blocks which BLOSUM used, but added evolutionary distances to form an evolutionary model. Based on the fact that proteins have complicated three-dimensional (3-D) structures, some scoring functions make use of the structure information, such as the STR (Overington et al., 992) and the STROMA (Qian and Goldstein, 2002) matrices. Some scoring functions like SDM (Prlić et al., 2000) were derived from protein pairs of similar structure instead of sequence similarity. The scoring functions used in Fugue (Shi et al., 200) and Wurst (Torda et al., 2004) are based on sequence-structure homology. Bayesian methods have also been employed for constructing scoring functions for multiple sequence alignment, such as BILD (Altschul et al., 200). Many scoring functions are designed for specific applications. Adachi and Hasegawa (996) established a model of amino acid substitution matrix for mitochondrial DNA encoded protein sequences, estimating the score matrix by the maximum likelihood method from mtdna data. Cao et al. (2009) developed scoring functions for DNA sequences based on information theory in the expectation maximization framework. Among these scoring functions, determining which is most powerful to detect the truly related segments for a given sequence study is an important problem. Usually, the scoring function is empirically selected based on some assumptions about the sequences to be compared. Altschul (99) argued that PAM20, among the PAM matrices, is probably most appropriate if only one matrix is used, based on information content of the score matrix measured in relative entropy. Henikoff and Henikoff (993) evaluated the performance of some commonly used substitution matrices, and found that log-likelihood ratio based scoring functions derived directly from multiple alignment data are better for detecting distant relationships than matrices based on PAM evolutionary model and the STR matrix achieved similar performance to BLOSUM62. The performance was evaluated through the efficiency for detecting true amino acid sequences belonging to particular protein families. Several investigators evaluated different scoring functions by comparing the alignments derived from the computer algorithm with the alignments generated by simulations through fidelity, confidence, and overall correctness (Polyanovsky et al., 2008; Holmes and Durbin, 998). In our study, we also compare different scoring functions using these quantities, however with terms that are more commonly used in the field of classification studies (see Section 2. for details). These investigators studied global alignments. In our study, we consider local alignment and general scoring functions without gaps. In this article, we consider sequence alignment as a statistical hypothesis testing problem and define scores using the log-likelihood ratio statistic based on the segments we intend to find. We focus on the relationship between the scoring function and the best aligned segment in the scenario of local sequence alignment. According to the Neyman-Pearson lemma (Neyman and Pearson, 933), the likelihood-ratio statistic is most powerful to distinguish a given distribution of optimal alignment from the background distribution for fixed global alignments. The test statistic provides a form of a scoring function which can be used in the sequence alignment. We conjecture that this log-likelihood ratio scoring function is statistically the most powerful one to detect the best aligned segment. Under the assumption that the compared sequences are independent and identically distributed (i.i.d.) letters sampled from some background distributions, we can treat the sequence alignment problem as a hypothesis testing problem. The null hypothesis is that the sequences are

3 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 679 independent, and the alternative hypothesis is that they are related due to some shared segments which have a given distribution of letter pairs. We choose the score of the best aligned segment, i.e., the highest score during local alignment, as the test statistic. We use the power of the statistic for the hypothesis test as a measurement of the performance of scoring functions. The higher the power, the better the scoring function is. We also take a classification perspective for detecting the aligned segments and use true positive rate (TPR), false discovery rate (FDR), and the f-statistic (a weighted average of TPR and FDR) as alternative criteria for evaluating the scoring functions. We show that the log-likelihood ratio scoring function is most powerful to detect aligned segments following the distribution derived from the scoring function. It applies to both DNA sequences and amino acid sequences. The aim of this article is to cut through the many ways to construct scoring functions and show that any scoring function used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. 2.. Theoretical motivation 2. METHODS In this section, we present the basis for the analysis we perform. The first result that motivated our work was the Neyman-Pearson Lemma (Neyman and Pearson, 933). This remarkable result, which has an elegant proof, is central to statistical theory and practice. The setting is hypothesis testing and we present the most elementary form of the Neyman-Pearson Lemma. The data are X, X 2,..., X i.i.d. (independent and identically distributed) from model 0 with distribution P (the null hypothesis H 0 ) or from model with distribution Q (the alternate hypothesis H ). Do the data X ¼ X, X 2,..., X come from model 0 or? A test function f satisfies f(x) in {0, } where we say H 0 is rejected if f(x) ¼. The size of the test or level of significance is P(f(X) ¼ ). Our ideal test function is one which has small size P(f(X) ¼ ) ¼ a and the largest possible power b ¼ Q(f(X) ¼ ). That is, we want a test statistic that has a small probability of rejecting a true null hypothesis, but the largest possible power or probability of rejecting H 0 when H is true. The Neyman-Pearson Lemma states that the most powerful test statistic is /(X) ¼ I Q(X) P(X) t, when P(/(X) ¼ ) ¼ a, where I( ) ¼ when the argument is true and 0 otherwise. This is one reason for the widespread use of likelihood ratio statistics. In sequence alignment, we are interested in pairs of aligned letters from finite alphabet L such as a b.an alignment A with length of n of letters from sequences of random letters A ¼ A A 2...A and B ¼ B B 2...B is represented as A A 2 A B B 2 B The null hypothesis is that all the 2n letters are i.i.d. P with P(A ¼ a) ¼ p a, and we call such an alignment P-distributed. The alternate hypothesis Q is a distribution over aligned pairs a b with Q A ¼ a B b ¼ qab.in contrast P A B ¼ a b ¼ pa p b. Thus, we are testing the statistical distribution of letter pairs in the alignment: P-distributed alignments versus Q-distributed alignments. Following the Neyman-Pearson Lemma, to test the hypothesis that the alignment distribution is we should use the likelihood ratio H 0 : P vs H : Q Y i¼ q Ai B i p Ai p Bi : For convenience, we will use the logarithm of this statistic to form an equivalent test.

4 680 MENG ET AL. / ¼ I X i ¼! q Ai B log i t : p Ai p Bi Thus, we have derived the log-likelihood scoring of alignments using alphabet scoring function q ab s(a, b) ¼ s P, Q (a, b) ¼ log, () p a p b which for sequence alignment goes back at least to Dayhoff et al. (978) and was more recently employed by Henikoff and Henikoff (992) and others. This article will explore the implications of this approach to scoring and its connection to hypothesis testing. If we consider P as the background distribution and Q as the alternate distribution for the alignment, we should use the log-likelihood ratio scoring function s ¼ s P,Q as the scoring function to best distinguish Q from P. However it is less evident what should be done for local alignment. We conjecture this scoring function is most powerful to detect Q distributed local alignments aligned fragments with distribution Q. Now if there were one given, fixed-length alignment, our previous discussion would be the conclusion of the matter. Instead there are, for two random sequences of length n, O(n 3 ) possible local alignments (we exclude indels where this number is much larger), and they are dependent in a subtle way. Is there any reason to be optimistic that log-likelihood scoring function is best for detecting local alignments of distribution Q from the background P? The mathematical result described next gives some hope for this. The following result first appeared in Arratia et al. (988) and was stated more generally in Karlin and Altschul (990) with a form that was proven in Dembo et al. (994). We give a version fitting our setup and do not completely repeat the notation we have defined above. Let A ¼ A A 2 A n and B ¼ B B 2 B n be i.i.d. random sequences with background distribution P. Assume s(a,b) is a scoring function that satisfies the conditions (i) max a, b2l s(a, b) 4 0 and (ii) E P s(a, B) 5 0, where E P s(a, B) ¼ X p a p b s(a, b): Let r > 0 be the largest real root of a, b2l f (k) ¼ E(k s(a, B) ) ¼ 0: (2) Then the proportion of letter a from sequence A aligned with letter b from sequence B in the optimal alignment segment converges to q ab ¼ p a p b r s(a, b) as sequence length n tends to infinity. We now take the asymptotic distribution of the theorem and solve for s(a, b). s(a, b) ¼ log =r q ab p a p b : As positive multiples (cs(a,b) versus s(a,b) for any c > 0, for example) do not affect the results of local alignment, the numerical value of r > 0 is irrelevant. Therefore for any scoring function satisfying the hypotheses of the theorem, there is an asymptotic log-likelihood scoring function. It is easy to show (see Appendix A) that for any log-likelihood scoring function, the conditions of the theorem are satisfied so long as P and Q are not identical, that is for some (a,b), q ab = p a p b. Thus, we have a duality between scoring and likelihood ratio statistics. The theorem assures us that in the i.i.d. case even with the complexity of O(n 3 ) competing local alignments, with a given scoring function, a local alignment algorithm searches for Q distributed alignments. From this point of view, sequence alignment is hypothesis testing where H 0 : The sequences A A 2 A n and B B 2 B m are from P i.i.d. letters. H : The sequences A A 2 A n and B B 2 B m are mixture of P i.i.d. letters and a Q distributed local alignment at an unknown location.

5 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 68 Because under either hypothesis the alignment algorithm is rewarding Q distributed local alignments, how do we determine signal from background? The answer is that this is not possible until the signal is significantly larger than the background. Fortunately, there is a well-studied basis for statistical significance in local alignments; the most famous is used in BLAST (Altschul et al., 990) and is closely related to the theorem presented above, in addition to there being rigorous Poisson approximation methods which are equivalent (Waterman and Vingron, 994). For our purposes, it will suffice to note that the growth of the alignment length of an optimal alignment, via an Erdös-Renyi law (Arratia et al., 988), for two sequences of length n is k ¼ log (nm) H(Q, P) ; where H(Q, P) ¼ X p a p b log (p a p b =q ab ): a, b2l Let S P,Q (A, B), the maximum local alignment score under the scoring function s P,Q defined by P and Q, be the test statistic. For a given size a, we choose a threshold t a, so that We define the power of the alignment test statistic as P(S P, Q (A; B) t a ) ¼ a: (3) power ¼ Q(S P, Q (A, B) t a ): (4) A scoring function s P,Q yielding the highest power is preferred in local sequence alignment. A natural way to evaluate the scoring function s P,Q is to see if the local aligned segment identified by the algorithm can find the signal of interest. A signal can be locally aligned segments. In our simulations, the signal is inserted at random positions of the two sequences. We treat the aligned position pairs of the signal as actual positives. However, the actual negatives are less easy to define because the other bases are not aligned. We denote the inserted signal by p* and let k* be the length of the signal p*. Similarly, the predicted positives are the aligned position pairs in the identified local aligned segment, which we refer to as p 0. Let the length of identified alignment be k 0. Table shows the relationship among the terms. The predicted negatives are difficult to define though. We use similar notation as in standard classification problems, and use TP, FP, and FN to represent true positive, false positive, and false negative, respectively. TPR (also referred to as sensitivity) is the fraction of true positives among the actual positives, i.e., TPR ¼ TP k ¼ jp0 \ p j k : The positive predictive value (PPV) or precision is defined by and FDR is defined as PPV ¼ TP k 0 ¼ jp0 \ p j k 0, FDR ¼ PPV: Table. TP, FP, and FN by Comparing Algorithmic Alignment p 0 with the Signal p* Signal p* Positive Negative Predicted alignment p 0 Positive True positive ¼jp 0 \ p*j False positive ¼ k jp 0 \ p*j Negative False negative ¼ k* jp 0 \ p*j True negative

6 682 MENG ET AL. Scoring functions yielding high TPR and high PPV (and low FDR) are preferred. Another commonly used measure to evaluate the performance of a classification problem is the f-statistic defined as f ¼ 2TP k þ k 0 ¼ 2 jp0 \ p j k þ k 0 : Note that the f-statistic is a weighted sum of TPR and PPV. In this study, we use TPR, FDR, and the f-statistic to evaluate the scoring function from the classification point of view Simulation studies We carry out extensive simulations to show that when the scoring function used for sequence comparison is the log-likelihood ratio score defined in equation, the test statistic has the highest power, TPR, PPV, and the f-statistic. To achieve this objective, we carry out simulations as follows. First, we choose a set of P distributions as the background distribution of letters. Second, we define a set of Q-distributions which define how letter pairs align with each other in the simulated signal region. For DNA sequences, we choose three P s and five Q s. The three P distributions are: uniform, A rich, and GC rich. The five Q distributions are: all matches have equal probability which is higher than the probability for mismatches, AA pair rich, AA pair poor, GG & CC rich, and GG & CC poor. For amino acid sequences, we choose two P s and three commonly used score matrices: BLOSUM45, BLOSUM62, and BLOSUM80. The Q s (Q, Q 2, Q 3 ) corresponding to the three score matrices are derived by solving equation 2, and the corresponding Q-distribution is given by q ab ¼ p a p b r s(a, b). The two P s are: equal probability for the 20 amino acids and the observed amino acid frequencies in vertebrates. For details about these choices, see Appendix B. Third, for a given size a, we calculate a threshold t a (P, Q), as in equation 3, when a scoring function, s P,Q, defined by P and Q in equation, is used to align the two sequences as follows:. Generate i.i.d. random sequences A and B of length n with background distribution P. 2. Do local alignment of sequence A and sequence B with scoring function s P,Q using the Smith- Waterman algorithm (Smith and Waterman, 98). 3. Repeat steps -2 for R ¼ 0, 000 times and rank the resulting local sequence alignment scores in ascending order. Approximate the value of t a (P, Q) by the upper a percentile of the local alignment scores. Fourth, we approximate the power of testing the hypotheses H 0 versus H when a Q*-distributed alignment is inserted in the two random sequences as follows. The Q* distribution is referred to as the target distribution. We simultaneously calculate the approximate values of TPR, FDR, and the f-statitics with the procedure. The objective of this study is to identify an optimal scoring function to detect the relationship between sequences related through Q*-local alignment. The simulation steps are as follows:. Generate i.i.d. random sequences A and B of length n with background distribution P. 2. For a specific target distribution Q* from the group of Q s, generate a length k* aligned pairs with Q* distribution with k log (n 2 ) ¼ ( þ e) H(Q, P 2 ), (5) where e is a factor making the length of Q* segment somewhat larger than the expected length under the null model. To generate a Q* segment, we create a potential local alignment by independently drawing k* letter pairs from the Q* distribution. We refer to the generated aligned segment as a Q*-local alignment. 3. Let the generated Q*-local alignment be p ¼ p, where p p and p 2 are sequences of length k*. 2 Replace part of sequences A and B at random positions with p and p 2, respectively as shown in Figure. Define the resulting sequences as A* and B*. 4. Do local sequence alignment between sequence A* and sequence B* with scoring function s P,Q using the Smith-Waterman algorithm.

7 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 683 FIG.. Insert Q*-local alignment into the sequences A and B. We repeat the above four steps for R 2 ¼,000 times, and the power of detecting the relationship between the two sequences with the Q*-local alignment inserted using score s P,Q is approximated by the fraction of times that the resulting local alignment score is at least t a (P, Q). From the classification point of view, we are interested in the expectation of TPR, FDR and the f-statistic. Let TP r, k r and k 0 r be the estimated corresponding values of TP, k* and k 0 in the r-th experiments, r ¼, 2,, R 2. Then TPR r, FDR r and f r, r ¼, 2,, R 2 are estimated by TPR r ¼ TP r k r, FDR r ¼ TP r kr 0, f r ¼ 2TP r kr þ : k0 r The expectation of TPR, FDR and f-statistic can be approximated by ^E(TPR) ¼ R 2 X R 2 r¼ ^E(FDR) ¼ R 2 X R 2 ^E(f ) ¼ R 2 X R 2 r¼ TP r kr, r¼ TP r kr 0, 2TP r kr þ : k0 r 3. RESULTS In the simulations, we let the size a be 0.0 and 0.05 and e in equation 5 be The length n of sequences is 0,000. The simulation results are shown in Tables 2 7. In each table, the column direction represents a condition when one target Q*-local alignment is inserted, and the row direction represents the results using the log-likelihood scoring function derived from one P and one Q. From Table 2, by comparing the power among different Qs under the same P and Q*, it can be seen that the test has the largest power when Q equals to Q*. In other words, the highest power appears diagonally. For example, the power of the tests using loglikelihood ratio scoring functions corresponding to Q to Q 5 when P ¼ P 2 and Q* ¼ Q 3 are 0.6, 0.5, 0.7, 0.68, and 0.55, respectively, for test size a ¼ 0.0. The largest power among the five tests is 0.7 when Q ¼ Q 3. The other tables can be viewed similarly. Tables 2 4 are for DNA sequences. When Q ¼ Q*, we obtain the highest power, TPR, the f-statistic, and the lowest FDR. That is, the scoring function derived from Q* is the most powerful scoring function to detect Q*-local alignment. Tables 5 7 are for amino acid sequences. Similar conclusions as for DNA sequences are obtained. The power of the test based on scoring function derived from Q reach the highest when Q ¼ Q*, no matter whether a ¼ 0.0 or a ¼ It can also be seen from Table 5 that the power of the test based on s P,Q decreases as the distance between Q and Q* increases. For example, when Q* ¼ Q corresponding to BLOSSOM45 and P ¼ P 2, the power of the tests based on BLOSUM45, BLOSSOM62, and BLOSSOM80 is 0.83, 0.75, and 0.56, respectively. Table 6 gives the TPR and FDR for the tests using different scoring functions. When the target distribution is Q 3 corresponding to BLOSUM80, the TPR of the test using scoring functions s P, Q, s P, Q2, and s P, Q3 is close to 80%. On the other hand, Tables 6 and 7 show that the FDR is lowest and the f-statistic is the highest when the scoring function s P, Q is used.

8 684 MENG ET AL. Table 2. Power of the Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted Target Q*, a ¼ 0.0 Target Q*, a ¼ 0.05 Scoring function Q Q 2 Q 3 Q 4 Q 5 Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q Test size a ¼ 0.0 or 0.05 (DNA sequences). 4. DISCUSSION Sequence alignments have been widely used to compare nucleotide and amino acid sequences. For a given scoring function, the local alignment score between two sequences is first obtained through a dynamic programming algorithm or a method such as BLAST (Altschul et al., 990), and a p-value or E-value can be calculated. Log-likelihood ratio scoring functions based on known aligned sequences were derived for sequence comparisons by (Dayhoff et al., 978) and (Henikoff and Henikoff, 992). Previous studies showed the superiority of the log-likelihood ratio scoring function by evaluating whether it can successfully identify genes within the same family (Henikoff and Henikoff, 993). It has also been argued that all reasonable substitution scoring functions are implicitly log-odds scoring functions (Karlin and Altschul, 990; Karlin Table 3. True Positive Rate (TPR) and False Discovery Rate (FDR) Using Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (DNA Sequences) Target Q*, TPR Target Q*, FDR Scoring function Q Q 2 Q 3 Q 4 Q 5 Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q

9 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 685 Table 4. f-statistic Using Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (DNA Sequences) Target distribution Q* Scoring function Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q Table 5. Power of Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted Target Q*, a ¼ 0.0 Target Q*, a ¼ 0.05 Scoring function Q Q 2 Q 3 Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q Test size a ¼ 0.0 or 0.05 (amino acid sequences). The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively. Table 6. True Positive Rate (TPR) and False Discovery Rate (FDR) of the Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (Amino Acid Sequences) Target Q*, TPR Target Q*, FDR Scoring function Q Q 2 Q 3 Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively.

10 686 MENG ET AL. Table 7. f-statistic of the Tests Based Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (Amino Acid Sequences) Target Q* Scoring function Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively. et al., 990; Altschul, 99), i.e., log-likelihood ratio scoring function. For a given scoring function s(, ), it has been shown that the probability that a is aligned to b in the best aligned segment is p a p b r s(a, b) with r being the largest root of the equation 2. For a given distribution Q for the aligned segment, it is possible to define a scoring function by the log-likelihood ratio between the Q distribution and the P distribution. Thus, scoring functions and target Q-distributions are coupled. Suppose that two sequences are related through a target distribution Q* in an aligned segment. Intuitively, the scoring function defined by the log-likelihood ratio between Q* and P distributions should be used. However, to the best of our knowledge, no studies have been carried out to prove or dispute this claim. In this article, we regard sequence alignment as a hypothesis testing problem, and study the power of tests based on different scoring functions for detecting the relationship between two sequences. For our studies, aligned segments were randomly inserted into the two sequences. The results from our simulations indicate that the log-likelihood ratio scoring function is the most powerful scoring function to detect segments of Q distribution using the scoring function s P,Q, as it has the highest power, TPR, and f-statistic, and the lowest FDR. However, we cannot mathematically rigorously prove that the log-likelihood ratio scoring function is optimal. In our simulation studies, we tried to choose a set of Q distributions as representative as possible. As the Q can be sampled in a continuous space with 5 degrees of freedom for DNA sequences and 399 degrees of freedom for amino acid sequences, we cannot search over all possibilities for Q. We chose representative Qs from the sampling space, compared the scoring functions derived from these Qs, and used those values to provide evidence to show that the log-likelihood ratio scoring function is most powerful. The field of sequence alignment lacks a proof of the claim we have conjectured. While for fixed length alignment the Neyman-Pearson Lemma holds, the distribution related to equation 2 is only true asymptotically. Thus we believe our result will only be proven as an asymptotic result. None the less it would be a significant advance for the general theory and practice of sequence alignment. Our study has several limitations. First, we assume that the two sequences are i.i.d in the null model. In many situations, Markovian models fit the sequences much better than the i.i.d model. Under the Markovian model, the log-likelihood scoring function will become more complex and can depend on adjacent pairs. We are confident that our results hold in the more general setting, as for example the asymptotic distribution for local alignment holds here. Second and more important, gaps should be considered in many alignment problems. Currently theoretical results are not available for local alignment on the length of gaps nor for aligned letter pairs for two random sequences. We conjecture that there is an asymptotic distribution for local alignment in this case as well, so long as n E PS(A, B) 5 0, where S(A, B) is the global alignment of the sequences A and B of length n. The condition states that the per letter score accumulation is negative. The asymptotic distribution even in the i.i.d. case for gaps will depend on the letter composition aligned to the gaps. However summing over the composition will give a gap length distribution. The lack of theoretical results makes the design of simulation studies difficult. We hypothesize that the optimal scoring function is still the log-likelihood scoring function. Further studies are

11 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 687 needed to prove or dispute this hypothesis. Such a result would be a significant extension of the results of Arratia et al. (988) and Dembo et al. (994). Similar results will hold for multiple local alignment. 5. APPENDIX A: Duality between scoring functions and log-likelihood ratio scores Claim. For any given P and Q distributions, the log-likelihood scoring function s P,Q defined in equation bys P, Q (a, b) ¼ log q ab p a p b satisfies the conditions: () max a, b2l s P, Q (a, b) 4 0 and (2) E P s P, Q (a, b) 5 0, unless q ab ¼ p a p b for all a, b 2L. Proof. If q ab ¼ p a p b for all a, b 2L, then s P,Q (a, b) ¼ 0 for all a, b 2L. Next assume that q ab 6¼ p a p b for some a, b. Then there must exist a, b 2L, such that q a b 4 p a p b. Thus, s P,Q(a*, b*) > 0. By Jensen s inequality, if X is a random variable and X is not a constant with probability, and g(x) is a strictly concave function, then E(g(X)) 5 g(ex). Applying this inequality with g(x) ¼ log(x), we have E P (s(a, b)) ¼ X p a p b s P, Q (a, b) a, b2l ¼ X p a p b log q ab p a, b2l a p b 5 log X q ab p a p b p a, b2l a p b ¼ log X q ab ¼ 0: B: The choices of P and Q distributions a, b2l In our study, we choose several P and Q distributions to provide evidence that the log-likelihood scoring function yields the highest power, TPR, the f-statistic, and the lowest FDR. It is important to choose such distributions so that they cover as many possibilities as possible. In this study, we choose P and Q distributions for DNA sequences. First, we consider equally likely distribution, i.e., p A ¼ p C ¼ p G ¼ p T ¼ 4. Second, we consider one-letter rich situation, e.g., A -rich. The background distribution pattern is set as p A ¼ 4 þ 3d, p C ¼ p G ¼ p T ¼ 4 d. If we let d be 20, then p A ¼ 2 5, p C ¼ p G ¼ p T ¼ 5. Third, we consider two-letter rich situation, e.g., G and C. p C ¼ p G ¼ 4 þ d 2, p A ¼ p T ¼ 4 d 2. When d 2 ¼ 2, p C ¼ p G ¼ 3, p A ¼ p T ¼ 6. In summary, we set three types of P s: equally likely, A rich, and G-C rich, and denote them as P, P 2, and P 3, respectively, as shown in Table 8. We choose the group of Q distributions for DNA sequences as follows. First, we let the probability of matches be larger than that for the mismatches. We choose Q as q ab ¼ 6 þ e ¼ q match, a ¼ b, 6 3 e ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e ¼ 6, q match ¼ 8, q mismatch ¼ 24. & Table 8. Background Distribution P for DNA Sequences Used in the Simulations P 4 P P 3 6 P A P C P G P T

12 688 MENG ET AL. Second, we consider the situation that the match of one specific letter, e.g., A, is preferred than the match for other letters. We consider AA pair rich Q-local alignment. So we choose Q as follows. 8 < 6 þ 3e 2, a=b=a, Q ab ¼ 6 þ e 2, a=b6=a, : 6 2 e 2 ¼ Q mismatch, a6=b, where a, b 2fA, C, G, Tg. When e 2 ¼ 6, q AA ¼ 4, q CC ¼ q GG ¼ q TT ¼ 8, q mismatch ¼ 32. Third, instead of letting AA pair to be enriched in the aligned region, we let another match, e.g., GG, be enriched. We set Q as follows. The reason for choosing Q this way is to see what happens if the enriched matches in the aligned part is different from the most abundant nucleotide in the background sequences. 8 < 6 þ 3e 3, a ¼ b ¼ G, q ab ¼ 6 þ e 3, a ¼ b 6¼ G, : 6 2 e 3 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 3 ¼ 6, q GG ¼ 4, q AA ¼ q CC ¼ q TT ¼ 8, q mismatch ¼ 32. Fourth, we set Q so that matches for two letters are enriched, e.g., CC and GG. Note that C and G are the enriched nucleotides for P 3 given above. 8 < 6 þ 2e 4, a ¼ b ¼ CorG, q ab ¼ 6 þ e 4, a ¼ b ¼ AorT, : 6 2 e 4 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 4 ¼ 6, q CC ¼ q GG ¼ 3 6, q AA ¼ q TT ¼ 8, q mismatch ¼ 32. Fifth, we let AA and TT be enriched in Q. Note that the enriched matches in the Q-local alignments are different from the enriched nucleotides in P 3. 8 < 6 þ 2e 5, a ¼ b ¼ AorT, q ab ¼ 6 þ e 5, a ¼ b ¼ CorG, : 6 2 e 5 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 5 ¼ 6, q AA ¼ q TT ¼ 3 6, q CC ¼ q GG ¼ 8, Q mismatch ¼ 32. In summary, we have five Q s Q, Q 2,, Q 5 as described above. For P as the uniform distribution P A ¼ P C ¼ P G ¼ P T ¼ 4, Table 9 shows the Q s we choose and the corresponding scores as well as the expectations of the scores. Choices of P s and Q s for amino acid sequences. We choose the commonly used BLOSUM45, BLOSUM62 and BLOSUM80 as the group of scoring functions, and the corresponding Q s(q, Q 2, Q 3 ) are derived from the three scoring functions through solving equation 2 of l, respectively. We choose two P-distributions. The first gives equal probability to all the amino acids (P ) and the other one is the observed amino acid frequencies in vertebrates (P 2 ) as shown in Table 0. Table 9. Q Distributions Used in the Simulations and the Corresponding Scores under the Equally Likely Background Distribution P, as well as the Expectation of the Scores Q (e ¼ 6 ) Q 2(e 2 ¼ 6 ) Q 3(e 3 ¼ 6 ) Q 4(e 4 ¼ 6 ) Q 5(e 5 ¼ 6 ) Q M: 8 M(AA): 4 M(GG): 4 M(GG, CC): 3 6 M(AA, TT): 3 6 M(CC, GG, TT): 8 M(AA, CC, TT): 8 M(AA, TT): 8 M(GG, CC): 8 N: 24 M: 0.69 N: 32 M(AA):.39 N: 32 M(GG):.39 N: 32 M(GG, CC):.0 N: 32 M(AA, TT):.0 s PQ M(CC, GG, TT): 0.69 M(AA, CC, TT): 0.69 M(AA, TT): 0.69 M(GG, CC): 0.69 N: 0.4 N: 0.69 N: 0.69 N: 0.69 N: 0.69 E P (s Q ) M, match; N, mismatch.

13 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 689 Table 0. Observed Amino Acid Frequencies in Vertebrates (P 2 ) A Alanine 7.4% R Arginine 4.2% N Asparagine 4.4% D Aspartic acid 5.9% C Cysteine 3.3% Q Glutamine 3.7% E Glutamic acid 5.8% G Glycine 7.4% H Histidine 2.9% I Isoleucine 3.8% L Leucine 7.6% K Lysine 7.2% M Methionine.8% F Phenylalanine 4.0% P Proline 5.0% S Serine 8.% T Threonine 6.2% W Tryptophan.3% Y Tyrosine 3.3% V Valine 6.8% C: The algorithm for simulation studies Algorithm Procedure flow Input: P,, P K, Q,, Q L, n and e Output: power a ¼ 0.0, power a ¼ 0.05, TPR, FDR, f for k ¼ ; k K; k þþdo for j ¼ ; j L; j þþdo for r ¼ ; r R ; r þþdo generate sequence A and sequence B with background P k ; local alignment between A and B using Q j ; record S Pk, Q j (A, B); end Rank S Pk, Q j (A, B); T a, j ¼ (ar ) th S Pk, Q j (A, B); end for i ¼ ; i L; i þþdo Q* ¼ Q i ; for r ¼ ; r R 2 ; r þþdo generate sequence A and sequence B with background P k ; plug in Q*-segment to generate new sequences A* and B*; for j ¼ ; j L; j þþdo local alignment between A* and B* using Q j ; record S Pk, Q j (A, B ); if S Pk, Q j (A, B ) T a, j then Flag r (P k, Q*, Q j ) ¼ ; else Flag r (P k, Q*, Q j ) ¼ 0; end calculate TPR r (P k, Q*, Q j ), FDR r (P k, Q*, Q j ), f r (P k, Q*, Q j ) end end for j ¼ ; j L; j þþdo

14 690 MENG ET AL. P R power Pk, Q, Q j ¼P R TPR Pk, Q, Q j ¼ P R FDR Pk, Q, Q j ¼ f Pk, Q, Q j ¼ end end end P R Flag r ¼ r(p k, Q, Q j ) R 2 ; TPR r ¼ r(p k, Q, Q j ) R 2 ; FDR r ¼ r(p k, Q, Q j ) R 2 ; r ¼ f r(p k, Q, Q j ) R 2 ; ACKNOWLEDGMENTS This work was supported by the NSFC (grant to L.M., X.Z.; grant to L.M.; grant to F.S., X.Z.; grant to F.S.) and the NIH (grant P50HG to F.S., M.S.W.; grant R2AG to F.S., M.S.W.). No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Adachi, J., and Hasegawa, M Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42, Altschul, S., Wootton, J., Zaslavsky, E., et al The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput. Biol. 6, e Altschul, S.F. 99. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 29, Altschul, S.F., Gish, W., Miller, W., et al Basic local alignment search tool. J. Mol. Biol. 25, Altschul, S.F., Madden, T.L., Schaffer, A.A., et al Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, Arratia, R., Morris, P., and Waterman, M Stochastic scrabble: large deviations for sequences with scores. J. Appl. Probab. 25, Cao, M.D., Dix, T.I., and Allison, L Computing substitution matrices for genomic comparative analysis. Proc. 3th Pacific-Asia Conf. Adv. Knowledge Discov. Data Mining Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, Dembo, A., Karlin, S., and Zeitouni, O Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Appl. Probab. 22, Do, C.B., Mahabhashyam, M.S., Brudno, M., et al Probcons: probabilistic consistency-based multiple sequence alignment. Genome Res. 5, Gonnet, G.H., Cohen, M.A., and Benner, S.A Exhaustive matching of the entire protein sequence database. Science 256, Henikoff, S., and Henikoff, J.G Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89, Henikoff, S., and Henikoff, J.G Performance evaluation of amino acid substitution matrices. Proteins 7, Holmes, I., and Durbin, R Dynamic programming alignment accuracy. J. Comput. Biol. 5, Jones, D.T., Taylor, W.R., and Thornton, J.M The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, Karlin, S., and Altschul, S Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87, Karlin, S., Dembo, A., and Kawabata, T Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 8,

15 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 69 Kent, W.J BLAT the BLAST-like alignment tool. Genome Res. 2, Langmead, B., Trapnell, C., Pop, M., et al Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 0, R25. Müler, T., Spang, R., and Vingron, M Estimating amino acid substitution models: a comparison of Dayhoff s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 9, 8 3. Needleman, S.B., and Wunsch, C.D A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, Neyman, J., and Pearson, E.S On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. A 23, Overington, J., Donnelly, D., Johnson, M.S., et al Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci., Pearson, W.R Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 83, Polyanovsky, V., Roytberg, M.A., and Tumanyan, V.G Reconstruction of genuine pairwise sequence alignment. J. Comput. Biol. 5, Prlić, A., Domingues, F.S., and Sippl, M.J Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng. 3, Qian, B., and Goldstein, R.A Optimization of a new score function for the generation of accurate alignments. Proteins 48, Shi, J., Blundell, T.L., and Mizuguchi, K FUGUE: sequence-structure homology recognition using environmentspecific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 30, Smith, T.F., and Waterman, M.S. 98. Identification of common molecular subsequences. J. Mol. Biol. 47, Subramanian, A., Kaufmann, M., and Morgenstern, B DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol. Biol. 3, 6. Subramanian, A., Weyer-Menkhoff, J., Kaufmann, M., et al DIALIGN-T: an improved algorithm for segmentbased multiple sequence alignment. BMC Bioinform. 6, 66. Thompson, J.D., Higgins, D.G., and Gibson, T.J CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, Torda, A.E., Procter, J.B., and Huber, T Wurst: a protein threading server with a structural scoring function, sequence profiles and optimized substitution matrices. Nucleic Acids Res. 32, W532 W535. Veerassamy, S., Smith, A., and Tillier, E.R.M A transition probability model for amino acid substitutions from blocks. J. Comput. Biol. 0, Waterman, M., and Vingron, M Sequence comparison significance and Poisson approximation. Stat. Sci. 9, Waterman, M.S Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, New York. Webb, B.M., Liu, J.S., and Lawrence, C.E BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res. 30, Zhu, J., Liu, J.S., and Lawrence, C.E Bayesian adaptive sequence alignment algorithms. Bioinformatics 4, Address correspondence to: Dr. Michael S. Waterman Molecular and Computational Biology University of Southern California 050 Childs Way, RRI 20 Los Angeles, CA msw@usc.edu

16

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

Optimization of a New Score Function for the Detection of Remote Homologs

Optimization of a New Score Function for the Detection of Remote Homologs PROTEINS: Structure, Function, and Genetics 41:498 503 (2000) Optimization of a New Score Function for the Detection of Remote Homologs Maricel Kann, 1 Bin Qian, 2 and Richard A. Goldstein 1,2 * 1 Department

More information

Sequence Database Search Techniques I: Blast and PatternHunter tools

Sequence Database Search Techniques I: Blast and PatternHunter tools Sequence Database Search Techniques I: Blast and PatternHunter tools Zhang Louxin National University of Singapore Outline. Database search 2. BLAST (and filtration technique) 3. PatternHunter (empowered

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Scoring Matrices. Shifra Ben-Dor Irit Orr

Scoring Matrices. Shifra Ben-Dor Irit Orr Scoring Matrices Shifra Ben-Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Single alignment: Substitution Matrix. 16 march 2017

Single alignment: Substitution Matrix. 16 march 2017 Single alignment: Substitution Matrix 16 march 2017 BLOSUM Matrix BLOSUM Matrix [2] (Blocks Amino Acid Substitution Matrices ) It is based on the amino acids substitutions observed in ~2000 conserved block

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas Informal inductive proof of best alignment path onsider the last step in the best

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Local Alignment Statistics

Local Alignment Statistics Local Alignment Statistics Stephen Altschul National Center for Biotechnology Information National Library of Medicine National Institutes of Health Bethesda, MD Central Issues in Biological Sequence Comparison

More information

A profile-based protein sequence alignment algorithm for a domain clustering database

A profile-based protein sequence alignment algorithm for a domain clustering database A profile-based protein sequence alignment algorithm for a domain clustering database Lin Xu,2 Fa Zhang and Zhiyong Liu 3, Key Laboratory of Computer System and architecture, the Institute of Computing

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Sequence Alignment Techniques and Their Uses

Sequence Alignment Techniques and Their Uses Sequence Alignment Techniques and Their Uses Sarah Fiorentino Since rapid sequencing technology and whole genomes sequencing, the amount of sequence information has grown exponentially. With all of this

More information

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas

Sequence comparison: Score matrices. Genome 559: Introduction to Statistical and Computational Genomics Prof. James H. Thomas Sequence comparison: Score matrices Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best alignment path onsider the last step in

More information

Sequence comparison: Score matrices

Sequence comparison: Score matrices Sequence comparison: Score matrices http://facultywashingtonedu/jht/gs559_2013/ Genome 559: Introduction to Statistical and omputational Genomics Prof James H Thomas FYI - informal inductive proof of best

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation

Edward Susko Department of Mathematics and Statistics, Dalhousie University. Introduction. Installation 1 dist est: Estimation of Rates-Across-Sites Distributions in Phylogenetic Subsititution Models Version 1.0 Edward Susko Department of Mathematics and Statistics, Dalhousie University Introduction The

More information

Pairwise sequence alignment

Pairwise sequence alignment Department of Evolutionary Biology Example Alignment between very similar human alpha- and beta globins: GSAQVKGHGKKVADALTNAVAHVDDMPNALSALSDLHAHKL G+ +VK+HGKKV A+++++AH+D++ +++++LS+LH KL GNPKVKAHGKKVLGAFSDGLAHLDNLKGTFATLSELHCDKL

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2013 Week3: Blast Algorithm, theory and practice Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and Systems Biology

More information

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9 Lecture 5 Alignment I. Introduction. For sequence data, the process of generating an alignment establishes positional homologies; that is, alignment provides the identification of homologous phylogenetic

More information

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types Exp 11- THEORY Sequence Alignment is a process of aligning two sequences to achieve maximum levels of identity between them. This help to derive functional, structural and evolutionary relationships between

More information

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM).

Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). 1 Bioinformatics: In-depth PROBABILITY & STATISTICS Spring Semester 2011 University of Zürich and ETH Zürich Lecture 4: Evolutionary models and substitution matrices (PAM and BLOSUM). Dr. Stefanie Muff

More information

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) Bioinformática Sequence Alignment Pairwise Sequence Alignment Universidade da Beira Interior (Thanks to Ana Teresa Freitas, IST for useful resources on this subject) 1 16/3/29 & 23/3/29 27/4/29 Outline

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

Quantifying sequence similarity

Quantifying sequence similarity Quantifying sequence similarity Bas E. Dutilh Systems Biology: Bioinformatic Data Analysis Utrecht University, February 16 th 2016 After this lecture, you can define homology, similarity, and identity

More information

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models

Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm. Alignment scoring schemes and theory: substitution matrices and gap models Lecture 2, 5/12/2001: Local alignment the Smith-Waterman algorithm Alignment scoring schemes and theory: substitution matrices and gap models 1 Local sequence alignments Local sequence alignments are necessary

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Protein sequence alignment with family-specific amino acid similarity matrices

Protein sequence alignment with family-specific amino acid similarity matrices TECHNICAL NOTE Open Access Protein sequence alignment with family-specific amino acid similarity matrices Igor B Kuznetsov Abstract Background: Alignment of amino acid sequences by means of dynamic programming

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55 Pairwise Alignment Guan-Shieng Huang shieng@ncnu.edu.tw Dept. of CSIE, NCNU Pairwise Alignment p.1/55 Approach 1. Problem definition 2. Computational method (algorithms) 3. Complexity and performance Pairwise

More information

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT 3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT.03.239 25.09.2012 SEQUENCE ANALYSIS IS IMPORTANT FOR... Prediction of function Gene finding the process of identifying the regions of genomic DNA that encode

More information

Substitution matrices

Substitution matrices Introduction to Bioinformatics Substitution matrices Jacques van Helden Jacques.van-Helden@univ-amu.fr Université d Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University

Sequence Alignment: Scoring Schemes. COMP 571 Luay Nakhleh, Rice University Sequence Alignment: Scoring Schemes COMP 571 Luay Nakhleh, Rice University Scoring Schemes Recall that an alignment score is aimed at providing a scale to measure the degree of similarity (or difference)

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of omputer Science San José State University San José, alifornia, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri Pairwise Sequence Alignment Homology

More information

Tools and Algorithms in Bioinformatics

Tools and Algorithms in Bioinformatics Tools and Algorithms in Bioinformatics GCBA815, Fall 2015 Week-4 BLAST Algorithm Continued Multiple Sequence Alignment Babu Guda, Ph.D. Department of Genetics, Cell Biology & Anatomy Bioinformatics and

More information

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, Grundlagen der Bioinformatik, SS 08, D. Huson, May 2, 2008 39 5 Blast This lecture is based on the following, which are all recommended reading: R. Merkl, S. Waack: Bioinformatik Interaktiv. Chapter 11.4-11.7

More information

Introduction to Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics Jianlin Cheng, PhD Department of Computer Science Informatics Institute 2011 Topics Introduction Biological Sequence Alignment and Database Search Analysis of gene expression

More information

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009

8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 8 Grundlagen der Bioinformatik, SS 09, D. Huson, April 28, 2009 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Pairwise & Multiple sequence alignments

Pairwise & Multiple sequence alignments Pairwise & Multiple sequence alignments Urmila Kulkarni-Kale Bioinformatics Centre 411 007 urmila@bioinfo.ernet.in Basis for Sequence comparison Theory of evolution: gene sequences have evolved/derived

More information

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing Evaluation Measures of Multiple Sequence Alignments Gaston H. Gonnet, *Chantal Korostensky and Steve Benner Institute for Scientic Computing ETH Zurich, 8092 Zuerich, Switzerland phone: ++41 1 632 74 79

More information

Biologically significant sequence alignments using Boltzmann probabilities

Biologically significant sequence alignments using Boltzmann probabilities Biologically significant sequence alignments using Boltzmann probabilities P. Clote Department of Biology, Boston College Gasson Hall 416, Chestnut Hill MA 02467 clote@bc.edu May 7, 2003 Abstract In this

More information

Probalign: Multiple sequence alignment using partition function posterior probabilities

Probalign: Multiple sequence alignment using partition function posterior probabilities Sequence Analysis Probalign: Multiple sequence alignment using partition function posterior probabilities Usman Roshan 1* and Dennis R. Livesay 2 1 Department of Computer Science, New Jersey Institute

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

Scoring Matrices. Shifra Ben Dor Irit Orr

Scoring Matrices. Shifra Ben Dor Irit Orr Scoring Matrices Shifra Ben Dor Irit Orr Scoring matrices Sequence alignment and database searching programs compare sequences to each other as a series of characters. All algorithms (programs) for comparison

More information

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences

Statistical Distributions of Optimal Global Alignment Scores of Random Protein Sequences BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. The fully-formatted PDF version will become available shortly after the date of publication, from the

More information

Similarity or Identity? When are molecules similar?

Similarity or Identity? When are molecules similar? Similarity or Identity? When are molecules similar? Mapping Identity A -> A T -> T G -> G C -> C or Leu -> Leu Pro -> Pro Arg -> Arg Phe -> Phe etc If we map similarity using identity, how similar are

More information

BLAST: Basic Local Alignment Search Tool

BLAST: Basic Local Alignment Search Tool .. CSC 448 Bioinformatics Algorithms Alexander Dekhtyar.. (Rapid) Local Sequence Alignment BLAST BLAST: Basic Local Alignment Search Tool BLAST is a family of rapid approximate local alignment algorithms[2].

More information

Large-Scale Genomic Surveys

Large-Scale Genomic Surveys Bioinformatics Subtopics Fold Recognition Secondary Structure Prediction Docking & Drug Design Protein Geometry Protein Flexibility Homology Modeling Sequence Alignment Structure Classification Gene Prediction

More information

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) Alignment principles and homology searching using (PSI-)BLAST Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU) http://ibivu.cs.vu.nl Bioinformatics Nothing in Biology makes sense except in

More information

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011

8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 8 Grundlagen der Bioinformatik, SoSe 11, D. Huson, April 18, 2011 2 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance and alignment 4. The number

More information

Lecture 5,6 Local sequence alignment

Lecture 5,6 Local sequence alignment Lecture 5,6 Local sequence alignment Chapter 6 in Jones and Pevzner Fall 2018 September 4,6, 2018 Evolution as a tool for biological insight Nothing in biology makes sense except in the light of evolution

More information

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences 140.638 where do sequences come from? DNA is not hard to extract (getting DNA from a

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment C E N T R E F O R I N T E G R A T I V E B I O I N F O R M A T I C S V U Introduction to bioinformatics 2007 Lecture 5 Pair-wise Sequence Alignment Bioinformatics Nothing in Biology makes sense except in

More information

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm

Introduction to sequence alignment. Local alignment the Smith-Waterman algorithm Lecture 2, 12/3/2003: Introduction to sequence alignment The Needleman-Wunsch algorithm for global sequence alignment: description and properties Local alignment the Smith-Waterman algorithm 1 Computational

More information

Sequence Analysis '17 -- lecture 7

Sequence Analysis '17 -- lecture 7 Sequence Analysis '17 -- lecture 7 Significance E-values How significant is that? Please give me a number for......how likely the data would not have been the result of chance,......as opposed to......a

More information

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018 CONCEPT OF SEQUENCE COMPARISON Natapol Pornputtapong 18 January 2018 SEQUENCE ANALYSIS - A ROSETTA STONE OF LIFE Sequence analysis is the process of subjecting a DNA, RNA or peptide sequence to any of

More information

Alignment & BLAST. By: Hadi Mozafari KUMS

Alignment & BLAST. By: Hadi Mozafari KUMS Alignment & BLAST By: Hadi Mozafari KUMS SIMILARITY - ALIGNMENT Comparison of primary DNA or protein sequences to other primary or secondary sequences Expecting that the function of the similar sequence

More information

Sequence analysis and comparison

Sequence analysis and comparison The aim with sequence identification: Sequence analysis and comparison Marjolein Thunnissen Lund September 2012 Is there any known protein sequence that is homologous to mine? Are there any other species

More information

Do Aligned Sequences Share the Same Fold?

Do Aligned Sequences Share the Same Fold? J. Mol. Biol. (1997) 273, 355±368 Do Aligned Sequences Share the Same Fold? Ruben A. Abagyan* and Serge Batalov The Skirball Institute of Biomolecular Medicine Biochemistry Department NYU Medical Center

More information

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético

Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Finding the Best Biological Pairwise Alignment Through Genetic Algorithm Determinando o Melhor Alinhamento Biológico Através do Algoritmo Genético Paulo Mologni 1, Ailton Akira Shinoda 2, Carlos Dias Maciel

More information

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Biochemistry 324 Bioinformatics. Pairwise sequence alignment Biochemistry 324 Bioinformatics Pairwise sequence alignment How do we compare genes/proteins? When we have sequenced a genome, we try and identify the function of unknown genes by finding a similar gene

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics

Inferring Phylogenies from Protein Sequences by. Parsimony, Distance, and Likelihood Methods. Joseph Felsenstein. Department of Genetics Inferring Phylogenies from Protein Sequences by Parsimony, Distance, and Likelihood Methods Joseph Felsenstein Department of Genetics University of Washington Box 357360 Seattle, Washington 98195-7360

More information

A New Similarity Measure among Protein Sequences

A New Similarity Measure among Protein Sequences A New Similarity Measure among Protein Sequences Kuen-Pin Wu, Hsin-Nan Lin, Ting-Yi Sung and Wen-Lian Hsu * Institute of Information Science Academia Sinica, Taipei 115, Taiwan Abstract Protein sequence

More information

Chapter 7: Rapid alignment methods: FASTA and BLAST

Chapter 7: Rapid alignment methods: FASTA and BLAST Chapter 7: Rapid alignment methods: FASTA and BLAST The biological problem Search strategies FASTA BLAST Introduction to bioinformatics, Autumn 2007 117 BLAST: Basic Local Alignment Search Tool BLAST (Altschul

More information

Bioinformatics for Biologists

Bioinformatics for Biologists Bioinformatics for Biologists Sequence Analysis: Part I. Pairwise alignment and database searching Fran Lewitter, Ph.D. Head, Biocomputing Whitehead Institute Bioinformatics Definitions The use of computational

More information

Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT

Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 200 Mary Ann Liebert, Inc. Pp. 549 556 Approximate P-Values for Local Sequence Alignments: Numerical Studies JOHN D. STOREY and DAVID SIEGMUND ABSTRACT

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University

BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University BIO 285/CSCI 285/MATH 285 Bioinformatics Programming Lecture 8 Pairwise Sequence Alignment 2 And Python Function Instructor: Lei Qian Fisk University Measures of Sequence Similarity Alignment with dot

More information

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment Sequence Analysis 17: lecture 5 Substitution matrices Multiple sequence alignment Substitution matrices Used to score aligned positions, usually of amino acids. Expressed as the log-likelihood ratio of

More information

Segment-based scores for pairwise and multiple sequence alignments

Segment-based scores for pairwise and multiple sequence alignments From: ISMB-98 Proceedings. Copyright 1998, AAAI (www.aaai.org). All rights reserved. Segment-based scores for pairwise and multiple sequence alignments Burkhard Morgenstern 1,*, William R. Atchley 2, Klaus

More information

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre Bioinformatics Scoring Matrices David Gilbert Bioinformatics Research Centre www.brc.dcs.gla.ac.uk Department of Computing Science, University of Glasgow Learning Objectives To explain the requirement

More information

bioinformatics 1 -- lecture 7

bioinformatics 1 -- lecture 7 bioinformatics 1 -- lecture 7 Probability and conditional probability Random sequences and significance (real sequences are not random) Erdos & Renyi: theoretical basis for the significance of an alignment

More information

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment

SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment SEPA: Approximate Non-Subjective Empirical p-value Estimation for Nucleotide Sequence Alignment Ofer Gill and Bud Mishra Courant Institute of Mathematical Sciences, New York University, 251 Mercer Street,

More information

Sequence Comparison. mouse human

Sequence Comparison. mouse human Sequence Comparison Sequence Comparison mouse human Why Compare Sequences? The first fact of biological sequence analysis In biomolecular sequences (DNA, RNA, or amino acid sequences), high sequence similarity

More information

Exercise 5. Sequence Profiles & BLAST

Exercise 5. Sequence Profiles & BLAST Exercise 5 Sequence Profiles & BLAST 1 Substitution Matrix (BLOSUM62) Likelihood to substitute one amino acid with another Figure taken from https://en.wikipedia.org/wiki/blosum 2 Substitution Matrix (BLOSUM62)

More information

Effects of Gap Open and Gap Extension Penalties

Effects of Gap Open and Gap Extension Penalties Brigham Young University BYU ScholarsArchive All Faculty Publications 200-10-01 Effects of Gap Open and Gap Extension Penalties Hyrum Carroll hyrumcarroll@gmail.com Mark J. Clement clement@cs.byu.edu See

More information

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter

Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Sepp Hochreiter Bioinformatics for Computer Scientists (Part 2 Sequence Alignment) Institute of Bioinformatics Johannes Kepler University, Linz, Austria Sequence Alignment 2. Sequence Alignment Sequence Alignment 2.1

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

Motivating the need for optimal sequence alignments...

Motivating the need for optimal sequence alignments... 1 Motivating the need for optimal sequence alignments... 2 3 Note that this actually combines two objectives of optimal sequence alignments: (i) use the score of the alignment o infer homology; (ii) use

More information

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life

Lecture 14 - Cells. Astronomy Winter Lecture 14 Cells: The Building Blocks of Life Lecture 14 Cells: The Building Blocks of Life Astronomy 141 Winter 2012 This lecture describes Cells, the basic structural units of all life on Earth. Basic components of cells: carbohydrates, lipids,

More information

Advanced topics in bioinformatics

Advanced topics in bioinformatics Feinberg Graduate School of the Weizmann Institute of Science Advanced topics in bioinformatics Shmuel Pietrokovski & Eitan Rubin Spring 2003 Course WWW site: http://bioinformatics.weizmann.ac.il/courses/atib

More information

Whole Genome Alignments and Synteny Maps

Whole Genome Alignments and Synteny Maps Whole Genome Alignments and Synteny Maps IINTRODUCTION It was not until closely related organism genomes have been sequenced that people start to think about aligning genomes and chromosomes instead of

More information

Basic Local Alignment Search Tool

Basic Local Alignment Search Tool Basic Local Alignment Search Tool Alignments used to uncover homologies between sequences combined with phylogenetic studies o can determine orthologous and paralogous relationships Local Alignment uses

More information

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel )

Pairwise sequence alignments. Vassilios Ioannidis (From Volker Flegel ) Pairwise sequence alignments Vassilios Ioannidis (From Volker Flegel ) Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs Importance

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

CSE 549: Computational Biology. Substitution Matrices

CSE 549: Computational Biology. Substitution Matrices CSE 9: Computational Biology Substitution Matrices How should we score alignments So far, we ve looked at arbitrary schemes for scoring mutations. How can we assign scores in a more meaningful way? Are

More information

Generalized Affine Gap Costs for Protein Sequence Alignment

Generalized Affine Gap Costs for Protein Sequence Alignment PROTEINS: Structure, Function, and Genetics 32:88 96 (1998) Generalized Affine Gap Costs for Protein Sequence Alignment Stephen F. Altschul* National Center for Biotechnology Information, National Library

More information

Fundamentals of database searching

Fundamentals of database searching Fundamentals of database searching Aligning novel sequences with previously characterized genes or proteins provides important insights into their common attributes and evolutionary origins. The principles

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data

False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data False discovery rate and related concepts in multiple comparisons problems, with applications to microarray data Ståle Nygård Trial Lecture Dec 19, 2008 1 / 35 Lecture outline Motivation for not using

More information

SUPPLEMENTARY INFORMATION

SUPPLEMENTARY INFORMATION Supplementary information S1 (box). Supplementary Methods description. Prokaryotic Genome Database Archaeal and bacterial genome sequences were downloaded from the NCBI FTP site (ftp://ftp.ncbi.nlm.nih.gov/genomes/all/)

More information

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD Department of Computer Science University of Missouri 2008 Free for Academic

More information

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming

20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, Global and local alignment of two sequences using dynamic programming 20 Grundlagen der Bioinformatik, SS 08, D. Huson, May 27, 2008 4 Pairwise alignment We will discuss: 1. Strings 2. Dot matrix method for comparing sequences 3. Edit distance 4. Global and local alignment

More information