SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X)

Size: px

Start display at page:

Download "SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING METHODS /(X) ¼ I Q(X)"

Barbra Richard
5 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 20 # Mary Ann Liebert, Inc. Pp DOI: 0.089/cmb Research Articles Sequence Alignment as Hypothesis Testing LU MENG, FENGZHU SUN, 2 XUEGONG ZHANG, and MICHAEL S. WATERMAN 2 ABSTRACT Sequence alignment depends on the scoring function that defines similarity between pairs of letters. For local alignment, the computational algorithm searches for the most similar segments in the sequences according to the scoring function. The choice of this scoring function is important for correctly detecting segments of interest. We formulate sequence alignment as a hypothesis testing problem, and conduct extensive simulation experiments to study the relationship between the scoring function and the distribution of aligned pairs within the aligned segment under this framework. We cut through the many ways to construct scoring functions and showed that any scoring function with negative expectation used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. The results indicate that the log-likelihood ratio scoring function is statistically most powerful and has the highest accuracy for detecting the segments of interest that are defined by the statistical distribution of aligned letter pairs. Key words: hypothesis testing, local alignment, power, scoring function, sequence alignment.. INTRODUCTION Sequence alignment is one of the most important problems in computational biology. Similar segments in gene or protein sequences often indicate evolutionary homology or functional relationships between the genes or proteins. Sequence alignment tasks are generally categorized into three different types: () global sequence alignment, which determines the best alignment of the sequences with their entire lengths by adjusting their relative positions and inserting gaps when necessary; (2) local sequence alignment, which determines segments in the sequences that are most similar with each other; and (3) semi-global or fit alignment, which searches for the occurrence of a short query sequence in a large sequence database (Waterman, 995). In any case, a scoring function needs to be defined to evaluate the similarity of the sequences and a computational algorithm is employed to search for the best alignment. The classical algorithms are based on dynamic programming: the Smith-Waterman algorithm (Smith and Waterman, 98) for local alignment and the Needleman-Wunsch algorithm (Needleman and Wunsch, 970) for global alignment. Many heuristic algorithms have been proposed to speed up the search procedure, such as BLAST (Altschul et al., 990), FASTA (Pearson, 990), CLUSTAL W (Thompson et al., 994), PSI-BLAST (Altschul et al., MOE Key Lab of Bioinformatics and Bioinformatics Division of TNLIST/Department of Automation, Tsinghua University, Beijing, P.R. China. 2 Molecular and Computational Biology, University of Southern California, Los Angeles, California; and TNLIST/ Department of Automation, Tsinghua University, Beijing, P.R. China. 677

2 678 MENG ET AL. 997), BLAT (Kent, 2002), BALSA (Zhu et al., 998; Webb et al., 2002), ProbCons (Do et al., 2005), DIALIGN-T(X) (Subramanian et al., 2005, 2008), and Bowtie (Langmead et al., 2009). No matter what algorithm is used, the choice of a scoring function is the key to producing good alignments. Several scoring functions have been introduced, most of which are implicitly log-odds matrices (Altschul, 99), i.e., loglikelihood ratio scoring functions. Dayhoff s PAM (Dayhoff et al., 978) and Henikoff s BLOSUM (Henikoff and Henikoff, 992) matrices are generally considered the standard in many applications. The PAM matrices were initially derived based on an explicit evolutionary model of closely related sequences and the observed mutations in these sequences. Dayhoff et al. (978) separated proteins into families, constructed phylogenetic trees for each family, and examined every branch of the resulting trees for substitutions. The score for two letters is defined by the ratio of probability of substitution between the two letters over the expected probability. Some scoring functions such as the JTT ( Jones et al., 992) and the GCB (Gonnet et al., 992) matrices were derived by extrapolation from closely related sequences based on PAM evolutionary model to increase accuracy of homology searches. The VTML matrix (Müler et al., 2002) was obtained by maximum likelihood estimation of Dayhoff s parameters. The BLOSUM matrices were derived based on known multiple aligned blocks of sequences with blocks being the aligned regions without gaps. The score between two amino acids was defined as 2 times log-ratio of the probability that the two amino acids are aligned in the blocks over the corresponding expected probability assuming the sequences are independent. The PMB matrix (Veerassamy et al., 2003) is based on the blocks which BLOSUM used, but added evolutionary distances to form an evolutionary model. Based on the fact that proteins have complicated three-dimensional (3-D) structures, some scoring functions make use of the structure information, such as the STR (Overington et al., 992) and the STROMA (Qian and Goldstein, 2002) matrices. Some scoring functions like SDM (Prlić et al., 2000) were derived from protein pairs of similar structure instead of sequence similarity. The scoring functions used in Fugue (Shi et al., 200) and Wurst (Torda et al., 2004) are based on sequence-structure homology. Bayesian methods have also been employed for constructing scoring functions for multiple sequence alignment, such as BILD (Altschul et al., 200). Many scoring functions are designed for specific applications. Adachi and Hasegawa (996) established a model of amino acid substitution matrix for mitochondrial DNA encoded protein sequences, estimating the score matrix by the maximum likelihood method from mtdna data. Cao et al. (2009) developed scoring functions for DNA sequences based on information theory in the expectation maximization framework. Among these scoring functions, determining which is most powerful to detect the truly related segments for a given sequence study is an important problem. Usually, the scoring function is empirically selected based on some assumptions about the sequences to be compared. Altschul (99) argued that PAM20, among the PAM matrices, is probably most appropriate if only one matrix is used, based on information content of the score matrix measured in relative entropy. Henikoff and Henikoff (993) evaluated the performance of some commonly used substitution matrices, and found that log-likelihood ratio based scoring functions derived directly from multiple alignment data are better for detecting distant relationships than matrices based on PAM evolutionary model and the STR matrix achieved similar performance to BLOSUM62. The performance was evaluated through the efficiency for detecting true amino acid sequences belonging to particular protein families. Several investigators evaluated different scoring functions by comparing the alignments derived from the computer algorithm with the alignments generated by simulations through fidelity, confidence, and overall correctness (Polyanovsky et al., 2008; Holmes and Durbin, 998). In our study, we also compare different scoring functions using these quantities, however with terms that are more commonly used in the field of classification studies (see Section 2. for details). These investigators studied global alignments. In our study, we consider local alignment and general scoring functions without gaps. In this article, we consider sequence alignment as a statistical hypothesis testing problem and define scores using the log-likelihood ratio statistic based on the segments we intend to find. We focus on the relationship between the scoring function and the best aligned segment in the scenario of local sequence alignment. According to the Neyman-Pearson lemma (Neyman and Pearson, 933), the likelihood-ratio statistic is most powerful to distinguish a given distribution of optimal alignment from the background distribution for fixed global alignments. The test statistic provides a form of a scoring function which can be used in the sequence alignment. We conjecture that this log-likelihood ratio scoring function is statistically the most powerful one to detect the best aligned segment. Under the assumption that the compared sequences are independent and identically distributed (i.i.d.) letters sampled from some background distributions, we can treat the sequence alignment problem as a hypothesis testing problem. The null hypothesis is that the sequences are

3 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 679 independent, and the alternative hypothesis is that they are related due to some shared segments which have a given distribution of letter pairs. We choose the score of the best aligned segment, i.e., the highest score during local alignment, as the test statistic. We use the power of the statistic for the hypothesis test as a measurement of the performance of scoring functions. The higher the power, the better the scoring function is. We also take a classification perspective for detecting the aligned segments and use true positive rate (TPR), false discovery rate (FDR), and the f-statistic (a weighted average of TPR and FDR) as alternative criteria for evaluating the scoring functions. We show that the log-likelihood ratio scoring function is most powerful to detect aligned segments following the distribution derived from the scoring function. It applies to both DNA sequences and amino acid sequences. The aim of this article is to cut through the many ways to construct scoring functions and show that any scoring function used in local alignment corresponds to a hypothesis test between the background distribution of sequence letters and a statistical distribution of letter pairs determined by the scoring function. 2.. Theoretical motivation 2. METHODS In this section, we present the basis for the analysis we perform. The first result that motivated our work was the Neyman-Pearson Lemma (Neyman and Pearson, 933). This remarkable result, which has an elegant proof, is central to statistical theory and practice. The setting is hypothesis testing and we present the most elementary form of the Neyman-Pearson Lemma. The data are X, X 2,..., X i.i.d. (independent and identically distributed) from model 0 with distribution P (the null hypothesis H 0 ) or from model with distribution Q (the alternate hypothesis H ). Do the data X ¼ X, X 2,..., X come from model 0 or? A test function f satisfies f(x) in {0, } where we say H 0 is rejected if f(x) ¼. The size of the test or level of significance is P(f(X) ¼ ). Our ideal test function is one which has small size P(f(X) ¼ ) ¼ a and the largest possible power b ¼ Q(f(X) ¼ ). That is, we want a test statistic that has a small probability of rejecting a true null hypothesis, but the largest possible power or probability of rejecting H 0 when H is true. The Neyman-Pearson Lemma states that the most powerful test statistic is /(X) ¼ I Q(X) P(X) t, when P(/(X) ¼ ) ¼ a, where I( ) ¼ when the argument is true and 0 otherwise. This is one reason for the widespread use of likelihood ratio statistics. In sequence alignment, we are interested in pairs of aligned letters from finite alphabet L such as a b.an alignment A with length of n of letters from sequences of random letters A ¼ A A 2...A and B ¼ B B 2...B is represented as A A 2 A B B 2 B The null hypothesis is that all the 2n letters are i.i.d. P with P(A ¼ a) ¼ p a, and we call such an alignment P-distributed. The alternate hypothesis Q is a distribution over aligned pairs a b with Q A ¼ a B b ¼ qab.in contrast P A B ¼ a b ¼ pa p b. Thus, we are testing the statistical distribution of letter pairs in the alignment: P-distributed alignments versus Q-distributed alignments. Following the Neyman-Pearson Lemma, to test the hypothesis that the alignment distribution is we should use the likelihood ratio H 0 : P vs H : Q Y i¼ q Ai B i p Ai p Bi : For convenience, we will use the logarithm of this statistic to form an equivalent test.

4 680 MENG ET AL. / ¼ I X i ¼! q Ai B log i t : p Ai p Bi Thus, we have derived the log-likelihood scoring of alignments using alphabet scoring function q ab s(a, b) ¼ s P, Q (a, b) ¼ log, () p a p b which for sequence alignment goes back at least to Dayhoff et al. (978) and was more recently employed by Henikoff and Henikoff (992) and others. This article will explore the implications of this approach to scoring and its connection to hypothesis testing. If we consider P as the background distribution and Q as the alternate distribution for the alignment, we should use the log-likelihood ratio scoring function s ¼ s P,Q as the scoring function to best distinguish Q from P. However it is less evident what should be done for local alignment. We conjecture this scoring function is most powerful to detect Q distributed local alignments aligned fragments with distribution Q. Now if there were one given, fixed-length alignment, our previous discussion would be the conclusion of the matter. Instead there are, for two random sequences of length n, O(n 3 ) possible local alignments (we exclude indels where this number is much larger), and they are dependent in a subtle way. Is there any reason to be optimistic that log-likelihood scoring function is best for detecting local alignments of distribution Q from the background P? The mathematical result described next gives some hope for this. The following result first appeared in Arratia et al. (988) and was stated more generally in Karlin and Altschul (990) with a form that was proven in Dembo et al. (994). We give a version fitting our setup and do not completely repeat the notation we have defined above. Let A ¼ A A 2 A n and B ¼ B B 2 B n be i.i.d. random sequences with background distribution P. Assume s(a,b) is a scoring function that satisfies the conditions (i) max a, b2l s(a, b) 4 0 and (ii) E P s(a, B) 5 0, where E P s(a, B) ¼ X p a p b s(a, b): Let r > 0 be the largest real root of a, b2l f (k) ¼ E(k s(a, B) ) ¼ 0: (2) Then the proportion of letter a from sequence A aligned with letter b from sequence B in the optimal alignment segment converges to q ab ¼ p a p b r s(a, b) as sequence length n tends to infinity. We now take the asymptotic distribution of the theorem and solve for s(a, b). s(a, b) ¼ log =r q ab p a p b : As positive multiples (cs(a,b) versus s(a,b) for any c > 0, for example) do not affect the results of local alignment, the numerical value of r > 0 is irrelevant. Therefore for any scoring function satisfying the hypotheses of the theorem, there is an asymptotic log-likelihood scoring function. It is easy to show (see Appendix A) that for any log-likelihood scoring function, the conditions of the theorem are satisfied so long as P and Q are not identical, that is for some (a,b), q ab = p a p b. Thus, we have a duality between scoring and likelihood ratio statistics. The theorem assures us that in the i.i.d. case even with the complexity of O(n 3 ) competing local alignments, with a given scoring function, a local alignment algorithm searches for Q distributed alignments. From this point of view, sequence alignment is hypothesis testing where H 0 : The sequences A A 2 A n and B B 2 B m are from P i.i.d. letters. H : The sequences A A 2 A n and B B 2 B m are mixture of P i.i.d. letters and a Q distributed local alignment at an unknown location.

5 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 68 Because under either hypothesis the alignment algorithm is rewarding Q distributed local alignments, how do we determine signal from background? The answer is that this is not possible until the signal is significantly larger than the background. Fortunately, there is a well-studied basis for statistical significance in local alignments; the most famous is used in BLAST (Altschul et al., 990) and is closely related to the theorem presented above, in addition to there being rigorous Poisson approximation methods which are equivalent (Waterman and Vingron, 994). For our purposes, it will suffice to note that the growth of the alignment length of an optimal alignment, via an Erdös-Renyi law (Arratia et al., 988), for two sequences of length n is k ¼ log (nm) H(Q, P) ; where H(Q, P) ¼ X p a p b log (p a p b =q ab ): a, b2l Let S P,Q (A, B), the maximum local alignment score under the scoring function s P,Q defined by P and Q, be the test statistic. For a given size a, we choose a threshold t a, so that We define the power of the alignment test statistic as P(S P, Q (A; B) t a ) ¼ a: (3) power ¼ Q(S P, Q (A, B) t a ): (4) A scoring function s P,Q yielding the highest power is preferred in local sequence alignment. A natural way to evaluate the scoring function s P,Q is to see if the local aligned segment identified by the algorithm can find the signal of interest. A signal can be locally aligned segments. In our simulations, the signal is inserted at random positions of the two sequences. We treat the aligned position pairs of the signal as actual positives. However, the actual negatives are less easy to define because the other bases are not aligned. We denote the inserted signal by p* and let k* be the length of the signal p*. Similarly, the predicted positives are the aligned position pairs in the identified local aligned segment, which we refer to as p 0. Let the length of identified alignment be k 0. Table shows the relationship among the terms. The predicted negatives are difficult to define though. We use similar notation as in standard classification problems, and use TP, FP, and FN to represent true positive, false positive, and false negative, respectively. TPR (also referred to as sensitivity) is the fraction of true positives among the actual positives, i.e., TPR ¼ TP k ¼ jp0 \ p j k : The positive predictive value (PPV) or precision is defined by and FDR is defined as PPV ¼ TP k 0 ¼ jp0 \ p j k 0, FDR ¼ PPV: Table. TP, FP, and FN by Comparing Algorithmic Alignment p 0 with the Signal p* Signal p* Positive Negative Predicted alignment p 0 Positive True positive ¼jp 0 \ p*j False positive ¼ k jp 0 \ p*j Negative False negative ¼ k* jp 0 \ p*j True negative

6 682 MENG ET AL. Scoring functions yielding high TPR and high PPV (and low FDR) are preferred. Another commonly used measure to evaluate the performance of a classification problem is the f-statistic defined as f ¼ 2TP k þ k 0 ¼ 2 jp0 \ p j k þ k 0 : Note that the f-statistic is a weighted sum of TPR and PPV. In this study, we use TPR, FDR, and the f-statistic to evaluate the scoring function from the classification point of view Simulation studies We carry out extensive simulations to show that when the scoring function used for sequence comparison is the log-likelihood ratio score defined in equation, the test statistic has the highest power, TPR, PPV, and the f-statistic. To achieve this objective, we carry out simulations as follows. First, we choose a set of P distributions as the background distribution of letters. Second, we define a set of Q-distributions which define how letter pairs align with each other in the simulated signal region. For DNA sequences, we choose three P s and five Q s. The three P distributions are: uniform, A rich, and GC rich. The five Q distributions are: all matches have equal probability which is higher than the probability for mismatches, AA pair rich, AA pair poor, GG & CC rich, and GG & CC poor. For amino acid sequences, we choose two P s and three commonly used score matrices: BLOSUM45, BLOSUM62, and BLOSUM80. The Q s (Q, Q 2, Q 3 ) corresponding to the three score matrices are derived by solving equation 2, and the corresponding Q-distribution is given by q ab ¼ p a p b r s(a, b). The two P s are: equal probability for the 20 amino acids and the observed amino acid frequencies in vertebrates. For details about these choices, see Appendix B. Third, for a given size a, we calculate a threshold t a (P, Q), as in equation 3, when a scoring function, s P,Q, defined by P and Q in equation, is used to align the two sequences as follows:. Generate i.i.d. random sequences A and B of length n with background distribution P. 2. Do local alignment of sequence A and sequence B with scoring function s P,Q using the Smith- Waterman algorithm (Smith and Waterman, 98). 3. Repeat steps -2 for R ¼ 0, 000 times and rank the resulting local sequence alignment scores in ascending order. Approximate the value of t a (P, Q) by the upper a percentile of the local alignment scores. Fourth, we approximate the power of testing the hypotheses H 0 versus H when a Q*-distributed alignment is inserted in the two random sequences as follows. The Q* distribution is referred to as the target distribution. We simultaneously calculate the approximate values of TPR, FDR, and the f-statitics with the procedure. The objective of this study is to identify an optimal scoring function to detect the relationship between sequences related through Q*-local alignment. The simulation steps are as follows:. Generate i.i.d. random sequences A and B of length n with background distribution P. 2. For a specific target distribution Q* from the group of Q s, generate a length k* aligned pairs with Q* distribution with k log (n 2 ) ¼ ( þ e) H(Q, P 2 ), (5) where e is a factor making the length of Q* segment somewhat larger than the expected length under the null model. To generate a Q* segment, we create a potential local alignment by independently drawing k* letter pairs from the Q* distribution. We refer to the generated aligned segment as a Q*-local alignment. 3. Let the generated Q*-local alignment be p ¼ p, where p p and p 2 are sequences of length k*. 2 Replace part of sequences A and B at random positions with p and p 2, respectively as shown in Figure. Define the resulting sequences as A* and B*. 4. Do local sequence alignment between sequence A* and sequence B* with scoring function s P,Q using the Smith-Waterman algorithm.

7 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 683 FIG.. Insert Q*-local alignment into the sequences A and B. We repeat the above four steps for R 2 ¼,000 times, and the power of detecting the relationship between the two sequences with the Q*-local alignment inserted using score s P,Q is approximated by the fraction of times that the resulting local alignment score is at least t a (P, Q). From the classification point of view, we are interested in the expectation of TPR, FDR and the f-statistic. Let TP r, k r and k 0 r be the estimated corresponding values of TP, k* and k 0 in the r-th experiments, r ¼, 2,, R 2. Then TPR r, FDR r and f r, r ¼, 2,, R 2 are estimated by TPR r ¼ TP r k r, FDR r ¼ TP r kr 0, f r ¼ 2TP r kr þ : k0 r The expectation of TPR, FDR and f-statistic can be approximated by ^E(TPR) ¼ R 2 X R 2 r¼ ^E(FDR) ¼ R 2 X R 2 ^E(f ) ¼ R 2 X R 2 r¼ TP r kr, r¼ TP r kr 0, 2TP r kr þ : k0 r 3. RESULTS In the simulations, we let the size a be 0.0 and 0.05 and e in equation 5 be The length n of sequences is 0,000. The simulation results are shown in Tables 2 7. In each table, the column direction represents a condition when one target Q*-local alignment is inserted, and the row direction represents the results using the log-likelihood scoring function derived from one P and one Q. From Table 2, by comparing the power among different Qs under the same P and Q*, it can be seen that the test has the largest power when Q equals to Q*. In other words, the highest power appears diagonally. For example, the power of the tests using loglikelihood ratio scoring functions corresponding to Q to Q 5 when P ¼ P 2 and Q* ¼ Q 3 are 0.6, 0.5, 0.7, 0.68, and 0.55, respectively, for test size a ¼ 0.0. The largest power among the five tests is 0.7 when Q ¼ Q 3. The other tables can be viewed similarly. Tables 2 4 are for DNA sequences. When Q ¼ Q*, we obtain the highest power, TPR, the f-statistic, and the lowest FDR. That is, the scoring function derived from Q* is the most powerful scoring function to detect Q*-local alignment. Tables 5 7 are for amino acid sequences. Similar conclusions as for DNA sequences are obtained. The power of the test based on scoring function derived from Q reach the highest when Q ¼ Q*, no matter whether a ¼ 0.0 or a ¼ It can also be seen from Table 5 that the power of the test based on s P,Q decreases as the distance between Q and Q* increases. For example, when Q* ¼ Q corresponding to BLOSSOM45 and P ¼ P 2, the power of the tests based on BLOSUM45, BLOSSOM62, and BLOSSOM80 is 0.83, 0.75, and 0.56, respectively. Table 6 gives the TPR and FDR for the tests using different scoring functions. When the target distribution is Q 3 corresponding to BLOSUM80, the TPR of the test using scoring functions s P, Q, s P, Q2, and s P, Q3 is close to 80%. On the other hand, Tables 6 and 7 show that the FDR is lowest and the f-statistic is the highest when the scoring function s P, Q is used.

8 684 MENG ET AL. Table 2. Power of the Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted Target Q*, a ¼ 0.0 Target Q*, a ¼ 0.05 Scoring function Q Q 2 Q 3 Q 4 Q 5 Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q Test size a ¼ 0.0 or 0.05 (DNA sequences). 4. DISCUSSION Sequence alignments have been widely used to compare nucleotide and amino acid sequences. For a given scoring function, the local alignment score between two sequences is first obtained through a dynamic programming algorithm or a method such as BLAST (Altschul et al., 990), and a p-value or E-value can be calculated. Log-likelihood ratio scoring functions based on known aligned sequences were derived for sequence comparisons by (Dayhoff et al., 978) and (Henikoff and Henikoff, 992). Previous studies showed the superiority of the log-likelihood ratio scoring function by evaluating whether it can successfully identify genes within the same family (Henikoff and Henikoff, 993). It has also been argued that all reasonable substitution scoring functions are implicitly log-odds scoring functions (Karlin and Altschul, 990; Karlin Table 3. True Positive Rate (TPR) and False Discovery Rate (FDR) Using Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (DNA Sequences) Target Q*, TPR Target Q*, FDR Scoring function Q Q 2 Q 3 Q 4 Q 5 Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q

9 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 685 Table 4. f-statistic Using Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (DNA Sequences) Target distribution Q* Scoring function Q Q 2 Q 3 Q 4 Q 5 s PQ s PQ s PQ s PQ s PQ s P2Q s P2Q s P2Q s P2Q s P2Q s P3Q s P3Q s P3Q s P3Q s P3Q Table 5. Power of Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted Target Q*, a ¼ 0.0 Target Q*, a ¼ 0.05 Scoring function Q Q 2 Q 3 Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q Test size a ¼ 0.0 or 0.05 (amino acid sequences). The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively. Table 6. True Positive Rate (TPR) and False Discovery Rate (FDR) of the Tests Based on Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (Amino Acid Sequences) Target Q*, TPR Target Q*, FDR Scoring function Q Q 2 Q 3 Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively.

10 686 MENG ET AL. Table 7. f-statistic of the Tests Based Different Scoring Functions when Different Target Q*-Local Alignments Are Inserted (Amino Acid Sequences) Target Q* Scoring function Q Q 2 Q 3 s PQ s PQ s PQ s P2Q s P2Q s P2Q The target distributions Q, Q 2, and Q 3 correspond to BLOSUM45, BLOSUM62, and BLOSUM80, respectively. et al., 990; Altschul, 99), i.e., log-likelihood ratio scoring function. For a given scoring function s(, ), it has been shown that the probability that a is aligned to b in the best aligned segment is p a p b r s(a, b) with r being the largest root of the equation 2. For a given distribution Q for the aligned segment, it is possible to define a scoring function by the log-likelihood ratio between the Q distribution and the P distribution. Thus, scoring functions and target Q-distributions are coupled. Suppose that two sequences are related through a target distribution Q* in an aligned segment. Intuitively, the scoring function defined by the log-likelihood ratio between Q* and P distributions should be used. However, to the best of our knowledge, no studies have been carried out to prove or dispute this claim. In this article, we regard sequence alignment as a hypothesis testing problem, and study the power of tests based on different scoring functions for detecting the relationship between two sequences. For our studies, aligned segments were randomly inserted into the two sequences. The results from our simulations indicate that the log-likelihood ratio scoring function is the most powerful scoring function to detect segments of Q distribution using the scoring function s P,Q, as it has the highest power, TPR, and f-statistic, and the lowest FDR. However, we cannot mathematically rigorously prove that the log-likelihood ratio scoring function is optimal. In our simulation studies, we tried to choose a set of Q distributions as representative as possible. As the Q can be sampled in a continuous space with 5 degrees of freedom for DNA sequences and 399 degrees of freedom for amino acid sequences, we cannot search over all possibilities for Q. We chose representative Qs from the sampling space, compared the scoring functions derived from these Qs, and used those values to provide evidence to show that the log-likelihood ratio scoring function is most powerful. The field of sequence alignment lacks a proof of the claim we have conjectured. While for fixed length alignment the Neyman-Pearson Lemma holds, the distribution related to equation 2 is only true asymptotically. Thus we believe our result will only be proven as an asymptotic result. None the less it would be a significant advance for the general theory and practice of sequence alignment. Our study has several limitations. First, we assume that the two sequences are i.i.d in the null model. In many situations, Markovian models fit the sequences much better than the i.i.d model. Under the Markovian model, the log-likelihood scoring function will become more complex and can depend on adjacent pairs. We are confident that our results hold in the more general setting, as for example the asymptotic distribution for local alignment holds here. Second and more important, gaps should be considered in many alignment problems. Currently theoretical results are not available for local alignment on the length of gaps nor for aligned letter pairs for two random sequences. We conjecture that there is an asymptotic distribution for local alignment in this case as well, so long as n E PS(A, B) 5 0, where S(A, B) is the global alignment of the sequences A and B of length n. The condition states that the per letter score accumulation is negative. The asymptotic distribution even in the i.i.d. case for gaps will depend on the letter composition aligned to the gaps. However summing over the composition will give a gap length distribution. The lack of theoretical results makes the design of simulation studies difficult. We hypothesize that the optimal scoring function is still the log-likelihood scoring function. Further studies are

11 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 687 needed to prove or dispute this hypothesis. Such a result would be a significant extension of the results of Arratia et al. (988) and Dembo et al. (994). Similar results will hold for multiple local alignment. 5. APPENDIX A: Duality between scoring functions and log-likelihood ratio scores Claim. For any given P and Q distributions, the log-likelihood scoring function s P,Q defined in equation bys P, Q (a, b) ¼ log q ab p a p b satisfies the conditions: () max a, b2l s P, Q (a, b) 4 0 and (2) E P s P, Q (a, b) 5 0, unless q ab ¼ p a p b for all a, b 2L. Proof. If q ab ¼ p a p b for all a, b 2L, then s P,Q (a, b) ¼ 0 for all a, b 2L. Next assume that q ab 6¼ p a p b for some a, b. Then there must exist a, b 2L, such that q a b 4 p a p b. Thus, s P,Q(a*, b*) > 0. By Jensen s inequality, if X is a random variable and X is not a constant with probability, and g(x) is a strictly concave function, then E(g(X)) 5 g(ex). Applying this inequality with g(x) ¼ log(x), we have E P (s(a, b)) ¼ X p a p b s P, Q (a, b) a, b2l ¼ X p a p b log q ab p a, b2l a p b 5 log X q ab p a p b p a, b2l a p b ¼ log X q ab ¼ 0: B: The choices of P and Q distributions a, b2l In our study, we choose several P and Q distributions to provide evidence that the log-likelihood scoring function yields the highest power, TPR, the f-statistic, and the lowest FDR. It is important to choose such distributions so that they cover as many possibilities as possible. In this study, we choose P and Q distributions for DNA sequences. First, we consider equally likely distribution, i.e., p A ¼ p C ¼ p G ¼ p T ¼ 4. Second, we consider one-letter rich situation, e.g., A -rich. The background distribution pattern is set as p A ¼ 4 þ 3d, p C ¼ p G ¼ p T ¼ 4 d. If we let d be 20, then p A ¼ 2 5, p C ¼ p G ¼ p T ¼ 5. Third, we consider two-letter rich situation, e.g., G and C. p C ¼ p G ¼ 4 þ d 2, p A ¼ p T ¼ 4 d 2. When d 2 ¼ 2, p C ¼ p G ¼ 3, p A ¼ p T ¼ 6. In summary, we set three types of P s: equally likely, A rich, and G-C rich, and denote them as P, P 2, and P 3, respectively, as shown in Table 8. We choose the group of Q distributions for DNA sequences as follows. First, we let the probability of matches be larger than that for the mismatches. We choose Q as q ab ¼ 6 þ e ¼ q match, a ¼ b, 6 3 e ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e ¼ 6, q match ¼ 8, q mismatch ¼ 24. & Table 8. Background Distribution P for DNA Sequences Used in the Simulations P 4 P P 3 6 P A P C P G P T

12 688 MENG ET AL. Second, we consider the situation that the match of one specific letter, e.g., A, is preferred than the match for other letters. We consider AA pair rich Q-local alignment. So we choose Q as follows. 8 < 6 þ 3e 2, a=b=a, Q ab ¼ 6 þ e 2, a=b6=a, : 6 2 e 2 ¼ Q mismatch, a6=b, where a, b 2fA, C, G, Tg. When e 2 ¼ 6, q AA ¼ 4, q CC ¼ q GG ¼ q TT ¼ 8, q mismatch ¼ 32. Third, instead of letting AA pair to be enriched in the aligned region, we let another match, e.g., GG, be enriched. We set Q as follows. The reason for choosing Q this way is to see what happens if the enriched matches in the aligned part is different from the most abundant nucleotide in the background sequences. 8 < 6 þ 3e 3, a ¼ b ¼ G, q ab ¼ 6 þ e 3, a ¼ b 6¼ G, : 6 2 e 3 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 3 ¼ 6, q GG ¼ 4, q AA ¼ q CC ¼ q TT ¼ 8, q mismatch ¼ 32. Fourth, we set Q so that matches for two letters are enriched, e.g., CC and GG. Note that C and G are the enriched nucleotides for P 3 given above. 8 < 6 þ 2e 4, a ¼ b ¼ CorG, q ab ¼ 6 þ e 4, a ¼ b ¼ AorT, : 6 2 e 4 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 4 ¼ 6, q CC ¼ q GG ¼ 3 6, q AA ¼ q TT ¼ 8, q mismatch ¼ 32. Fifth, we let AA and TT be enriched in Q. Note that the enriched matches in the Q-local alignments are different from the enriched nucleotides in P 3. 8 < 6 þ 2e 5, a ¼ b ¼ AorT, q ab ¼ 6 þ e 5, a ¼ b ¼ CorG, : 6 2 e 5 ¼ q mismatch, a 6¼ b, where a, b 2fA, C, G, Tg. When e 5 ¼ 6, q AA ¼ q TT ¼ 3 6, q CC ¼ q GG ¼ 8, Q mismatch ¼ 32. In summary, we have five Q s Q, Q 2,, Q 5 as described above. For P as the uniform distribution P A ¼ P C ¼ P G ¼ P T ¼ 4, Table 9 shows the Q s we choose and the corresponding scores as well as the expectations of the scores. Choices of P s and Q s for amino acid sequences. We choose the commonly used BLOSUM45, BLOSUM62 and BLOSUM80 as the group of scoring functions, and the corresponding Q s(q, Q 2, Q 3 ) are derived from the three scoring functions through solving equation 2 of l, respectively. We choose two P-distributions. The first gives equal probability to all the amino acids (P ) and the other one is the observed amino acid frequencies in vertebrates (P 2 ) as shown in Table 0. Table 9. Q Distributions Used in the Simulations and the Corresponding Scores under the Equally Likely Background Distribution P, as well as the Expectation of the Scores Q (e ¼ 6 ) Q 2(e 2 ¼ 6 ) Q 3(e 3 ¼ 6 ) Q 4(e 4 ¼ 6 ) Q 5(e 5 ¼ 6 ) Q M: 8 M(AA): 4 M(GG): 4 M(GG, CC): 3 6 M(AA, TT): 3 6 M(CC, GG, TT): 8 M(AA, CC, TT): 8 M(AA, TT): 8 M(GG, CC): 8 N: 24 M: 0.69 N: 32 M(AA):.39 N: 32 M(GG):.39 N: 32 M(GG, CC):.0 N: 32 M(AA, TT):.0 s PQ M(CC, GG, TT): 0.69 M(AA, CC, TT): 0.69 M(AA, TT): 0.69 M(GG, CC): 0.69 N: 0.4 N: 0.69 N: 0.69 N: 0.69 N: 0.69 E P (s Q ) M, match; N, mismatch.

13 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 689 Table 0. Observed Amino Acid Frequencies in Vertebrates (P 2 ) A Alanine 7.4% R Arginine 4.2% N Asparagine 4.4% D Aspartic acid 5.9% C Cysteine 3.3% Q Glutamine 3.7% E Glutamic acid 5.8% G Glycine 7.4% H Histidine 2.9% I Isoleucine 3.8% L Leucine 7.6% K Lysine 7.2% M Methionine.8% F Phenylalanine 4.0% P Proline 5.0% S Serine 8.% T Threonine 6.2% W Tryptophan.3% Y Tyrosine 3.3% V Valine 6.8% C: The algorithm for simulation studies Algorithm Procedure flow Input: P,, P K, Q,, Q L, n and e Output: power a ¼ 0.0, power a ¼ 0.05, TPR, FDR, f for k ¼ ; k K; k þþdo for j ¼ ; j L; j þþdo for r ¼ ; r R ; r þþdo generate sequence A and sequence B with background P k ; local alignment between A and B using Q j ; record S Pk, Q j (A, B); end Rank S Pk, Q j (A, B); T a, j ¼ (ar ) th S Pk, Q j (A, B); end for i ¼ ; i L; i þþdo Q* ¼ Q i ; for r ¼ ; r R 2 ; r þþdo generate sequence A and sequence B with background P k ; plug in Q*-segment to generate new sequences A* and B*; for j ¼ ; j L; j þþdo local alignment between A* and B* using Q j ; record S Pk, Q j (A, B ); if S Pk, Q j (A, B ) T a, j then Flag r (P k, Q*, Q j ) ¼ ; else Flag r (P k, Q*, Q j ) ¼ 0; end calculate TPR r (P k, Q*, Q j ), FDR r (P k, Q*, Q j ), f r (P k, Q*, Q j ) end end for j ¼ ; j L; j þþdo

14 690 MENG ET AL. P R power Pk, Q, Q j ¼P R TPR Pk, Q, Q j ¼ P R FDR Pk, Q, Q j ¼ f Pk, Q, Q j ¼ end end end P R Flag r ¼ r(p k, Q, Q j ) R 2 ; TPR r ¼ r(p k, Q, Q j ) R 2 ; FDR r ¼ r(p k, Q, Q j ) R 2 ; r ¼ f r(p k, Q, Q j ) R 2 ; ACKNOWLEDGMENTS This work was supported by the NSFC (grant to L.M., X.Z.; grant to L.M.; grant to F.S., X.Z.; grant to F.S.) and the NIH (grant P50HG to F.S., M.S.W.; grant R2AG to F.S., M.S.W.). No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Adachi, J., and Hasegawa, M Model of amino acid substitution in proteins encoded by mitochondrial DNA. J. Mol. Evol. 42, Altschul, S., Wootton, J., Zaslavsky, E., et al The construction and use of log-odds substitution scores for multiple sequence alignment. PLoS Comput. Biol. 6, e Altschul, S.F. 99. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol. 29, Altschul, S.F., Gish, W., Miller, W., et al Basic local alignment search tool. J. Mol. Biol. 25, Altschul, S.F., Madden, T.L., Schaffer, A.A., et al Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 25, Arratia, R., Morris, P., and Waterman, M Stochastic scrabble: large deviations for sequences with scores. J. Appl. Probab. 25, Cao, M.D., Dix, T.I., and Allison, L Computing substitution matrices for genomic comparative analysis. Proc. 3th Pacific-Asia Conf. Adv. Knowledge Discov. Data Mining Dayhoff, M.O., Schwartz, R.M., and Orcutt, B.C A model of evolutionary change in proteins. Atlas Protein Sequence Struct. 5, Dembo, A., Karlin, S., and Zeitouni, O Limit distribution of maximal non-aligned two-sequence segmental score. Ann. Appl. Probab. 22, Do, C.B., Mahabhashyam, M.S., Brudno, M., et al Probcons: probabilistic consistency-based multiple sequence alignment. Genome Res. 5, Gonnet, G.H., Cohen, M.A., and Benner, S.A Exhaustive matching of the entire protein sequence database. Science 256, Henikoff, S., and Henikoff, J.G Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci. U.S.A. 89, Henikoff, S., and Henikoff, J.G Performance evaluation of amino acid substitution matrices. Proteins 7, Holmes, I., and Durbin, R Dynamic programming alignment accuracy. J. Comput. Biol. 5, Jones, D.T., Taylor, W.R., and Thornton, J.M The rapid generation of mutation data matrices from protein sequences. Comput. Appl. Biosci. 8, Karlin, S., and Altschul, S Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc. Natl. Acad. Sci. U.S.A. 87, Karlin, S., Dembo, A., and Kawabata, T Statistical composition of high-scoring segments from molecular sequences. Ann. Stat. 8,

15 SEQUENCE ALIGNMENT AS HYPOTHESIS TESTING 69 Kent, W.J BLAT the BLAST-like alignment tool. Genome Res. 2, Langmead, B., Trapnell, C., Pop, M., et al Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 0, R25. Müler, T., Spang, R., and Vingron, M Estimating amino acid substitution models: a comparison of Dayhoff s estimator, the resolvent approach and a maximum likelihood method. Mol. Biol. Evol. 9, 8 3. Needleman, S.B., and Wunsch, C.D A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, Neyman, J., and Pearson, E.S On the problem of the most efficient tests of statistical hypotheses. Phil. Trans. R. Soc. A 23, Overington, J., Donnelly, D., Johnson, M.S., et al Environment-specific amino acid substitution tables: tertiary templates and prediction of protein folds. Protein Sci., Pearson, W.R Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol. 83, Polyanovsky, V., Roytberg, M.A., and Tumanyan, V.G Reconstruction of genuine pairwise sequence alignment. J. Comput. Biol. 5, Prlić, A., Domingues, F.S., and Sippl, M.J Structure-derived substitution matrices for alignment of distantly related sequences. Protein Eng. 3, Qian, B., and Goldstein, R.A Optimization of a new score function for the generation of accurate alignments. Proteins 48, Shi, J., Blundell, T.L., and Mizuguchi, K FUGUE: sequence-structure homology recognition using environmentspecific substitution tables and structure-dependent gap penalties. J. Mol. Biol. 30, Smith, T.F., and Waterman, M.S. 98. Identification of common molecular subsequences. J. Mol. Biol. 47, Subramanian, A., Kaufmann, M., and Morgenstern, B DIALIGN-TX: greedy and progressive approaches for segment-based multiple sequence alignment. Algorithms Mol. Biol. 3, 6. Subramanian, A., Weyer-Menkhoff, J., Kaufmann, M., et al DIALIGN-T: an improved algorithm for segmentbased multiple sequence alignment. BMC Bioinform. 6, 66. Thompson, J.D., Higgins, D.G., and Gibson, T.J CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 22, Torda, A.E., Procter, J.B., and Huber, T Wurst: a protein threading server with a structural scoring function, sequence profiles and optimized substitution matrices. Nucleic Acids Res. 32, W532 W535. Veerassamy, S., Smith, A., and Tillier, E.R.M A transition probability model for amino acid substitutions from blocks. J. Comput. Biol. 0, Waterman, M., and Vingron, M Sequence comparison significance and Poisson approximation. Stat. Sci. 9, Waterman, M.S Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman & Hall, New York. Webb, B.M., Liu, J.S., and Lawrence, C.E BALSA: Bayesian algorithm for local sequence alignment. Nucleic Acids Res. 30, Zhu, J., Liu, J.S., and Lawrence, C.E Bayesian adaptive sequence alignment algorithms. Bioinformatics 4, Address correspondence to: Dr. Michael S. Waterman Molecular and Computational Biology University of Southern California 050 Childs Way, RRI 20 Los Angeles, CA msw@usc.edu

In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.