Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT

Size: px

Start display at page:

Download "Approximate P-Values for Local Sequence Alignments: Numerical Studies. JOHN D. STOREY and DAVID SIEGMUND ABSTRACT"

Clifton Richard
5 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 5, 200 Mary Ann Liebert, Inc. Pp Approximate P-Values for Local Sequence Alignments: Numerical Studies JOHN D. STOREY and DAVID SIEGMUND ABSTRACT Siegmund and Yair (2000) have given an approximate p-value when two independent, identically distributed sequences from a nite alphabet are optimally aligned based on a scoring system that rewards similarities according to a general scoring matrix and penalizes gaps (insertions and deletions). The approximation involves an in nite sequence of dif cultto-compute parameters. In this paper, it is shown by numerical studies that these reduce to essentially two numerically distinct parameters, which can be computed as one-dimensional numerical integrals. For an arbitrary scoring matrix and af ne gap penalty, this modi ed approximation is easily evaluated. Comparison with published numerical results show that it is reasonably accurate. Key words: local alignment, af ne gap penalty, p-value, Marov renewal theory.. INTRODUCTION Comparison of a new gene (DNA sequence) or protein (amino acid sequence) with the existing sequences in a database can be an important rst step in learning the structure and/or function of the new sequence. Local pairwise alignment of sequences plays an important role in such database searches. The quality of an alignment is determined by the total similarity score of letters at aligned positions and the number of gaps (insertions/deletions) in each sequence. (These concepts will be de ned precisely in what follows.) Smith and Waterman (98) and Altschul et al. (990) have developed rapid and effective computational algorithms for local pairwise comparison of DNA or amino acid sequences. In addition, one would lie to evaluate the statistical signi cance of sequences showing a particular level of similarity (e.g., Arratia et al., 990; Dembo et al., 994; Altschul and Gish, 996). Approximate p-values when gaps are not allowed have been given by a number of authors, e.g., Arratia et al. (990), for a special scoring function and more generally by Dembo et al. (994). The form of the approximation in the ungapped case has been conjectured to be valid also in the gapped case (Waterman and Vingron, 994) and has been empirically t to simulated and to real data to obtain values for parameters in the approximating formula (cf. Waterman and Vingron, 994; Altschul and Gish, 996). While these approximations seem to be reasonably accurate, development of each approximation requires substantial statistical analysis and computation that go beyond the basic data of the problem: the scoring matrix and the gap costs. More recent results in this direction have been obtained by Altschul et al. (200). Department of Statistics, Stanford University, Stanford, CA

2 550 STOREY AND SIEGMUND For af ne gap costs, Mott and Tribe (999) obtain a useful approximation by an interesting heuristic argument. Assuming the gap open penalty is suf ciently large, Siegmund and Yair (2000) provide a somewhat different approximation, which depends on in nitely many parameters. They also conjecture that for numerical purposes a further approximation depending on two of these parameters would suf ce. The purpose of this paper is to show that a) this conjecture is correct, b) the modi ed approximation is reasonably accurate, and c) it can be rapidly computed directly from basic data of the problem. In addition to better theoretical understanding of how the p-value depends on the parameters of the problem, this approximation requires effectively zero computing time, so it could be implemented on line for a scoring matrix of the user s choice. Section 2 gives our notation and states the formal approximation of Siegmund and Yair (2000), which depends on the parameters 0; ; : Section 3 explains the probabilistic meaning of these parameters. Section 4 contains simulations based on the representations of Section 3, which show for several common scoring matrices that the r for r are essentially constant. Using the two parameters 0 and 3 D lim r r, one obtains a much simpler approximation, which is shown numerically to be in reasonable agreement with the approximation of Atlschul and Gish (996) based on their empirically estimated parameters. Convenient integral representations for 0;3 are given in an appendix. 2. NOTATION AND APPROXIMATION Consider two nite sequences x and y from a nite alphabet. Thus, x D x x 2 x m and y D y y 2 y n with x i ;y j 2 A. We assume throughout that x ;:::;x m are independently distributed with P 0 fx i D g D ¹ for all i; and similarly y ;:::;y n are independently distributed with P 0 fy j D g Dº for all j. Moreover, the x s and y s are independent. A candidate alignment, z D f.i t ;j t / : t g, for some i < i 2 < < i m and j <j 2 < <j n, speci es that x it and y jt are aligned for all t D ; ;: The other x s with subscripts between i and i and the other y s with subscripts between j and j are said to be unaligned. Note that there may be other letters, at the beginning or end of the two sequences, that are neither aligned nor formally designated as unaligned. We assume that either i tc D i t C orj tc D j t C for all t<; i.e., there can be unaligned letters in only one sequence at a time. With each candidate alignment z we associate a score S z D S z.x; y/. Aligned letters x i and y j are scored according to a similarity matrix K.x i ;y j /: We assume a penalty of ± for each unaligned letter. The total number of unaligned letters is l D.i i C C j j C /. A gap is the interval of unaligned letters that begins with a value t C such that i t D j t and either i tc >i t C orj tc >j t C and ends with the next aligned pair after.x it ;y jt /. Each gap is assessed a cost in addition to the cost ± of each unaligned letter. Frequently one refers to as the gap open and ± as the gap extension cost. The score of the candidate alignment z is S z D S z.x; y/ D 6tD K.x i t ;y jt / j ±l, where j is the number of gaps and l is the total number of unaligned letters, or equivalently the total length of all gaps. Note that there are no costs assessed for letters that are neither aligned nor unaligned. Within the collection Z of all candidate alignments, one can identify the best alignment the one with the highest score. The p-value of the best score under the null assumption that the sequences x and y are independent random samples from the given alphabet is P 0.max z2z S z b/; () where b is the observed value of the score for the best alignment and P 0 is the null probability described above. We assume that E 0 K.x;y/ < 0 and P 0 fk.x;y/ > 0g > 0: (2) Let Ã.µ/ D log E 0 exp[µk.x;y/]. Then exp[µk. ; / Ã.µ/]¹ º de nes an exponential family of probabilities indexed by µ, which for µ D 0 reduces to the original probability Q 0. ; / D ¹ º. It follows from (2) by convexity that there is a unique value µ > 0 for which Ã.µ / D 0: Let ¹ 0 D Ã 0.µ />0:

3 LOCAL SEQUENCE ALIGNMENTS 55 Let Q. ; / D exp[µ K. ; /]Q 0. ; / and note that ¹ 0 D E K.x;y/: Also let Q i;j be de ned to be the product probability that gives x the marginal distribution it has under Q i and y the marginal distribution it has under Q j. We assume that E K.x;y/ E i;j K.x;y/ > 0 (3) for all i; j. This condition is easily checed and excludes the case that K. ; / is a sum of a function of and a function of. We shall require the technical assumptions that min.m; n/=b!and that for some 0 <²<, mnexpf.µ b/ ² g is bounded. The second assumption puts the probability of interest into the domain of large deviations. To simplify the presentation of the main approximation, we assume that K.x;y/ is a nonarithmetic random variable and discuss the more complicated arithmetic case below. The main result of Siegmund and Yair (2000) is as follows. Theorem. Assume conditions (2) and (3) hold and that µ D log.µ b/ C C (4) for some constant C. There exist parameters 0; ; in (0,] such that as b!, P 0 max S z b z2z X X mne µ b.µ ¹ 0 / 20 [.2be µ =¹ 0 / j =j!] jd0 ldj e µ ±l X i il j C l jc ; (5) where the innermost summation extends over the l j terms having i C Ci l jc D j and i C 2i 2 C C.l j C /i l jc D l: An upper bound for the indicated sums is =[exp.µ ±/ ]: Remars. (i) A principal consequence of the numerical studies reported below is that the constants r for r are essentially all equal, say, to a single parameter 3. Hence the right hand side of (5) can reasonably be approximated by the much simpler expression mn.µ ¹ 0 / 20 exp. µ bf 2.µ ¹ 0 / 3e µ =[e µ ± ]g/: (6) (ii) In applications, the random variables K.x i ;y j / will be arithmetic of span h, typically h D : In this case it does not seem possible to give a precise asymptotic expression for the desired probability, which can be shown to lie asymptotically in an interval bounded below by hµ =.e hµ / and above by. e hµ /=.hµ / times the quantitity on the right hand side of (5). (iii) As explained by Siegmund and Yair (2000) the condition (4) can be viewed as a diagnostic for the accuracy of the approximation (5). If we suppose that b and are given and (4) de nes the value of C, then one should be careful in applying (5) when C is small, say, C< : THE CONSTANTS l r, r 0,, 2, In the rest of this paper we assume that K.x i ;y j / is arithmetic with span h D : The constant 0 appears in the approximation of Dembo et al. (994), where gaps are not permitted. It has the same probabilistic meaning as the other r, but it is de ned relative to a random wal process rather than a Marov chain and consequently is easily evaluated numerically. It has been thoroughly discussed in the context of sequential analysis (e.g., Woodroofe, 982; Siegmund, 985). To be more precise and prepare for the more complex case of r.r ) given below, let P denote the probability under which.x ;y /;.x 2 ;y 2 /; are independent and identically distributed with the distribution Q de ned following display (2). The log lielihood ratio of the rst n pairs.x i ;y i / under P

4 552 STOREY AND SIEGMUND relative to P 0 is `n D µ S n, where S n D P n K.x i;y i /: It follows from the arguments of Siegmund and Yair (2000) that 0 D lim N which after summation by parts and a change of measure equals N E 0 [exp.max n N `n/]; (7) lim N. e µ / X j E [expf µ.s.j/ j/gi.j/ N]; (8) where.j/ D minfn : S n j g for j (and.0/ is similarly de ned with a strict inequality). Since ¹ 0 D E K.x ;y />0, by the strong law of large numbers P f.j/ N g converges to for j<. ²/N¹ 0 and to 0 for j>.c²/n¹ 0 for arbitrary 0 <²<: It follows from the renewal theorem that º 0 D lim E [expf µ.s.j/ j/g] (9) j exists. From (5) (7) and elementary analysis one obtains 0 D ¹ 0. e µ /º 0 : (0) In addition, º 0 can be expressed in terms of the distribution of the ladder variables..0/; S.0/ / (e.g., Siegmund, 985, Chapter 8) and then can be evaluated numerically as a one-dimensional inverse Fourier transform (cf. Woodroofe [982] for nonarithmetic variables and Tu and Siegmund [999] for the arithmetic case). For completeness the result of Tu and Siegmund (999) is stated in the appendix. Except for the numerical evaluation of º 0, the preceding argument applies with minor modi cations to the r for r : Consider two sequences x and y of length N, for some integers N!. Given r and, <N r, consider the alignment which matches the rst x s with the rst y s and the x s between C and N r with the y s between Cr C andn. Let P.r/ denote the probability measure that maes the aligned pairs x i ;y j independent and identically distributed with joint distribution Q and leaves the distribution of the other x i and y j unchanged. Let `.r/ denote the log lielihood ratio of P.r/ relative to P.r/,so`.r/ 0 D µ [ P K.x i;y i / C P N r C K.x i;y /]and`.r/ icr D µ P 2 [K.x i;y i / K.x i ;y icr /]. Siegmund and Yair (2000) obtain the representations (cf. (7)) r D lim N µ N E.r/ 0 exp.max `.r/ N / D lim N µ N E.r/ exp.max f`.r/ N g/ : () Let P.r/ denote the extension of P.r/ N r to the in nitely long sequence fx i;y i ; i<g: Note that D `.r/ D µ [K.x ;y / K.x ;y Cr /] is a function of x, y and y Cr only. Under P.r/ the are identically distributed; and.r/ is independent of any.r/ l for l such that.l mod r/ 6D. mod r/. For each 0 i<r, the process f.r/ : N r;. mod r/ D ig is a rst order Marov chain. Thus `.r/ D P 2.r/ i is an additive functional of an rth order Marov chain. By () and the argument leading from (7) to (0), we see that.r/.r/ r D ¹ r. e µ /º r ; (2) where ¹ r D E.r/ [K.x 2 ;y 2 / K.x 2 ;y 2Cr ] is constant in r, and the limit de ning º r (exactly as in (9)) is guaranteed to exist by the renewal theorem for additive functionals of a Marov chain applied to the process of ladder variables, e.g., Athreya et al. (978). Althoughwe obtain a representationstructurally identicalto (0), numerical computationsare much more complicated than for a random wal. Asmussen (989) has used a representation of the limit de ning º r in terms of ladder variables to evaluate such a parameter for a similar process. The problem is also discussed by Karlin and Dembo (992), although they do not appear to have developed a numerical algorithm. In the

5 LOCAL SEQUENCE ALIGNMENTS 553 following section we use the representation in () for simulating º r : In this regard, note that simulation requires only the generation of independent, identically distributed, discrete, random variables, while an analytic approach must deal with the Marovian dependence described above. From the probabilistic meaning of the º r, it seems natural to conjecture that they are effectively constant in r ; and since.r/ 2 ; ;.r/ rc are independent and identically distributed, with a distribution not depending on r, the limit of º r as r!can be evaluated in terms of a random wal (albeit a different random wal from that entering into º 0 above). 4. NUMERICAL RESULTS Our goal in this section is two-fold: (i) to show by simulation that it is reasonable to regard the constants r for r as a single constant, so that the complicated approximations given in Theorem can be replaced by the much simpler expresson (6), which is very easy to evaluate numerically via the algorithm given in the appendix; and (ii) to compare the results of our approximation to selected results in the literature reported by Altschul and Gish (996). We use the amino acid frequencies reported by Robinson and Robinson (99) and the scoring matrices BLOSUM62, BLOSUM50 (Henioff and Henioff (992), and PAM250 (e.g., Dayhoff, 978). Tables and 2 give estimated values of º r for the BLOSUM62 and PAM250 substitution matrices, respectively. The estimates are based on 50,000 repetitions of a Monte Carlo experiment and are given for various values of r. Since º r is represented as a limit as j! and large values of j require proportionately long Monte Carlo runs, the estimates are also tabled for different values of j. (The process increases at rate ¹, so a rough estimate, which ignores edge effects, of the mean value of `.r/.j/ is j=¹. Table 3 gives values of ¹.) Standard deviations of the estimates are about The estimates are essentially constant in both r and j. Very similar results have been obtained for the BLOSUM50 substitution matrix and hence are omitted. We conclude from these simulations that it is reasonable to use the simpli ed approximation (6) instead of the much more complicated (5). Table 3 gives the numerical values of different parameters that are required for implementation of the approximation (6). In particular, instead of º r for r, we propose to use º D lim r º r, which involves independent, identically distributed, random variables and hence can be computed by the method given in the appendix. Then 3 D ¹. e µ /º. For ease in maing numerical comparisons with published values, we continue to use the parameters in the form given above. We also give parameters relevant to the scale that Altschul and Gish (996) call nats, which arises from multiplying K given in an arbitrary scale by µ. An increase of one unit in the threshold a D µ b results in a change of the probability () by approximately the factor e. In this scale, the mean values ¹ r become the Kullbac Leibler information numbers I r D µ ¹ r.r D 0; /. In Table 3, the values given for º 0 and º have been obtained by the method given in the appendix, which requires a one-dimensional numerical integration. It is easily automated and very fast to evaluate. To the extent that the values of º r do change with r, the value given for º in Table 3 should be a better approximation for large r, when the increments of the process `.r/ are independent over long stretches. Although the value of º is close to the values given in Table, it is outside the range of Table. Simulated Values of º r for BLOSUM62 Matrix a r j a Estimates are based on 50,000 repetitions of a Monte Carlo experiment, for various values of j.

6 554 STOREY AND SIEGMUND Table 2. Simulated Values of º r for PAM250 Matrix a r j a Estimates are based on 50,000 repetitions of a Monte Carlo experiment, for various values of j. Table 3. Parameters of Selected Substitution Matrices BLOSUM62 BLOSUM50 PAM250 µ º ¹ I º ¹ I Table 4. p-values a K ± b p 0 6 BLOSUM BLOSUM PAM a Upper entry is our approximation; lower entry from Altschul and Gish (996); m D n D 500.

7 LOCAL SEQUENCE ALIGNMENTS 555 reasonable sampling variability. The values of º for the BLOSUM50 and PAM250 matrices are within the range of reasonable sampling variability of the Monte Carlo estimates for large r. To verify that these values are at least reasonable, one can compare them to the case of a random wal having normally distributed increments, for which there is the simple approximation º ¼ exp[ ½.2I/ =2 ]; where ½ D 0:583 (cf. Siegmund, 985). This gives º 0 ¼ 0:625 and º ¼ 0:503 for the BLOSUM62 matrix and similar results for the other two cases. For implementation of the approximation (6), we also mae edge corrections to m and n, by replacing m with m 0 D m b=¹ 0 and by a similar correction to n. This correction is justi ed by the observation that an (ungapped) interval that achieves a score of b is roughly of length b=¹ 0. Similar edge corrections are used by Altschul and Gish (996) and by Mott and Tribe (999). Table 4 gives p-values for m D n D 500 and selected values of ; ±,andb. The rst entry given is the approximation (6); the second uses the extreme value approximation suggested by Waterman and Vingron (994) and by Altschul and Gish (996) with the empirically determined parameters given in Altschul and Gish (996). Our approximation is typically about 2 5 times as large as the Altschul Gish approximation. Very similar results hold for other values of m D n in the range [300,000]. The practical advantage of our approximation is that it requires only the almost trivial evaluation of µ and related parameters and computation of two one dimensional numerical integrals. Hence it can be easily evaluated for a scoring matrix chosen by the user without restriction to the few cases for which the empirical research necessary to estimate the parameters of the Altschul Gish approximation has been carried out. ACKNOWLEDGMENTS This research was supported in part by National Science Foundation Grant DMS and by an NSF Graduate Research Fellowship. 5. APPENDIX Theorem 2. Let x ;x 2 ; ;x n be independent and identically distributed arithmetic random variables with positive mean ¹, nite positive variance ¾ 2, and span h. LetÁ.t/ be the characteristic function of x and.t/ D log. Á.t// D P nd [Á n.t/=n]. LetS n D P n id x i. Then for arbitrary >0, 8 t ± ¹ ² 9 X µ n E.e SC n / C log.¹=h/ D Z 2¼ >< t e h it C log h h. eit / >= dt 2¼ 0 h >: e h it C e it : >; nd Let S. / denote the left hand side of (). Siegmund (985, p. 75) shows that º 0 and º are of the form hexp[ S.µ /]=[ e µ h ]: The values given in Table 2 were obtained from this expression. (3) REFERENCES Altschul, S.F., Bundschuh, R., Olsen, R., and Hwa, T The estimation of statistical parameters for local alignment score distributions. Nucl. Acids Res. 29, Altschul, S.F., and Gish, W Local alignment statistics. Methods in Enzymology 266, Altschul, S.F., Gish, W., Miller, W., Myers, E.W., and Lipman, D.J Basic local alignment search tool. J. Mol. Biol. 25, Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W., and Lipman, D.J Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucl. Acids Res. 25,

8 556 STOREY AND SIEGMUND Arratia, R., Goldstein, L., and Gordon L Two moments suf ce for Poisson approximation: The Chen Stein method. Ann. Probab. 7, Arratia, R., Gordon, L., and Waterman, M.S The Erdös-Rényi Law in distribution for coin tossing and sequence matching. Ann. Statist. 8, Asmussen, S Ris theory in a Marovian environment. Scand. Actuarial J., Athreya, K.B., McDonald, D., and Ney, P Limit theorems for semi-marov processes and renewal theory for Marov chains. Ann. Probab. 6, Dayhoff, M.O Atlas of Protein Sequence and Structure, vol. 5, suppl. 3, National Biomedical Research Foundation, Washington, DC. Dembo, A., Karlin, S., and Zeitouni, O Limit distribution of maximal non-aligned two-sequence segmental score, Ann. Probab. 22, Henioff, S., and Henioff, J.G Amino acid substitution matrices from protein blocs. Proc. Nat. Acad. Sci. USA 89, Karlin, S., and Dembo, A Limit distributions of maximal segmental score among Marov-dependent partial sums. Adv. Appl. Probab. 24, Mott, R., and Tribe, R Approximate statistics of gapped alignments. J. Comp. Biol. 6, 9 2. Pearson, W.R Comparison of methods for searching protein databases. Protein Sci. 4, Robinson, A.B., and Robinson, L.R. 99. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Nat. Acad. Sci. USA 88, Siegmund, D Sequential Analysis: Tests and Con dence Intervals. Springer-Verlag, New Yor. Siegmund, D., and Yair, B Approximate p-values for local sequence alignments. Ann. Statist. 28, Smith, T.F., and Waterman, M.S. 98. Identi cation of common molecular subsequences.j. Mol. Biol. 47, Tu, I.-P., and Siegmund, D The maximum of a function of a Marov chain and application to linage analysis. Adv. Appl. Probab. 3, Waterman, M Introduction to Computational Biology: Maps, Sequences and Genomes. Chapman and Hall, London. Waterman, M., and Vingron, M Sequence comparison and Poisson approximation. Statistical Science 9, Woodroofe, M Nonlinear Renewal Theory in Sequential Analysis. SIAM, Philadelphia. Yair, B., and Polla, M A new representation for a renewal-theoretic constant appearing in asymptotic approximations of large deviations. Ann. Appl. Probab. 8, Address correspondence to: David Siegmund Stanford University Department of Statistics Sequoia Hall 40 Stanford, CA, dos@stat.stanford.edu

In-Depth Assessment of Local Sequence Alignment

2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.