Comparison of the similarities between two segments of biological sequences using k-tuples (also. Research Articles

Size: px

Start display at page:

Download "Comparison of the similarities between two segments of biological sequences using k-tuples (also. Research Articles"

Corey McCormick
6 years ago
Views:

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 16, Number 12, 2009 # Mary Ann Liebert, Inc. Pp DOI: =cmb Research Articles Alignment-Free Sequence Comparison (I): Statistics and Power GESINE REINERT, 1 DAVID CHEW, 2 FENGZHU SUN, 3,4 and MICHAEL S. WATERMAN 3,4 ABSTRACT Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the statistic, relies on the comparison of the k- tuple content for both sequences. Although it has been known for some years that the statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the word count statistic, which we call D S 2 and D* 2. For DS 2, which is a selfstandardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D 2, outperforms DS 2 in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D 2, we cannot provide a closed form for power calculations. Key words: alignment-free, normal approximation, normal distribution, sequence alignment, word count statistics. 1. INTRODUCTION Comparison of the similarities between two segments of biological sequences using k-tuples (also called k-grams or k-words) arises from the need for rapid sequence comparison. Such methods are often employed in cdna sequence comparisons. Today next-generation sequencing methods are producing unprecedented volumes of sequence data. Therefore, we expect that the use of k-tuples will play an increasingly important role for molecular sequence and genome comparisons in the current era. This article will explore in some detail the statistic which one of these methods is based upon, along with other substantially superior statistics. One of the most widely used statistics for sequence comparison based on k-tuples is the so-called statistic, which is based on the joint k-tuple content in the two sequences. If two sequences are closely related, we would expect the k-tuple content of both sequences to be very similar. 1 Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom. 2 Department of Statistics and Applied Probability, National University of Singapore, Singapore , Singapore. 3 Molecular and Computational Biology Program, University of Southern California, Los Angeles, California MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST=Department of Automation, Tsinghua University, Beijing, P.R. China. 1615

2 1616 REINERT ET AL. More formally, suppose that two sequences, A ¼ A 1 A 2...A n and B ¼ B 1 B 2...B m, say, are composed of letters that are drawn from a finite alphabet A of size d. For a 2A, let p a denote the probability of letter a. For w ¼ (w 1,..., w k ) 2A k, let X w ¼ Xn 1(A i ¼ w 1,..., A i þ k 1 ¼ w k ) i ¼ 1 count the number of occurrences of w in A, and similarly, Y w counts the number of occurrences of w in B. Here, n ¼ n k þ 1; similarly, we put for later use m ¼ m k þ 1. Then is defined by ¼ X w2a k X w Y w : The null model is typically chosen to be such that the letters are independent and identically distributed (i.i.d.), and that the two sequences are independent. Using this model, Lippert et al. (2002) derived a Poisson approximation, a compound Poisson approximation, and a normal approximation for ; the normal approximation is only valid under the assumption that not all letters of the alphabet are equally likely. In the case that all letters are equally likely, the statistic looks asymptotically like the sum of products of independent normal variables. Lippert et al. (2002) also found that the statistic is dominated by background noise, in the nonuniform case. In the work of Kantorovitz et al. (2007a), it was shown that in the regime that all letters are equally likely, the standardized statistic D2z ¼ E( ) sd( ) is asymptotically normally distributed when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. In clustering some biologically related sequences, Kantorovitz et al. (2007b) found that D2z outperforms. The heuristic argument is that the background models for the two sequences may be different, and the statistic should hence be normalized to account for the different background distributions of the sequences. Yet in the nonuniform case the issue about the variability being dominated by the noise in the single sequences remains. In this article, we propose a new statistic, which is a self-standardized version of. In general, Shepp (1964) observed that, if X and Y are independent mean zero normals, X with variance r 2 X and Y with variance r 2 Y, then ffiffiffiffiffiffiffiffiffiffiffi XY r p is again normal, with variance 2 X r2 Y.Forw¼w X 2 þ Y 2 (r X þ r Y ) w k, p w ¼ Q k i ¼ 1 p w i is the probability of occurrence of w, and the centralized count variable is denoted as ~X w ¼ X w np w and ~Y w ¼ Y w mp w : We introduce the new count statistic as D S 2 ¼ X ~X w ~Y w q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (1) w2a k ~X w 2 þ ~Y w 2 Here we set 0 0 ¼ 0. The superscript S stands for Shepp, and also for self-standardized. We shall see that, under reasonable assumptions, D S 2 will be approximately normally distributed. In practice we shall usually have to replace p a, the (unobserved) letter probabilities, by ^p(a), the relative count of letter a in the concatenation of the two sequences, based on the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution. We then estimate the probability of occurrence of w ¼ w 1...w k by bp w ¼ Q k i ¼ 1 bp wi. In our simulations, we always estimate the letter probabilities, even when we assume that all letters are equally likely. We also study the following version of the word count statistic: D 2 ¼ X w ~X w ~Y w pffiffiffiffiffiffi, (2) n m p cpw w which in our simulations outperforms not only but also D S 2, in terms of power for detecting the relatedness between the two sequences. This statistic comes about by considering P p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ~X w ~Y w w, but as dvarx w VarYw d w

3 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1617 the variance is costly to compute, it is replaced by the estimated mean of the word occurrence across the two sequences when the probability of the word pattern is small; this approach can be justified by considering a Poisson approximation for the individual word counts. We justify in Section 2 that D 2 can be viewed as the sum of the products of independent normal variables, and we suggest how to simulate from its asymptotic distribution, for which we do not have a closed-form expression. To explain the problem with, ¼ X w2a k ~X w ~Y w þ n X w2a k p w Y w þ m X w2a k p w X w mn X w2a k p 2 w : (3) Approximately, if n and m are large, under the null model, ~X w should follow a mean zero normal distribution with variance of order n with respect to sequence length. The distribution of P w2a ~X k w ~Y w should be approximately, for large n and m, the distribution of the sum of products of pairs of independent mean zero normal variables with a variance of order O(nm). If all letters are equally likely, then p w ¼ d k P for all words w, and hence, w2a k p w X w ¼ d k P w2a k X w ¼ nd k, giving that ¼ P w2a ~X k w ~Y w þ mnd k. The variability in is the same as the variability in P w2a ~X k w ~Y w, and indeed as in D 2. When not all letters are equally likely, as the variance of X w is of order n and the variance of Y w is of order m, the variance of n P w2a k p wy w is of order O(n 2 m), and similarly the variance of m P w2a k p wx w is of order O(nm 2 ). Hence, the variability in is dominated by the variability in P P w2a k p wx w and w2a k p wy w. Thus, in this case, the variability in is dominated by the terms that reflect the noise in the single sequences only. The asymptotic normality of for both nucleic acid and amino acid sequences has been studied empirically by Forêt et al. (2009). However, no power study was undertaken; our argument shows that in the nonuniform case the asymptotic normality of only stems from the asymptotic normality of the underlying word counts in the respective sequences. Even in the regime that all letters are equally likely, if we only leave the last word d ¼ (d,..., d) out, forming the statistic D2 ¼ P w2a k ;w6¼d X wy w, then P w2a k ;w6¼d X w ¼ n X d, which is not constant. So even if we just leave one word out of the whole set of possible words, if the sequence lengths are large and all other parameters are fixed, then the variability, now in D2, will be dominated by the variability in the single sequences. Hence, the statistic is, in general, not useful for assessing whether the two underlying sequences are related. This article is structured as follows. In Section 2, we discuss the distribution of D S 2 and under the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution, and we shall present simulation results for testing the normality of, D S 2, and. In Section 3, we study the power of the statistics, D S 2,, and D2z under two alternative scenarios. The first scenario is that the two sequences contain a common motif, whereas the second scenario is a pattern transfer model; we pick a word in the first sequence and use it to replace a word in the second sequence. Our results illustrate not only the poor performance of, but also the encouraging performances of D S 2 and of. Section 4 illustrates how the asymptotic normality of D S 2 gives a fast method for assessing statistical significance, as only its standard deviation has to be approximated and not the empirical distribution itself. There is a caveat if the distribution on the alphabet is very close to uniform, and if some other conditions are satisfied which relate to having a large number of summands of products of pairs of independent normals, then will behave like the sum of products of normally distributed variables, similar to the uniform case; when the deviation from uniform increases, the asymptotic normality for holds. This phase transition is explored in Section 5. We summarize our results in Section 6, and we briefly indicate generalizations to Markov chain models as well as to multiple sequence comparisons. The proofs for Section 2 are presented in the Supplementary Material (see online supplementary material at The code for simulating from the distributions is available at www-rcf.usc.edu=*fsun=programs=d2=d2-all.html

4 1618 REINERT ET AL. 2. THE DISTRIBUTIONS OF D S 2 AND UNDER THE NULL MODEL Here the null model is that the letters are i.i.d. and the two sequences are independent. We assume, as in Huang (2002), that k min(n, m). For,Forêt et al. (2009) studied the empirical distribution via simulations, and they found that a gamma distribution outperforms the normal distribution in general. For longer sequences they showed that the normal approximation itself would be appropriate D S 2 and asymptotic normality First, we focus on the word counts in a single sequence. Let ~X d n ¼ ~X d ¼ ( ~X w, w 2A k, w 6¼ d) be the vector of centered word counts with the last word d ¼ (d,..., d) left out; note that ~X d ¼ X ~X w (4) w2a k ;w6¼d can be recovered from the set ~X d. Huang (2002) showed a multivariate normal approximation for the word count vector in a single sequence. The limiting covariance matrix C needs some notation; see section 12.1 in Waterman (1995), with results derived by Lundstrom (1990). For w ¼ w 1 w 2 w k 2A k, we define, for ¼ 1,..., k, p w ( ) ¼ p(w k þ 1 )p(w k þ 2 ) p(w k ), which is the probability that w occurs, given that w 1 w 2 w k has occurred. For words u ¼ u 1 u 2 u k and v ¼ v 1 v 2 v k, the overlap indicator is defined as n u, v (j) ¼ 1(u j þ 1 ¼ v 1,..., u k ¼ v k j ): This overlap indicator equals 1 if the last k j letters of u overlap exactly the first k j letters of v. Then the approximating covariance matrix C is given by Xk 1 Xk 1 C u, v ¼ p(u) n u, v (j)p v (j) þ p(v) n v, u (j)p u (j) j ¼ 1 j ¼ 1 (2k 1)p(u)p(v) p(u)1(u ¼ v): (5) A similar normal approximation is valid for ~Y d n ¼ ~Y d ¼ ( ~Y w, w 2A k, w 6¼ d). As we assume that A and B have the same letter probability distribution, both have the same limiting covariance matrix C. Thus, we obtain the following approximation for D S 2 (for the proof and the precise bounds, see online Supplementary Material at We use the abbreviation MVN(l, C) to denote a multivariate normal distribution with mean vector m and covariance matrix C. Also, a function h : R! R is called Lipschitz, with Lipschitz constant 1, if for all real x and y, jh(x) h(y)jjx yj. Theorem 2.1. Assume m n and k 5 n 2. Let Z 1 ¼ (Z 1, 1,..., Z 1, dk 1) ~ MVN(0, C) and Z 2 ¼ (Z 2, 1,..., Z 2, dk 1) ~ MVN(0, C) be two independent (d k 1)-dimensional normal vectors. In analogy to Equation (4), put, for i ¼ 1, 2, Let Z i (d) ¼ X w2a k ;w6¼d Z i (w): D lim ¼ X Z 1 (w)z 2 (w) p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : w2a Z k 1 (w) 2 þ Z 2 (w) 2

5 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1619 Then, D lim is mean zero normally distributed, and, for any function h which is bounded and Lipschitz with Lipschitz constant 1, as n?? with d ¼ d(n) and k ¼ k(d, n),! 1 je[h(d S 2 )] E[h(D lim)]j¼o k 4 d5k 8 : n & The bound in Theorem 2.1 may not be optimal; indeed, it is based on a multivariate normal approximation for word counts, Corollary 6.1 (see online Supplementary Material at which is of order k 2 d k (n 1 2 þ m 1 2 ). The purpose of the bound is to illustrate the trade-off between alphabet size, word length, and sequence length. If d, the alphabet size, is very large, then even moderately long words will be rare unless the sequence is very long. Because of the complicated dependence, we were not able to give a closed-form expression for the variance of D lim. Theorem 2.1, however, justifies using a z-test for the null model, based on the statistic D S 2, using the estimated standard deviation D 2 and the product of independent normals The statistic D 2 given in Equation (2) is motivated by estimating the standardized counts Xw 0 ¼ ~X w pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and Yw 0 Var( ~X w ) ¼ ~Y w p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi, Var( ~Y w ) approximating Var( ~X w ) ¼ np w (1 p w ) by its mean np w, with the argument that 1 p w will be close to 1 when k is reasonably large and the word w is relatively rare. From Corollary 6.1 (see online Supplementary Material at we obtain a multivariate normal approximation for the standardized count vectors X 0 ¼ (Xw 0 ; w 2Anfdg) and for Y0 ¼ (Yw 0 ; w 2Anfdg). Although the covariances within the vectors will not disappear, for each w, Xw 0 and Y0 w are independent and would be approximated by independent univariate standard normal variables. From the Stuart and Ord (1987), we know the distribution of the product of two independent standard normal variables [see also Springer and Thompson (1966)]. Lemma 2.1. Let X and Y be two independent standard normal random variables. Then the product W ¼ XY has probability density f (w) ¼ 1 p K 0(jzj), (6) where K 0 (x) ¼ R 1 0 cos(xt)(1 þ t 2 ) 1 2 dt denotes the modified Bessel function of the third kind. Thus, the distribution of each summand Xw 0 Y0 w will approximately have density (6). The covariance structure will result in an approximation with a complicated distribution which would be easily assessed by simulation from many normal vectors with covariance matrix C given in Equation (5), standardizing and taking products The case that all letters are equally likely In the case that all letters are equally likely, both Lippert et al. (2002) and Kantorovitz (2007) observed that will not follow a normal distribution. Kantorovitz et al. (2007a) showed that D2z ¼ p D ffiffiffiffiffiffiffiffiffiffiffi 2 E is Var( ) asymptotically normal, however, when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. However, when the word length is fixed,, D2z, and D 2 may not tend to normal as sequence length tends to infinity. Note that in the case that all letters are equally likely, all of, D2z, and D 2 agree up to constants Simulations To illustrate the quality of the normal approximation, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the

6 1620 REINERT ET AL. alphabet A¼fa, g, c, tg. We consider two types of distributions on the letters: the uniform distribution (p a ¼ p c ¼ p g ¼ p t ¼ 1 4 ) and a gc rich, nonuniform distribution (p a ¼ p t ¼ 1 6, p c ¼ p g ¼ 1 3 ); the latter distribution is the same as that used in Lippert et al. (2002) and Forêt et al. (2006) to study. Similar to Lippert et al. (2002) and Forêt et al. (2006), for each n ¼ 2 j 10 2, where j ¼ 0, 1,..., 8, and for each k ¼ 2,..., 10, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. Forêt et al. (2006) found as optimal tuple length k ¼ 7, for n ¼ 800, 1600, and 3200; optimal in the sense that for this choice of k, the statistic will be closest to normal. All results are based on a sample size of 10,000; we use the same simulated sequences for all three scores,, D 2, and DS 2.AsD2z differs from only by an additive and a multiplicative constant, we do not include D2z in these simulations. We then use the Lilliefors test (Lilliefors, 1967) to assess whether the distributions are close to normal. The Lilliefors test is a modification of the Kolmogorov Smirnov goodness-of-fit test, which is implemented using the sample mean and standard deviation as the mean and standard deviation of the theoretical limiting normal distribution. In contrast to the Kolmogorov Smirnov test, statistical significance is based on the Lilliefors distribution; see also Forêt et al. (2009) for a discussion why not to use an unmodified Kolmogorov Smirnov test when the standard deviation is estimated. A p-value of less than 0.05 indicates that we would reject the null model at the 5% significance level. Under the null model, in 100 tests we would expect about five tests resulting in a p-value of less than 5%. Precision is up to four decimal places. For easier readability, a value of is recorded simply as 0. We will first discuss the nonuniform case, when asymptotic normality has been shown to hold for all three statistics, D S 2, and, when first sequence length and then word length tend to infinity. For short words, only D S 2 has been shown to be approximately normal when sequence length tends to infinity. The regime of interest here is that words are not too rare; for long words, say k 4 2 log 1 p2 n, with p 2 ¼ P a2a p2 a, a compound Poisson approximation is more appropriate (Lippert et al., 2002) The nonuniform case. In the nonuniform case, for all three statistics, the larger the sequence length and the smaller k, the closer the distribution is to normality; the performance is rather different though. Recall that the sequence length is 2 j 100; for easier readability, we denote the 2 j 100 column in the table just by the value for j. Table 1 summarizes the p-values for the Lilliefors tests in the nonuniform case for, D 2, and DS 2. For, Table 1 shows that even for k ¼ 1 we would reject the hypothesis of normality at the 5% level as long as the sequence length is not at least 3200 bp (j ¼ 5). For k 2, the required sequence length would be around 25,600 bp (j ¼ 8). Table 1 also shows that the statistic D 2 would reject the hypothesis of normality not only for large k, but also for small k with large sequence length. This nonmonotonic behavior of D 2 indicates that, to declare statistical significance, the statistic should not be compared with a normal distribution. In contrast, Table 1 displays that D S 2 is reasonably close to normal even for a sequence of length 200 bp when k 4; for k ¼ 8 a sequence of length 1600 bp would already look reasonably normal. Moreover, the statistics stay close to normal, with increasing sequence length and with increasing word length, and it thus displays the monotonicity which makes the statistic safe to apply. We repeated the simulations using the Kolmogorov Smirnov test with the known mean and variances, based on Kantorovitz et al. (2007b), instead of the Lilliefors test, for. Although the Kolmogorov Smirnov test gave slightly larger p-values, thus indicating a slightly better fit to a normal distribution, the qualitative behavior remained (data not shown) The uniform case. In the uniform case, our theoretical results predict that the limiting distribution of would only look normal when the sequence length is large and k is large also, or at least in a moderate range. In contrast, D S 2 would still be asymptotically normal even for small k when the sequence is large. Table 2 confirms this predicted behavior. Note that from Equation (3) we can see that, in the uniform case, both D 2 and D2z are the same as up to a multiplicative constant and an additive constant. Table 2 shows that the statistics, D 2, and D2z do not monotonically approach the normal distribution. In contrast, we find that D S 2 is close to normal even for sequences of length 100 bp when k 3, and it gets closer to the normal distribution with both increasing sequence length and decreasing k.

7 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1621 Table 1. Lilliefors Tests in the Nonuniform Case j ¼ 0 j¼ 1 j¼ 2 j¼ 3 j¼ 4 j¼ 5 j¼ 6 j¼ 7 j¼ 8 k p-values for k p-values for D k p-values for D S POWER STUDIES In Section 2, we studied the distributions of, D 2, and DS 2 under the null model that the two sequences are i.i.d. having the same distribution. In this section, we will study the power of detecting the relationships between the two sequences under two alternative models for their relationships. Note that as a result of estimating the mean, the term (X w ncp w ) vanishes in the case when k ¼ 1 for D S 2 and D 2. So we chose k 2 for a fair comparison of our statistics. First, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet A¼fa, g, c, tg. We consider the same two types of distributions on the letters as earlier: the uniform distribution (p a ¼ p c ¼ p g ¼ p t ¼ 1 4 ) and a gc rich, nonuniform distribution (p a ¼ p t ¼ 1 6, p c ¼ p g ¼ 1 3 ). For each n ¼ 2j 10 2, where j ¼ 0, 1,..., 8, and for each k ¼ 2,..., 10, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. All results are based on a sample size of 10,000. The first alternative model renders the two sequences dependent through a common motif (W) which is randomly distributed across the two sequences. The second alternative model is inspired by horizontal gene transfer. We randomly choose a certain number of fragments in the first sequence and then replace the

8 1622 REINERT ET AL. Table 2. Lilliefors Test in the Uniform Case j ¼ 0 j¼ 1 j¼ 2 j¼ 3 j¼ 4 j¼ 5 j¼ 6 j¼ 7 j¼ 8 k p-values for k p-values for D S corresponding fragments (position-wise) in the second sequence by the letters in the first sequence. Again, as a consequence, the two sequences would no longer be independent. In more detail, the two models are chosen as follows: The common motif model: A motif of length L ¼ 5 is chosen, say w ¼ agcca. Next, Bernoulli random variables Z 1, Z 2,..., with P (Z i ¼ 1) ¼ g, are generated for i ¼ 1, 2,..., n L þ 1. If Z i ¼ 1, we insert the word W in place of A i A i þ 1...A i þ L 1 in sequence 1. We avoid overlap by moving on to Z i þ L, whenever Z i ¼ 1. We repeat the process for sequence 2. The scores of the various statistics are then computed using the newly generated pair of sequences. The pattern transfer model: We first choose L ¼ 5 as the length of the segment to be transferred from sequence 1 to 2. Again, Bernoulli random variables Z 1, Z 2,..., with P(Z i ¼ 1) ¼ g, are generated for i ¼ 1, 2,..., n L þ 1. When Z i ¼ 1, we pick the L-word A i A i þ 1...A i þ L 1 in sequence 1 and replace B i B i þ 1...B i þ L 1 in sequence 2 with it. Again, we disallow overlaps. For this model, we compute the scores of all the statistics using sequence 1 and new sequence 2. The procedure described above is repeated 10,000 times, and the statistics calculated to yield the empirical distributions of the various statistics for each triplet of (k, n, g). As g values for the Bernoulli variables we chose g ¼ 0.001, 0.005, 0.01, 0.05, and g ¼ 0.1. For each statistic, we set a type I level of a ¼ Using the empirical distribution S of the statistic under the null model we find s so that P(S s) ¼ a. For a given g value, the power of the statistics is then estimated by the proportion of times the score under the alternative model exceeds s. We now consider the power curves of, D S 2, and for both models, as well as a comparison between these statistics. For alternative model 1, Figure 1 shows that for k ¼ 2, the power of is even smaller than 0.05, the type I error. Further, k ¼ 6 has the best power. Figure 2 shows that k ¼ 4 has the greatest power for D S 2 under the first alternative model. For, Figure 3 shows that k ¼ 5 has the greatest power for D 2 under the first alternative model, which corresponds to the length of the common motif which we assume relate the two sequences.

9 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1623 Alternative model 1; gc rich; g = 0.01 power k= 2 k= 4 k= 5 k= 6 k= length FIG. 1. Alternative Model 1: Power curves for under the gc-rich distribution; g ¼ For k ¼ 2, the power of is smaller than 0.05, the type I error (indicated by the horizontal dashed line). Turning to a comparison of the power of our various statistics, we find that D 2 has greater power than DS 2 for each k ¼ 2, 4, 5, 6, 10 (result not shown). We note that although in the uniform case,, D 2, and D2z coincide up to multiplicative and additive constants, Figure 4 shows slight differences between D2z and ¼ D 2. These differences stem from using the estimated parameters instead of the true model parameters in the test statistic D2z. S power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 1; gc rich; g = length FIG. 2. Alternative Model 1: Power curves for D S 2 greatest power. under the gc-rich distribution; g ¼ Note: k ¼ 4 has the

10 1624 REINERT ET AL. * power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 1; gc rich; g = length FIG. 3. Alternative Model 1: Power curves for D 2 under the gc-rich distribution; g ¼ Figure 5 shows a typical scenario for alternative model 1, where both D S 2 and have greater power than for given k and g and the power increases as the length, n, of the sequences increases. Even for a small g, we are able to notice the difference in the power of the various statistics. We also note that here D2z has higher power than. For alternative model 2, the picture changes. Figure 6 shows that the power of is poor for k ¼ 2, 4, 5, 6; but when increasing the parameter k to 10, far beyond the length of the tuple which we transfer, the power increases dramatically. Alternative model 1 ; Uniform ; k = 5 ; g = 0.01 power D2z S * length FIG. 4. Alternative Model 1: Power curves for, z, D 2, and D S 2 under the uniform distribution; g ¼ 0.01, k ¼ 5. Note: For the uniform case, and D 2 differ by only a constant.

11 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1625 Alternative model 1 ; gc rich ; k = 5 ; g = 0.01 power D2z S * length FIG. 5. Alternative Model 1: Power curves for, z, D 2, and DS 2 under the gc-rich model; k ¼ 5, g ¼ Note: All of D S 2, D2z, and have greater power than for given k and g. In contrast, Figure 7 shows that for D S 2 the power is moderate for all values of k in the plot, and it does not show a marked increase with sequence length. Using k ¼ 10, instead of k ¼ 6, seems to decrease the power slightly. For D 2, Figure 8 shows that the power increases with k, and increasing sequence length slightly improves the power. For k ¼ 10, the power approaches 1 for long sequences. power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 2; gc rich; g = length FIG. 6. Alternative Model 2: Power curves for under the gc-rich distribution; g ¼ This graph suggests that the power increases with k.

12 1626 REINERT ET AL. power k= 2 k= 4 k= 5 k= 6 k= 10 S Alternative model 2; gc rich; g = length FIG. 7. Alternative Model 2: Power curves for D S 2 under the gc-rich distribution; g ¼ For the alternative model 2, Figures 6 8 suggest that, under the gc-rich, nonuniform distribution, for and D S 2, the greater the k value, the greater the power, even if this comes with a higher computational cost. We note that for k fixed, D 2 has greater power than. Moreover, D S 2 has smaller power than for k ¼ 10 and long sequences. Also, we need a larger g value to see the differentiation of the power between the various k values for alternative model 2. This is due to the fact that in the first alternative model, a particular power k= 2 k= 4 k= 5 k= 6 k= 10 * Alternative model 2; gc rich; g = length FIG. 8. Alternative Model 2: Power curves for D 2 under the gc-rich distribution; g ¼ 0.05.

13 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1627 Alternative model 2 ; gc rich ; k = 5 ; g = 0.05 power D2z S * length FIG. 9. Alternative Model 2: Power curves for, z, D 2, and DS 2 under the gc-rich distribution when k ¼ 5, g ¼ Note: For k ¼ 5, has the least power and its power actually decreases as n increases. motif has a large contribution to the statistics. In the second model, however, the segment transferred from sequence 1 might be similar to the corresponding segment it replaces in sequence 2; hence, a greater g value is required before the sequences show similarity. Under the alternative model 2, we find that for k 9 and g 0.05, and D2z actually show a decrease in power as n increases, in certain intervals; this is illustrated in Figure 9. For k ¼ 10, has higher power than D S 2, but lower power than ; the higher power than DS 2 comes at a great computational cost (result not shown). Alternative model 2 ; Uniform ; k = 5 ; g = 0.05 power D2z S * length FIG. 10. Alternative Model 2: Power curves for, D2z, D 2, and DS 2 g ¼ under the uniform distribution when k ¼ 5,

14 1628 REINERT ET AL. Our findings suggest that is not desirable as a statistic for sequence comparison. We conjecture that this is due to the fact that is dominated by the normal components of the individual sequences and so is actually measuring the sum of the departure of each sequence from the background (Lippert et al., 2002) rather than the (dis)similarity between the two sequences. As n increases, loses its detecting power as the two sequences become more similar. As an aside, under the uniform distribution, in the alternative model 2, all three statistics behave similarly for k ¼ 5, as expected, see Figure USING D S 2 TO TEST FOR SIMILARITY Although in our simulations D 2 is more powerful than DS 2, the statistic DS 2 is still considerably more powerful than. For tests that would result in small p-values, as required for multiple tests for example, simulating the empirical distribution of the test statistics under the null hypothesis can be time consuming. In contrast to D 2, the limiting distribution of DS 2 is normal with mean zero, and hence, testing is straightforward; only the standard deviation needs to be estimated. To illustrate the procedure, for fixed k and n, we generate 10,000 pairs of sequences under the gc-rich or uniform case and compute the D S 2 scores for each pair. The standard deviation of DS 2 is then estimated from these empirical scores. Again, for fixed k and n, we generate 2000 pairs of sequences under the null model of no relationship between the two sequences, in both the gc-rich and uniform case, and we compute the D S 2 score. Assuming asymptotic normality, we use a z-test to test the null hypothesis of no relationship, assuming mean zero and the estimated standard deviation. Then we generate 2000 pairs of sequences from alternative model 1, with motif insertion probabilities g ¼ 0.001, 0.005, 0.01, 0.05, 0.1. The D S 2 statistic is computed, and we carry out a z-test, based on the asymptotic normality of D S 2. We repeat the procedure for the pattern transfer model, alternative model 2. We choose k ¼ 4, 5, 6, because we know from our power simulations that D S 2 works best when the motif length is around 5. We compare to the results that we obtain using the empirical distribution of D S 2 instead, where the empirical distribution function is based on 10,000 samples. In addition, we use the empirical distribution of D S 2 based on 100,000 samples. Tables 3 and 4 show the estimated type 1 and type 2 error rates in the gc-rich and uniform case; recall that the type 1 error is the probability to reject the null hypothesis although it is true. The type 2 error, the probability to accept the null hypothesis although false, is estimated under the alternative model 1 and under the alternative model 2, with motif insertion probability and pattern transfer probability g taking on the values g ¼ 0.005, 0.01, 0.05, 0.1. For each n and k, the first row gives the estimates from the z-test (abbreviated as z), and the second row gives the estimate from the empirical distribution function (abbreviated as e). Except for the puzzling case when n ¼ 3200 with k ¼ 6, the results are remarkably similar, and there is no clear advantage of using the empirical distribution function when based on a relatively small number of samples. The general observation is that the normal approximation for D S 2 gives a fast method for assessing statistical significance. 5. PHASE TRANSITION In this section, we explore the effect of small deviation from the uniform distribution for the statistic only. We restrict attention to the alphabet A¼fa, g, c, tg and word size k ¼ 1; both sequences are of the same length n. Again p a denotes the probability of the letter a. Then the standardized counts X Z a ¼ p (X ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a np a ) np a (1 p a and Y Z ) a ¼ p (Y ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a np a ) np a (1 p a both tend to standard normal variables when n tends to infinity. With ) this notation and noting that E( ) ¼ n 2 P a2a p2 a, we obtain E( ) n ¼ X p a (1 p a )X Z a YZ a þ p ffiffiffi X n p a 1 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p a (1 p a ) X Z a 4 þ YZ a Þ: a2a a2a When the distribution of the alphabet is uniform, (p a, p c, p g, p t ) ¼ ( 1 4, 1 4, 1 4, 1 4 ) the second term in Equation (7) vanishes, (7)

15 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1629 Table 3. The Estimated Type 1 and 2 Error Rates When Applying the z-test Using the Estimated Variance, Using 2000 Samples Type 1 Type 2 Length k M1g1 M1g2 M2g3 M2g z e t z e t z e t z e t z e t z e t ,800 4 z e t ,800 5 z e t ,800 6 z e t ,600 4 z e t ,600 5 z e t ,600 6 z e t M1=M2 refers to alternative model 1=2; g1=g2=g3=g4 refers to the cases where g ¼ 0.005, 0.01, 0.05, 0.1, respectively, where g is the parameter of the Bernoulli random variable B. So M1g3 means alternative model 1, g ¼ The first row gives the estimates from the z-test (abbreviated as z), the second row gives the estimate from the empirical distribution function (abbreviated as e), both based on 10,000 samples. The third row, abbreviated as t, gives the estimate from the empirical distribution function based on 100,000 samples. E( ) ¼ 3 X X Z a n 16 YZ a, and so it is asymptotically nonnormal (in fact, a sum of products of standard normal variables). In the situation where (p a, p c, p g, p t ) 6¼ ( 1 4, 1 4, 1 4, 1 4 ) do not depend on n, the second term in Equation (7) dominates the first term; as E( ) p n ffiffi X p a 1 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p a (1 p a )(X Z a n 4 þ YZ a ), a2a the scaled limit is a normal distribution. a2a

16 1630 REINERT ET AL. Table 4. The Estimated Type 1 and 2 Error Rates When Applying the z-test Using the Estimated Variance, Using 2000 Samples, but for the Uniform Case Type 1 Type 2 Length k M1g1 M1g2 M2g3 M2g z e t z e t z e t z e t z e t z e t ,800 4 z e t ,800 5 z e t ,800 6 z e t ,600 4 z e t ,600 5 z e t ,600 6 z e t Next we assume that ( p a (n), p g (n), p c (n), p t (n)) changes with n in such a way that there exists a function f (n)? 0 and constants C l, l ¼ a, g, c, t satisfying C a þ C g þ C c þ C t ¼ 0 such that lim n!1 f 1 (n)(p l (n) 1=4) ¼ C l, for each letter l 2A. Then E( ) ¼ 3 X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) Xa Z n YZ a a2a pffiffiffiffiffi 3n f (n) X þ C a (X Z a 4 þ YZ a!, ) þ H(f (n)) (8) where g(n)? 0asn?? and Y( f (n)) indicates a term that has the same order as f (n). a2a

17 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1631 Table 5. The Ratio R n of the Coefficient of the First Term Over the Coefficient of the Second Term in for Different Values of n and n Let f (n) ¼ 1=n 0.5 þ. When <0, the second term in (8) will dominate and E( ) n will tend to a normal 1 distribution. When >0 the first term dominates and E( ) n will tend to be nonnormal. Thus, we expect a phase transition from normal to nonnormal as changes from negative to positive. Intuitively, the ratio R n of the coefficient of the first term over the coefficient of the second term in Equation (8) can be thought of as a ratio of dominance ; pffiffiffi 3 16 R n ¼ n p ffiffi 3 ¼ 3 4 n : 4 Table 5 shows the decrease of R n for increasing n (row labels) and decreasing (column labels). To run simulations in the vicinity of this phase transition, we consider two types of probability vectors for the alphabet in A and f (n) ¼ 1=n 0.5 þ 1. The type I probability vector is chosen as 4 þ f (n), 1 4 f (n), 1 4, 1 4 Þ, giving (C a, C c, C g, C t ) ¼ (1, 1, 0, 0). 1 In the second scenario, type II, the probability vector perturbs all components, 4 þ f (n), 1 4 f (n), 1 4 þ f (n), 1 4 f (n)þ, so that (C a, C g, C c, C t ) ¼ (1, 1, 1, 1). Under the type I model, we can show that the variance of P a2a C a X ~ a þ P a2a C a Y ~ a is approximately Thus, there exists Z n? N(0, 1) as n tends to infinity such that X C a ~X a þ X C a ~Y a p 4 ffiffi Z n (type I): a2a a2a 3 Equation (8) can be rewritten as E( ) ¼ 3n X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) ~X a ~Y a a2a þ n p ffiffiffiffiffi 2n f (n) ðz n þ H(f (n)) Þ: (9) 2 Under the type II model, the variance of P a2a C a X ~ a þ P a2a C a Y ~ a is approximately 32 3 and X C a ~X a þ X rffiffiffi 2 C a ~Y a 4 Z n (type II): 3 a2a a2a Hence, under the type II model, Equation (8) can be rewritten as E( ) ¼ 3n X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) ~X a ~Y a a2a þ n p ffiffiffiffiffi 3n f (n) ðz n þ H(f (n)) Þ: (10) 2 pffiffi The ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is 2 - fold larger than that for Equation (10). Therefore, we expect that normality appears for relative small absolute values for the type II vector. For given n and, we generate sequences of length n using both types of distribution vectors. The scores for word size k ¼ 1 are then tabulated. We use a Kolmogorov Smirnov test to test the hypothesis that is normally distributed, and the corresponding p-value is obtained. Here again we use the

18 1632 REINERT ET AL. Table 6. The p-values of the Kolmogorov Smirnov Test for Testing the Normality of for Letter Distributions Which Are Close to Uniform; f(n) ¼ 1=n 0.5þ Type I : 4 þ f ðnþ; 1 4 f ðnþ; 1 4 ; e e e e Type II : 4 þ f ðnþ; 1 4 f ðnþ; 1 4 þ f ðnþ; 1 4 f ðnþ e e e e n theoretical mean and variance for the test. Table 6 gives the p-values for different values of n and under the two models; again we report only the first four digits. Table 6 indicates that under the type II model the distribution of is not significantly different from normality when 0.05 and n ¼ , while significantly different from normality when ¼ 0.05 and n On the other hand, under the type I model, the distribution of is significantly different from normality even when ¼ 0.05 and n ¼ The simulation results are consistent with our intuition. As shown in Table 5, the ratio R n of the coefficient of the first term over that of the second term in Equation (8) is less than 0.1 if < 0.10 when n ¼ This can explain why normality of appears if < 0.10 when n ¼ for both type I and II models. Further, qffiffi as the ratio of the coefficient of 3 the first term over the coefficient of the second term in Equation (9) is 2-fold larger than that for Equation (10), normality of begins to appear when < 0.05 for the type II model. 6. DISCUSSION The typically used statistic asymptotically ignores the joint word occurrences in two sequences unless all letters are almost equally likely. In the latter scenario, a phase transition occurs. Hence, the statistic is neither robust nor informative, under the normal regime. The main advantage of is that it is easy to compute.

19 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1633 The proposed D S 2 statistic instead is also easy to compute, but it can be compared with a normal distribution to assess significance, and it performs well in a power study. The D 2 statistic is more powerful than DS 2 in our simulation study and is also easy to compute, but its asymptotic distribution does not have a convenient form; instead, it would best be assessed using simulation, which are time consuming, as the tail of the distribution would need to be estimated. Our recommendation is to discard, to use D S 2 instead when computing time is limited, and to ideally use D 2 for sequence comparison based on k-tuple content. Our results allow for a number of generalizations. The normal approximation for the word counts in each individual sequences does not assume that the underlying letter distribution is the same as in the other sequence. Hence, the normal approximation for D S 2 also holds when the sequences do not follow the same underlying distribution on the letters. Huang (2002) gave a related normal approximation for one sequence in the more general situation that the sequence is generated by a homogeneous Markov chain. Kantorovitz et al. (2007b) already successfully adapted D2z to the Markov case. Also for D S 2 the generalization of our results for that setting should be straightforward; the error bounds would need to be adjusted. Burden et al. (2008) generalized D2z to allow for mismatches, on the four-letter alphabet {a, c, g, t } under the Bernoulli model that p a ¼ p t and p c ¼ p g ; they called an m-neighborhood of a word w of length k the set of all words which differ by at most m letters from w. The generalized statistic then counts the number of all m-neighborhood matches of all k-words between two sequences. With our normal approximation for all word counts, D S 2 could be generalized similarly to allow for a certain number of mismatches. The quality of the normal approximation will depend on the number m of permitted mismatches. We also indicate that more than two sequences could be compared in a similar fashion. Quine (1994) stated the result that if X 1,..., X n are independent normal random variables with zero means and variances r 2 1,..., r2 n, then! X 1 X 2 X P n r 2 1 X 2 i1 Xi 2 2 Xi 2 ~N 0, r2 n P n 1 r 2 i1 r 2 i 2 r 2, i n 1 where both sums are over all integers 1 i 1 < i 2 < i n 1 n (Melnykov and Chen, 2007). This suggests the extension of the D S 2 statistic for multiple sequence comparison by taking the products of the individual word counts and standardizing it as earlier. Then still a normal approximation is valid. Similarly, we could extend D 2 as the sum, over all words, of the product of more than two standardized word counts. Springer and Thompson (1966) gave a formula for the density of the product of independent standard normals. Again the covariance structure of the word counts within one sequence would make it recommendable to assess the limiting distribution via simulation. ACKNOWLEDGMENTS G.R. was supported in part by EPSRC grant no. GR=R52183=01, and by BBSRC and EPSRC through OCISB. D.C. was supported by a Overseas Postdoctoral Fellowship from the National University of Singapore. F.S. was supported by NIH grant no. P50 HG and R21AG M.S.W. was supported by NIH grant no. P50 HG and R21AG No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Burden, C.J., Kantorovitz, M.R., and Wilson, S.R Approximate word matches between two random sequences. Ann. Appl. Probab. 18, Forêt, S., Kantorovitz, M., and Burden, C Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 7 (Suppl 5), S21.

On the length of the longest exact position. match in a random sequence

On the length of the longest exact position match in a random sequence G. Reinert and Michael S. Waterman Abstract A mixed Poisson approximation and a Poisson approximation for the length of the longest