On the length of the longest exact position. match in a random sequence

Size: px

Start display at page:

Download "On the length of the longest exact position. match in a random sequence"

Neil Holmes
5 years ago
Views:

1 On the length of the longest exact position match in a random sequence G. Reinert and Michael S. Waterman Abstract A mixed Poisson approximation and a Poisson approximation for the length of the longest exact match of a random sequence across another sequence are provided, where the match is required to start at position 1 in the first sequence. This problem arises when looking for suitable anchors in whole genome alignments. AMS 2000 subject classifications. 62E17, 92D20. Key words and phrases: Poisson approximation, mixed Poisson approximation, length of longest match, Chen-Stein method Corresponding author. Department of Statistics, University of Oxford, 1 South Parks Road, Oxford OX1 3TG, UK; reinert@stats.ox.ac.uk; phone ; fax GR was supported in part by EPSRC grant no. GR/R52183/01. Molecular and Computational Biology, University of Southern California, 835 W 37th Street, SHS 172, Los Angeles, California , USA; msw@usc.edu; phone ; fax MSW was supported by NIH grant no. P50 HG

2 1 Introduction When aligning whole genomes, often a seed-and-extend technique is used. Starting from exact or near-exact matches, reliable ones among these matches are selected as anchors, and then the remaining stretches are filled in using local and global alignment. See Lippert et al. [1] for a discussion of genome alignment methods using anchors. To select a match that is both sensitive and specific, [1] introduce a score based on the length, R n, of the longest exact match of a random sequence across another sequence, where shifts are not allowed. For R n and the associated scores, [1] find that their approach based on a mixed Poisson approximation, although valid, is computationally not feasible if the distribution of the random letters making up the random sequences is not uniform, as the mixing takes place over too many terms; the authors resort to a Monte Carlo method. Here we provide a Poisson approximation for the number of matches of fixed length, along with bounds provided by the Chen-Stein method, and we obtain an approximate expression for the cumulative distribution function of R n that is easy to compute. The bound on the error in the approximation turns out to be small, thus making our suggestion a useful approach. The set-up for our problem is as follows. Let A = A 1 A 2... A n and B = B 1 B 2... B n be two independent sequences with i.i.d. letters from a finite alphabet A with d elements. Let π(a) be the probability that a random letter takes on the letter a, and let π = max a A π(a) be the maximum of these probabilities. The letter distribution is not necessarily uniform. We put R n = max {A k = B j+k, k = 1,..., m, for some 0 j n m}; m thus R n denotes the length of the longest exact match of a random sequence across another 2

3 sequence, where shifts are not allowed. Note that if the match in sequence A was not required to start at position 1, the problem would reduce to the distribution of the well understood H n = max {A i+k = B j+k, k = 1,..., m, for some 0 i, j n m}, m see Waterman [3]. Our problem differs from the study of H n by requiring an exact match beginning at a fixed position in the first sequence. To reveal the Poisson-type structure in the problem, we use a standard duality argument as follows. If R n < m then there are no matches of length m (or longer) in the sequence. Ignoring end effects, this means that there are no occurrences of A 1... A m in B. Let W m denote the number of (clumps of) matches of length m (or longer) in the sequence, so that P (R n < m) P (W m = 0). In Section 2 we shall give a mixed Poisson approximation for P (W m = 0). Section 3 derives the Poisson approximation for P (W m = 0) and applies it to obtain an approximation, with bound, for P (R n < m). Finally, in Section 4 we illustrate that the approximation for P (R n < m) is indeed easily computatble. 2 A mixed Poisson approximation For Poisson and mixed Poisson approximation it is useful to think in terms of clumps of occurrences, see Reinert et al. [2], because de-clumping disentangles the dependence arising from self-overlap of words. We say that a clump of a word = m starts at position i in B if there is an occurrence of at position i, and there is no (overlapping) occurrence of at positions i m + 1,..., i 1. 3

4 Thus when ignoring end effects the study of R n is equivalent to the study of W m = n i=1 1(a clump of A 1... A m starts at position i in B), where we abbreviate n = n m + 1. End effects only arise from the possibility that, when embedded in an infinite sequence, the sequence B = B 1 B 2... B n starts within a clump in the infinite sequence. Assume that B = B 1 B 0 B 1 B n B n+1 is an infinite sequence for now, so that we can ignore end effects. Then we have R n < m W m = 0. If m is large enough, then a fixed word of length m will rarely occur at a given position i in the random sequence B. When using clumps in order to account for the strong dependence between neighbouring occurrences in the case that has a large amount of self-overlap, it is plausible and indeed established that the number of clumps of in B is approximately Poisson distributed, Proposition 1 below. For any fixed, we let n W m () = 1(a clump of starts at position i in B). i=1 In what follows we shall always assume that = w 1 w m A m, so that m µ() = π(w i ) i=1 is the probability of a random word of length m equals. If there is a p such that w i = w i+p, i = 1,..., m p, then p is called a period of. A period is a principal period if it is not a strict multiple of the minimal period. An occurrence of starting at postion i is a clump if and only if for none of the periods p of, the truncated word (p) = w 1 w p 4

5 starts at position i p. It is easy to see that it suffices to consider all principal periods. The probability that a clump of starts at a given position in the sequence is then given by µ(w) = µ(w) µ(w (p) w), p P () where (p) = w 1 w p w 1 w m is the concatenated word, and P () is the set of principal periods of. In particular, EW = λ() := n µ(). To describe the distance between the distributions of non-negative integer valued random variables X and Y we use the total variation distance, defined by d T V (X, Y ) = sup P (X B) P (Y B). B {0,1,...} It will be convenient to abbreviate, for r = 1, 2, 3,..., π r = a A(π(a)) r, the probability that r random letters match. Corollary in [2] together with the independence of the letters immediately gives the following proposition. Proposition 1 Let Z() P o( λ()) be Poisson distributed with mean λ(). Then d TV ( L(Wm ()), P o( λ())) ) (n m + 1) µ(){(6m 5) µ() + 2(m 1)µ()}. Proposition 1 only counts the number of occurrences of a fixed word, whereas in our problem, the first m letters of the sequence A, namely A 1... A m, constitute a random word. Thus we need to condition on the words that A 1... A m take on, and using the rule of total probability, we obtain a mixed Poisson approximation. 5

6 Theorem 1 With the above notation, P (W m = 0) µ()p (P o( λ()) = 0) Rem 1 (8m 7) nπ m 3. Remark 1 Recalling that π = max a A π(a), we note that λ() n(π ) m. If we consider the regime that n(π ) m is approximately constant, with m fixed, then (π ) m = O(n 1 ), and using the bound π 3 (π ) 2 a A π(a) = (π ) 2, we obtain that Rem 1 = O(n 1 ), thus indicating that the bound in Theorem 1 is of useful order. Proof of Theorem 1. Writing out the different sequences that A 1 A 2... A m can take on, we have P (W m = 0) = µ()p (W m () = 0) = µ()p (P o( λ()) = 0) + µ()ɛ 1 (), where, by Proposition 1, µ()ɛ 1 n µ() µ(){(6m 5) µ() + 2(m 1)µ()} =: Rem 1. For Rem 1 we use that µ() µ() to bound Rem 1 (8m 7) n (µ()) 3. Now, if A 1 A m, B 1 B m, and C 1 C m are three independent random words, then (µ()) 3 = P (A 1 A m = )P (B 1 B m = )P (C 1 C m = ) = P (A 1 A m = B 1 B m = C 1 C m ) = (π 3 ) m using that the letters are independent, so that Rem 1 (8m 7) n(π 3 ) m, as claimed. 6

7 3 Poisson approximation to the mixed Poisson approximation Although Theorem 1 is valid, the probability µ()p (P o( λ()) = 0) is difficult to evaluate, the sum growing exponentially with alphabet size. As much of the computational difficulty lies in accounting for the different periods in all words A m, our idea is to approximate P (P o( λ()) = 0) by the simpler expression P (P o(λ()) = 0), where λ() := nµ(). Thus we ignore the period correction in the Poisson parameter. While this may much distort the limiting distribution for words with a large amount of self-overlap, there are not too many such words in A m ; indeed we provide a bound on the error in this approximation in the next theorem. Recall that π = max a A π(a). Theorem 2 For A m, let Z() have Poisson distribution with mean λ(), and let Z() have Poisson distribution with mean λ(). Abbreviate f = π2 2 π 3. Then where µ()p ( Z() = 0) µ()p (Z() = 0) = Rem 2, Rem 2 (1 e n(π ) m ) { (π 2 ) m f m+ m f + m 2 (π ) m }. (1) Here, x denotes the integer part of x. Remark 2 In the regime considered in Remark 1, that n(π ) m is approximately constant it follows that Rem 2 = O(n 1 ), indicating that the bound in Theorem 2 is usable. 7

8 Proof of Theorem 2. Firstly we wish to bound P ( Z() = 0) P (Z() = 0) = e λ() e λ(). By series expansion it is easy to see that for any 0 ν λ <, λ ( e λ ν 1 ) (λ ν) ( e λ 1 ) and direct manipulation yields e ν e λ (λ µ) 1 e λ. λ Applying this bound with λ = λ() and ν = λ(), we have for each fixed that Thus P ( Z() = 0) P (Z() = 0) 1 e λ() λ() n p P () µ(w (p) w). µ()p ( Z() = 0) µ()p (Z() = 0) = Rem 2, where Rem 2 µ() 1 e λ() λ() n p P () µ(w (p) w) m 1 (1 e n(π ) m ) 1(p P ())µ(w (p) w). (2) p=1 In the last step we used the uniform bound µ() (π ) m for all. To bound the sum p P () µ(w (p) w) we consider the cases that p m 2 and p m separately. 8

9 For p m + 1 we note that 2p + 1 m, and writing out the period yields 2 1(p P ())µ(w (p) w) = 1(p P ())P (A 1... A m+p = w (p) w) = 1(p P ())P (A 1... A m+p = w (p) w; A 1 = A p+1 = A 2p+1,..., A m p = A m = A m+p ; A m p+1 = A m+1,..., A p = A 2p ) P (A l = A l+p = A l+2p, l = 1,..., m p; A l = A l+p, l = m p + 1,..., p). But this probability can be expressed by the probability π 3 that three random letters match, and the probability π 2 that two random letters match. We have m p equations forcing the matching of three random letters each, and 2p m equations forcing the matching of two random letters each. As the letters are independent, the probabilities are easy to calculate; Thus with f = π2 2 π 3, which is less or equal to 1, P (A l = A l+p = A l+2p, l = 1,..., m p; A l = A l+p, l = m p + 1,..., p) ( ) ( ) = π m p 3 π 2p m π3 m π 2 p 2 = 2. π 2 π 3 m 1 1(p P ())µ(w (p) w) p= m 2 +1 m 1 p= m 2 +1 π m p 3 π 2p m 2 = (π 2 ) m f m+ m (3) 1 f For p m 2 we note that if a word has period p m 2, then the letters w p+1,..., w m are uniquely determined. Therefore any word can possess at most one prinicpal period p m 2. 9

10 Again spelling out the periodicity, we obtain that m 2 1(p P ())µ(w (p) w) p=1 = m 2 p=1 ( P A i = A i+p =... = A i+( m+p )p, i = 1,..., m(modp); p A j = A j+p =... = A j+( m p )p, j = m(modp) + 1,..., p ) m 2 ( ) p m(mod p) ( ) m(mod p) π m+p p π m+p. (4) p +1 p=1 Expression (4) can be bounded further by using that π r π π r 1 (π ) r 1, giving m 2 p=1 ( π m+p p ) p m(mod p) ( ) m(mod p) π m+p p +1 = m 2 p=1 m 2 p=1 (π ) (π m(mod p) m+p p (π ) m m p (π p m+p p ) p ) p (π ) m m. (5) 2 Summarizing we obtain from (3) and (5) that m 1 p=1 1(p P ())µ(w (p) w) (π 2 ) m f m+ m f + m 2 (π ) m. (6) Substituting in (2) gives the stated result. Remark 3 As π = max a A π(a), the bound µ(w (p) w) (π ) m+p is immediate. Using that any word of length m that has a principal period p m is completely determined by its 2 first p letters, instead of using (4), a quick and dirty bound is m 2 1(p P ())µ(w (p) w) p=1 m 2 p=1 d p (π ) m+p = (π ) m+1 d (π d) m 2 1. π d 1 As a A π a = 1, we have that π d 1. However, if the letter distribution is close to uniform, and if m is relatively large, then the above bound will be small. 10

11 Now we apply our results to the original problem, the cumulative distribution function of R n, the length of the longest exact position match. Corollary 1 For A m, as in Theorem 2 let Z() have Poisson distribution with mean λ(). Then P (R n < m) P (Z() = 0) Rem 3, A m where Rem 3 = Rem 1 + ( 1 + (m 1)(π ) m (1 e n(π ) m ) 1) Rem 2, with Rem 1 given in Theorem 1 and Rem 2 given in Theorem 2. Remark 4 In the regime that n(π ) m is approximately constant, we have already seen in Remark 1 and in Remark 2 that Rem 1 = O(n 1 ) and Rem 2 = O(n 1 ), and so also Rem 3 = O(n 1 ), providing a useful bound. Proof of Corollary 1. In view of Theorem 1 and Theorem 2, all that is required is to bound the end effects, resulting from B having been idealized as just a part of an infinite sequence when it came to counting clumps. To bound the end effects, note that (see, e.g. [2], Equation (6.4.10) ) P {1(R n > m) 1(W m = 0)} (m 1) µ()(µ() µ()) (m 1)(π ) m 1(p P ())µ( (p) ). p We now use (6), giving that P {1(R n > m) 1(W m = 0)} (m 1)(π ) m { (π 2 ) m 1 f m 2 +1 m 11 f 1 + (π ) m m 2 }.

12 Applying Theorem 1 and Theorem 2 finishes the proof. Remark 5 Lippert et al. [1] introduce as Z-score Z i,n = max {A i+k = A j+k, k = 0,..., m 1; 1 i j n}. m This is similar to R n but allows self-overlap. [1] show that the probability P { L i=1 1(Z i,n k)} that the scores Z i,n exceed k consecutively across L positions can be expressed by probabilities involving only R n, so Corollary 1 can be applied to approximate the distribution of the scores. 4 Numerical illustration A counting argument shows that µ()p (Z() = 0) = µ()e λ() is not as difficult to evaluate as µ()p ( Z() = 0), as follows. Let n a () denote the number of times that letter a A appears in. Then, as the letters are independent, µ() = (π(a)) na() a A and hence we obtain the multinomial expression µ()p (Z() = 0) (7) ( ) { } { m = π(a) na exp n } π(a) na. (n (na,a A):na {0,1,...,m}, a, a A) a A a A a A na=m While there does not appear to exist a simplifying expression in general, we note that (7) is a polynomial problem in m; indeed we only need to evaluate O(m d 1 ) summands instead of O(d m ) summands. As we consider d typically much smaller than m, this is a considerable reduction in complexity. 12

13 In particular, if A = {A, C, G, T } and if π A = π T, π C = π G, as may be reasonable to assume when considering both a DNA sequence and its reverse-complement, then denoting the base-pair probabilities by p = 2π A = 1 2π C, the expression (7) simplifies to a binomial expectation, µ()p (Z() = 0) = m k=0 ( ) m p k (1 p) m k exp { n2 m p k (1 p) m k}, k where k stands for the sum n A + n T. Example. Suppose as in [1] that n = , the estimated length of the human genome, NCBI build 28 and build 34, with alphabet A = {A, C, G, T } of size d = 4, and basecomposition in a non-repeat region estimated as p A = p T = 0.29 and p C = p G = 0.21, so that π = 0.29 and p = 2p A = Then, truncating after the first four digits, π 2 = , π2 2 = , π 3 = , f = , and we can calculate the mean of λ(a 1 A 2 A m ) using that Eλ(A 1 A 2 A m ) = n µ() 2 = n2 m m k=0 ( ) m p 2k (1 p) 2(m k). k Table 1 gives the expected Poisson parameter for m = 15,..., 22. m Eλ Table 1: Expected Poisson parameter Thus for m = 15 we would expect W 15 = 0, and hence R n < 15, with low probability, whereas for m = 22 we would expect W 22 = 0 with high probability, hence R n < 22 with high probability. 13

14 Table 2 below gives a summary of the estimated probability ρ(m) = Am P (Z() = 0) for P (R n m) 1 P (W m = 0) obtained in Corollary 1, for m = 15, 16,..., 22, along with the Monte-Carlo estimates ˆρ(m) from [1]; we note that Table 8 in [1] indeed gives estimates for P (R n m) instead of P (R n < m) as written ibid. We add our bound from Corollary 1 along with the estimated standard deviation terms contributing to our bound; recall that Rem 2 is given in (1). V arˆρ(m) from [1], and the separate remainder m ρ(m) ˆρ(m) bound V arˆρ(m) Rem 1 Rem e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e e-13 Table 2: Estimated probabilities, bounds, and remainder terms Our approximated probabilities are similar to the Monte-Carlo estimates in [1]. However, whereas [1] can only conclude that, say, an approximate 95% confidence interval for the true probability P (R n m) is given by ˆρ ± 1.96 V arˆρ(m), we indeed proved that the true probability will lie within ρ(m) ± bound, which is a shorter interval for all values of m considered in this example. 14

15 Also we see that both remainder terms Rem 1 and Rem 2 contribute in similar magnitude to the bound Rem 3, indicating that the bound on the error made in replacing the mixed Poisson approximation by the Poisson approximation is not much larger than the bound on the error made by the mixed Poisson approximation in the first place. References [1] Lippert, R.A., Zhao, X., Florea, L., Mobarry, C., and Istrail, S. (2004). Finding Anchors for Genomic Sequence Comparison. In Proceedings of the 8th Annual International Conference on Research in Computational Biology (RECOMB 04), ACM Press, Also in J. Comp. Biol. 12, (2005). [2] Reinert, G., Schbath, S., and Waterman, M.S. (2005). Statistics on words with applications to biological sequences. In Lothaire: Applied Combinatorics on Words J. Berstel and D. Perrin, eds., Cambridge University Press, [3] Waterman, M.S. (1995). Introduction to Computational Biology. Chapman and Hall. 15

The Distribution of Short Word Match Counts Between Markovian Sequences

The Distribution of Short Word Match Counts Between Markovian Sequences Conrad J. Burden 1, Paul Leopardi 1 and Sylvain Forêt 2 1 Mathematical Sciences Institute, Australian National University, Canberra,