Comparison of the similarities between two segments of biological sequences using k-tuples (also. Research Articles

Size: px
Start display at page:

Download "Comparison of the similarities between two segments of biological sequences using k-tuples (also. Research Articles"

Transcription

1 JOURNAL OF COMPUTATIONAL BIOLOGY Volume 16, Number 12, 2009 # Mary Ann Liebert, Inc. Pp DOI: =cmb Research Articles Alignment-Free Sequence Comparison (I): Statistics and Power GESINE REINERT, 1 DAVID CHEW, 2 FENGZHU SUN, 3,4 and MICHAEL S. WATERMAN 3,4 ABSTRACT Large-scale comparison of the similarities between two biological sequences is a major issue in computational biology; a fast method, the statistic, relies on the comparison of the k- tuple content for both sequences. Although it has been known for some years that the statistic is not suitable for this task, as it tends to be dominated by single-sequence noise, to date no suitable adjustments have been proposed. In this article, we suggest two new variants of the word count statistic, which we call D S 2 and D* 2. For DS 2, which is a selfstandardized statistic, we show that the statistic is asymptotically normally distributed, when sequence lengths tend to infinity, and not dominated by the noise in the individual sequences. The second statistic, D 2, outperforms DS 2 in terms of power for detecting the relatedness between the two sequences in our examples; but although it is straightforward to simulate from the asymptotic distribution of D 2, we cannot provide a closed form for power calculations. Key words: alignment-free, normal approximation, normal distribution, sequence alignment, word count statistics. 1. INTRODUCTION Comparison of the similarities between two segments of biological sequences using k-tuples (also called k-grams or k-words) arises from the need for rapid sequence comparison. Such methods are often employed in cdna sequence comparisons. Today next-generation sequencing methods are producing unprecedented volumes of sequence data. Therefore, we expect that the use of k-tuples will play an increasingly important role for molecular sequence and genome comparisons in the current era. This article will explore in some detail the statistic which one of these methods is based upon, along with other substantially superior statistics. One of the most widely used statistics for sequence comparison based on k-tuples is the so-called statistic, which is based on the joint k-tuple content in the two sequences. If two sequences are closely related, we would expect the k-tuple content of both sequences to be very similar. 1 Department of Statistics, University of Oxford, Oxford OX1 3TG, United Kingdom. 2 Department of Statistics and Applied Probability, National University of Singapore, Singapore , Singapore. 3 Molecular and Computational Biology Program, University of Southern California, Los Angeles, California MOE Key Laboratory of Bioinformatics and Bioinformatics Division, TNLIST=Department of Automation, Tsinghua University, Beijing, P.R. China. 1615

2 1616 REINERT ET AL. More formally, suppose that two sequences, A ¼ A 1 A 2...A n and B ¼ B 1 B 2...B m, say, are composed of letters that are drawn from a finite alphabet A of size d. For a 2A, let p a denote the probability of letter a. For w ¼ (w 1,..., w k ) 2A k, let X w ¼ Xn 1(A i ¼ w 1,..., A i þ k 1 ¼ w k ) i ¼ 1 count the number of occurrences of w in A, and similarly, Y w counts the number of occurrences of w in B. Here, n ¼ n k þ 1; similarly, we put for later use m ¼ m k þ 1. Then is defined by ¼ X w2a k X w Y w : The null model is typically chosen to be such that the letters are independent and identically distributed (i.i.d.), and that the two sequences are independent. Using this model, Lippert et al. (2002) derived a Poisson approximation, a compound Poisson approximation, and a normal approximation for ; the normal approximation is only valid under the assumption that not all letters of the alphabet are equally likely. In the case that all letters are equally likely, the statistic looks asymptotically like the sum of products of independent normal variables. Lippert et al. (2002) also found that the statistic is dominated by background noise, in the nonuniform case. In the work of Kantorovitz et al. (2007a), it was shown that in the regime that all letters are equally likely, the standardized statistic D2z ¼ E( ) sd( ) is asymptotically normally distributed when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. In clustering some biologically related sequences, Kantorovitz et al. (2007b) found that D2z outperforms. The heuristic argument is that the background models for the two sequences may be different, and the statistic should hence be normalized to account for the different background distributions of the sequences. Yet in the nonuniform case the issue about the variability being dominated by the noise in the single sequences remains. In this article, we propose a new statistic, which is a self-standardized version of. In general, Shepp (1964) observed that, if X and Y are independent mean zero normals, X with variance r 2 X and Y with variance r 2 Y, then ffiffiffiffiffiffiffiffiffiffiffi XY r p is again normal, with variance 2 X r2 Y.Forw¼w X 2 þ Y 2 (r X þ r Y ) w k, p w ¼ Q k i ¼ 1 p w i is the probability of occurrence of w, and the centralized count variable is denoted as ~X w ¼ X w np w and ~Y w ¼ Y w mp w : We introduce the new count statistic as D S 2 ¼ X ~X w ~Y w q ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : (1) w2a k ~X w 2 þ ~Y w 2 Here we set 0 0 ¼ 0. The superscript S stands for Shepp, and also for self-standardized. We shall see that, under reasonable assumptions, D S 2 will be approximately normally distributed. In practice we shall usually have to replace p a, the (unobserved) letter probabilities, by ^p(a), the relative count of letter a in the concatenation of the two sequences, based on the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution. We then estimate the probability of occurrence of w ¼ w 1...w k by bp w ¼ Q k i ¼ 1 bp wi. In our simulations, we always estimate the letter probabilities, even when we assume that all letters are equally likely. We also study the following version of the word count statistic: D 2 ¼ X w ~X w ~Y w pffiffiffiffiffiffi, (2) n m p cpw w which in our simulations outperforms not only but also D S 2, in terms of power for detecting the relatedness between the two sequences. This statistic comes about by considering P p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ~X w ~Y w w, but as dvarx w VarYw d w

3 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1617 the variance is costly to compute, it is replaced by the estimated mean of the word occurrence across the two sequences when the probability of the word pattern is small; this approach can be justified by considering a Poisson approximation for the individual word counts. We justify in Section 2 that D 2 can be viewed as the sum of the products of independent normal variables, and we suggest how to simulate from its asymptotic distribution, for which we do not have a closed-form expression. To explain the problem with, ¼ X w2a k ~X w ~Y w þ n X w2a k p w Y w þ m X w2a k p w X w mn X w2a k p 2 w : (3) Approximately, if n and m are large, under the null model, ~X w should follow a mean zero normal distribution with variance of order n with respect to sequence length. The distribution of P w2a ~X k w ~Y w should be approximately, for large n and m, the distribution of the sum of products of pairs of independent mean zero normal variables with a variance of order O(nm). If all letters are equally likely, then p w ¼ d k P for all words w, and hence, w2a k p w X w ¼ d k P w2a k X w ¼ nd k, giving that ¼ P w2a ~X k w ~Y w þ mnd k. The variability in is the same as the variability in P w2a ~X k w ~Y w, and indeed as in D 2. When not all letters are equally likely, as the variance of X w is of order n and the variance of Y w is of order m, the variance of n P w2a k p wy w is of order O(n 2 m), and similarly the variance of m P w2a k p wx w is of order O(nm 2 ). Hence, the variability in is dominated by the variability in P P w2a k p wx w and w2a k p wy w. Thus, in this case, the variability in is dominated by the terms that reflect the noise in the single sequences only. The asymptotic normality of for both nucleic acid and amino acid sequences has been studied empirically by Forêt et al. (2009). However, no power study was undertaken; our argument shows that in the nonuniform case the asymptotic normality of only stems from the asymptotic normality of the underlying word counts in the respective sequences. Even in the regime that all letters are equally likely, if we only leave the last word d ¼ (d,..., d) out, forming the statistic D2 ¼ P w2a k ;w6¼d X wy w, then P w2a k ;w6¼d X w ¼ n X d, which is not constant. So even if we just leave one word out of the whole set of possible words, if the sequence lengths are large and all other parameters are fixed, then the variability, now in D2, will be dominated by the variability in the single sequences. Hence, the statistic is, in general, not useful for assessing whether the two underlying sequences are related. This article is structured as follows. In Section 2, we discuss the distribution of D S 2 and under the null hypothesis that the two sequences are independent and both are generated by i.i.d. letters from the same distribution, and we shall present simulation results for testing the normality of, D S 2, and. In Section 3, we study the power of the statistics, D S 2,, and D2z under two alternative scenarios. The first scenario is that the two sequences contain a common motif, whereas the second scenario is a pattern transfer model; we pick a word in the first sequence and use it to replace a word in the second sequence. Our results illustrate not only the poor performance of, but also the encouraging performances of D S 2 and of. Section 4 illustrates how the asymptotic normality of D S 2 gives a fast method for assessing statistical significance, as only its standard deviation has to be approximated and not the empirical distribution itself. There is a caveat if the distribution on the alphabet is very close to uniform, and if some other conditions are satisfied which relate to having a large number of summands of products of pairs of independent normals, then will behave like the sum of products of normally distributed variables, similar to the uniform case; when the deviation from uniform increases, the asymptotic normality for holds. This phase transition is explored in Section 5. We summarize our results in Section 6, and we briefly indicate generalizations to Markov chain models as well as to multiple sequence comparisons. The proofs for Section 2 are presented in the Supplementary Material (see online supplementary material at The code for simulating from the distributions is available at www-rcf.usc.edu=*fsun=programs=d2=d2-all.html

4 1618 REINERT ET AL. 2. THE DISTRIBUTIONS OF D S 2 AND UNDER THE NULL MODEL Here the null model is that the letters are i.i.d. and the two sequences are independent. We assume, as in Huang (2002), that k min(n, m). For,Forêt et al. (2009) studied the empirical distribution via simulations, and they found that a gamma distribution outperforms the normal distribution in general. For longer sequences they showed that the normal approximation itself would be appropriate D S 2 and asymptotic normality First, we focus on the word counts in a single sequence. Let ~X d n ¼ ~X d ¼ ( ~X w, w 2A k, w 6¼ d) be the vector of centered word counts with the last word d ¼ (d,..., d) left out; note that ~X d ¼ X ~X w (4) w2a k ;w6¼d can be recovered from the set ~X d. Huang (2002) showed a multivariate normal approximation for the word count vector in a single sequence. The limiting covariance matrix C needs some notation; see section 12.1 in Waterman (1995), with results derived by Lundstrom (1990). For w ¼ w 1 w 2 w k 2A k, we define, for ¼ 1,..., k, p w ( ) ¼ p(w k þ 1 )p(w k þ 2 ) p(w k ), which is the probability that w occurs, given that w 1 w 2 w k has occurred. For words u ¼ u 1 u 2 u k and v ¼ v 1 v 2 v k, the overlap indicator is defined as n u, v (j) ¼ 1(u j þ 1 ¼ v 1,..., u k ¼ v k j ): This overlap indicator equals 1 if the last k j letters of u overlap exactly the first k j letters of v. Then the approximating covariance matrix C is given by Xk 1 Xk 1 C u, v ¼ p(u) n u, v (j)p v (j) þ p(v) n v, u (j)p u (j) j ¼ 1 j ¼ 1 (2k 1)p(u)p(v) p(u)1(u ¼ v): (5) A similar normal approximation is valid for ~Y d n ¼ ~Y d ¼ ( ~Y w, w 2A k, w 6¼ d). As we assume that A and B have the same letter probability distribution, both have the same limiting covariance matrix C. Thus, we obtain the following approximation for D S 2 (for the proof and the precise bounds, see online Supplementary Material at We use the abbreviation MVN(l, C) to denote a multivariate normal distribution with mean vector m and covariance matrix C. Also, a function h : R! R is called Lipschitz, with Lipschitz constant 1, if for all real x and y, jh(x) h(y)jjx yj. Theorem 2.1. Assume m n and k 5 n 2. Let Z 1 ¼ (Z 1, 1,..., Z 1, dk 1) ~ MVN(0, C) and Z 2 ¼ (Z 2, 1,..., Z 2, dk 1) ~ MVN(0, C) be two independent (d k 1)-dimensional normal vectors. In analogy to Equation (4), put, for i ¼ 1, 2, Let Z i (d) ¼ X w2a k ;w6¼d Z i (w): D lim ¼ X Z 1 (w)z 2 (w) p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : w2a Z k 1 (w) 2 þ Z 2 (w) 2

5 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1619 Then, D lim is mean zero normally distributed, and, for any function h which is bounded and Lipschitz with Lipschitz constant 1, as n?? with d ¼ d(n) and k ¼ k(d, n),! 1 je[h(d S 2 )] E[h(D lim)]j¼o k 4 d5k 8 : n & The bound in Theorem 2.1 may not be optimal; indeed, it is based on a multivariate normal approximation for word counts, Corollary 6.1 (see online Supplementary Material at which is of order k 2 d k (n 1 2 þ m 1 2 ). The purpose of the bound is to illustrate the trade-off between alphabet size, word length, and sequence length. If d, the alphabet size, is very large, then even moderately long words will be rare unless the sequence is very long. Because of the complicated dependence, we were not able to give a closed-form expression for the variance of D lim. Theorem 2.1, however, justifies using a z-test for the null model, based on the statistic D S 2, using the estimated standard deviation D 2 and the product of independent normals The statistic D 2 given in Equation (2) is motivated by estimating the standardized counts Xw 0 ¼ ~X w pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi and Yw 0 Var( ~X w ) ¼ ~Y w p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi, Var( ~Y w ) approximating Var( ~X w ) ¼ np w (1 p w ) by its mean np w, with the argument that 1 p w will be close to 1 when k is reasonably large and the word w is relatively rare. From Corollary 6.1 (see online Supplementary Material at we obtain a multivariate normal approximation for the standardized count vectors X 0 ¼ (Xw 0 ; w 2Anfdg) and for Y0 ¼ (Yw 0 ; w 2Anfdg). Although the covariances within the vectors will not disappear, for each w, Xw 0 and Y0 w are independent and would be approximated by independent univariate standard normal variables. From the Stuart and Ord (1987), we know the distribution of the product of two independent standard normal variables [see also Springer and Thompson (1966)]. Lemma 2.1. Let X and Y be two independent standard normal random variables. Then the product W ¼ XY has probability density f (w) ¼ 1 p K 0(jzj), (6) where K 0 (x) ¼ R 1 0 cos(xt)(1 þ t 2 ) 1 2 dt denotes the modified Bessel function of the third kind. Thus, the distribution of each summand Xw 0 Y0 w will approximately have density (6). The covariance structure will result in an approximation with a complicated distribution which would be easily assessed by simulation from many normal vectors with covariance matrix C given in Equation (5), standardizing and taking products The case that all letters are equally likely In the case that all letters are equally likely, both Lippert et al. (2002) and Kantorovitz (2007) observed that will not follow a normal distribution. Kantorovitz et al. (2007a) showed that D2z ¼ p D ffiffiffiffiffiffiffiffiffiffiffi 2 E is Var( ) asymptotically normal, however, when first sequence length and then word length tend to infinity, while the alphabet size stays fixed. However, when the word length is fixed,, D2z, and D 2 may not tend to normal as sequence length tends to infinity. Note that in the case that all letters are equally likely, all of, D2z, and D 2 agree up to constants Simulations To illustrate the quality of the normal approximation, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the

6 1620 REINERT ET AL. alphabet A¼fa, g, c, tg. We consider two types of distributions on the letters: the uniform distribution (p a ¼ p c ¼ p g ¼ p t ¼ 1 4 ) and a gc rich, nonuniform distribution (p a ¼ p t ¼ 1 6, p c ¼ p g ¼ 1 3 ); the latter distribution is the same as that used in Lippert et al. (2002) and Forêt et al. (2006) to study. Similar to Lippert et al. (2002) and Forêt et al. (2006), for each n ¼ 2 j 10 2, where j ¼ 0, 1,..., 8, and for each k ¼ 2,..., 10, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. Forêt et al. (2006) found as optimal tuple length k ¼ 7, for n ¼ 800, 1600, and 3200; optimal in the sense that for this choice of k, the statistic will be closest to normal. All results are based on a sample size of 10,000; we use the same simulated sequences for all three scores,, D 2, and DS 2.AsD2z differs from only by an additive and a multiplicative constant, we do not include D2z in these simulations. We then use the Lilliefors test (Lilliefors, 1967) to assess whether the distributions are close to normal. The Lilliefors test is a modification of the Kolmogorov Smirnov goodness-of-fit test, which is implemented using the sample mean and standard deviation as the mean and standard deviation of the theoretical limiting normal distribution. In contrast to the Kolmogorov Smirnov test, statistical significance is based on the Lilliefors distribution; see also Forêt et al. (2009) for a discussion why not to use an unmodified Kolmogorov Smirnov test when the standard deviation is estimated. A p-value of less than 0.05 indicates that we would reject the null model at the 5% significance level. Under the null model, in 100 tests we would expect about five tests resulting in a p-value of less than 5%. Precision is up to four decimal places. For easier readability, a value of is recorded simply as 0. We will first discuss the nonuniform case, when asymptotic normality has been shown to hold for all three statistics, D S 2, and, when first sequence length and then word length tend to infinity. For short words, only D S 2 has been shown to be approximately normal when sequence length tends to infinity. The regime of interest here is that words are not too rare; for long words, say k 4 2 log 1 p2 n, with p 2 ¼ P a2a p2 a, a compound Poisson approximation is more appropriate (Lippert et al., 2002) The nonuniform case. In the nonuniform case, for all three statistics, the larger the sequence length and the smaller k, the closer the distribution is to normality; the performance is rather different though. Recall that the sequence length is 2 j 100; for easier readability, we denote the 2 j 100 column in the table just by the value for j. Table 1 summarizes the p-values for the Lilliefors tests in the nonuniform case for, D 2, and DS 2. For, Table 1 shows that even for k ¼ 1 we would reject the hypothesis of normality at the 5% level as long as the sequence length is not at least 3200 bp (j ¼ 5). For k 2, the required sequence length would be around 25,600 bp (j ¼ 8). Table 1 also shows that the statistic D 2 would reject the hypothesis of normality not only for large k, but also for small k with large sequence length. This nonmonotonic behavior of D 2 indicates that, to declare statistical significance, the statistic should not be compared with a normal distribution. In contrast, Table 1 displays that D S 2 is reasonably close to normal even for a sequence of length 200 bp when k 4; for k ¼ 8 a sequence of length 1600 bp would already look reasonably normal. Moreover, the statistics stay close to normal, with increasing sequence length and with increasing word length, and it thus displays the monotonicity which makes the statistic safe to apply. We repeated the simulations using the Kolmogorov Smirnov test with the known mean and variances, based on Kantorovitz et al. (2007b), instead of the Lilliefors test, for. Although the Kolmogorov Smirnov test gave slightly larger p-values, thus indicating a slightly better fit to a normal distribution, the qualitative behavior remained (data not shown) The uniform case. In the uniform case, our theoretical results predict that the limiting distribution of would only look normal when the sequence length is large and k is large also, or at least in a moderate range. In contrast, D S 2 would still be asymptotically normal even for small k when the sequence is large. Table 2 confirms this predicted behavior. Note that from Equation (3) we can see that, in the uniform case, both D 2 and D2z are the same as up to a multiplicative constant and an additive constant. Table 2 shows that the statistics, D 2, and D2z do not monotonically approach the normal distribution. In contrast, we find that D S 2 is close to normal even for sequences of length 100 bp when k 3, and it gets closer to the normal distribution with both increasing sequence length and decreasing k.

7 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1621 Table 1. Lilliefors Tests in the Nonuniform Case j ¼ 0 j¼ 1 j¼ 2 j¼ 3 j¼ 4 j¼ 5 j¼ 6 j¼ 7 j¼ 8 k p-values for k p-values for D k p-values for D S POWER STUDIES In Section 2, we studied the distributions of, D 2, and DS 2 under the null model that the two sequences are i.i.d. having the same distribution. In this section, we will study the power of detecting the relationships between the two sequences under two alternative models for their relationships. Note that as a result of estimating the mean, the term (X w ncp w ) vanishes in the case when k ¼ 1 for D S 2 and D 2. So we chose k 2 for a fair comparison of our statistics. First, we generate a pair of independent random sequences of length n under the null model, with i.i.d. letters. Throughout we restrict ourselves to the alphabet A¼fa, g, c, tg. We consider the same two types of distributions on the letters as earlier: the uniform distribution (p a ¼ p c ¼ p g ¼ p t ¼ 1 4 ) and a gc rich, nonuniform distribution (p a ¼ p t ¼ 1 6, p c ¼ p g ¼ 1 3 ). For each n ¼ 2j 10 2, where j ¼ 0, 1,..., 8, and for each k ¼ 2,..., 10, we compute the scores for each pair of sequences for the various statistics, where k is the word size for the count statistics. All results are based on a sample size of 10,000. The first alternative model renders the two sequences dependent through a common motif (W) which is randomly distributed across the two sequences. The second alternative model is inspired by horizontal gene transfer. We randomly choose a certain number of fragments in the first sequence and then replace the

8 1622 REINERT ET AL. Table 2. Lilliefors Test in the Uniform Case j ¼ 0 j¼ 1 j¼ 2 j¼ 3 j¼ 4 j¼ 5 j¼ 6 j¼ 7 j¼ 8 k p-values for k p-values for D S corresponding fragments (position-wise) in the second sequence by the letters in the first sequence. Again, as a consequence, the two sequences would no longer be independent. In more detail, the two models are chosen as follows: The common motif model: A motif of length L ¼ 5 is chosen, say w ¼ agcca. Next, Bernoulli random variables Z 1, Z 2,..., with P (Z i ¼ 1) ¼ g, are generated for i ¼ 1, 2,..., n L þ 1. If Z i ¼ 1, we insert the word W in place of A i A i þ 1...A i þ L 1 in sequence 1. We avoid overlap by moving on to Z i þ L, whenever Z i ¼ 1. We repeat the process for sequence 2. The scores of the various statistics are then computed using the newly generated pair of sequences. The pattern transfer model: We first choose L ¼ 5 as the length of the segment to be transferred from sequence 1 to 2. Again, Bernoulli random variables Z 1, Z 2,..., with P(Z i ¼ 1) ¼ g, are generated for i ¼ 1, 2,..., n L þ 1. When Z i ¼ 1, we pick the L-word A i A i þ 1...A i þ L 1 in sequence 1 and replace B i B i þ 1...B i þ L 1 in sequence 2 with it. Again, we disallow overlaps. For this model, we compute the scores of all the statistics using sequence 1 and new sequence 2. The procedure described above is repeated 10,000 times, and the statistics calculated to yield the empirical distributions of the various statistics for each triplet of (k, n, g). As g values for the Bernoulli variables we chose g ¼ 0.001, 0.005, 0.01, 0.05, and g ¼ 0.1. For each statistic, we set a type I level of a ¼ Using the empirical distribution S of the statistic under the null model we find s so that P(S s) ¼ a. For a given g value, the power of the statistics is then estimated by the proportion of times the score under the alternative model exceeds s. We now consider the power curves of, D S 2, and for both models, as well as a comparison between these statistics. For alternative model 1, Figure 1 shows that for k ¼ 2, the power of is even smaller than 0.05, the type I error. Further, k ¼ 6 has the best power. Figure 2 shows that k ¼ 4 has the greatest power for D S 2 under the first alternative model. For, Figure 3 shows that k ¼ 5 has the greatest power for D 2 under the first alternative model, which corresponds to the length of the common motif which we assume relate the two sequences.

9 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1623 Alternative model 1; gc rich; g = 0.01 power k= 2 k= 4 k= 5 k= 6 k= length FIG. 1. Alternative Model 1: Power curves for under the gc-rich distribution; g ¼ For k ¼ 2, the power of is smaller than 0.05, the type I error (indicated by the horizontal dashed line). Turning to a comparison of the power of our various statistics, we find that D 2 has greater power than DS 2 for each k ¼ 2, 4, 5, 6, 10 (result not shown). We note that although in the uniform case,, D 2, and D2z coincide up to multiplicative and additive constants, Figure 4 shows slight differences between D2z and ¼ D 2. These differences stem from using the estimated parameters instead of the true model parameters in the test statistic D2z. S power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 1; gc rich; g = length FIG. 2. Alternative Model 1: Power curves for D S 2 greatest power. under the gc-rich distribution; g ¼ Note: k ¼ 4 has the

10 1624 REINERT ET AL. * power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 1; gc rich; g = length FIG. 3. Alternative Model 1: Power curves for D 2 under the gc-rich distribution; g ¼ Figure 5 shows a typical scenario for alternative model 1, where both D S 2 and have greater power than for given k and g and the power increases as the length, n, of the sequences increases. Even for a small g, we are able to notice the difference in the power of the various statistics. We also note that here D2z has higher power than. For alternative model 2, the picture changes. Figure 6 shows that the power of is poor for k ¼ 2, 4, 5, 6; but when increasing the parameter k to 10, far beyond the length of the tuple which we transfer, the power increases dramatically. Alternative model 1 ; Uniform ; k = 5 ; g = 0.01 power D2z S * length FIG. 4. Alternative Model 1: Power curves for, z, D 2, and D S 2 under the uniform distribution; g ¼ 0.01, k ¼ 5. Note: For the uniform case, and D 2 differ by only a constant.

11 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1625 Alternative model 1 ; gc rich ; k = 5 ; g = 0.01 power D2z S * length FIG. 5. Alternative Model 1: Power curves for, z, D 2, and DS 2 under the gc-rich model; k ¼ 5, g ¼ Note: All of D S 2, D2z, and have greater power than for given k and g. In contrast, Figure 7 shows that for D S 2 the power is moderate for all values of k in the plot, and it does not show a marked increase with sequence length. Using k ¼ 10, instead of k ¼ 6, seems to decrease the power slightly. For D 2, Figure 8 shows that the power increases with k, and increasing sequence length slightly improves the power. For k ¼ 10, the power approaches 1 for long sequences. power k= 2 k= 4 k= 5 k= 6 k= 10 Alternative model 2; gc rich; g = length FIG. 6. Alternative Model 2: Power curves for under the gc-rich distribution; g ¼ This graph suggests that the power increases with k.

12 1626 REINERT ET AL. power k= 2 k= 4 k= 5 k= 6 k= 10 S Alternative model 2; gc rich; g = length FIG. 7. Alternative Model 2: Power curves for D S 2 under the gc-rich distribution; g ¼ For the alternative model 2, Figures 6 8 suggest that, under the gc-rich, nonuniform distribution, for and D S 2, the greater the k value, the greater the power, even if this comes with a higher computational cost. We note that for k fixed, D 2 has greater power than. Moreover, D S 2 has smaller power than for k ¼ 10 and long sequences. Also, we need a larger g value to see the differentiation of the power between the various k values for alternative model 2. This is due to the fact that in the first alternative model, a particular power k= 2 k= 4 k= 5 k= 6 k= 10 * Alternative model 2; gc rich; g = length FIG. 8. Alternative Model 2: Power curves for D 2 under the gc-rich distribution; g ¼ 0.05.

13 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1627 Alternative model 2 ; gc rich ; k = 5 ; g = 0.05 power D2z S * length FIG. 9. Alternative Model 2: Power curves for, z, D 2, and DS 2 under the gc-rich distribution when k ¼ 5, g ¼ Note: For k ¼ 5, has the least power and its power actually decreases as n increases. motif has a large contribution to the statistics. In the second model, however, the segment transferred from sequence 1 might be similar to the corresponding segment it replaces in sequence 2; hence, a greater g value is required before the sequences show similarity. Under the alternative model 2, we find that for k 9 and g 0.05, and D2z actually show a decrease in power as n increases, in certain intervals; this is illustrated in Figure 9. For k ¼ 10, has higher power than D S 2, but lower power than ; the higher power than DS 2 comes at a great computational cost (result not shown). Alternative model 2 ; Uniform ; k = 5 ; g = 0.05 power D2z S * length FIG. 10. Alternative Model 2: Power curves for, D2z, D 2, and DS 2 g ¼ under the uniform distribution when k ¼ 5,

14 1628 REINERT ET AL. Our findings suggest that is not desirable as a statistic for sequence comparison. We conjecture that this is due to the fact that is dominated by the normal components of the individual sequences and so is actually measuring the sum of the departure of each sequence from the background (Lippert et al., 2002) rather than the (dis)similarity between the two sequences. As n increases, loses its detecting power as the two sequences become more similar. As an aside, under the uniform distribution, in the alternative model 2, all three statistics behave similarly for k ¼ 5, as expected, see Figure USING D S 2 TO TEST FOR SIMILARITY Although in our simulations D 2 is more powerful than DS 2, the statistic DS 2 is still considerably more powerful than. For tests that would result in small p-values, as required for multiple tests for example, simulating the empirical distribution of the test statistics under the null hypothesis can be time consuming. In contrast to D 2, the limiting distribution of DS 2 is normal with mean zero, and hence, testing is straightforward; only the standard deviation needs to be estimated. To illustrate the procedure, for fixed k and n, we generate 10,000 pairs of sequences under the gc-rich or uniform case and compute the D S 2 scores for each pair. The standard deviation of DS 2 is then estimated from these empirical scores. Again, for fixed k and n, we generate 2000 pairs of sequences under the null model of no relationship between the two sequences, in both the gc-rich and uniform case, and we compute the D S 2 score. Assuming asymptotic normality, we use a z-test to test the null hypothesis of no relationship, assuming mean zero and the estimated standard deviation. Then we generate 2000 pairs of sequences from alternative model 1, with motif insertion probabilities g ¼ 0.001, 0.005, 0.01, 0.05, 0.1. The D S 2 statistic is computed, and we carry out a z-test, based on the asymptotic normality of D S 2. We repeat the procedure for the pattern transfer model, alternative model 2. We choose k ¼ 4, 5, 6, because we know from our power simulations that D S 2 works best when the motif length is around 5. We compare to the results that we obtain using the empirical distribution of D S 2 instead, where the empirical distribution function is based on 10,000 samples. In addition, we use the empirical distribution of D S 2 based on 100,000 samples. Tables 3 and 4 show the estimated type 1 and type 2 error rates in the gc-rich and uniform case; recall that the type 1 error is the probability to reject the null hypothesis although it is true. The type 2 error, the probability to accept the null hypothesis although false, is estimated under the alternative model 1 and under the alternative model 2, with motif insertion probability and pattern transfer probability g taking on the values g ¼ 0.005, 0.01, 0.05, 0.1. For each n and k, the first row gives the estimates from the z-test (abbreviated as z), and the second row gives the estimate from the empirical distribution function (abbreviated as e). Except for the puzzling case when n ¼ 3200 with k ¼ 6, the results are remarkably similar, and there is no clear advantage of using the empirical distribution function when based on a relatively small number of samples. The general observation is that the normal approximation for D S 2 gives a fast method for assessing statistical significance. 5. PHASE TRANSITION In this section, we explore the effect of small deviation from the uniform distribution for the statistic only. We restrict attention to the alphabet A¼fa, g, c, tg and word size k ¼ 1; both sequences are of the same length n. Again p a denotes the probability of the letter a. Then the standardized counts X Z a ¼ p (X ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a np a ) np a (1 p a and Y Z ) a ¼ p (Y ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi a np a ) np a (1 p a both tend to standard normal variables when n tends to infinity. With ) this notation and noting that E( ) ¼ n 2 P a2a p2 a, we obtain E( ) n ¼ X p a (1 p a )X Z a YZ a þ p ffiffiffi X n p a 1 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p a (1 p a ) X Z a 4 þ YZ a Þ: a2a a2a When the distribution of the alphabet is uniform, (p a, p c, p g, p t ) ¼ ( 1 4, 1 4, 1 4, 1 4 ) the second term in Equation (7) vanishes, (7)

15 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1629 Table 3. The Estimated Type 1 and 2 Error Rates When Applying the z-test Using the Estimated Variance, Using 2000 Samples Type 1 Type 2 Length k M1g1 M1g2 M2g3 M2g z e t z e t z e t z e t z e t z e t ,800 4 z e t ,800 5 z e t ,800 6 z e t ,600 4 z e t ,600 5 z e t ,600 6 z e t M1=M2 refers to alternative model 1=2; g1=g2=g3=g4 refers to the cases where g ¼ 0.005, 0.01, 0.05, 0.1, respectively, where g is the parameter of the Bernoulli random variable B. So M1g3 means alternative model 1, g ¼ The first row gives the estimates from the z-test (abbreviated as z), the second row gives the estimate from the empirical distribution function (abbreviated as e), both based on 10,000 samples. The third row, abbreviated as t, gives the estimate from the empirical distribution function based on 100,000 samples. E( ) ¼ 3 X X Z a n 16 YZ a, and so it is asymptotically nonnormal (in fact, a sum of products of standard normal variables). In the situation where (p a, p c, p g, p t ) 6¼ ( 1 4, 1 4, 1 4, 1 4 ) do not depend on n, the second term in Equation (7) dominates the first term; as E( ) p n ffiffi X p a 1 p ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p a (1 p a )(X Z a n 4 þ YZ a ), a2a the scaled limit is a normal distribution. a2a

16 1630 REINERT ET AL. Table 4. The Estimated Type 1 and 2 Error Rates When Applying the z-test Using the Estimated Variance, Using 2000 Samples, but for the Uniform Case Type 1 Type 2 Length k M1g1 M1g2 M2g3 M2g z e t z e t z e t z e t z e t z e t ,800 4 z e t ,800 5 z e t ,800 6 z e t ,600 4 z e t ,600 5 z e t ,600 6 z e t Next we assume that ( p a (n), p g (n), p c (n), p t (n)) changes with n in such a way that there exists a function f (n)? 0 and constants C l, l ¼ a, g, c, t satisfying C a þ C g þ C c þ C t ¼ 0 such that lim n!1 f 1 (n)(p l (n) 1=4) ¼ C l, for each letter l 2A. Then E( ) ¼ 3 X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) Xa Z n YZ a a2a pffiffiffiffiffi 3n f (n) X þ C a (X Z a 4 þ YZ a!, ) þ H(f (n)) (8) where g(n)? 0asn?? and Y( f (n)) indicates a term that has the same order as f (n). a2a

17 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1631 Table 5. The Ratio R n of the Coefficient of the First Term Over the Coefficient of the Second Term in for Different Values of n and n Let f (n) ¼ 1=n 0.5 þ. When <0, the second term in (8) will dominate and E( ) n will tend to a normal 1 distribution. When >0 the first term dominates and E( ) n will tend to be nonnormal. Thus, we expect a phase transition from normal to nonnormal as changes from negative to positive. Intuitively, the ratio R n of the coefficient of the first term over the coefficient of the second term in Equation (8) can be thought of as a ratio of dominance ; pffiffiffi 3 16 R n ¼ n p ffiffi 3 ¼ 3 4 n : 4 Table 5 shows the decrease of R n for increasing n (row labels) and decreasing (column labels). To run simulations in the vicinity of this phase transition, we consider two types of probability vectors for the alphabet in A and f (n) ¼ 1=n 0.5 þ 1. The type I probability vector is chosen as 4 þ f (n), 1 4 f (n), 1 4, 1 4 Þ, giving (C a, C c, C g, C t ) ¼ (1, 1, 0, 0). 1 In the second scenario, type II, the probability vector perturbs all components, 4 þ f (n), 1 4 f (n), 1 4 þ f (n), 1 4 f (n)þ, so that (C a, C g, C c, C t ) ¼ (1, 1, 1, 1). Under the type I model, we can show that the variance of P a2a C a X ~ a þ P a2a C a Y ~ a is approximately Thus, there exists Z n? N(0, 1) as n tends to infinity such that X C a ~X a þ X C a ~Y a p 4 ffiffi Z n (type I): a2a a2a 3 Equation (8) can be rewritten as E( ) ¼ 3n X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) ~X a ~Y a a2a þ n p ffiffiffiffiffi 2n f (n) ðz n þ H(f (n)) Þ: (9) 2 Under the type II model, the variance of P a2a C a X ~ a þ P a2a C a Y ~ a is approximately 32 3 and X C a ~X a þ X rffiffiffi 2 C a ~Y a 4 Z n (type II): 3 a2a a2a Hence, under the type II model, Equation (8) can be rewritten as E( ) ¼ 3n X 1 þ 8C a f (n) 16C2 a f 2 (n) þ c(n) ~X a ~Y a a2a þ n p ffiffiffiffiffi 3n f (n) ðz n þ H(f (n)) Þ: (10) 2 pffiffi The ratio of the coefficient of the first term over the coefficient of the second term in Equation (9) is 2 - fold larger than that for Equation (10). Therefore, we expect that normality appears for relative small absolute values for the type II vector. For given n and, we generate sequences of length n using both types of distribution vectors. The scores for word size k ¼ 1 are then tabulated. We use a Kolmogorov Smirnov test to test the hypothesis that is normally distributed, and the corresponding p-value is obtained. Here again we use the

18 1632 REINERT ET AL. Table 6. The p-values of the Kolmogorov Smirnov Test for Testing the Normality of for Letter Distributions Which Are Close to Uniform; f(n) ¼ 1=n 0.5þ Type I : 4 þ f ðnþ; 1 4 f ðnþ; 1 4 ; e e e e Type II : 4 þ f ðnþ; 1 4 f ðnþ; 1 4 þ f ðnþ; 1 4 f ðnþ e e e e n theoretical mean and variance for the test. Table 6 gives the p-values for different values of n and under the two models; again we report only the first four digits. Table 6 indicates that under the type II model the distribution of is not significantly different from normality when 0.05 and n ¼ , while significantly different from normality when ¼ 0.05 and n On the other hand, under the type I model, the distribution of is significantly different from normality even when ¼ 0.05 and n ¼ The simulation results are consistent with our intuition. As shown in Table 5, the ratio R n of the coefficient of the first term over that of the second term in Equation (8) is less than 0.1 if < 0.10 when n ¼ This can explain why normality of appears if < 0.10 when n ¼ for both type I and II models. Further, qffiffi as the ratio of the coefficient of 3 the first term over the coefficient of the second term in Equation (9) is 2-fold larger than that for Equation (10), normality of begins to appear when < 0.05 for the type II model. 6. DISCUSSION The typically used statistic asymptotically ignores the joint word occurrences in two sequences unless all letters are almost equally likely. In the latter scenario, a phase transition occurs. Hence, the statistic is neither robust nor informative, under the normal regime. The main advantage of is that it is easy to compute.

19 ALIGNMENT-FREE SEQUENCE COMPARISON (I) 1633 The proposed D S 2 statistic instead is also easy to compute, but it can be compared with a normal distribution to assess significance, and it performs well in a power study. The D 2 statistic is more powerful than DS 2 in our simulation study and is also easy to compute, but its asymptotic distribution does not have a convenient form; instead, it would best be assessed using simulation, which are time consuming, as the tail of the distribution would need to be estimated. Our recommendation is to discard, to use D S 2 instead when computing time is limited, and to ideally use D 2 for sequence comparison based on k-tuple content. Our results allow for a number of generalizations. The normal approximation for the word counts in each individual sequences does not assume that the underlying letter distribution is the same as in the other sequence. Hence, the normal approximation for D S 2 also holds when the sequences do not follow the same underlying distribution on the letters. Huang (2002) gave a related normal approximation for one sequence in the more general situation that the sequence is generated by a homogeneous Markov chain. Kantorovitz et al. (2007b) already successfully adapted D2z to the Markov case. Also for D S 2 the generalization of our results for that setting should be straightforward; the error bounds would need to be adjusted. Burden et al. (2008) generalized D2z to allow for mismatches, on the four-letter alphabet {a, c, g, t } under the Bernoulli model that p a ¼ p t and p c ¼ p g ; they called an m-neighborhood of a word w of length k the set of all words which differ by at most m letters from w. The generalized statistic then counts the number of all m-neighborhood matches of all k-words between two sequences. With our normal approximation for all word counts, D S 2 could be generalized similarly to allow for a certain number of mismatches. The quality of the normal approximation will depend on the number m of permitted mismatches. We also indicate that more than two sequences could be compared in a similar fashion. Quine (1994) stated the result that if X 1,..., X n are independent normal random variables with zero means and variances r 2 1,..., r2 n, then! X 1 X 2 X P n r 2 1 X 2 i1 Xi 2 2 Xi 2 ~N 0, r2 n P n 1 r 2 i1 r 2 i 2 r 2, i n 1 where both sums are over all integers 1 i 1 < i 2 < i n 1 n (Melnykov and Chen, 2007). This suggests the extension of the D S 2 statistic for multiple sequence comparison by taking the products of the individual word counts and standardizing it as earlier. Then still a normal approximation is valid. Similarly, we could extend D 2 as the sum, over all words, of the product of more than two standardized word counts. Springer and Thompson (1966) gave a formula for the density of the product of independent standard normals. Again the covariance structure of the word counts within one sequence would make it recommendable to assess the limiting distribution via simulation. ACKNOWLEDGMENTS G.R. was supported in part by EPSRC grant no. GR=R52183=01, and by BBSRC and EPSRC through OCISB. D.C. was supported by a Overseas Postdoctoral Fellowship from the National University of Singapore. F.S. was supported by NIH grant no. P50 HG and R21AG M.S.W. was supported by NIH grant no. P50 HG and R21AG No competing financial interests exist. DISCLOSURE STATEMENT REFERENCES Burden, C.J., Kantorovitz, M.R., and Wilson, S.R Approximate word matches between two random sequences. Ann. Appl. Probab. 18, Forêt, S., Kantorovitz, M., and Burden, C Asymptotic behaviour and optimal word size for exact and approximate word matches between random sequences. BMC Bioinformat. 7 (Suppl 5), S21.

On the length of the longest exact position. match in a random sequence

On the length of the longest exact position. match in a random sequence On the length of the longest exact position match in a random sequence G. Reinert and Michael S. Waterman Abstract A mixed Poisson approximation and a Poisson approximation for the length of the longest

More information

The Distribution of Short Word Match Counts Between Markovian Sequences

The Distribution of Short Word Match Counts Between Markovian Sequences The Distribution of Short Word Match Counts Between Markovian Sequences Conrad J. Burden 1, Paul Leopardi 1 and Sylvain Forêt 2 1 Mathematical Sciences Institute, Australian National University, Canberra,

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 11, Issue 1 2012 Article 3 Alignment-free Sequence Comparison for Biologically Realistic Sequences of Moderate Length Conrad J. Burden,

More information

Distributional regimes for the number of k-word matches between two random sequences

Distributional regimes for the number of k-word matches between two random sequences Distributional regimes for the number of k-word matches between two random sequences Ross A. Lippert, Haiyan Huang, and Michael S. Waterman Informatics Research, Celera Genomics, Rockville, MD 0878; Department

More information

On random sequence spacing distributions

On random sequence spacing distributions On random sequence spacing distributions C.T.J. Dodson School of Mathematics, Manchester University Manchester M6 1QD, UK ctdodson@manchester.ac.uk Abstract The random model for allocation of an amino

More information

Application: Bucket Sort

Application: Bucket Sort 5.2.2. Application: Bucket Sort Bucket sort breaks the log) lower bound for standard comparison-based sorting, under certain assumptions on the input We want to sort a set of =2 integers chosen I+U@R from

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data

DEGseq: an R package for identifying differentially expressed genes from RNA-seq data DEGseq: an R package for identifying differentially expressed genes from RNA-seq data Likun Wang Zhixing Feng i Wang iaowo Wang * and uegong Zhang * MOE Key Laboratory of Bioinformatics and Bioinformatics

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory

Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory Entropy and Ergodic Theory Lecture 3: The meaning of entropy in information theory 1 The intuitive meaning of entropy Modern information theory was born in Shannon s 1948 paper A Mathematical Theory of

More information

Notes on Poisson Approximation

Notes on Poisson Approximation Notes on Poisson Approximation A. D. Barbour* Universität Zürich Progress in Stein s Method, Singapore, January 2009 These notes are a supplement to the article Topics in Poisson Approximation, which appeared

More information

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance

The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance The local equivalence of two distances between clusterings: the Misclassification Error metric and the χ 2 distance Marina Meilă University of Washington Department of Statistics Box 354322 Seattle, WA

More information

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS

chapter 12 MORE MATRIX ALGEBRA 12.1 Systems of Linear Equations GOALS chapter MORE MATRIX ALGEBRA GOALS In Chapter we studied matrix operations and the algebra of sets and logic. We also made note of the strong resemblance of matrix algebra to elementary algebra. The reader

More information

Inference For High Dimensional M-estimates. Fixed Design Results

Inference For High Dimensional M-estimates. Fixed Design Results : Fixed Design Results Lihua Lei Advisors: Peter J. Bickel, Michael I. Jordan joint work with Peter J. Bickel and Noureddine El Karoui Dec. 8, 2016 1/57 Table of Contents 1 Background 2 Main Results and

More information

Statistical Applications in Genetics and Molecular Biology

Statistical Applications in Genetics and Molecular Biology Statistical Applications in Genetics and Molecular Biology Volume 8, Issue 1 2009 Article 43 Characterizing the D2 Statistic: Word Matches in Biological Sequences Sylvain Forêt Susan R. Wilson Conrad J.

More information

The expansion of random regular graphs

The expansion of random regular graphs The expansion of random regular graphs David Ellis Introduction Our aim is now to show that for any d 3, almost all d-regular graphs on {1, 2,..., n} have edge-expansion ratio at least c d d (if nd is

More information

Computational Genomics

Computational Genomics Computational Genomics http://www.cs.cmu.edu/~02710 Introduction to probability, statistics and algorithms (brief) intro to probability Basic notations Random variable - referring to an element / event

More information

Robustness and Distribution Assumptions

Robustness and Distribution Assumptions Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology

More information

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.

Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6. Chapter 7 Reading 7.1, 7.2 Questions 3.83, 6.11, 6.12, 6.17, 6.25, 6.29, 6.33, 6.35, 6.50, 6.51, 6.53, 6.55, 6.59, 6.60, 6.65, 6.69, 6.70, 6.77, 6.79, 6.89, 6.112 Introduction In Chapter 5 and 6, we emphasized

More information

Inference For High Dimensional M-estimates: Fixed Design Results

Inference For High Dimensional M-estimates: Fixed Design Results Inference For High Dimensional M-estimates: Fixed Design Results Lihua Lei, Peter Bickel and Noureddine El Karoui Department of Statistics, UC Berkeley Berkeley-Stanford Econometrics Jamboree, 2017 1/49

More information

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances

Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances Advances in Decision Sciences Volume 211, Article ID 74858, 8 pages doi:1.1155/211/74858 Research Article A Nonparametric Two-Sample Wald Test of Equality of Variances David Allingham 1 andj.c.w.rayner

More information

BLAST: Target frequencies and information content Dannie Durand

BLAST: Target frequencies and information content Dannie Durand Computational Genomics and Molecular Biology, Fall 2016 1 BLAST: Target frequencies and information content Dannie Durand BLAST has two components: a fast heuristic for searching for similar sequences

More information

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review

STATS 200: Introduction to Statistical Inference. Lecture 29: Course review STATS 200: Introduction to Statistical Inference Lecture 29: Course review Course review We started in Lecture 1 with a fundamental assumption: Data is a realization of a random process. The goal throughout

More information

A nonparametric two-sample wald test of equality of variances

A nonparametric two-sample wald test of equality of variances University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 211 A nonparametric two-sample wald test of equality of variances David

More information

Bounded Budget Betweenness Centrality Game for Strategic Network Formations

Bounded Budget Betweenness Centrality Game for Strategic Network Formations Bounded Budget Betweenness Centrality Game for Strategic Network Formations Xiaohui Bei 1, Wei Chen 2, Shang-Hua Teng 3, Jialin Zhang 1, and Jiajie Zhu 4 1 Tsinghua University, {bxh08,zhanggl02}@mails.tsinghua.edu.cn

More information

Erdős-Renyi random graphs basics

Erdős-Renyi random graphs basics Erdős-Renyi random graphs basics Nathanaël Berestycki U.B.C. - class on percolation We take n vertices and a number p = p(n) with < p < 1. Let G(n, p(n)) be the graph such that there is an edge between

More information

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling

Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling Learning Sequence Motif Models Using Expectation Maximization (EM) and Gibbs Sampling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 009 Mark Craven craven@biostat.wisc.edu Sequence Motifs what is a sequence

More information

Are the Digits in a Mersenne Prime Random? A Probability Model

Are the Digits in a Mersenne Prime Random? A Probability Model Are the in a Are the in a Prime Random? October 11, 2016 1 / 21 Outline Are the in a 1 2 3 4 5 2 / 21 Source Are the in a This talk is my expansion of the blog post: Strings of in, gottwurfelt.com, Jan

More information

Taylor series. Chapter Introduction From geometric series to Taylor polynomials

Taylor series. Chapter Introduction From geometric series to Taylor polynomials Chapter 2 Taylor series 2. Introduction The topic of this chapter is find approximations of functions in terms of power series, also called Taylor series. Such series can be described informally as infinite

More information

Bias-Variance Error Bounds for Temporal Difference Updates

Bias-Variance Error Bounds for Temporal Difference Updates Bias-Variance Bounds for Temporal Difference Updates Michael Kearns AT&T Labs mkearns@research.att.com Satinder Singh AT&T Labs baveja@research.att.com Abstract We give the first rigorous upper bounds

More information

MODERATE DEVIATIONS IN POISSON APPROXIMATION: A FIRST ATTEMPT

MODERATE DEVIATIONS IN POISSON APPROXIMATION: A FIRST ATTEMPT Statistica Sinica 23 (2013), 1523-1540 doi:http://dx.doi.org/10.5705/ss.2012.203s MODERATE DEVIATIONS IN POISSON APPROXIMATION: A FIRST ATTEMPT Louis H. Y. Chen 1, Xiao Fang 1,2 and Qi-Man Shao 3 1 National

More information

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing

Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Summary and discussion of: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing Statistics Journal Club, 36-825 Beau Dabbs and Philipp Burckhardt 9-19-2014 1 Paper

More information

STAT 536: Genetic Statistics

STAT 536: Genetic Statistics STAT 536: Genetic Statistics Tests for Hardy Weinberg Equilibrium Karin S. Dorman Department of Statistics Iowa State University September 7, 2006 Statistical Hypothesis Testing Identify a hypothesis,

More information

1.5 Sequence alignment

1.5 Sequence alignment 1.5 Sequence alignment The dramatic increase in the number of sequenced genomes and proteomes has lead to development of various bioinformatic methods and algorithms for extracting information (data mining)

More information

Stochastic Processes

Stochastic Processes Stochastic Processes Stochastic Process Non Formal Definition: Non formal: A stochastic process (random process) is the opposite of a deterministic process such as one defined by a differential equation.

More information

Notes for Expansions/Series and Differential Equations

Notes for Expansions/Series and Differential Equations Notes for Expansions/Series and Differential Equations In the last discussion, we considered perturbation methods for constructing solutions/roots of algebraic equations. Three types of problems were illustrated

More information

Prime numbers and Gaussian random walks

Prime numbers and Gaussian random walks Prime numbers and Gaussian random walks K. Bruce Erickson Department of Mathematics University of Washington Seattle, WA 9895-4350 March 24, 205 Introduction Consider a symmetric aperiodic random walk

More information

Networks: Lectures 9 & 10 Random graphs

Networks: Lectures 9 & 10 Random graphs Networks: Lectures 9 & 10 Random graphs Heather A Harrington Mathematical Institute University of Oxford HT 2017 What you re in for Week 1: Introduction and basic concepts Week 2: Small worlds Week 3:

More information

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES

RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES RENORMALIZATION OF DYSON S VECTOR-VALUED HIERARCHICAL MODEL AT LOW TEMPERATURES P. M. Bleher (1) and P. Major (2) (1) Keldysh Institute of Applied Mathematics of the Soviet Academy of Sciences Moscow (2)

More information

Lectures 5 & 6: Hypothesis Testing

Lectures 5 & 6: Hypothesis Testing Lectures 5 & 6: Hypothesis Testing in which you learn to apply the concept of statistical significance to OLS estimates, learn the concept of t values, how to use them in regression work and come across

More information

THIS work is motivated by the goal of finding the capacity

THIS work is motivated by the goal of finding the capacity IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 53, NO. 8, AUGUST 2007 2693 Improved Lower Bounds for the Capacity of i.i.d. Deletion Duplication Channels Eleni Drinea, Member, IEEE, Michael Mitzenmacher,

More information

Asymptotic Statistics-VI. Changliang Zou

Asymptotic Statistics-VI. Changliang Zou Asymptotic Statistics-VI Changliang Zou Kolmogorov-Smirnov distance Example (Kolmogorov-Smirnov confidence intervals) We know given α (0, 1), there is a well-defined d = d α,n such that, for any continuous

More information

Relating Graph to Matlab

Relating Graph to Matlab There are two related course documents on the web Probability and Statistics Review -should be read by people without statistics background and it is helpful as a review for those with prior statistics

More information

Bivariate Uniqueness in the Logistic Recursive Distributional Equation

Bivariate Uniqueness in the Logistic Recursive Distributional Equation Bivariate Uniqueness in the Logistic Recursive Distributional Equation Antar Bandyopadhyay Technical Report # 629 University of California Department of Statistics 367 Evans Hall # 3860 Berkeley CA 94720-3860

More information

Supporting Information

Supporting Information Supporting Information Weghorn and Lässig 10.1073/pnas.1210887110 SI Text Null Distributions of Nucleosome Affinity and of Regulatory Site Content. Our inference of selection is based on a comparison of

More information

Sequence analysis and Genomics

Sequence analysis and Genomics Sequence analysis and Genomics October 12 th November 23 rd 2 PM 5 PM Prof. Peter Stadler Dr. Katja Nowick Katja: group leader TFome and Transcriptome Evolution Bioinformatics group Paul-Flechsig-Institute

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

Review. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Review. DS GA 1002 Statistical and Mathematical Models.   Carlos Fernandez-Granda Review DS GA 1002 Statistical and Mathematical Models http://www.cims.nyu.edu/~cfgranda/pages/dsga1002_fall16 Carlos Fernandez-Granda Probability and statistics Probability: Framework for dealing with

More information

Introduction to Statistics

Introduction to Statistics MTH4106 Introduction to Statistics Notes 15 Spring 2013 Testing hypotheses about the mean Earlier, we saw how to test hypotheses about a proportion, using properties of the Binomial distribution It is

More information

GRE Quantitative Reasoning Practice Questions

GRE Quantitative Reasoning Practice Questions GRE Quantitative Reasoning Practice Questions y O x 7. The figure above shows the graph of the function f in the xy-plane. What is the value of f (f( ))? A B C 0 D E Explanation Note that to find f (f(

More information

2.1 Elementary probability; random sampling

2.1 Elementary probability; random sampling Chapter 2 Probability Theory Chapter 2 outlines the probability theory necessary to understand this text. It is meant as a refresher for students who need review and as a reference for concepts and theorems

More information

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST

EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST EVALUATING THE REPEATABILITY OF TWO STUDIES OF A LARGE NUMBER OF OBJECTS: MODIFIED KENDALL RANK-ORDER ASSOCIATION TEST TIAN ZHENG, SHAW-HWA LO DEPARTMENT OF STATISTICS, COLUMBIA UNIVERSITY Abstract. In

More information

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n

Recall that in order to prove Theorem 8.8, we argued that under certain regularity conditions, the following facts are true under H 0 : 1 n Chapter 9 Hypothesis Testing 9.1 Wald, Rao, and Likelihood Ratio Tests Suppose we wish to test H 0 : θ = θ 0 against H 1 : θ θ 0. The likelihood-based results of Chapter 8 give rise to several possible

More information

A central limit theorem for an omnibus embedding of random dot product graphs

A central limit theorem for an omnibus embedding of random dot product graphs A central limit theorem for an omnibus embedding of random dot product graphs Keith Levin 1 with Avanti Athreya 2, Minh Tang 2, Vince Lyzinski 3 and Carey E. Priebe 2 1 University of Michigan, 2 Johns

More information

Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations

Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations Mathematics Course 111: Algebra I Part I: Algebraic Structures, Sets and Permutations D. R. Wilkins Academic Year 1996-7 1 Number Systems and Matrix Algebra Integers The whole numbers 0, ±1, ±2, ±3, ±4,...

More information

A nonparametric test for path dependence in discrete panel data

A nonparametric test for path dependence in discrete panel data A nonparametric test for path dependence in discrete panel data Maximilian Kasy Department of Economics, University of California - Los Angeles, 8283 Bunche Hall, Mail Stop: 147703, Los Angeles, CA 90095,

More information

Multimedia Communications. Mathematical Preliminaries for Lossless Compression

Multimedia Communications. Mathematical Preliminaries for Lossless Compression Multimedia Communications Mathematical Preliminaries for Lossless Compression What we will see in this chapter Definition of information and entropy Modeling a data source Definition of coding and when

More information

Figure 10.1: Recording when the event E occurs

Figure 10.1: Recording when the event E occurs 10 Poisson Processes Let T R be an interval. A family of random variables {X(t) ; t T} is called a continuous time stochastic process. We often consider T = [0, 1] and T = [0, ). As X(t) is a random variable

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

1 Motivation for Instrumental Variable (IV) Regression

1 Motivation for Instrumental Variable (IV) Regression ECON 370: IV & 2SLS 1 Instrumental Variables Estimation and Two Stage Least Squares Econometric Methods, ECON 370 Let s get back to the thiking in terms of cross sectional (or pooled cross sectional) data

More information

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty

Practical Applications and Properties of the Exponentially. Modified Gaussian (EMG) Distribution. A Thesis. Submitted to the Faculty Practical Applications and Properties of the Exponentially Modified Gaussian (EMG) Distribution A Thesis Submitted to the Faculty of Drexel University by Scott Haney in partial fulfillment of the requirements

More information

Short cycles in random regular graphs

Short cycles in random regular graphs Short cycles in random regular graphs Brendan D. McKay Department of Computer Science Australian National University Canberra ACT 0200, Australia bdm@cs.anu.ed.au Nicholas C. Wormald and Beata Wysocka

More information

September Math Course: First Order Derivative

September Math Course: First Order Derivative September Math Course: First Order Derivative Arina Nikandrova Functions Function y = f (x), where x is either be a scalar or a vector of several variables (x,..., x n ), can be thought of as a rule which

More information

Appendix A Numerical Tables

Appendix A Numerical Tables Appendix A Numerical Tables A.1 The Gaussian Distribution The Gaussian distribution (Eq. 2.31)isdefinedas 1 f (x)= 2πσ 2 e (x μ) 2 2σ 2. (A.1) The maximum value is obtained at x = μ, and the value of (x

More information

Stochastic processes and Markov chains (part II)

Stochastic processes and Markov chains (part II) Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University Amsterdam, The

More information

NAG Library Chapter Introduction. G08 Nonparametric Statistics

NAG Library Chapter Introduction. G08 Nonparametric Statistics NAG Library Chapter Introduction G08 Nonparametric Statistics Contents 1 Scope of the Chapter.... 2 2 Background to the Problems... 2 2.1 Parametric and Nonparametric Hypothesis Testing... 2 2.2 Types

More information

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka 819-0395, Japan. daisuke@inf.kyushu-u.ac.jp

More information

CONSTRAINED PERCOLATION ON Z 2

CONSTRAINED PERCOLATION ON Z 2 CONSTRAINED PERCOLATION ON Z 2 ZHONGYANG LI Abstract. We study a constrained percolation process on Z 2, and prove the almost sure nonexistence of infinite clusters and contours for a large class of probability

More information

#A69 INTEGERS 13 (2013) OPTIMAL PRIMITIVE SETS WITH RESTRICTED PRIMES

#A69 INTEGERS 13 (2013) OPTIMAL PRIMITIVE SETS WITH RESTRICTED PRIMES #A69 INTEGERS 3 (203) OPTIMAL PRIMITIVE SETS WITH RESTRICTED PRIMES William D. Banks Department of Mathematics, University of Missouri, Columbia, Missouri bankswd@missouri.edu Greg Martin Department of

More information

The number of distributions used in this book is small, basically the binomial and Poisson distributions, and some variations on them.

The number of distributions used in this book is small, basically the binomial and Poisson distributions, and some variations on them. Chapter 2 Statistics In the present chapter, I will briefly review some statistical distributions that are used often in this book. I will also discuss some statistical techniques that are important in

More information

Group, Rings, and Fields Rahul Pandharipande. I. Sets Let S be a set. The Cartesian product S S is the set of ordered pairs of elements of S,

Group, Rings, and Fields Rahul Pandharipande. I. Sets Let S be a set. The Cartesian product S S is the set of ordered pairs of elements of S, Group, Rings, and Fields Rahul Pandharipande I. Sets Let S be a set. The Cartesian product S S is the set of ordered pairs of elements of S, A binary operation φ is a function, S S = {(x, y) x, y S}. φ

More information

STAT 461/561- Assignments, Year 2015

STAT 461/561- Assignments, Year 2015 STAT 461/561- Assignments, Year 2015 This is the second set of assignment problems. When you hand in any problem, include the problem itself and its number. pdf are welcome. If so, use large fonts and

More information

INTRODUCTION TO PATTERN RECOGNITION

INTRODUCTION TO PATTERN RECOGNITION INTRODUCTION TO PATTERN RECOGNITION INSTRUCTOR: WEI DING 1 Pattern Recognition Automatic discovery of regularities in data through the use of computer algorithms With the use of these regularities to take

More information

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions M. Vidyasagar Cecil & Ida Green Chair The University of Texas at Dallas Email: M.Vidyasagar@utdallas.edu

More information

Polymers in a slabs and slits with attracting walls

Polymers in a slabs and slits with attracting walls Polymers in a slabs and slits with attracting walls Aleks Richard Martin Enzo Orlandini Thomas Prellberg Andrew Rechnitzer Buks van Rensburg Stu Whittington The Universities of Melbourne, Toronto and Padua

More information

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2

EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME. Xavier Mestre 1, Pascal Vallet 2 EXTENDED GLRT DETECTORS OF CORRELATION AND SPHERICITY: THE UNDERSAMPLED REGIME Xavier Mestre, Pascal Vallet 2 Centre Tecnològic de Telecomunicacions de Catalunya, Castelldefels, Barcelona (Spain) 2 Institut

More information

THE SIMPLE URN PROCESS AND THE STOCHASTIC APPROXIMATION OF ITS BEHAVIOR

THE SIMPLE URN PROCESS AND THE STOCHASTIC APPROXIMATION OF ITS BEHAVIOR THE SIMPLE URN PROCESS AND THE STOCHASTIC APPROXIMATION OF ITS BEHAVIOR MICHAEL KANE As a final project for STAT 637 (Deterministic and Stochastic Optimization) the simple urn model is studied, with special

More information

Chapter 5. Means and Variances

Chapter 5. Means and Variances 1 Chapter 5 Means and Variances Our discussion of probability has taken us from a simple classical view of counting successes relative to total outcomes and has brought us to the idea of a probability

More information

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A. 1. Let P be a probability measure on a collection of sets A. (a) For each n N, let H n be a set in A such that H n H n+1. Show that P (H n ) monotonically converges to P ( k=1 H k) as n. (b) For each n

More information

Introduction to Metalogic

Introduction to Metalogic Philosophy 135 Spring 2008 Tony Martin Introduction to Metalogic 1 The semantics of sentential logic. The language L of sentential logic. Symbols of L: Remarks: (i) sentence letters p 0, p 1, p 2,... (ii)

More information

The main results about probability measures are the following two facts:

The main results about probability measures are the following two facts: Chapter 2 Probability measures The main results about probability measures are the following two facts: Theorem 2.1 (extension). If P is a (continuous) probability measure on a field F 0 then it has a

More information

The Minesweeper game: Percolation and Complexity

The Minesweeper game: Percolation and Complexity The Minesweeper game: Percolation and Complexity Elchanan Mossel Hebrew University of Jerusalem and Microsoft Research March 15, 2002 Abstract We study a model motivated by the minesweeper game In this

More information

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321 Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process

More information

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki.

Protein Bioinformatics. Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet sandberg.cmb.ki. Protein Bioinformatics Rickard Sandberg Dept. of Cell and Molecular Biology Karolinska Institutet rickard.sandberg@ki.se sandberg.cmb.ki.se Outline Protein features motifs patterns profiles signals 2 Protein

More information

2 DISCRETE-TIME MARKOV CHAINS

2 DISCRETE-TIME MARKOV CHAINS 1 2 DISCRETE-TIME MARKOV CHAINS 21 FUNDAMENTAL DEFINITIONS AND PROPERTIES From now on we will consider processes with a countable or finite state space S {0, 1, 2, } Definition 1 A discrete-time discrete-state

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part I) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

Computational statistics

Computational statistics Computational statistics Markov Chain Monte Carlo methods Thierry Denœux March 2017 Thierry Denœux Computational statistics March 2017 1 / 71 Contents of this chapter When a target density f can be evaluated

More information

Jumping Sequences. Steve Butler Department of Mathematics University of California, Los Angeles Los Angeles, CA

Jumping Sequences. Steve Butler Department of Mathematics University of California, Los Angeles Los Angeles, CA 1 2 3 47 6 23 11 Journal of Integer Sequences, Vol. 11 (2008), Article 08.4.5 Jumping Sequences Steve Butler Department of Mathematics University of California, Los Angeles Los Angeles, CA 90095 butler@math.ucla.edu

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

Testing for Regime Switching in Singaporean Business Cycles

Testing for Regime Switching in Singaporean Business Cycles Testing for Regime Switching in Singaporean Business Cycles Robert Breunig School of Economics Faculty of Economics and Commerce Australian National University and Alison Stegman Research School of Pacific

More information

Bioinformatics and BLAST

Bioinformatics and BLAST Bioinformatics and BLAST Overview Recap of last time Similarity discussion Algorithms: Needleman-Wunsch Smith-Waterman BLAST Implementation issues and current research Recap from Last Time Genome consists

More information

The Behavior of Multivariate Maxima of Moving Maxima Processes

The Behavior of Multivariate Maxima of Moving Maxima Processes The Behavior of Multivariate Maxima of Moving Maxima Processes Zhengjun Zhang Department of Mathematics Washington University Saint Louis, MO 6313-4899 USA Richard L. Smith Department of Statistics University

More information

1 Lyapunov theory of stability

1 Lyapunov theory of stability M.Kawski, APM 581 Diff Equns Intro to Lyapunov theory. November 15, 29 1 1 Lyapunov theory of stability Introduction. Lyapunov s second (or direct) method provides tools for studying (asymptotic) stability

More information

Recall the Basics of Hypothesis Testing

Recall the Basics of Hypothesis Testing Recall the Basics of Hypothesis Testing The level of significance α, (size of test) is defined as the probability of X falling in w (rejecting H 0 ) when H 0 is true: P(X w H 0 ) = α. H 0 TRUE H 1 TRUE

More information

Introduction to Probability

Introduction to Probability LECTURE NOTES Course 6.041-6.431 M.I.T. FALL 2000 Introduction to Probability Dimitri P. Bertsekas and John N. Tsitsiklis Professors of Electrical Engineering and Computer Science Massachusetts Institute

More information

The tape of M. Figure 3: Simulation of a Turing machine with doubly infinite tape

The tape of M. Figure 3: Simulation of a Turing machine with doubly infinite tape UG3 Computability and Intractability (2009-2010): Note 4 4. Bells and whistles. In defining a formal model of computation we inevitably make a number of essentially arbitrary design decisions. These decisions

More information

4 Sums of Independent Random Variables

4 Sums of Independent Random Variables 4 Sums of Independent Random Variables Standing Assumptions: Assume throughout this section that (,F,P) is a fixed probability space and that X 1, X 2, X 3,... are independent real-valued random variables

More information

Strongly chordal and chordal bipartite graphs are sandwich monotone

Strongly chordal and chordal bipartite graphs are sandwich monotone Strongly chordal and chordal bipartite graphs are sandwich monotone Pinar Heggernes Federico Mancini Charis Papadopoulos R. Sritharan Abstract A graph class is sandwich monotone if, for every pair of its

More information