Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences

Size: px

Start display at page:

Download "Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences"

Angel Fletcher
6 years ago
Views:

1 Mining Infrequent Patterns of Two Frequent Substrings from a Single Set of Biological Sequences Daisuke Ikeda Department of Informatics, Kyushu University 744 Moto-oka, Fukuoka , Japan. daisuke@inf.kyushu-u.ac.jp Abstract This paper is devoted to considering mining infrequent patterns from biological sequences. Two typical approaches to find infrequent patterns are model-driven and data-driven, and each of them has advantages and disadvantages. As a mixed approach, FPCS (Finding Peculiar Composite Strings) was proposed in a literature, where two substrings x and y are decided by given data and their concatenation xy is evaluated in a model-driven manner. Although its effectiveness has already shown, it requires the background set of sequences, in addition to the target set. In this paper, we propose another approach for infrequent patterns, which, given a single set of sequences, finds string patterns of two substrings frequent in the set. Therefore, the proposed approach is simpler than FPCS. Using biological features, such as RNA, of popular bacterial DNA sequences, the effectiveness of the proposed approach is evaluated. For B. subtilis and C. perfringens, the proposed approach can find RNA regions as well as FPCS while it fails to do that for E. coli and S. enterica because FPCS is more finely granular than the proposed approach. Keywords: Under-represented patterns, Infrequent patterns, Text mining, Bioinformatics 1. Introduction With plenty of biological sequences, it is becoming much more important to develop mining algorithms for such sequences. As one of such mining algorithms, those for frequent or infrequent patterns, called over-represented or under-represented ones, respectively, have been attracted in bioinformatics [2]. Compared to frequent patterns, it is more difficult for mining algorithms to find infrequent ones from biological sequences because of data sparseness. That is, there exist a huge number of infrequent subsequences due to the sparseness, and thus it is critical to select useful patterns out of the huge number of infrequent candidate patterns. We can see existing methods as two basic approaches for infrequent pattern mining: one is model-driven in which a probabilistic model is assumed for being normal and a candidate pattern is under- or over-represented if its frequency is far from the expected frequency estimated from the model; the other is data-driven in which, given two sets of sequences, an algorithm outputs a pattern if it frequently appears in one of the sets while it rarely does in the other. A typical approach of the former types is the z-score [2]. The score for a pattern w is defined as z(w) = f(w) E(w), N(w) where f(w) is the frequency of w in a given set of sequences, E(w) its expected value of w under an assumed probabilistic model, and N(w) a normalization factor of w. As a probabilistic model, the Bernoulli model is assumed in [3], [4] and the Markov model is considered in [5], [6]. However, a simple model can not describe the given sequences well while a complicated one requires huge computational resources, and thus it is difficult to decide an appropriate model in advance. A typical approach for the latter types is the contrast or distinguish pattern. In this case, a background set is assumed to define being normal, in addition to a target set [7], [8]. However, this approach is basically for frequent patterns by eliminating useless candidate patterns which are both frequent in two sets. In [9], the algorithm called FPCS (Finding Peculiar Compositions) were proposed, where, given a target set T and a background set B of sequences, a pattern w is extracted as the form of w = xy if each of x and y is more frequent in B than in T and conversely w = xy is more frequent in T. More precisely, given two parameters θ T, θ B ( 1), a candidate pattern w is extracted if P(x B) > θ B P(x T ), P(y B) > θ B P(y T ), and P(xy T ) > θ T P(xy B), where P(w S) denotes the empirical probability of w in S. This means that we estimate P (xy) by P(x B) P(y B) and, if the observed probability P(xy T ) is larger than the estimated probability P (xy) = P(x B) P(y B), we say xy is quite unusual in T. In this framework, the estimation of probabilities is done like z-score with a probabilistic model, but the unit of words, such as x and y, is defined automatically using the given background set of data. In this sense, we can say that FPCS is a mixed approach of model- and data-driven approaches.

2 In [9], it is shown that, given bacterial sequences as the target and background sets, many of found peculiar compositions are exceptional by a z-score criteria, and some peculiar compositions are not [9]. This implies that we can find peculiar compositions which can not be found by z- score. In [10], it is shown that many peculiar compositions are found in biological features, such as rrna and transposase, using DNA sequences of 7 popular bacteria, such as Escherichia coli K-12 (E. coli) and Bacillus subtilis (B. subtilis). Although FPCS s effectiveness has shown, it requires the background set of sequences, in addition to the target set. Of course, it seems to be natural in bioinformatics to compare the target sequences with some other sequences. However, it is much more useful when we can use an infrequent mining algorithm for a single set of sequences. In this paper, we proposed another approach for under-represented patterns, inspired by FPCS. The proposed method requires a set of sequences and outputs infrequent patterns of the form w = xy, where x and y are frequent in the input set and P (xy) is much more frequent than its estimated value P (x)p (y). Using biological features, such as RNA, of popular bacterial DNA sequences, the effectiveness of the proposed method is evaluated. 2. Finding Peculiar Compositions According to [9], [10], we briefly explain the peculiar composition discovery problem and its significance on DNA sequences of popular bacteria. 2.1 Problem Definition Let Σ be an alphabet and an element of Σ is called a letter. In case of nucleotide sequences, Σ = {A, C, G, T }. The set of all the finite sequences of one or more letters is denoted by Σ +, and an element of Σ + is called a string. The length of a string x, denoted by x, is the number of letters of x. Consider a string x = a 1 a n (a i Σ). A letter a i in x is denoted by x[i], and a i a i+1... a j (i < j) is called a substring 1. For two strings x, y Σ +, the concatenation of x and y is denoted by xy. We call xy the composition of x and y. Conversely, a pair of two strings (x, y) is called a division of w if w = xy. There exist O( w ) divisions. For instance, if x = AAC and y = GC then xy = AACGC, and (A, ACGC), (AA, CGC),... (AACG, C) are divisions of AACGC. Let x, y Σ +. An occurrence of x in y is an integer i such that x[j] = y[i + j] (1 j x ). The frequency of x in y is the number of occurrences of x in y. We extend this notion in case of a set D of strings, instead of a string y, as follows: f D (x) to denote the sum of the frequencies of x 1 In this paper, we do not consider the empty string, that is the case i = j. in all strings in D. Since the frequency is affected by the absolute size of D, we introduce the empirical probability of x in D as the relative frequencies P(x D) = f D (x)/# D, where # D is the sum of frequencies. We define a set of positions of x in y as follows: Pos y (x) = {i + j x[j] = y[i + j], 1 j x }. For example, Pos babbab (ab) = {2, 3, 5, 6}. In other words, Pos y (x) is a set of all positions on y covered by x. It is naturally extended for a set D of substrings in y by Pos y (D) = x D Pos y (x). Note that we count only one time even if two substrings x and x share some positions in y since Pos y (D) is defined as a set. The peculiar composition discovery problem is defined as follows. Definition 1: The peculiar composition discovery problem is, given two sets T and B of strings and threshold values θ T > 1, θ B > 1 and η 2, to find all peculiar compositions of the form xy such that P(x B) > θ B P(x T ), P(y B) > θ B P(y T ), P(xy T ) > θ T P(xy B), and f T (xy) η. From the first two conditions, both frequencies of x and y are much larger in B than those in T. Therefore, we can expect that the composition xy appear frequently in B than in T. From the third condition, however, a found peculiar composition xy appear more frequently in T than in B. In this sense, xy is exceptional. 2.2 Peculiar Compositions in Biological Sequences Fig. 1 provides found peculiar compositions on a genetic map of the whole target sequence B. subtilis from 1bp at the top-left to 4,214,630bp at the bottom-right, where E. coli is used as the background sequence. A map contains two tracks. The above one is for biological features, where a feature is displayed above (resp. below) of the track line if it is in the normal strand (resp. its complement); the below track is for found peculiar compositions, where they are drawn at both strands because if a peculiar composition is found at the normal strand then its corresponding composition is also found at the corresponding position of the complement, and vice versa. Fig. 1 includes rrna colored by blue, trna lightblue, and other RNA related features navyblue as biological features, where we say that a feature is RNA related if its feature key includes RNA as a substring. A gene or CDS whose function, product, or note record contains transposon and phage is colored in green and yellow, respectively. We exclude other gene and CDS from the map because it is known that genes prevail in bacterial DNA sequences.

3 Fig. 1: A genetic map of the whole DNA sequence of B. subtilis with two tracks, where rrna, trna, other RNAs, transposon, and phage are colored in blue, lightblue, navyblue, green and yellow, respectively, on the above track, and found peculiar compositions are colored in red on the below track. We see that peculiar compositions found in case that θ T = 2.0, θ B = 2.0 and η = 3 densely appear at biological features, especially RNAs. It is known that RNAs are well preserved among species, and transposons and phages are external, and thus we can say that found peculiar compositions are useful. Fig. 2 shows three enlarged maps of B. subtilis from 1bp to around 1Mbp, where parameter values for η are changed. From these maps, we see that found peculiar compositions are densely appear at biological features, even if we use a larger value for a parameter. In [10], it is also shown that patterns extracted by the z-score and contrast patterns appearing infrequently in the target sequence can not match well to biological features. Peculiar compositions in Table 1 are found in B. subtilis, where E. coli is used as the background set, and θ T = θ B = 2.0 and η = 10. They are found in rrna (rrno-16s). 3. Peculiar Compositions of Frequent Substrings In this section, we introduce the peculiar composition of frequent substrings discovery problem. First of all, we assume a single set of sequences. To define frequent x and y, we have used T and B in the peculiar composition discovery problem. However, now we do not have B and thus we define frequent x and y by a minimum support threshold. Once we obtain frequent x and y, all we have to do is to find xy whose probability is much larger than its estimation value, P (x)p (y). The peculiar composition of frequent substrings discovery problem is defined as follows. Definition 2: The peculiar composition of frequent substrings discovery problem is, given a set T of strings and threshold values θ > 1, minsup f 2 and minsup xy 2, to find all peculiar compositions of frequent substring of the form xy such that f T (x) minsup f f T (y) minsup f P(xy T ) > θp(x T )P(y T ), and f T (xy) η. From the first two conditions, both x and y must be frequent, and the probability of xy must be larger than its expectation value P(x T )P(y T ). The last condition is for the minimum support of found patterns. 4. Experiments In this section, after describing data sets and how to evaluate, we show both qualitative and quantitative evaluation of the proposed method. 4.1 Setting The data sets used in our experiments are whole DNA sequences of four bacteria, E. coli, B. subtilis, Clostridium perfringens (C. perfringens for short), and Salmonella enterica (S. enterica for short), whose statistics are shown in

4 Fig. 2: Three maps in case η = 3, 5 and 10 from left to right, where the target set, the background one, and the other parameters θ T, θ B are fixed to B. subitilis, E. coli, 2 and 2, respectively. Table 1: Some peculiar compositions found in B. subtilis, where E. coli is used as the background set. (x, y) (f T (xy), f B (xy)) (f T (x), f B (x)) (f T (y), f B (y)) (AACGCT GG, CGGCGT G) (9, 1) (92, 389) (399, 1217) (ACGCT G, GCGGCGT ) (9, 3) (1532, 4008) (584, 1838) CGCT GGCG, GCGT G) (9, 2) (104, 895) (4009, 11412) (CGCT G, GCGGCGT ) (10, 6) (6718, 17434) (584, 1838) (9807,11362) (11461,11538) (11549,11625) (11706,14634) (14689,14808) Fig. 3: A gene map of B. subtilis from 1bp to around 20Kbp. Table 2. We have chosen E. coli and B. subtilis since they are typical model bacterium with different properties: the former is Gram-negative while the latter is Gram-positive. We have chosen C. perfringens (resp. S. enterica) since it is gram-positive (resp. gram-negative). As qualitative evaluation, we use genetic maps as shown above, and as qualitative evaluation we caluculate a popular evaluation value used in information retrieval, F -masure, which is defined as F β = (1 + β2 ) P R β 2 P + R, where P and R denote precision and recall, respectively. We choose β = 1/4 for F β although F -measure typically means F 1, which puts weight on precision and recall equally. However, our goal is not to find these features but to show that found peculiar compositions match biological features. Although found patterns seem to fully cover RNAs (see Fig. 1), they are sparse in an enlarged map of Fig. 3, where only about 20Kbp are described. From the viewpoint of our goal, we do not need high recall values since we are trying to find useful, infrequent patterns and we can t expect that infrequent patterns cover all occurrences of some features. Thus, we choose β = 1/4, which weighs precision four times as much as recall. 4.2 Genetic Maps As qualitative evaluation, we show genetic maps, like Fig. 1. First, we show a map obtained from B. subtilis with parameter values θ = , minsup f = 1000, and minsup xy = 20 (see Fig. 4). We see that dense regions of peculiar compositions of frequent substrings match to blue regions, that is, rrna and trna. Next, we show a map obtained from E. coli with parameter values θ = , minsup f = 3000, and minsup xy = 20 (see Fig. 5). In this case, we can t find dense regions even when we change parameter values. To compare FPCS, we show a genetic map of E. coli, where B. subtilis is used as the background set, θ T = θ B = 2, and η = 5 (Fig. 6). Unlike Fig. 5, we can clearly see dense regions and most of these regions correspond to some biological features. Fig. 7 shows genetic maps of C. perfringens and S. enterica. From the map of C. perfringens, we see dense regions at rrna and trna while we can t see such regions at designated features from the map of S. enterica although there exist dense regions. 4.3 F -measures In this section, we quantitatively evaluate the proposed method by the F -measure. Table 3 shows the results, where feature column shows features we consider as correct ones. For E. coli, we use RNA and transposon as target features

5 Table 2: List of DNA sequences used in experiments. Name Accession # GC% Length (bp) Gram-pos/neg E. coli NC_ ,639,675 B. subtilis NC_ ,214,630 + C. perfringens NC_ ,256,683 + S. enterica NC_ ,791,958 Fig. 4: A genetic maps of the whole DNA sequences of B. subtilis, where θ = , minsup f = 1000, and minsup xy = 20. Fig. 5: A genetic map of the whole DNA sequences of E. coli, where θ = , minsup f = 3000, and minsup xy = 20.

6 Fig. 6: A genetic map of the whole DNA sequence of E. coli, where B. subtilis is used as the background set, θ T = θ B = 2.0 and η = 5. Table 3: Precisions, recalls, and F 1/4 values E. coli and B. subtilis, where features in feature column are assumed to be correct. NC# feature θ minsup f minsup xy precision recall F 1/4 NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA, Transposon NC_ RNA NC_ RNA NC_ RNA and, for B. subtilis, we use only RNA. First of all, F -measure values for B. subtilis are much better than those for E. coli as we have seen from genetic maps. Next, we compare these results with those of FPCS [10]. F -measures obtained by FPCS are much larger than those of the proposed method. In case of E. coli, F 1/4 = for RNA and transposon, where precision is and recall is , and when B. subtilis is given, and F 1/4 = for only RNA, where precision is and recall is Conclusion We have proposed a peculiar composition of frequent substrings which requires only a single set of sequences, and evaluated both quantitatively and qualitatively. The proposed method only requires a single set and thus it is simpler than FPCS. However, F -measure values of the proposed method are much smaller than those of FPCS. The reason for this may be due to the definitions of x s and y s being frequent for final output patterns of the form xy. In FPCS, x and y are defined independently using two ratios, θ T and θ B, and thus the frequencies for them can be quite different. In fact, we see quite different frequencies of x and y in Table 1. On the other hand, being frequent is defined absolutely by one parameter value, minsup f, in the proposed method. Therefore, frequencies of x and y must be similar. Thus, FPCS is more finely granular than the proposed approach. From maps of 4, 5, 7, it seems that peculiar compositions of frequent substrings appear at RNA given gram-negative bacteria while they do not given gram-positive bacteria. It is important to validate this hypothesis with more experiments. References [1] L. Parida, Pattern Discovery in Bioinformatics: Theory & Algorithms. Chapman & Hall/CRC, July 2007.

7 Fig. 7: Genetic maps of the whole DNA sequence of C. perfringens (upper side) and S. enterica (lower side), where , minsup f 3000, and minsup xy 10 are used for C. perfringens and mytheta57, minsup f 2000, and minsup xy 10 for S. enterica. [2] A. Apostolico, M. E. Bock, S. Lonardi, and X. Xu, Efficient Detection of Unusual Words, J. of Comput. Biol., vol. 7, no. 1/2, pp , Jan [3] M.-Y. Leung, G. M. Marsh, and T. P. Speed, Over- and Underrepresentation of Short DNA Words in Herpesvirus Genomes, J. of Comput. Biol., vol. 3, no. 3, pp , [4] S. Schbath, An Efficient Statistic to Detect Over- and Underrepresented Words in DNA Sequences, J. of Comput. Biol., vol. 4, no. 2, pp , [5] T. Marschall and S. Rahmann, Efficient Exact Motif Discovery, Bioinformatics, vol. 25, no. 12, pp. i356 i364, [6] D. Ikeda, Characteristic Sets of Strings Common to Semi-structured Documents, in Proceedings of the Second International Conference on Discovery Science, ser. Lecture Notes in Artificial Intelligence 1721, December 1999, pp [7] X. Ji, J. Bailey, and G. Dong, Mining Minimal Distinguishing Subsequence Patterns with Gap Constraints, Knowledge and Information Systems, vol. 11, no. 3, pp , [8] D. Ikeda and E. Suzuki, Mining Peculiar Compositions of Frequent Substrings from Sparse Text Data Using Background Texts, in Proc. of ECML PKDD, September 2009, pp [9] D. Ikeda, O. Maruyama, and S. Kuhara, Infrequent, Unexpected, and Contrast Pattern Discovery from Bacterial Genomes by Genome-wide Comparative Analysis, in Proceedings of the 4th International Conference on Bioinformatics Models, Methods and Algorithms, February 2013, pp

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric