The 9th Workshop on Combnatoral Mathematcs and Computaton Theory On the Repeatng Group Fndng Problem Bo-Ren Kung, Wen-Hsen Chen, R.C.T Lee Graduate Insttute of Informaton Technology and Management Takmng Unversty of Scence and Technology last3@gmal.com, wchen@m.ntu.edu.tw, rctlee@ncnu.edu.tw Abstract In ths paper, we nvestgate the repeatng group detecton problems. We defne a specal knd of repeatng groups, namely maxmal repeatng group. Based upon ths defnton, we defne a specal strng, called complete repeatng group strng. Usng dynamc programmng method, we can fnd all repeatng groups of a strng and determne whether a strng s a complete repeatng group strng. Introducton Strng processng s an mportant and nterestng task n computer scence. DNA sequence assocated research. It s also useful n In ths task, there are some common and classc problems: local sequence algnment problem [][], global sequence algnment problem [3], multple sequence algnment problem [4][5][6], exact pattern matchng problem[7][8][9][0], approxmate pattern matchng problem [][][3] fndng all maxmal palndromes problem [4], fndng all tandem repeats problem [4][5][6][7], fndng all tandem arrays problem [8], etc. research done n these problems. There are qute a lot of In ths paper, we proposed two new and nterestng problems for strng processng We shall frst defne some termnologes: strng s a sequence of characters. Thus we shall use T t t tn to denote a strng. We shall use T (, ) to denote tt t. In a strng, f a substrng T(, ) T(, ) and we say that T, ) and T, ) form a repeatng group. ( ( A, We further defne a term, maxmal repeatng group. A maxmal repeatng group of a strng s a repeatng group whch s not contaned n any other repeatng group. For nstance, consder T cabcdabce. Then abc s a maxmal repeatng group whle nether ab nor bc s a maxmal repeatng group because they are both contaned n abc. If T A A Am such that for every A, there s an A such that A and A form a maxmal repeatng group and no two correspond to one A A, T s called a complete repeatng group strng. Suppose the strng s T accaabbcc aacbcab. In ths case T A A A3 A4 A5 A 6 A7 A8 where A ac, A ca, A3 ab, A bc, A ca, A ac, A bc A ab. 4 5 6 7, 8 In other words, we can see that T A A. Note that all A3 A4 A A A4 A3 repeatng groups n a complete repeatng group strng must be maxmal and non-overlappng. In ths paper, we dscuss two problems: T t t t Problem : Gven a strng n, fnd all repeatng groups n T. Problem : Gven a strng T t t t n, decompose T nto a complete repeatng group strng f possble. Problem In ths secton, we propose an algorthm to solve Problem. The algorthm s based on dynamc programmng approach. In Problem, we are gven a strng T t t t n and we have to fnd repeatng groups of T. We now compare T (, ) wth T (, ). Wthout losng generalty, we assume that M (, denote the length of. Let ) 30
The 9th Workshop on Combnatoral Mathematcs and Computaton Theory the longest common suffx of T (, ) and T (, ). The table contanng all M (, ) s For example, wll be called the M table of T. let T gababagaba. Then we can see that M ( 3,5) because the longest common suffx of T(,5) gabab and T(,3) gab s ab. On the other hand, M ( 4,7) 0 as there s no common suffx between T(,7) gababag and T(,4) gaba. To create M table, we use the followng recursve formula: M (, ) f t M (, ) 0 f t t Formula t The followng dynamc programmng table, Table -, gves all M (, ) s of T gababagaba. g a 0 3 b 0 0 4 a 0 0 3 4 5 6 7 8 9 0 g a b a b a g a b a 5 b 0 0 0 6 a 0 0 3 0 7 g 0 0 0 0 0 8 a 0 0 0 0 9 b 0 0 3 0 0 0 0 0 a 0 0 4 0 3 0 0 Table - The M table of T abababa. From the above table, we can see that M ( 4,6) 3 whch denotes T(,4) gaba T(,6) gababa have the longest and common suffx aba wth length 3. Ths means that we have found a repeatng group ( T (,4), T(4,6))( aba). Snce M (4,0) 4, we have found another repeatng group ( T (,4), T(7,0))( gaba). T (, By examnng all of ) s, we wll be able to fnd all of the repeatng groups. 3 The Detecton of Overlappng Repeatng Groups and the Modfcaton of the M Table We frst defne overlappng as follows: A substrng T tt t overlaps wth a substrng T t' t' t ' f a suffx of T s a prefx of T. For example, let T ababa, T T(,3 ) aba, T T(3,5) aba. We can easly see that the suffx of T s equal to the prefx of T. So we can call T and T an overlappng repeatng group. For Problem, we do not allow overlappng repeatng groups. Suppose T abababa. We may frst get the M table as follows: 3 4 5 6 7 a b a b a b a 4 b 0 0 5 a 0 3 0 6 b 0 0 4 0 7 a 0 3 0 5 0 Table 3- The M table of T abababa From the above table, we can see that M ( 5,7) 5 whch ndcates that T( 5 5,5) T(,5 ) ababa and T( 7 5,7) T(3,7) ababa s a repeatng group. We can also see that T (3,7) overlaps T (,5 ) wth a common substrng T( 3,5) aba. Thus we shall not count T(,5) ababa and T( 3,7) ababa as a repeatng group for Problem. In general, f M(, ) k, then T ( k, ) T( k, ), or equvalently, T( k, ) 3
The 9th Workshop on Combnatoral Mathematcs and Computaton Theory and T( k, ) s a repeatng group. If k, ths repeatng group s overlappng as shown n Fg.. 5 a 0 0 6 b 0 0 0 7 a 0 3 0 0 Table 4- The M ' table of T abababa 5 Problem Fg An Overlappng Repeatng Group. If M(, ) k and k, we know that T( k, ) T( k, ). Thus T( k, ) and T( k, ) s an overlappng repeatng group. But, as shown n Fg, T( k, ) and T(, ) s stll a non-overlappng repeatng group. We are now ready to solve Problem. We frst llustrate the general approach of our algorthm. Gven a strng T, we frst fnd the longest suffx Y of T whch s equal to a substrng X of T whch does not overlap wth Y as shown n Fg 3. Fg 3 X of T whch does not overlap wth Y. Fg Non-overlappng Repeatng Group of an Overlappng Repeatng Group If there s no such non-overlappng repeatng group, report falure; otherwse let T' T X Y as shown n Fg 4 and start the above process agan. In other words, we can use the followng recursve formula to fnd the overlappng repeatng groups and modfy them f possble: M(, ) f k Formula Accordng to the above formula, we can transform Table 3- to Table 4-. 3 4 5 6 7 a b a b a b a 4 b 0 0 Fg 4 Non-overlappng repeatng group. Let us gve an example here to llustrate our approach. Suppose we are gven the followng strng: 3 4 5 6 7 8 9 0 T= A b c d a c c d a c a b Usng the algorthm gven n the prevous sectons, we may mmedately fnd the longest suffx ab, namely T (,), whch s equal to a substrng n T, not overlappng wth t, whch s T (,). We record ths frst par T(,) T(,) ab. We may now mark T (, ) and T (,) and consder the remanng unmasked strng as 3
The 9th Workshop on Combnatoral Mathematcs and Computaton Theory follows: 3 4 5 6 7 8 9 0 T= a b c d a c c d a c a b We fnd that T ( 7,0) T(3,6) cdac. Thus we conclude that T s a complete repeatng group strng. Prevously, we showed that we could use dynamc programmng method to create a M table. We also showed that we can use Formula to modfy the table. Suppose that we have the followng strng: T= a b a b c a b c a b Then the M table looks lke the followng: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 4 0 0 0 b 0 0 0 0 5 0 0 Table 5- The M table of T ababcabcab We use Formula to modfy the above table as follows: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 3 0 0 0 b 0 0 0 0 3 0 0 Table 5- The M ' table of T ababcabcab We now examne the last row of the above table and note the largest element n the row s M (7,0). Ths means that we have found the longest suffx T (8,0) whch s equal to another substrng n T whch s T (5,7). Ths s our frst non-overlappng repeatng group found. We mark these two substrngs so that the table looks lke the followng: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 3 0 0 0 b 0 0 0 0 3 0 0 Table 5-3 Markng the M table of T ababcabcab For the remanng matrx, the last row s row 4. We fnd the largest element of ths row s M (,4). Ths means that we have found 33
The 9th Workshop on Combnatoral Mathematcs and Computaton Theory another non-overlappng repeatng group, namely, T (3,4) and T (, ). After markng the two substrngs agan, we fnd that the matrx becomes empty. We report our strng looks lke the followng and s a complete repeatng group strng. 6 Concluson In ths paper, we dscuss two strng processng problems. We show that the dynamc programmng technque can be effectvely used to solve these problems. In the future we would lke to explore the followng problem. Note that the decomposton of a strng nto complete repeatng group strng s not unque. It wll be good f we can fnd soluton such that the number of maxmal repeatng groups s mnmzed. We suspect that ths problem may be NP-complete. If t s not, we hope to fnd a polynomal algorthm to solve the problem. References [] Smth, T. F. and Waterman M. S.. Identfcaton of Common Molecular Subsequences. 98, pp. 95-97. [] Webb, B. M., Lu, J. S., and Lawrence, C. E.. BALSA: Bayesan algorthm for local sequence algnment. Nuclec Acds Research, Vol. 30, No. 5, pp. 68-77. [3] Huang, X.. On global sequence algnment. Bonformatcs, Vol. 0, No. 3, 994, pp. 7-335. [4] Carrllo, H. and Lpman, D. The Multple Sequence Algnment Problem n Bology. SIAM Journal on Appled Mathematcs, Vol. 48, No. 5, 988, pp. 073-08. [5] Chan, S. C., Wong, A. K. C. and Chu, D. K. Y.A.. Survey of Multple Sequence Comparson Methods. Bulletn of Mathematcal Bology, Vol. 54, No. 4, 99, pp. 563-598. [6] Lpman, D. J., Altschul, S. F. and Kececoglu, J. D.. A Tool for Multple Sequence Algnment. Proc. Nat. Acad. Sc., Vol. 86, 989, pp. 44-445. [7] Aho, A. V. and Corasck, M. J.. Effcent Strng Matchng, Matchng. Communcatons of ACM, Vol. 8, 975, pp. 333-340. [8] Boyer, R. S and Moore, J. S.. A Fast Strng Searchng Algorthm. Communcaton of the ACM, Vol. 0, 977, pp. 76-77. [9] Fscher, M.M., and Paterson, M.S.. Strng-Matchng and other products. SIAM-AMS Proceedngs, Vol. 7., 974, pp. 3-5. [0] Knuth, D. E., Morrs, J. H. and Pratt, V. R.. Fast Pattern Matchng n Strngs. SIAM Journal on Computng, Vol. 6, 977, pp. 33-350. [] Landau, G. M. and Vshkn, U.. Effcent Strng Matchng wth k Msmatches. Theoretcal Computer Scence, Vol. 43, 986, pp. 39-49. [] Gall, Z. and Gancarlo, R. Improved Strng Matchng wth k Msmatches. SIGACT News, Vol. 7, No. 4, 986, pp. 5-54. [3] Ukkonen E.. Algorthms for Approxmate Strng Matchng. Informaton and Control, Vol. 64, 985, pp. 00 8. [4] Gusfeld, D. Algorthms on Strngs Trees and Sequences: Computer Scence and Computatonal Bology. Cambrdge Unversty Press, 997. [5] Benson, G. Tandem Repeats Fnder: a Program to Analyze DNA Sequences. Oxford [6] Buchner, M. and Janarastt, S.. Detecton and Vsualzaton of Tandem Repeats n DNA 34
The 9th Workshop on Combnatoral Mathematcs and Computaton Theory Sequences. IEEE Transactons on Sgnal Processng, Vol. 5, 003, pp. 80-87. [7] Wexler, Y., Yakhn, Z., Kash, Y. and Geger, D. Fndng Approxmate Tandem Repeats n Genomc Sequences. Proceedngs of the eghth annual nternatonal conference on Computatonal molecular bology, 004, pp. 3-3. [8] Stoye, J., Gusfeld, D.. Smple and flexble detecton of contguous repeats usng a suffx tree. Theoretcal Computer Scence, 00. 35