On the Repeating Group Finding Problem

Similar documents
Design and Analysis of Algorithms

The Order Relation and Trace Inequalities for. Hermitian Operators

Difference Equations

Dynamic Programming 4/5/12. Dynamic programming. Fibonacci numbers. Fibonacci: a first attempt. David Kauchak cs302 Spring 2012

Outline and Reading. Dynamic Programming. Dynamic Programming revealed. Computing Fibonacci. The General Dynamic Programming Technique

A new Approach for Solving Linear Ordinary Differential Equations

Problem Set 9 Solutions

Speeding up Computation of Scalar Multiplication in Elliptic Curve Cryptosystem

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

Dynamic Programming. Preview. Dynamic Programming. Dynamic Programming. Dynamic Programming (Example: Fibonacci Sequence)

a b a In case b 0, a being divisible by b is the same as to say that

The Minimum Universal Cost Flow in an Infeasible Flow Network

Formulas for the Determinant

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

On the Multicriteria Integer Network Flow Problem

The lower and upper bounds on Perron root of nonnegative irreducible matrices

Maximizing the number of nonnegative subsets

The L(2, 1)-Labeling on -Product of Graphs

Hyper-Sums of Powers of Integers and the Akiyama-Tanigawa Matrix

Finding Primitive Roots Pseudo-Deterministically

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Interactive Bi-Level Multi-Objective Integer. Non-linear Programming Problem

Introduction to information theory and data compression

Lecture 2: Gram-Schmidt Vectors and the LLL Algorithm

arxiv: v1 [math.co] 1 Mar 2014

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Pop-Click Noise Detection Using Inter-Frame Correlation for Improved Portable Auditory Sensing

Structure and Drive Paul A. Jensen Copyright July 20, 2003

On a direct solver for linear least squares problems

Graph Reconstruction by Permutations

Computational Biology Lecture 8: Substitution matrices Saad Mneimneh

Finding Dense Subgraphs in G(n, 1/2)

On the Interval Zoro Symmetric Single-step Procedure for Simultaneous Finding of Polynomial Zeros

Ballot Paths Avoiding Depth Zero Patterns

Finding the Longest Similar Subsequence of Thumbprints for Intrusion Detection

A new construction of 3-separable matrices via an improved decoding of Macula s construction

LOW BIAS INTEGRATED PATH ESTIMATORS. James M. Calvin

MAXIMUM A POSTERIORI TRANSDUCTION

Yong Joon Ryang. 1. Introduction Consider the multicommodity transportation problem with convex quadratic cost function. 1 2 (x x0 ) T Q(x x 0 )

Module 9. Lecture 6. Duality in Assignment Problems

5 The Rational Canonical Form

Section 3.6 Complex Zeros

Beyond Zudilin s Conjectured q-analog of Schmidt s problem

8.6 The Complex Number System

Lecture 5 Decoding Binary BCH Codes

APPENDIX A Some Linear Algebra

A Hybrid Variational Iteration Method for Blasius Equation

VARIATION OF CONSTANT SUM CONSTRAINT FOR INTEGER MODEL WITH NON UNIFORM VARIABLES

Lecture Randomized Load Balancing strategies and their analysis. Probability concepts include, counting, the union bound, and Chernoff bounds.

Remarks on the Properties of a Quasi-Fibonacci-like Polynomial Sequence

Chapter - 2. Distribution System Power Flow Analysis

The internal structure of natural numbers and one method for the definition of large prime numbers

Affine transformations and convexity

Two Methods to Release a New Real-time Task

A FORMULA FOR COMPUTING INTEGER POWERS FOR ONE TYPE OF TRIDIAGONAL MATRIX

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

COS 521: Advanced Algorithms Game Theory and Linear Programming

A Local Variational Problem of Second Order for a Class of Optimal Control Problems with Nonsmooth Objective Function

Min Cut, Fast Cut, Polynomial Identities

A New Refinement of Jacobi Method for Solution of Linear System Equations AX=b

Matrix Approximation via Sampling, Subspace Embedding. 1 Solving Linear Systems Using SVD

Some Comments on Accelerating Convergence of Iterative Sequences Using Direct Inversion of the Iterative Subspace (DIIS)

Volume 18 Figure 1. Notation 1. Notation 2. Observation 1. Remark 1. Remark 2. Remark 3. Remark 4. Remark 5. Remark 6. Theorem A [2]. Theorem B [2].

The optimal delay of the second test is therefore approximately 210 hours earlier than =2.

Computing Correlated Equilibria in Multi-Player Games

20. Mon, Oct. 13 What we have done so far corresponds roughly to Chapters 2 & 3 of Lee. Now we turn to Chapter 4. The first idea is connectedness.

Singular Value Decomposition: Theory and Applications

Convexity preserving interpolation by splines of arbitrary degree

Lecture 13 APPROXIMATION OF SECOMD ORDER DERIVATIVES

Calculation of time complexity (3%)

General theory of fuzzy connectedness segmentations: reconciliation of two tracks of FC theory

EEE 241: Linear Systems

The Study of Teaching-learning-based Optimization Algorithm

Kernel Methods and SVMs Extension

Errors for Linear Systems

arxiv: v1 [math.ho] 18 May 2008

Fundamental loop-current method using virtual voltage sources technique for special cases

THERE ARE INFINITELY MANY FIBONACCI COMPOSITES WITH PRIME SUBSCRIPTS

A combinatorial problem associated with nonograms

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Application of Fuzzy Algebra in Automata theory

Approximate Smallest Enclosing Balls

A Simple Research of Divisor Graphs

THE CHVÁTAL-ERDŐS CONDITION AND 2-FACTORS WITH A SPECIFIED NUMBER OF COMPONENTS

On the correction of the h-index for career length

Drago{ CVETKOVI] Mirjana ^ANGALOVI] 1. INTRODUCTION

A CHARACTERISATION OF VIRTUALLY FREE GROUPS

Turing Machines (intro)

CS : Algorithms and Uncertainty Lecture 17 Date: October 26, 2016

THE CHINESE REMAINDER THEOREM. We should thank the Chinese for their wonderful remainder theorem. Glenn Stevens

Communication Complexity 16:198: February Lecture 4. x ij y ij

Explicit constructions of all separable two-qubits density matrices and related problems for three-qubits systems

Generalized Linear Methods

CONTRAST ENHANCEMENT FOR MIMIMUM MEAN BRIGHTNESS ERROR FROM HISTOGRAM PARTITIONING INTRODUCTION

Global Sensitivity. Tuesday 20 th February, 2018

REGULAR POSITIVE TERNARY QUADRATIC FORMS. 1. Introduction

k(k 1)(k 2)(p 2) 6(p d.

1 Matrix representations of canonical matrices

CHAPTER-5 INFORMATION MEASURE OF FUZZY MATRIX AND FUZZY BINARY RELATION

Lecture 2: Prelude to the big shrink

Transcription:

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory On the Repeatng Group Fndng Problem Bo-Ren Kung, Wen-Hsen Chen, R.C.T Lee Graduate Insttute of Informaton Technology and Management Takmng Unversty of Scence and Technology last3@gmal.com, wchen@m.ntu.edu.tw, rctlee@ncnu.edu.tw Abstract In ths paper, we nvestgate the repeatng group detecton problems. We defne a specal knd of repeatng groups, namely maxmal repeatng group. Based upon ths defnton, we defne a specal strng, called complete repeatng group strng. Usng dynamc programmng method, we can fnd all repeatng groups of a strng and determne whether a strng s a complete repeatng group strng. Introducton Strng processng s an mportant and nterestng task n computer scence. DNA sequence assocated research. It s also useful n In ths task, there are some common and classc problems: local sequence algnment problem [][], global sequence algnment problem [3], multple sequence algnment problem [4][5][6], exact pattern matchng problem[7][8][9][0], approxmate pattern matchng problem [][][3] fndng all maxmal palndromes problem [4], fndng all tandem repeats problem [4][5][6][7], fndng all tandem arrays problem [8], etc. research done n these problems. There are qute a lot of In ths paper, we proposed two new and nterestng problems for strng processng We shall frst defne some termnologes: strng s a sequence of characters. Thus we shall use T t t tn to denote a strng. We shall use T (, ) to denote tt t. In a strng, f a substrng T(, ) T(, ) and we say that T, ) and T, ) form a repeatng group. ( ( A, We further defne a term, maxmal repeatng group. A maxmal repeatng group of a strng s a repeatng group whch s not contaned n any other repeatng group. For nstance, consder T cabcdabce. Then abc s a maxmal repeatng group whle nether ab nor bc s a maxmal repeatng group because they are both contaned n abc. If T A A Am such that for every A, there s an A such that A and A form a maxmal repeatng group and no two correspond to one A A, T s called a complete repeatng group strng. Suppose the strng s T accaabbcc aacbcab. In ths case T A A A3 A4 A5 A 6 A7 A8 where A ac, A ca, A3 ab, A bc, A ca, A ac, A bc A ab. 4 5 6 7, 8 In other words, we can see that T A A. Note that all A3 A4 A A A4 A3 repeatng groups n a complete repeatng group strng must be maxmal and non-overlappng. In ths paper, we dscuss two problems: T t t t Problem : Gven a strng n, fnd all repeatng groups n T. Problem : Gven a strng T t t t n, decompose T nto a complete repeatng group strng f possble. Problem In ths secton, we propose an algorthm to solve Problem. The algorthm s based on dynamc programmng approach. In Problem, we are gven a strng T t t t n and we have to fnd repeatng groups of T. We now compare T (, ) wth T (, ). Wthout losng generalty, we assume that M (, denote the length of. Let ) 30

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory the longest common suffx of T (, ) and T (, ). The table contanng all M (, ) s For example, wll be called the M table of T. let T gababagaba. Then we can see that M ( 3,5) because the longest common suffx of T(,5) gabab and T(,3) gab s ab. On the other hand, M ( 4,7) 0 as there s no common suffx between T(,7) gababag and T(,4) gaba. To create M table, we use the followng recursve formula: M (, ) f t M (, ) 0 f t t Formula t The followng dynamc programmng table, Table -, gves all M (, ) s of T gababagaba. g a 0 3 b 0 0 4 a 0 0 3 4 5 6 7 8 9 0 g a b a b a g a b a 5 b 0 0 0 6 a 0 0 3 0 7 g 0 0 0 0 0 8 a 0 0 0 0 9 b 0 0 3 0 0 0 0 0 a 0 0 4 0 3 0 0 Table - The M table of T abababa. From the above table, we can see that M ( 4,6) 3 whch denotes T(,4) gaba T(,6) gababa have the longest and common suffx aba wth length 3. Ths means that we have found a repeatng group ( T (,4), T(4,6))( aba). Snce M (4,0) 4, we have found another repeatng group ( T (,4), T(7,0))( gaba). T (, By examnng all of ) s, we wll be able to fnd all of the repeatng groups. 3 The Detecton of Overlappng Repeatng Groups and the Modfcaton of the M Table We frst defne overlappng as follows: A substrng T tt t overlaps wth a substrng T t' t' t ' f a suffx of T s a prefx of T. For example, let T ababa, T T(,3 ) aba, T T(3,5) aba. We can easly see that the suffx of T s equal to the prefx of T. So we can call T and T an overlappng repeatng group. For Problem, we do not allow overlappng repeatng groups. Suppose T abababa. We may frst get the M table as follows: 3 4 5 6 7 a b a b a b a 4 b 0 0 5 a 0 3 0 6 b 0 0 4 0 7 a 0 3 0 5 0 Table 3- The M table of T abababa From the above table, we can see that M ( 5,7) 5 whch ndcates that T( 5 5,5) T(,5 ) ababa and T( 7 5,7) T(3,7) ababa s a repeatng group. We can also see that T (3,7) overlaps T (,5 ) wth a common substrng T( 3,5) aba. Thus we shall not count T(,5) ababa and T( 3,7) ababa as a repeatng group for Problem. In general, f M(, ) k, then T ( k, ) T( k, ), or equvalently, T( k, ) 3

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory and T( k, ) s a repeatng group. If k, ths repeatng group s overlappng as shown n Fg.. 5 a 0 0 6 b 0 0 0 7 a 0 3 0 0 Table 4- The M ' table of T abababa 5 Problem Fg An Overlappng Repeatng Group. If M(, ) k and k, we know that T( k, ) T( k, ). Thus T( k, ) and T( k, ) s an overlappng repeatng group. But, as shown n Fg, T( k, ) and T(, ) s stll a non-overlappng repeatng group. We are now ready to solve Problem. We frst llustrate the general approach of our algorthm. Gven a strng T, we frst fnd the longest suffx Y of T whch s equal to a substrng X of T whch does not overlap wth Y as shown n Fg 3. Fg 3 X of T whch does not overlap wth Y. Fg Non-overlappng Repeatng Group of an Overlappng Repeatng Group If there s no such non-overlappng repeatng group, report falure; otherwse let T' T X Y as shown n Fg 4 and start the above process agan. In other words, we can use the followng recursve formula to fnd the overlappng repeatng groups and modfy them f possble: M(, ) f k Formula Accordng to the above formula, we can transform Table 3- to Table 4-. 3 4 5 6 7 a b a b a b a 4 b 0 0 Fg 4 Non-overlappng repeatng group. Let us gve an example here to llustrate our approach. Suppose we are gven the followng strng: 3 4 5 6 7 8 9 0 T= A b c d a c c d a c a b Usng the algorthm gven n the prevous sectons, we may mmedately fnd the longest suffx ab, namely T (,), whch s equal to a substrng n T, not overlappng wth t, whch s T (,). We record ths frst par T(,) T(,) ab. We may now mark T (, ) and T (,) and consder the remanng unmasked strng as 3

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory follows: 3 4 5 6 7 8 9 0 T= a b c d a c c d a c a b We fnd that T ( 7,0) T(3,6) cdac. Thus we conclude that T s a complete repeatng group strng. Prevously, we showed that we could use dynamc programmng method to create a M table. We also showed that we can use Formula to modfy the table. Suppose that we have the followng strng: T= a b a b c a b c a b Then the M table looks lke the followng: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 4 0 0 0 b 0 0 0 0 5 0 0 Table 5- The M table of T ababcabcab We use Formula to modfy the above table as follows: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 3 0 0 0 b 0 0 0 0 3 0 0 Table 5- The M ' table of T ababcabcab We now examne the last row of the above table and note the largest element n the row s M (7,0). Ths means that we have found the longest suffx T (8,0) whch s equal to another substrng n T whch s T (5,7). Ths s our frst non-overlappng repeatng group found. We mark these two substrngs so that the table looks lke the followng: 3 4 5 6 7 8 9 0 a b a b c a b c a b 4 b 0 0 5 c 0 0 0 0 6 a 0 0 0 7 b 0 0 0 0 8 c 0 0 0 0 3 0 0 9 a 0 0 0 3 0 0 0 b 0 0 0 0 3 0 0 Table 5-3 Markng the M table of T ababcabcab For the remanng matrx, the last row s row 4. We fnd the largest element of ths row s M (,4). Ths means that we have found 33

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory another non-overlappng repeatng group, namely, T (3,4) and T (, ). After markng the two substrngs agan, we fnd that the matrx becomes empty. We report our strng looks lke the followng and s a complete repeatng group strng. 6 Concluson In ths paper, we dscuss two strng processng problems. We show that the dynamc programmng technque can be effectvely used to solve these problems. In the future we would lke to explore the followng problem. Note that the decomposton of a strng nto complete repeatng group strng s not unque. It wll be good f we can fnd soluton such that the number of maxmal repeatng groups s mnmzed. We suspect that ths problem may be NP-complete. If t s not, we hope to fnd a polynomal algorthm to solve the problem. References [] Smth, T. F. and Waterman M. S.. Identfcaton of Common Molecular Subsequences. 98, pp. 95-97. [] Webb, B. M., Lu, J. S., and Lawrence, C. E.. BALSA: Bayesan algorthm for local sequence algnment. Nuclec Acds Research, Vol. 30, No. 5, pp. 68-77. [3] Huang, X.. On global sequence algnment. Bonformatcs, Vol. 0, No. 3, 994, pp. 7-335. [4] Carrllo, H. and Lpman, D. The Multple Sequence Algnment Problem n Bology. SIAM Journal on Appled Mathematcs, Vol. 48, No. 5, 988, pp. 073-08. [5] Chan, S. C., Wong, A. K. C. and Chu, D. K. Y.A.. Survey of Multple Sequence Comparson Methods. Bulletn of Mathematcal Bology, Vol. 54, No. 4, 99, pp. 563-598. [6] Lpman, D. J., Altschul, S. F. and Kececoglu, J. D.. A Tool for Multple Sequence Algnment. Proc. Nat. Acad. Sc., Vol. 86, 989, pp. 44-445. [7] Aho, A. V. and Corasck, M. J.. Effcent Strng Matchng, Matchng. Communcatons of ACM, Vol. 8, 975, pp. 333-340. [8] Boyer, R. S and Moore, J. S.. A Fast Strng Searchng Algorthm. Communcaton of the ACM, Vol. 0, 977, pp. 76-77. [9] Fscher, M.M., and Paterson, M.S.. Strng-Matchng and other products. SIAM-AMS Proceedngs, Vol. 7., 974, pp. 3-5. [0] Knuth, D. E., Morrs, J. H. and Pratt, V. R.. Fast Pattern Matchng n Strngs. SIAM Journal on Computng, Vol. 6, 977, pp. 33-350. [] Landau, G. M. and Vshkn, U.. Effcent Strng Matchng wth k Msmatches. Theoretcal Computer Scence, Vol. 43, 986, pp. 39-49. [] Gall, Z. and Gancarlo, R. Improved Strng Matchng wth k Msmatches. SIGACT News, Vol. 7, No. 4, 986, pp. 5-54. [3] Ukkonen E.. Algorthms for Approxmate Strng Matchng. Informaton and Control, Vol. 64, 985, pp. 00 8. [4] Gusfeld, D. Algorthms on Strngs Trees and Sequences: Computer Scence and Computatonal Bology. Cambrdge Unversty Press, 997. [5] Benson, G. Tandem Repeats Fnder: a Program to Analyze DNA Sequences. Oxford [6] Buchner, M. and Janarastt, S.. Detecton and Vsualzaton of Tandem Repeats n DNA 34

The 9th Workshop on Combnatoral Mathematcs and Computaton Theory Sequences. IEEE Transactons on Sgnal Processng, Vol. 5, 003, pp. 80-87. [7] Wexler, Y., Yakhn, Z., Kash, Y. and Geger, D. Fndng Approxmate Tandem Repeats n Genomc Sequences. Proceedngs of the eghth annual nternatonal conference on Computatonal molecular bology, 004, pp. 3-3. [8] Stoye, J., Gusfeld, D.. Smple and flexble detecton of contguous repeats usng a suffx tree. Theoretcal Computer Scence, 00. 35