On the k-closest Substring and k-consensus Pattern Problems

Size: px

Start display at page:

Download "On the k-closest Substring and k-consensus Pattern Problems"

Pamela Cooper
5 years ago
Views:

1 On the k-closest Substring and k-consensus Pattern Problems Yishan Jiao 1,JingyiXu 1, and Ming Li 2 1 Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences, 6#, South Road, Kexueyuan, Zhongguancun, Beijing, P.R.China, {jys,xjy}@ict.ac.cn, WWW home page: 2 University of Waterloo, mli@uwaterloo.ca Abstract. Given a set S = {s 1,s 2,...,s n} of strings each of length m, and an integer L, we study the following two problems. k-closest Substring problem: find k center strings c 1,c 2,...,c k of length L minimizing d such that for each s j S, there is a length-l substring t j (closest substring) of s j with min 1 i k d(c i,t j) d. We give a PTAS for this problem, for k = O(1). k-consensus Pattern problem: find k median strings c 1,c 2,...,c k of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = n min d(ci,tj). We give a PTAS for j=1 1 i k this problem, for k = O(1). Our results improve recent results of [10] and [16] both of which depended on the random linear transformation technique in [16]. As for general k case, we give an alternative and direct proof of the NP-hardness of (2-ɛ)-approximation of the Hamming radius k-clustering problem, a special case of the k-closest Substring problem restricted to L = m. Keywords: k-center problems, closest string and substrings, consensus pattern, polynomial time approximation scheme. 1 Introduction While the original departure point of this study has been separating repeats in our DNA sequence assembly project, we have quickly realized that the problems we have abstracted relate to many widely studied in different areas from geometric clustering [4], [8], [3], [16] to DNA multiple motif finding [12], [7], [10]. In sequence assembly, the greatest challenge is to deal with repeats, when the shortest common superstring GREEDY algorithm [1] is applied. Given a collection of approximate repeats, if it is possible to separate them into original groups, the quality of sequence assembly algorithm will be improved at least for a class of repeats that are sufficiently different.

2 Many classic computational problems such as clustering and common string find applications in a great variety of contexts related to molecular biology: finding conserved regions in unaligned sequences, genetic drug target identification, and classfying protein sequences. See [11], [12], [7] for comprehensive overviews of such applications. Throughout the article, we use a fixed finite alphabet Σ. Lets and t be finite strings over Σ. Letd(s, t) denote the Hamming distance between s and t, that is, the number of positions where s and t differ. s is the length of s. s[i] isthe i-th character of s. Thus,s = s[1]s[2]...s[ s ]. In this article, we consider the following two problems: k-closest Substring problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find k center strings c 1,c 2,...,c k of length L minimizing d such that for every string s j S, there is a length-l substring t j (closest substring) of s j with min 1 i k d(c i,t j ) d. Wecallthesolution ({c 1,c 2,...,c k },d) a k-clustering of S and call the number d the maximum cluster radius of the k-clustering. k-consensus Pattern problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find k median strings c 1,c 2,...,c k of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = 1 j n min 1 i k d(c i,t j ). Some special-case versions of the above two problems have been studied by many authors including [6], [7], [10], [11], [12], [14], [13], [16]. Most related work focusedoneitherthek =1caseortheL = m case. However, since protein biological functions are generally more related to local regions of protein sequence than to the whole sequence, both the k-closest Substring problem and the k-consensus Pattern problem may find more applications in computational biology than their special-case versions. In this article, we extend the random sampling strategy in [14] and [13] to give a deterministic PTAS for the O(1)-Closest Substring problem. We also give a deterministic PTAS for the O(1)-Consensus Pattern problem. Using a novel construction, we give a direct and neater proof of the NP-hardness of (2- ɛ)-approximation of the Hamming radius k-clustering problem other than the one in [7] which relied on embedding an improved construction in [3] into the Hamming metric. 2 Related Work 2.1 The Closest Substring Problem The following problem was studied in [11], [6], [12], [13]: Closest String: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find a center string s of length m minimizing d such that for every string s i S, d(s, s i ) d. Two groups of authors, [11] and [6], studied Closest String problem and gave a ratio 4/3 approximation algorithm independently. Finally, Li, Ma and Wang [12] gave a PTAS for the Closest String problem.

3 Moreover, the authors in [14] and [13] generalized the above result and gave a PTAS for the following Closest Substring problem: Closest Substring: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find a center string s of length L minimizing d such that for each s i S there is a length-l substring t i (closest substrings) of s i with d(s, t i ) d. The k-closest Substring problem degenerates into the Closest Substring problem when k = The Hamming Radius k-clustering Problem The following is the problem studied by L. Gasieniec et al [7], [10]: Hamming radius k-clustering problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find k center strings c 1,c 2,...,c k of length m minimizing d such that for every string s j S, min 1 i k d(c i,s j ) d. The Hamming radius k-clustering problem is abbreviated to HRC. The k-closest Substring problem degenerates into the Hamming radius k- clustering problem when L = m. The authors in [7] gave a PTAS for the k = O(1) case when the maximum cluster radius d is small (d = O(log(n + m))). Recently, J. Jansson [10] proposed a RPTAS for the k = O(1) case. According to Jansson [10]: We combine the randomized PTAS of Ostrovsky and Rabani [16] for the Hamming p-median clustering problem with the PTAS of Li, Ma, and Wang [13] for HRC restricted to p = 1 to obtain a randomized PTAS for HRC restricted to p = O(1) that has a high success probability. In addition, Jansson wondered whether a deterministic PTAS can be constructed for the Hamming radius O(1)-clustering problem and whether a PTAS can be constructed for the O(1)-Closest Substring problem. It seems that the random linear transformation technique in [16] cannot be easily adapted to solve the O(1)-Closest Substring problem. The Claim 5 in [16], which plays a key role, requires that different solutions of the O(1)-Closest Substring problem should have the same n closest substrings (those which is nearest among all length-l substrings of some string to the center string which it is assigned to). It is clear that for the O(1)-Closest Substring problem such a requirement is not guaranteed. However, Lemma 1 in [16] is not strong enough to solve all possible cases. Therefore, since the requirement mentioned above is not guaranteed, some possible bad cases can not be excluded by using the triangle inequality as in [16]. Contrary to their method, we adopt the random sampling strategy in [14] and [13]. The key idea is to design a quasi-distance measure h such that for any cluster center c in the optimal solution and any substring t from some string in S, h approximates the Hamming distance between t and c very well. Using this measure h, we give a PTAS for the O(1)-Closest Substring problem. The random sampling strategy has found many successful applications in various contexts such as nearest neighbor search [9] and finding local similarities between DNA sequences [2].

4 As for the general k case,l.gasieniecet al [7] showed that it is impossible to approximate HRC within any constant factor less than 2 unless P=NP. They improved a construction in [3] for the planar counterpart of HRC under the L 1 metric and then embedded the construction into the Hamming metric to prove the inapproximability result for HRC. The geometric counterpart of HRC has a relatively longer research history than HRC itself. In the geometric counterpart of HRC, the set of strings is replaced with the set of points in m-dimensional space. We call the problem Geometric k-center problem. When m=1, the problem is trival. For m 2case,itwasshowntobe NP-complete [4]. In 1988, the (2 ɛ)-inapproximable result for the planar version of the Geometric k-center problem under the L 1 metric was shown [3]. Since a ratio 2 approximation algorithm in [8] assumed nothing beyond the triangle inequality, these bounds are tight. [3] adopted a key idea in [5] of embedding an instance of the vertex cover problem for planar graphs of degree at most 3 in the plane so that each edge e becomes a path p e with some odd number of edges, at least 3. The midpoints of these edges then form an instance of the plannar k-center problem. We adopt the similar idea to the one in [5] and give a novel construction purely in Hamming metric, which is used to prove the inapproximability result for Hamming radius k-clustering problem. 2.3 The Hamming p-median Clustering Problem The following problem was studied by Ostrovsky and Rabani [16]: Hamming p-median clustering problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find k median strings c 1,c 2,...,c k of length m minimizing the total cost w = n min d(c i,s j ). j=1 1 i k The k-consensus Pattern problem degenerates into the Hamming p- median clustering problem when L = m. Ostrovsky and Rabani [16] gave a RPTAS for the problem. 2.4 The Consensus Pattern Problem The following is the problem studied in [12]: Consensus Pattern: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find a median string c of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = 1 j n d(c,t j ). The k-consensus Pattern problem degenerates into the Consensus Pattern problem when k = 1. [12] gave a PTAS for the Consensus Pattern problem. We extend it to give a PTAS for the O(1)-Consensus Pattern problem.

5 3 NP-Hardness of Approximating Hamming Radius k-clustering Firstly let us outline the underlying ideas of our proof of Theorem 1. Given any instance G of the vertex cover problem, with m edges, we can construct an instance S of the Hamming radius k-clustering problem which has a k- clustering with the maximum cluster radius not exceeding 2 if and only if G has a vertex cover with k m vertices. Such a construction is the key to our proof. Due to our construction, any two strings in S are at distance either at least 8 or at most 4 from each other. Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution. So if there is a polynomial algorithm for the Hamming radius k-clustering problem within an approximation factor less than 2, we can utilize it to solve the exact vertex cover number of any instance G. Thus a contradiction can be deduced. Theorem 1. If k is not fixed, the Hamming radius k-clustering problem can not be approximated within constant factor less than two unless P=NP. Proof. We reduce the vertex cover problem to the problem of approximating the Hamming radius k-clustering problem within any constant factor less than two. Given any instance G =(V,E) of the vertex cover problem with V containing n vertices v 1,v 2,...,v n and E containing m edges e 1,e 2,...,e m. The reduction constructs an instance S of the Hamming radius k-clustering problem such that k m vertices in V can cover E if and only if there is a k-clustering of S with the maximum cluster radius not exceeding 2. For each e i = v i1 v i2 E, we denote l(i) =min(i 1,i 2 )andr(i) =max(i 1,i 2 ). We denote a set V = {u ij 1 i m 1 j 5} of 5m vertices. We encode each v i V by a length-6n binary string s i of the form 0 6(i 1) (n i) and encode each u ij V by a length-6n binary string t ij of the form 0 6(l(i) 1)+j j 0 6(r(i) l(i) 1) 1 j+1 0 6(n r(i))+5 j. See Fig. 1 for an illustration of such an encoding schema. So we have an instance S of the Hamming radius k-clustering problem with S = m i=1 {t i 1,t i3,t i5 }. We denote c(v i )=s i for each v i V and c(u ij )=t ij for each u ij V.We denote V S = m i=1 {u i 1,u i3,u i5 }. We define a graph G =(V V,E )withe = m i=1 {v l(i)u i1,u i1 u i2,u i2 u i3, u i3 u i4,u i4 u i5,u i5 v r(i) }. Lemma 1. Given any two adjacent vertices x, y in G, we have that d(c(x),c(y)) =2. Proof. It is easy to check that there are only three possible cases and for each case the above conclusion holds. Case 1. There is some 1 i m such that x = v l(i) and y = u i1.

6 s i : (a) t i1 : t i2 : t i3 : t i4 : t i5 : (b) Fig. 1. Illustration of the encoding schema (a) s i (1 i n) (b)t ij j 5) (1 i m, 1 Case 2. There is some 1 i m such that x = u i5 and y = v r(i). Case 3. There are some 1 i m and 1 j 4 such that x = u ij y = u ij+1. and Lemma 2. Given any two different vertices x, y in V S satisfying that there is some z V V which is adjacent to both x and y in G, we have that d(c(x),c(y)) = 4. Proof. It is easy to check that there are only five possible cases and for each case the above conclusion holds. Case 1. There are some 1 i, j m and l(i) =l(j) such that x = u i1 and y = u j1. Case 2. There are some 1 i, j m and l(i) =r(j) such that x = u i1 and y = u j5. Case 3. There are some 1 i, j m and r(i) =l(j) such that x = u i5 and y = u j1. Case 4. There are some 1 i, j m and r(i) =r(j) such that x = u i5 and y = u j5. Case 5. There are some 1 i m and j {1, 3} such that x = u ij and y = u ij+2. Lemma 3. Given any two vertices x, y in V S,ifnoverticeinV V is adjacent to both x and y in G, we have that d(c(x),c(y)) 8. Proof. Let x = u ij, y = u i j,1 i, i m and j, j {1, 3, 5}. Weconsider three cases: Case 1. i = i. Clearly, in this case, we have that d(c(x),c(y)) = 8. Case 2. l(i), r(i), l(i )andr(i ) are all different from each other. Clearly, in this case, we have that d(c(x),c(y)) = 16. Case 3. Otherwise, exactly one of l(i) =l(i ), r(i) =r(i ), l(i) =r(i )and r(i) =l(i ) holds. Without loss of generality, we assume that l(i) =l(i ). We denote a = l(i) =l(i ), b = r(i), c=r(i ), O = {6a 6 <l 6a x(l) =1},

7 P = {6a 6 < l 6a y(l) =1}, Q = {6b 6 < l 6b x(l) = 1} and R = {6c 6 <l 6c y(l) =1}. Thus, we have the following inequality: d(c(x),c(y)) Q + R + O P = Q + R + Q R =2max( Q, R ) =2max(j, j )+2. (1) Clearly, in this case, max(j, j ) 3. Therefore, by Formula (1), we have that d(c(x),c(y)) 8. Lemma 4. Given k 2m, k m vertices in V can cover E if and only if there is a k-clustering of S with the maximum cluster radius equal to 2. Proof. If part: Considering the set T = {t i3 1 i m } of strings. Any cluster can not contain two or more strings in T since otherwise the radius of this cluster would be at least 4. Therefore, there must be exactly k m clusters, each of which doesn t contain any string in T. For each edge e i E, eithert i1 or t i5 would be in someone among those k m clusters. On the other hand, for anyone (say, the l-th, 1 l k m )amongthosek m clusters, there must be some index 1 j n such that for each t i1 (t i5 ) in the cluster, l(i) =j (r(i) =j). We denote f(l) =j. Clearly, the k m vertices v f(1),v f(2),...,v f(k m ) can cover E. Only if part: Since k 2m < 3m = S and for any two different strings s, t S we have d(s, t) 4, hence for any k-clustering of S, themaximumcluster radius is at least 2. Assuming that k m vertices v i1,v i2,...,v ik m can cover E, we can introduce k m strings s i1,s i2,...,s ik m as the first k m center strings. The last m center strings can be introduced as follows: since for each edge e i E, eithers l(i) or s r(i) is among the first k m center strings, we introduce t i4 as the (k m + i)-th center string if s l(i) is among the first k m center strings and t i2 otherwise. Clearly, this give a k-clustering of S with the maximum cluster radius equal to 2. Now we can conclude our proof. Firstnotethat,byLemmas1,2,3,whenk 2m and there is a k-clustering of S with the maximum cluster radius equal to 2, finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution. Clearly, there is a critical 1 k c 2m satisfying that there is a k c -clustering of S with the maximum cluster radius equal to 2 and for any (k c 1)-clustering of S, the maximum cluster radius is at least 4. If there is a polynomial algorithm for the Hamming radius k-clustering problem within an approximation factor less than 2, we can apply it on each instance S in the decreasing order of k from 2m downto 1. So once we find a k such that the maximum cluster radius of the instance S obtained by the algorithm is at least 4, we are sure that k c = k + 1. Since when k k c,the

8 algorithm can of course find a k-clustering of S with the maximum cluster radius equal to 2, and when k = k c 1, for any k-clustering of S, themaximumcluster radius is at least 4. Therefore, from Lemma 4, the vertex cover number of the graph G is k c m. That is, we can get the exact vertex cover number of any instance G of the vertex cover problem in polynomial time. However, the vertex cover problem is NP-hard. Contradiction, this completes our reduction. 4 A Deterministic PTAS for the O(1)-Closest Substring Problem In this section, we study the k-closest Substring problem when k = O(1). Both the algorithm and the proof are based on the k =2case.Wecallitthe 2-Closest Substring problem. We make use of some results for the Closest String problem and extend a random sampling strategy in [14] and [13] to give a deterministic PTAS for the 2-Closest Substring problem. Finally, we give an informal statement explaining how and why it can be easily extended to the general k = O(1) case. Since the Hamming radius O(1)-clustering problem is just a special-case version of the O(1)-Closest Substring problem, the same algorithm also gives a deterministic PTAS which improves the RPTAS as in [10], for the Hamming radius O(1)-clustering problem. 4.1 Some Definitions Let s, t be strings of length m. A multiset P = {j 1,j 2,...,j k } such that 1 j 1 j 2... j k m is called a position set. Bys P we denote the string s[j 1 ]s[j 2 ]...s[j k ]. We also write d P (s, t) tomeand(s P,t P ). Let c, o be strings of length L, Q {1, 2,...,L}, P = {1, 2,...,L}\Q, R P,wedenote h(o, c, Q, R) =d Q (o, c)+ P R dr (o, c), f(s, c, Q, R) =min {any length-l substring t of s} h(t, c, Q, R), g(s, c) =min {any length-l substring t of s} d(t, c). The above function h is just the quasi-distance measure adopted by us. Also, the above function f and g are two new distance measure introduced by us. We refer to them as distance f or distance g from now on. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem, where each s i is of length m (m L). Let (c A,c B,A,B)bethesolutionso that the set S has been partitioned into two clusters A, B and c A, c B are the center strings of cluster A, B respectively. We denote d(c A,c B,A,B) the minimal d satisfying that s A g(s, c A ) d s Bg(s, c B ) d. Let s i1,s i2,...,s ir be r strings (allowing repeats) in S. LetQ i1,i 2,...,i r be the set of positions where s i1,s i2,...,s ir agree, P i1,i 2,...,i r = {1, 2,...,m}\ Q i1,i 2,...,i r.

9 4.2 Several Useful Lemmas Lemma 5. (Chernoff Bound) Let X 1,X 2,...,X n be n independent random 0-1 variables, where X i takes 1 with probability p i, 0 <p i < 1. LetX = n i=1 X i, and µ = E[X]. Then for any 0 <ɛ 1, (1) Pr(X >µ+ ɛn) < exp ( 1 3 nɛ2), (2) Pr(X <µ ɛn) exp ( 1 2 nɛ2). Let S = {s 1,s 2,...,s n } be an instance of the Closest String problem, where each s i is of length m. Lets, d opt are the center string and the radius in the d(s optimal solution. Let ρ 0 =max i,s j) 1 i,j n d opt. Lemma 6. For any constant r, 2 r<n,ifρ 0 > r 1, then there are indices 1 i 1,i 2,...,i r n such that for any 1 l n, d(s l,s Qi1,i 2,...,ir i 1 ) d(s Qi1,i 2,...,ir l,s Qi1,i 2,...,ir Q ) 1 i1,i 2,...,ir 2r 1 d opt. Lemma 7. P i1,i 2,...,i r rd opt. Lemma 5 is Lemma 1.2 in [13]. Lemma 6 is Lemma 2.1 in [13]. Lemma 7 is Claim 2.4 in [13]. The following is a lemma which plays a key role in the proof of our main result. Lemma 8. For any constant r, 2 r<n, there are r strings s i1,s i2,...,s ir (allowing repeats) in S and a center string c such that for any 1 l n, d(c,s l ) ( r 1 )d opt where c = s Qi1,i 2,...,ir i 1, Qi1,i 2,...,ir c = s Pi1,i 2,...,ir P. i1,i 2,...,ir Proof. If ρ r 1,wecanchooses 1 r timesasther strings s i1,s i2,...,s ir. Otherwise, by Lemma 6, such r strings s i1,s i2,...,s ir do exist. 4.3 A PTAS for the 2-Closest Substring Problem How to choose n closest substrings and partition n strings into two sets accordingly are the only two obstacles on the way to the solution. We use the random sampling strategy in [14] and [13] to handle both the two obstacles. This is a further application of the random sampling strategy in [14] and [13]. Now let us outline the underlying ideas. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem, where each s i is of length m. Suppose that in the optimal solution, t 1,t 2,...,t n are the n closest substrings (those which is nearest among all length-l substrings of some string to the center string which it is assigned to) which form two instances of the Closest String problem based on the optimal partition. Thus, if we can obtain the same partition together with choice as one in the optimal solution, we can solve the Closest String problem on such two instances respectively. Unfortunately, we don t know the exact partition together with choice in the optimal solution. However, by virtue of the quasi-distance measure h which approximates the Hamming distance d very well, we can get a not too bad partition together with choice as compared with the one in the optimal solution. That is, even

10 if some string is partitioned into wrong cluster, its distance g from wrong center will not exceed (1 + ɛ) times its distance g from the right center in the optimal solution with ɛ as small as we desire. The detailed algorithm (Algorithm 2-Closest Substring) is given in Fig. 2. We prove Theorem 2 in the rest of the section. Algorithm 2-Closest Substring Input n strings s 1,s 2,...,s n Σ m, integer L. Output two center strings c 1,c for each r length-l substrings t i1,t i2,...,t ir (allowing repeats, but if t ip and t iq are both chosen from the same s i then t ip = t iq )ofthen input strings do (1) Q 1 = {1 k L t i1 [k] =t i2 [k] =...= t ir [k]}, P 1 = {1, 2,...,L}\Q 1. (2) Let R 1 be a multiset containing 4 1 log(mn) uniformly random positions ɛ 2 from P 1. (3) for each r length-l substrings t j1,t j2,...,t jr (allowing repeats, but if t jp and t jq are both chosen from the same s j then t jp = t jq )ofthen input strings do (a) Q 2 = {1 k L t j1 [k] =t j2 [k] =...= t jr [k]}, P 2 = {1, 2,...,L}\ Q 2. (b) Let R 2 be a multiset containing 4 1 log(mn) uniformly random ɛ 2 positions from P 2. (c) for each string y 1 of length R 1 and each string y 2 of length R 2 do (i) Let c 1 be the string such that c 1 R1 = y 1 and c 1 Q1 = t i1 Q1, and c 2 be the string such that c 2 R2 = y 2 and c 2 Q2 = t j1 Q2. (ii) for l from 1 to n do Assign s l into set C 1 if f(s l,c 1,Q 1,R 1) f(s l,c 2,Q 2,R 2)and set C 2 otherwise.we denote c (l) =c 1, Q (l) =Q 1, R (l) =R 1 if s l C 1 and c (l) = c 2, Q (l) = Q 2,R (l) = R 2 otherwise. let t l be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s l such that h(t l,c (l),q (l),r (l)) = f(s l,c (l),q (l),r (l)). (iii) Using the method in [13], solve the optimization problem defined by Formula (9) approximately to get a solution (x 1,x 2) within error max(ɛ P 1,ɛ P 2 ). (iv) Let c 1 be the string such that c 1 P1 = x 1 and c 1 Q1 = t i1 Q1, and c 2 be the string such that c 2 P2 = x 2 and c 2 Q2 = t j1 Q2. (v) Let d =max 1 l n min(g(s l,c 1 ),g(s l,c 2 )). 2. Output the (c 1,c 2 ) with minimum d in step 1(3)(c). Fig. 2. The PTAS for the 2-Closest Substring problem Theorem 2. Algorithm 2-Closest Substring is a PTAS for the 2-Closest Substring problem. Proof. Let ɛ be any small positive number and r 2beanyfixedinteger. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem,

11 where each s i is of length m. LetT = {u j s S u j is a length-l substring of s}. Suppose that in the optimal solution, t 1,t 2,...,t n are the n closest substrings, C 1,C 2 are the two clusters which S is partitioned into and c 1,c 2 are the length- L center strings of the two clusters. We denote T 1 = {t j s j C 1 },T 2 = {t j s j C 2 } and d opt = max 1 j n min(d(t j,c 1 ),d(t j,c 2 )). Both T 1 and T 2 form instances of the Closest String problem, so by trying all possibilities, we can assume that t i1,t i2,...,t ir and t j1,t j2,...,t jr are the r strings that satisfy Lemma 8 by considering T 1 and T 2 respectively. Let Q 1 (Q 2 )bethesetof positions where t i1,t i2,...,t ir (t j1,t j2,...,t jr ) agree and P 1 = {1, 2,...,L}\Q 1 (P 2 = {1, 2,...,L}\Q 2 ). Let c 1 be the string such that c 1 P 1 = c 1 P1 and c 1 Q1 = t i1 Q1,andc 2 be the string such that c 2 P2 = c 2 P2 and c 2 Q2 = t j1 Q2. By Lemma 8, the solution (c 1,c 2,T 1,T 2 ) is a good approximation of the optimal solution (c 1,c 2,T 1,T 2 ), that is, d(c 1,c 2,T 1,T 2 ) ( r 1 )d opt. (2) As for P 1 (P 2 ) where we know nothing about c 1 (c 2 ), we randomly pick 4 1 ɛ log(mn) positions from P 2 1 (P 2 ). Suppose that the multiset of these random positions is R 1 (R 2 ). By trying all possibilities, we can assume that we can get c 1 Q1 R 1 and c 2 Q2 R 2 at some point. We then partition each s j S into set C 1 if f(s j,c 1,Q 1,R 1 ) f(s j,c 2,Q 2,R 2 )andsetc 2 otherwise. For each 1 j n, wedenotec (j) =c 1,Q (j) =Q 1,R (j) =R 1,P (j) =P 1 if s j C 1 and c (j) =c 2,Q (j) =Q 2,R (j) =R 2,P (j) =P 2 otherwise. For each 1 j n, lett j be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s j such that h(t j,c (j),q (j),r (j)) = f(s j,c (j),q (j),r (j)). We denote T 1 = {t j s j C 1} and T 2 = {t j s j C 2}. We can prove the following Lemma 9. Lemma 9. With high probability, for each u j T, d(u j,c 1) h(u j,c 1,Q 1,R 1 ) ɛ P 1 and d(u j,c 2) h(u j,c 2,Q 2,R 2 ) ɛ P 2. (3) Proof. Let λ = P1 R. It is easy to see that 1 (u dr1 j,c 1 )isthesumof R 1 independent random 0-1 variables R 1 i=1 X i,wherex i = 1 indicates a mismatch between c 1 and u j at the i-th position in R 1.Letµ = E[d R1 (u j,c 1)]. Obviously, µ = d P1 (u j,c 1 )/λ. Therefore, by Lemma 5 (2), we have the following inequality: Pr(d(u j,c 1 ) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) = Pr(d R1 (u j,c 1) (d(u j,c 1) d Q1 (u j,c 1))/λ ɛ R 1 ) = Pr(d R1 (u j,c 1 ) dp1 (u j,c 1 )/λ ɛ R 1 ) = Pr(d R1 (u j,c 1) µ ɛ R 1 ) exp( 1 2 ɛ2 R 1 ) (mn) 2, (4)

12 where the last inequality is due to the setting R 1 = 4 1 ɛ 2 log(mn) in step 1(2) of the algorithm. Similarly, using Lemma 5 (1), we have Pr(d(u j,c 1) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) (mn) 4 3. (5) Combining Formula (4) with (5), we have that for any u j T, Pr( d(u j,c 1 ) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) 2(mn) 4 3. (6) Similarly, we have that for any u j T, Pr( d(u j,c 2 ) h(u j,c 2,Q 2,R 2 ) ɛ P 2 ) 2(mn) 4 3. (7) Summing up over all u j T, we have that with probability at least 1 4(mn) 1 3, Formula (3) holds. Now we can conclude our proof. For each 1 j n, wedenotec(j) =c 1, Q(j) =Q 1, R(j) =R 1, P (j) =P 1 if t j T 1 and c(j) =c 2, Q(j) =Q 2, R(j) =R 2, P (j) =P 2 otherwise. Therefore, we have that Formula (8) holds with probability at least 1 4(mn) 1 3. d(c (j),t j) h(t j,c (j),q (j),r (j)) + ɛ P (j) h(t j,c(j),q(j),r(j)) + ɛ P (j) d(c(j),t j )+ɛ P(j) + ɛ P (j) d(c 1,c 2,T 1,T 2 )+ɛ P(j) + ɛ P (j) ( r 1 )d opt + ɛ P (j) + ɛ P (j) ( r 1 +2ɛr)d opt, 1 j n, (8) where the first and the third inequality is by Formula (3), the second inequality is due to the definition of the partition, the fifth inequality is by Formula (2) and the last inequality is due to Lemma 7. By the definition of c 1 and c 2, the following optimization problem has a solution (x 1,x 2 )=(c 1 P1,c 2 P2 ) such that d ( r 1 +2ɛr)d opt. min d ; d(x 1,t j P 1 ) d d Q1 (t i1,t j ), t j T 1 ; x 1 = P 1. (9) d(x 2,t j P 2 ) d d Q2 (t j1,t j ), t j T 2 ; x 2 = P 2. We can solve the optimization problem within error ɛrd opt by applying the method for the Closest String problem [13] to set T 1 and T 2 respectively. Let (x 1,x 2 ) be the solution of the optimization problem. Then by Formula (9) we have

13 d(x 1,t j P1 ) ( r 1 +2ɛr)d opt d Q1 (t i1,t j)+ɛ P 1, t j T 1. d(x 2,t j P 2 ) ( r 1 +2ɛr)d opt d Q2 (t j1,t j )+ɛ P 2, t j T 2. (10) Let (c ) be defined as in step 1(3)(c)(iv), then by Formula (10), we have 1,c 2 d(c 1,t j)=d(x 1,t j P1 )+d Q1 (t i1,t j) ( r 1 +2ɛr)d opt + ɛ P 1 ( r 1 +3ɛr)d opt, t j T 1, d(c 2,t j )=d(x 2,t j P 2 )+d Q2 (t j1,t j ) ( r 1 +2ɛr)d opt + ɛ P 2 ( r 1 +3ɛr)d opt, t j T 2. (11) we can get a good approximate solution (c 1,c 2,T 1,T 2)withd(c 1,c 2,T 1,T 2) ( r 1 +3ɛr)d opt. It is easy to see that the algorithm runs in polynomial time for any fixed positive r and ɛ. For any δ>0, by properly setting r and ɛ such 1 that 2r 1 +3ɛr δ, the algorithm outputs in polynomial time a solution (c 1,c 2) with high probability such that min(g(s j,c 1 ),g(s j,c 2 )) (1 + δ)d opt for each 1 j n. The algorithm can be derandomized by standard methods [15]. 4.4 The General k = O(1) Case Theorem 2 can be trivially extended to the k = O(1) case. Theorem 3. There is a PTAS for the O(1)-Closest Substring problem. We only give an informal explanation of how such an extension can be done. Unlike Jansson s RPTAS in [10] for the Hamming radius O(1)-clustering problem, which made use of the apex of tournament to get a good assignment of strings to clusters, our extension is trivial. (Due to space limitation, the explanation is omitted and given in supplementary material.) 5 The O(1)-Consensus Pattern Problem It is relatively straightforward to extend the techniques in [12] for the Consensus Pattern problem to give a PTAS for the O(1)-Consensus Pattern problem. Since the algorithm for the k = O(1) case is tedious, we only give an algorithm for the k = 2 case. The detailed algorithm (Algorithm 2-Consensus Pattern) is described in Fig. 3.

14 Theorem 4. Algorithm 2-Consensus Pattern is a PTAS for the 2-Consensus Pattern problem. Again, since the proof is tedious and does not involve new ideas, we only give an informal explanation about how such an extension can be done. (Due to space limitation, the explanation is omitted and given in supplementary material.) Algorithm 2-Consensus Pattern Input n strings s 1,s 2,...,s n Σ m, integer L. Output two median strings c 1, c for each r length-l substrings t i1,t i2,...,t ir (allowing repeats, but if t ip and t iq are both chosen from the same s i then t ip = t iq )ofthen input strings do (1) Let c 1 be the column-wise majority string of t i1,t i2,...,t ir. (2) for each r length-l substrings t j1,t j2,...,t jr (allowing repeats, but if t jp and t jq are both chosen from the same s j then t jp = t jq )ofthen input strings do (a) Let c 2 be the column-wise majority string of t j1,t j2,...,t jr. (b) for j from 1 to n do Assign s j into set C 1 if g(s j,c 1) g(s j,c 2)andsetC 2 otherwise. We denote c (j) =c 1 if s j C 1 and c (j) =c 2 otherwise. let t j be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s j such that d(t j,c (j)) = g(s j,c (j)). (c) We denote T 1 = {t j s j C 1}, T 2 = {t j s j C 2}. Letc 1 be the column-wise majority string of all strings in T 1 and c 2 be the columnwise majority string of all strings in T 2. (d) Let w = 1 j n min(g(sj,c 1 ),g(s j,c 2 )). 2. Output the (c 1,c 2 ) with minimum w in step 1(2). Fig. 3. The PTAS for the 2-Consensus Pattern problem We have presented a PTAS for the O(1)-Consensus Pattern problem. Since the min-sum Hamming median clustering problem (i.e., Hamming p- median clustering problem) discussed in [16] is just a special-case version of the O(1)-Consensus Pattern problem, we also present a deterministic PTAS for that problem. Acknowledgement We would like to thank Dongbo Bu, Hao Lin, Bin Ma, Jingfen Zhang and Zefeng Zhang in the Sequence Assembly project at the Institute of Computing Technology, Chinese Academy of Sciences, for various discussions. References 1. A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis: Linear Approximation of Shortest Superstrings. Journal of the ACM. 41(4) (1994)

15 2. J. Buhler: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17 (2001) T. Feder and D. H. Greene: Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (1988) R.J. Fowler, M.S. Paterson and S.L. Tanimoto: Optimal packing and covering in the plane are NP-complete. Inf. Proc. Letters 12(3) (1981) M.R. Garey and D. S. Johnson: The rectilinear Steiner tree problem is NPcomplete. SIAM J.Appl.Math. 32 (1977) L. Gasieniec, J. Jansson, and A. Lingas: Efficient approximation algorithms for the Hamming center problem. Proc. 10th ACM-SIAM Symp. on Discrete Algorithms (1999) S905 S L. Gasieniec, J. Jansson, and A. Lingas: Approximation Algorithms for Hamming Clustering Problems. Proc. 11th Symp. CPM. (2000) T. F. Gonzalez: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38 (1985) P. Indyk, R. Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proc. 30th Annual ACM Symp. Theory Comput. (1998) J. Jansson: Consensus Algorithms for Trees and Strings. Doctoral dissertation (2003) 11. K. Lanctot, M. Li, B. Ma, L. Wang and L. Zhang: Distinguishing string selection problems. Proc. 10th ACM-SIAM Symp. Discrete Algorithms (1999) M. Li, B. Ma, and L. Wang: Finding similar regions in many strings. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (1999) M. Li, B. Ma, and L. Wang: On the Closest String and Substring Problems. Journal of ACM. 49(2) (2002) B. Ma: A polynomial time approximation scheme for the Closest Substring problem. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (2000) R. Motwani and P.Raghavan: Randomized Algorithms. Cambridge Univ. Press (1995) 16. R. Ostrovsky and Y. Rabani: Polynomial-Time Approximation Schemes for Geometric Min-Sum Median Clustering. Journal of ACM. 49(2) (2002)

Closest String and Closest Substring Problems

Closest String and Closest Substring Problems January 8, 2010 Problem Formulation Problem Statement I Closest String Given a set S = {s 1, s 2,, s n } of strings each length m, find a center string s of length m minimizing d such that for every string