On the k-closest Substring and k-consensus Pattern Problems

Size: px
Start display at page:

Download "On the k-closest Substring and k-consensus Pattern Problems"

Transcription

1 On the k-closest Substring and k-consensus Pattern Problems Yishan Jiao 1,JingyiXu 1, and Ming Li 2 1 Bioinformatics lab, Institute of Computing Technology, Chinese Academy of Sciences, 6#, South Road, Kexueyuan, Zhongguancun, Beijing, P.R.China, {jys,xjy}@ict.ac.cn, WWW home page: 2 University of Waterloo, mli@uwaterloo.ca Abstract. Given a set S = {s 1,s 2,...,s n} of strings each of length m, and an integer L, we study the following two problems. k-closest Substring problem: find k center strings c 1,c 2,...,c k of length L minimizing d such that for each s j S, there is a length-l substring t j (closest substring) of s j with min 1 i k d(c i,t j) d. We give a PTAS for this problem, for k = O(1). k-consensus Pattern problem: find k median strings c 1,c 2,...,c k of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = n min d(ci,tj). We give a PTAS for j=1 1 i k this problem, for k = O(1). Our results improve recent results of [10] and [16] both of which depended on the random linear transformation technique in [16]. As for general k case, we give an alternative and direct proof of the NP-hardness of (2-ɛ)-approximation of the Hamming radius k-clustering problem, a special case of the k-closest Substring problem restricted to L = m. Keywords: k-center problems, closest string and substrings, consensus pattern, polynomial time approximation scheme. 1 Introduction While the original departure point of this study has been separating repeats in our DNA sequence assembly project, we have quickly realized that the problems we have abstracted relate to many widely studied in different areas from geometric clustering [4], [8], [3], [16] to DNA multiple motif finding [12], [7], [10]. In sequence assembly, the greatest challenge is to deal with repeats, when the shortest common superstring GREEDY algorithm [1] is applied. Given a collection of approximate repeats, if it is possible to separate them into original groups, the quality of sequence assembly algorithm will be improved at least for a class of repeats that are sufficiently different.

2 Many classic computational problems such as clustering and common string find applications in a great variety of contexts related to molecular biology: finding conserved regions in unaligned sequences, genetic drug target identification, and classfying protein sequences. See [11], [12], [7] for comprehensive overviews of such applications. Throughout the article, we use a fixed finite alphabet Σ. Lets and t be finite strings over Σ. Letd(s, t) denote the Hamming distance between s and t, that is, the number of positions where s and t differ. s is the length of s. s[i] isthe i-th character of s. Thus,s = s[1]s[2]...s[ s ]. In this article, we consider the following two problems: k-closest Substring problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find k center strings c 1,c 2,...,c k of length L minimizing d such that for every string s j S, there is a length-l substring t j (closest substring) of s j with min 1 i k d(c i,t j ) d. Wecallthesolution ({c 1,c 2,...,c k },d) a k-clustering of S and call the number d the maximum cluster radius of the k-clustering. k-consensus Pattern problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find k median strings c 1,c 2,...,c k of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = 1 j n min 1 i k d(c i,t j ). Some special-case versions of the above two problems have been studied by many authors including [6], [7], [10], [11], [12], [14], [13], [16]. Most related work focusedoneitherthek =1caseortheL = m case. However, since protein biological functions are generally more related to local regions of protein sequence than to the whole sequence, both the k-closest Substring problem and the k-consensus Pattern problem may find more applications in computational biology than their special-case versions. In this article, we extend the random sampling strategy in [14] and [13] to give a deterministic PTAS for the O(1)-Closest Substring problem. We also give a deterministic PTAS for the O(1)-Consensus Pattern problem. Using a novel construction, we give a direct and neater proof of the NP-hardness of (2- ɛ)-approximation of the Hamming radius k-clustering problem other than the one in [7] which relied on embedding an improved construction in [3] into the Hamming metric. 2 Related Work 2.1 The Closest Substring Problem The following problem was studied in [11], [6], [12], [13]: Closest String: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find a center string s of length m minimizing d such that for every string s i S, d(s, s i ) d. Two groups of authors, [11] and [6], studied Closest String problem and gave a ratio 4/3 approximation algorithm independently. Finally, Li, Ma and Wang [12] gave a PTAS for the Closest String problem.

3 Moreover, the authors in [14] and [13] generalized the above result and gave a PTAS for the following Closest Substring problem: Closest Substring: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find a center string s of length L minimizing d such that for each s i S there is a length-l substring t i (closest substrings) of s i with d(s, t i ) d. The k-closest Substring problem degenerates into the Closest Substring problem when k = The Hamming Radius k-clustering Problem The following is the problem studied by L. Gasieniec et al [7], [10]: Hamming radius k-clustering problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find k center strings c 1,c 2,...,c k of length m minimizing d such that for every string s j S, min 1 i k d(c i,s j ) d. The Hamming radius k-clustering problem is abbreviated to HRC. The k-closest Substring problem degenerates into the Hamming radius k- clustering problem when L = m. The authors in [7] gave a PTAS for the k = O(1) case when the maximum cluster radius d is small (d = O(log(n + m))). Recently, J. Jansson [10] proposed a RPTAS for the k = O(1) case. According to Jansson [10]: We combine the randomized PTAS of Ostrovsky and Rabani [16] for the Hamming p-median clustering problem with the PTAS of Li, Ma, and Wang [13] for HRC restricted to p = 1 to obtain a randomized PTAS for HRC restricted to p = O(1) that has a high success probability. In addition, Jansson wondered whether a deterministic PTAS can be constructed for the Hamming radius O(1)-clustering problem and whether a PTAS can be constructed for the O(1)-Closest Substring problem. It seems that the random linear transformation technique in [16] cannot be easily adapted to solve the O(1)-Closest Substring problem. The Claim 5 in [16], which plays a key role, requires that different solutions of the O(1)-Closest Substring problem should have the same n closest substrings (those which is nearest among all length-l substrings of some string to the center string which it is assigned to). It is clear that for the O(1)-Closest Substring problem such a requirement is not guaranteed. However, Lemma 1 in [16] is not strong enough to solve all possible cases. Therefore, since the requirement mentioned above is not guaranteed, some possible bad cases can not be excluded by using the triangle inequality as in [16]. Contrary to their method, we adopt the random sampling strategy in [14] and [13]. The key idea is to design a quasi-distance measure h such that for any cluster center c in the optimal solution and any substring t from some string in S, h approximates the Hamming distance between t and c very well. Using this measure h, we give a PTAS for the O(1)-Closest Substring problem. The random sampling strategy has found many successful applications in various contexts such as nearest neighbor search [9] and finding local similarities between DNA sequences [2].

4 As for the general k case,l.gasieniecet al [7] showed that it is impossible to approximate HRC within any constant factor less than 2 unless P=NP. They improved a construction in [3] for the planar counterpart of HRC under the L 1 metric and then embedded the construction into the Hamming metric to prove the inapproximability result for HRC. The geometric counterpart of HRC has a relatively longer research history than HRC itself. In the geometric counterpart of HRC, the set of strings is replaced with the set of points in m-dimensional space. We call the problem Geometric k-center problem. When m=1, the problem is trival. For m 2case,itwasshowntobe NP-complete [4]. In 1988, the (2 ɛ)-inapproximable result for the planar version of the Geometric k-center problem under the L 1 metric was shown [3]. Since a ratio 2 approximation algorithm in [8] assumed nothing beyond the triangle inequality, these bounds are tight. [3] adopted a key idea in [5] of embedding an instance of the vertex cover problem for planar graphs of degree at most 3 in the plane so that each edge e becomes a path p e with some odd number of edges, at least 3. The midpoints of these edges then form an instance of the plannar k-center problem. We adopt the similar idea to the one in [5] and give a novel construction purely in Hamming metric, which is used to prove the inapproximability result for Hamming radius k-clustering problem. 2.3 The Hamming p-median Clustering Problem The following problem was studied by Ostrovsky and Rabani [16]: Hamming p-median clustering problem: Given a set S = {s 1,s 2,...,s n } of strings each of length m, find k median strings c 1,c 2,...,c k of length m minimizing the total cost w = n min d(c i,s j ). j=1 1 i k The k-consensus Pattern problem degenerates into the Hamming p- median clustering problem when L = m. Ostrovsky and Rabani [16] gave a RPTAS for the problem. 2.4 The Consensus Pattern Problem The following is the problem studied in [12]: Consensus Pattern: Given a set S = {s 1,s 2,...,s n } of strings each of length m, and an integer L, find a median string c of length L and a substring t j (consensus pattern) of length L from each s j minimizing the total cost w = 1 j n d(c,t j ). The k-consensus Pattern problem degenerates into the Consensus Pattern problem when k = 1. [12] gave a PTAS for the Consensus Pattern problem. We extend it to give a PTAS for the O(1)-Consensus Pattern problem.

5 3 NP-Hardness of Approximating Hamming Radius k-clustering Firstly let us outline the underlying ideas of our proof of Theorem 1. Given any instance G of the vertex cover problem, with m edges, we can construct an instance S of the Hamming radius k-clustering problem which has a k- clustering with the maximum cluster radius not exceeding 2 if and only if G has a vertex cover with k m vertices. Such a construction is the key to our proof. Due to our construction, any two strings in S are at distance either at least 8 or at most 4 from each other. Thus finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution. So if there is a polynomial algorithm for the Hamming radius k-clustering problem within an approximation factor less than 2, we can utilize it to solve the exact vertex cover number of any instance G. Thus a contradiction can be deduced. Theorem 1. If k is not fixed, the Hamming radius k-clustering problem can not be approximated within constant factor less than two unless P=NP. Proof. We reduce the vertex cover problem to the problem of approximating the Hamming radius k-clustering problem within any constant factor less than two. Given any instance G =(V,E) of the vertex cover problem with V containing n vertices v 1,v 2,...,v n and E containing m edges e 1,e 2,...,e m. The reduction constructs an instance S of the Hamming radius k-clustering problem such that k m vertices in V can cover E if and only if there is a k-clustering of S with the maximum cluster radius not exceeding 2. For each e i = v i1 v i2 E, we denote l(i) =min(i 1,i 2 )andr(i) =max(i 1,i 2 ). We denote a set V = {u ij 1 i m 1 j 5} of 5m vertices. We encode each v i V by a length-6n binary string s i of the form 0 6(i 1) (n i) and encode each u ij V by a length-6n binary string t ij of the form 0 6(l(i) 1)+j j 0 6(r(i) l(i) 1) 1 j+1 0 6(n r(i))+5 j. See Fig. 1 for an illustration of such an encoding schema. So we have an instance S of the Hamming radius k-clustering problem with S = m i=1 {t i 1,t i3,t i5 }. We denote c(v i )=s i for each v i V and c(u ij )=t ij for each u ij V.We denote V S = m i=1 {u i 1,u i3,u i5 }. We define a graph G =(V V,E )withe = m i=1 {v l(i)u i1,u i1 u i2,u i2 u i3, u i3 u i4,u i4 u i5,u i5 v r(i) }. Lemma 1. Given any two adjacent vertices x, y in G, we have that d(c(x),c(y)) =2. Proof. It is easy to check that there are only three possible cases and for each case the above conclusion holds. Case 1. There is some 1 i m such that x = v l(i) and y = u i1.

6 s i : (a) t i1 : t i2 : t i3 : t i4 : t i5 : (b) Fig. 1. Illustration of the encoding schema (a) s i (1 i n) (b)t ij j 5) (1 i m, 1 Case 2. There is some 1 i m such that x = u i5 and y = v r(i). Case 3. There are some 1 i m and 1 j 4 such that x = u ij y = u ij+1. and Lemma 2. Given any two different vertices x, y in V S satisfying that there is some z V V which is adjacent to both x and y in G, we have that d(c(x),c(y)) = 4. Proof. It is easy to check that there are only five possible cases and for each case the above conclusion holds. Case 1. There are some 1 i, j m and l(i) =l(j) such that x = u i1 and y = u j1. Case 2. There are some 1 i, j m and l(i) =r(j) such that x = u i1 and y = u j5. Case 3. There are some 1 i, j m and r(i) =l(j) such that x = u i5 and y = u j1. Case 4. There are some 1 i, j m and r(i) =r(j) such that x = u i5 and y = u j5. Case 5. There are some 1 i m and j {1, 3} such that x = u ij and y = u ij+2. Lemma 3. Given any two vertices x, y in V S,ifnoverticeinV V is adjacent to both x and y in G, we have that d(c(x),c(y)) 8. Proof. Let x = u ij, y = u i j,1 i, i m and j, j {1, 3, 5}. Weconsider three cases: Case 1. i = i. Clearly, in this case, we have that d(c(x),c(y)) = 8. Case 2. l(i), r(i), l(i )andr(i ) are all different from each other. Clearly, in this case, we have that d(c(x),c(y)) = 16. Case 3. Otherwise, exactly one of l(i) =l(i ), r(i) =r(i ), l(i) =r(i )and r(i) =l(i ) holds. Without loss of generality, we assume that l(i) =l(i ). We denote a = l(i) =l(i ), b = r(i), c=r(i ), O = {6a 6 <l 6a x(l) =1},

7 P = {6a 6 < l 6a y(l) =1}, Q = {6b 6 < l 6b x(l) = 1} and R = {6c 6 <l 6c y(l) =1}. Thus, we have the following inequality: d(c(x),c(y)) Q + R + O P = Q + R + Q R =2max( Q, R ) =2max(j, j )+2. (1) Clearly, in this case, max(j, j ) 3. Therefore, by Formula (1), we have that d(c(x),c(y)) 8. Lemma 4. Given k 2m, k m vertices in V can cover E if and only if there is a k-clustering of S with the maximum cluster radius equal to 2. Proof. If part: Considering the set T = {t i3 1 i m } of strings. Any cluster can not contain two or more strings in T since otherwise the radius of this cluster would be at least 4. Therefore, there must be exactly k m clusters, each of which doesn t contain any string in T. For each edge e i E, eithert i1 or t i5 would be in someone among those k m clusters. On the other hand, for anyone (say, the l-th, 1 l k m )amongthosek m clusters, there must be some index 1 j n such that for each t i1 (t i5 ) in the cluster, l(i) =j (r(i) =j). We denote f(l) =j. Clearly, the k m vertices v f(1),v f(2),...,v f(k m ) can cover E. Only if part: Since k 2m < 3m = S and for any two different strings s, t S we have d(s, t) 4, hence for any k-clustering of S, themaximumcluster radius is at least 2. Assuming that k m vertices v i1,v i2,...,v ik m can cover E, we can introduce k m strings s i1,s i2,...,s ik m as the first k m center strings. The last m center strings can be introduced as follows: since for each edge e i E, eithers l(i) or s r(i) is among the first k m center strings, we introduce t i4 as the (k m + i)-th center string if s l(i) is among the first k m center strings and t i2 otherwise. Clearly, this give a k-clustering of S with the maximum cluster radius equal to 2. Now we can conclude our proof. Firstnotethat,byLemmas1,2,3,whenk 2m and there is a k-clustering of S with the maximum cluster radius equal to 2, finding an approximate solution within an approximation factor less than 2 is no easier than finding an exact solution. Clearly, there is a critical 1 k c 2m satisfying that there is a k c -clustering of S with the maximum cluster radius equal to 2 and for any (k c 1)-clustering of S, the maximum cluster radius is at least 4. If there is a polynomial algorithm for the Hamming radius k-clustering problem within an approximation factor less than 2, we can apply it on each instance S in the decreasing order of k from 2m downto 1. So once we find a k such that the maximum cluster radius of the instance S obtained by the algorithm is at least 4, we are sure that k c = k + 1. Since when k k c,the

8 algorithm can of course find a k-clustering of S with the maximum cluster radius equal to 2, and when k = k c 1, for any k-clustering of S, themaximumcluster radius is at least 4. Therefore, from Lemma 4, the vertex cover number of the graph G is k c m. That is, we can get the exact vertex cover number of any instance G of the vertex cover problem in polynomial time. However, the vertex cover problem is NP-hard. Contradiction, this completes our reduction. 4 A Deterministic PTAS for the O(1)-Closest Substring Problem In this section, we study the k-closest Substring problem when k = O(1). Both the algorithm and the proof are based on the k =2case.Wecallitthe 2-Closest Substring problem. We make use of some results for the Closest String problem and extend a random sampling strategy in [14] and [13] to give a deterministic PTAS for the 2-Closest Substring problem. Finally, we give an informal statement explaining how and why it can be easily extended to the general k = O(1) case. Since the Hamming radius O(1)-clustering problem is just a special-case version of the O(1)-Closest Substring problem, the same algorithm also gives a deterministic PTAS which improves the RPTAS as in [10], for the Hamming radius O(1)-clustering problem. 4.1 Some Definitions Let s, t be strings of length m. A multiset P = {j 1,j 2,...,j k } such that 1 j 1 j 2... j k m is called a position set. Bys P we denote the string s[j 1 ]s[j 2 ]...s[j k ]. We also write d P (s, t) tomeand(s P,t P ). Let c, o be strings of length L, Q {1, 2,...,L}, P = {1, 2,...,L}\Q, R P,wedenote h(o, c, Q, R) =d Q (o, c)+ P R dr (o, c), f(s, c, Q, R) =min {any length-l substring t of s} h(t, c, Q, R), g(s, c) =min {any length-l substring t of s} d(t, c). The above function h is just the quasi-distance measure adopted by us. Also, the above function f and g are two new distance measure introduced by us. We refer to them as distance f or distance g from now on. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem, where each s i is of length m (m L). Let (c A,c B,A,B)bethesolutionso that the set S has been partitioned into two clusters A, B and c A, c B are the center strings of cluster A, B respectively. We denote d(c A,c B,A,B) the minimal d satisfying that s A g(s, c A ) d s Bg(s, c B ) d. Let s i1,s i2,...,s ir be r strings (allowing repeats) in S. LetQ i1,i 2,...,i r be the set of positions where s i1,s i2,...,s ir agree, P i1,i 2,...,i r = {1, 2,...,m}\ Q i1,i 2,...,i r.

9 4.2 Several Useful Lemmas Lemma 5. (Chernoff Bound) Let X 1,X 2,...,X n be n independent random 0-1 variables, where X i takes 1 with probability p i, 0 <p i < 1. LetX = n i=1 X i, and µ = E[X]. Then for any 0 <ɛ 1, (1) Pr(X >µ+ ɛn) < exp ( 1 3 nɛ2), (2) Pr(X <µ ɛn) exp ( 1 2 nɛ2). Let S = {s 1,s 2,...,s n } be an instance of the Closest String problem, where each s i is of length m. Lets, d opt are the center string and the radius in the d(s optimal solution. Let ρ 0 =max i,s j) 1 i,j n d opt. Lemma 6. For any constant r, 2 r<n,ifρ 0 > r 1, then there are indices 1 i 1,i 2,...,i r n such that for any 1 l n, d(s l,s Qi1,i 2,...,ir i 1 ) d(s Qi1,i 2,...,ir l,s Qi1,i 2,...,ir Q ) 1 i1,i 2,...,ir 2r 1 d opt. Lemma 7. P i1,i 2,...,i r rd opt. Lemma 5 is Lemma 1.2 in [13]. Lemma 6 is Lemma 2.1 in [13]. Lemma 7 is Claim 2.4 in [13]. The following is a lemma which plays a key role in the proof of our main result. Lemma 8. For any constant r, 2 r<n, there are r strings s i1,s i2,...,s ir (allowing repeats) in S and a center string c such that for any 1 l n, d(c,s l ) ( r 1 )d opt where c = s Qi1,i 2,...,ir i 1, Qi1,i 2,...,ir c = s Pi1,i 2,...,ir P. i1,i 2,...,ir Proof. If ρ r 1,wecanchooses 1 r timesasther strings s i1,s i2,...,s ir. Otherwise, by Lemma 6, such r strings s i1,s i2,...,s ir do exist. 4.3 A PTAS for the 2-Closest Substring Problem How to choose n closest substrings and partition n strings into two sets accordingly are the only two obstacles on the way to the solution. We use the random sampling strategy in [14] and [13] to handle both the two obstacles. This is a further application of the random sampling strategy in [14] and [13]. Now let us outline the underlying ideas. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem, where each s i is of length m. Suppose that in the optimal solution, t 1,t 2,...,t n are the n closest substrings (those which is nearest among all length-l substrings of some string to the center string which it is assigned to) which form two instances of the Closest String problem based on the optimal partition. Thus, if we can obtain the same partition together with choice as one in the optimal solution, we can solve the Closest String problem on such two instances respectively. Unfortunately, we don t know the exact partition together with choice in the optimal solution. However, by virtue of the quasi-distance measure h which approximates the Hamming distance d very well, we can get a not too bad partition together with choice as compared with the one in the optimal solution. That is, even

10 if some string is partitioned into wrong cluster, its distance g from wrong center will not exceed (1 + ɛ) times its distance g from the right center in the optimal solution with ɛ as small as we desire. The detailed algorithm (Algorithm 2-Closest Substring) is given in Fig. 2. We prove Theorem 2 in the rest of the section. Algorithm 2-Closest Substring Input n strings s 1,s 2,...,s n Σ m, integer L. Output two center strings c 1,c for each r length-l substrings t i1,t i2,...,t ir (allowing repeats, but if t ip and t iq are both chosen from the same s i then t ip = t iq )ofthen input strings do (1) Q 1 = {1 k L t i1 [k] =t i2 [k] =...= t ir [k]}, P 1 = {1, 2,...,L}\Q 1. (2) Let R 1 be a multiset containing 4 1 log(mn) uniformly random positions ɛ 2 from P 1. (3) for each r length-l substrings t j1,t j2,...,t jr (allowing repeats, but if t jp and t jq are both chosen from the same s j then t jp = t jq )ofthen input strings do (a) Q 2 = {1 k L t j1 [k] =t j2 [k] =...= t jr [k]}, P 2 = {1, 2,...,L}\ Q 2. (b) Let R 2 be a multiset containing 4 1 log(mn) uniformly random ɛ 2 positions from P 2. (c) for each string y 1 of length R 1 and each string y 2 of length R 2 do (i) Let c 1 be the string such that c 1 R1 = y 1 and c 1 Q1 = t i1 Q1, and c 2 be the string such that c 2 R2 = y 2 and c 2 Q2 = t j1 Q2. (ii) for l from 1 to n do Assign s l into set C 1 if f(s l,c 1,Q 1,R 1) f(s l,c 2,Q 2,R 2)and set C 2 otherwise.we denote c (l) =c 1, Q (l) =Q 1, R (l) =R 1 if s l C 1 and c (l) = c 2, Q (l) = Q 2,R (l) = R 2 otherwise. let t l be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s l such that h(t l,c (l),q (l),r (l)) = f(s l,c (l),q (l),r (l)). (iii) Using the method in [13], solve the optimization problem defined by Formula (9) approximately to get a solution (x 1,x 2) within error max(ɛ P 1,ɛ P 2 ). (iv) Let c 1 be the string such that c 1 P1 = x 1 and c 1 Q1 = t i1 Q1, and c 2 be the string such that c 2 P2 = x 2 and c 2 Q2 = t j1 Q2. (v) Let d =max 1 l n min(g(s l,c 1 ),g(s l,c 2 )). 2. Output the (c 1,c 2 ) with minimum d in step 1(3)(c). Fig. 2. The PTAS for the 2-Closest Substring problem Theorem 2. Algorithm 2-Closest Substring is a PTAS for the 2-Closest Substring problem. Proof. Let ɛ be any small positive number and r 2beanyfixedinteger. Let S = {s 1,s 2,...,s n } be an instance of the 2-Closest Substring problem,

11 where each s i is of length m. LetT = {u j s S u j is a length-l substring of s}. Suppose that in the optimal solution, t 1,t 2,...,t n are the n closest substrings, C 1,C 2 are the two clusters which S is partitioned into and c 1,c 2 are the length- L center strings of the two clusters. We denote T 1 = {t j s j C 1 },T 2 = {t j s j C 2 } and d opt = max 1 j n min(d(t j,c 1 ),d(t j,c 2 )). Both T 1 and T 2 form instances of the Closest String problem, so by trying all possibilities, we can assume that t i1,t i2,...,t ir and t j1,t j2,...,t jr are the r strings that satisfy Lemma 8 by considering T 1 and T 2 respectively. Let Q 1 (Q 2 )bethesetof positions where t i1,t i2,...,t ir (t j1,t j2,...,t jr ) agree and P 1 = {1, 2,...,L}\Q 1 (P 2 = {1, 2,...,L}\Q 2 ). Let c 1 be the string such that c 1 P 1 = c 1 P1 and c 1 Q1 = t i1 Q1,andc 2 be the string such that c 2 P2 = c 2 P2 and c 2 Q2 = t j1 Q2. By Lemma 8, the solution (c 1,c 2,T 1,T 2 ) is a good approximation of the optimal solution (c 1,c 2,T 1,T 2 ), that is, d(c 1,c 2,T 1,T 2 ) ( r 1 )d opt. (2) As for P 1 (P 2 ) where we know nothing about c 1 (c 2 ), we randomly pick 4 1 ɛ log(mn) positions from P 2 1 (P 2 ). Suppose that the multiset of these random positions is R 1 (R 2 ). By trying all possibilities, we can assume that we can get c 1 Q1 R 1 and c 2 Q2 R 2 at some point. We then partition each s j S into set C 1 if f(s j,c 1,Q 1,R 1 ) f(s j,c 2,Q 2,R 2 )andsetc 2 otherwise. For each 1 j n, wedenotec (j) =c 1,Q (j) =Q 1,R (j) =R 1,P (j) =P 1 if s j C 1 and c (j) =c 2,Q (j) =Q 2,R (j) =R 2,P (j) =P 2 otherwise. For each 1 j n, lett j be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s j such that h(t j,c (j),q (j),r (j)) = f(s j,c (j),q (j),r (j)). We denote T 1 = {t j s j C 1} and T 2 = {t j s j C 2}. We can prove the following Lemma 9. Lemma 9. With high probability, for each u j T, d(u j,c 1) h(u j,c 1,Q 1,R 1 ) ɛ P 1 and d(u j,c 2) h(u j,c 2,Q 2,R 2 ) ɛ P 2. (3) Proof. Let λ = P1 R. It is easy to see that 1 (u dr1 j,c 1 )isthesumof R 1 independent random 0-1 variables R 1 i=1 X i,wherex i = 1 indicates a mismatch between c 1 and u j at the i-th position in R 1.Letµ = E[d R1 (u j,c 1)]. Obviously, µ = d P1 (u j,c 1 )/λ. Therefore, by Lemma 5 (2), we have the following inequality: Pr(d(u j,c 1 ) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) = Pr(d R1 (u j,c 1) (d(u j,c 1) d Q1 (u j,c 1))/λ ɛ R 1 ) = Pr(d R1 (u j,c 1 ) dp1 (u j,c 1 )/λ ɛ R 1 ) = Pr(d R1 (u j,c 1) µ ɛ R 1 ) exp( 1 2 ɛ2 R 1 ) (mn) 2, (4)

12 where the last inequality is due to the setting R 1 = 4 1 ɛ 2 log(mn) in step 1(2) of the algorithm. Similarly, using Lemma 5 (1), we have Pr(d(u j,c 1) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) (mn) 4 3. (5) Combining Formula (4) with (5), we have that for any u j T, Pr( d(u j,c 1 ) h(u j,c 1,Q 1,R 1 ) ɛ P 1 ) 2(mn) 4 3. (6) Similarly, we have that for any u j T, Pr( d(u j,c 2 ) h(u j,c 2,Q 2,R 2 ) ɛ P 2 ) 2(mn) 4 3. (7) Summing up over all u j T, we have that with probability at least 1 4(mn) 1 3, Formula (3) holds. Now we can conclude our proof. For each 1 j n, wedenotec(j) =c 1, Q(j) =Q 1, R(j) =R 1, P (j) =P 1 if t j T 1 and c(j) =c 2, Q(j) =Q 2, R(j) =R 2, P (j) =P 2 otherwise. Therefore, we have that Formula (8) holds with probability at least 1 4(mn) 1 3. d(c (j),t j) h(t j,c (j),q (j),r (j)) + ɛ P (j) h(t j,c(j),q(j),r(j)) + ɛ P (j) d(c(j),t j )+ɛ P(j) + ɛ P (j) d(c 1,c 2,T 1,T 2 )+ɛ P(j) + ɛ P (j) ( r 1 )d opt + ɛ P (j) + ɛ P (j) ( r 1 +2ɛr)d opt, 1 j n, (8) where the first and the third inequality is by Formula (3), the second inequality is due to the definition of the partition, the fifth inequality is by Formula (2) and the last inequality is due to Lemma 7. By the definition of c 1 and c 2, the following optimization problem has a solution (x 1,x 2 )=(c 1 P1,c 2 P2 ) such that d ( r 1 +2ɛr)d opt. min d ; d(x 1,t j P 1 ) d d Q1 (t i1,t j ), t j T 1 ; x 1 = P 1. (9) d(x 2,t j P 2 ) d d Q2 (t j1,t j ), t j T 2 ; x 2 = P 2. We can solve the optimization problem within error ɛrd opt by applying the method for the Closest String problem [13] to set T 1 and T 2 respectively. Let (x 1,x 2 ) be the solution of the optimization problem. Then by Formula (9) we have

13 d(x 1,t j P1 ) ( r 1 +2ɛr)d opt d Q1 (t i1,t j)+ɛ P 1, t j T 1. d(x 2,t j P 2 ) ( r 1 +2ɛr)d opt d Q2 (t j1,t j )+ɛ P 2, t j T 2. (10) Let (c ) be defined as in step 1(3)(c)(iv), then by Formula (10), we have 1,c 2 d(c 1,t j)=d(x 1,t j P1 )+d Q1 (t i1,t j) ( r 1 +2ɛr)d opt + ɛ P 1 ( r 1 +3ɛr)d opt, t j T 1, d(c 2,t j )=d(x 2,t j P 2 )+d Q2 (t j1,t j ) ( r 1 +2ɛr)d opt + ɛ P 2 ( r 1 +3ɛr)d opt, t j T 2. (11) we can get a good approximate solution (c 1,c 2,T 1,T 2)withd(c 1,c 2,T 1,T 2) ( r 1 +3ɛr)d opt. It is easy to see that the algorithm runs in polynomial time for any fixed positive r and ɛ. For any δ>0, by properly setting r and ɛ such 1 that 2r 1 +3ɛr δ, the algorithm outputs in polynomial time a solution (c 1,c 2) with high probability such that min(g(s j,c 1 ),g(s j,c 2 )) (1 + δ)d opt for each 1 j n. The algorithm can be derandomized by standard methods [15]. 4.4 The General k = O(1) Case Theorem 2 can be trivially extended to the k = O(1) case. Theorem 3. There is a PTAS for the O(1)-Closest Substring problem. We only give an informal explanation of how such an extension can be done. Unlike Jansson s RPTAS in [10] for the Hamming radius O(1)-clustering problem, which made use of the apex of tournament to get a good assignment of strings to clusters, our extension is trivial. (Due to space limitation, the explanation is omitted and given in supplementary material.) 5 The O(1)-Consensus Pattern Problem It is relatively straightforward to extend the techniques in [12] for the Consensus Pattern problem to give a PTAS for the O(1)-Consensus Pattern problem. Since the algorithm for the k = O(1) case is tedious, we only give an algorithm for the k = 2 case. The detailed algorithm (Algorithm 2-Consensus Pattern) is described in Fig. 3.

14 Theorem 4. Algorithm 2-Consensus Pattern is a PTAS for the 2-Consensus Pattern problem. Again, since the proof is tedious and does not involve new ideas, we only give an informal explanation about how such an extension can be done. (Due to space limitation, the explanation is omitted and given in supplementary material.) Algorithm 2-Consensus Pattern Input n strings s 1,s 2,...,s n Σ m, integer L. Output two median strings c 1, c for each r length-l substrings t i1,t i2,...,t ir (allowing repeats, but if t ip and t iq are both chosen from the same s i then t ip = t iq )ofthen input strings do (1) Let c 1 be the column-wise majority string of t i1,t i2,...,t ir. (2) for each r length-l substrings t j1,t j2,...,t jr (allowing repeats, but if t jp and t jq are both chosen from the same s j then t jp = t jq )ofthen input strings do (a) Let c 2 be the column-wise majority string of t j1,t j2,...,t jr. (b) for j from 1 to n do Assign s j into set C 1 if g(s j,c 1) g(s j,c 2)andsetC 2 otherwise. We denote c (j) =c 1 if s j C 1 and c (j) =c 2 otherwise. let t j be the length-l substring (if several such substrings exist, we choose one of them arbitrarily) of s j such that d(t j,c (j)) = g(s j,c (j)). (c) We denote T 1 = {t j s j C 1}, T 2 = {t j s j C 2}. Letc 1 be the column-wise majority string of all strings in T 1 and c 2 be the columnwise majority string of all strings in T 2. (d) Let w = 1 j n min(g(sj,c 1 ),g(s j,c 2 )). 2. Output the (c 1,c 2 ) with minimum w in step 1(2). Fig. 3. The PTAS for the 2-Consensus Pattern problem We have presented a PTAS for the O(1)-Consensus Pattern problem. Since the min-sum Hamming median clustering problem (i.e., Hamming p- median clustering problem) discussed in [16] is just a special-case version of the O(1)-Consensus Pattern problem, we also present a deterministic PTAS for that problem. Acknowledgement We would like to thank Dongbo Bu, Hao Lin, Bin Ma, Jingfen Zhang and Zefeng Zhang in the Sequence Assembly project at the Institute of Computing Technology, Chinese Academy of Sciences, for various discussions. References 1. A. Blum, T. Jiang, M. Li, J. Tromp, and M. Yannakakis: Linear Approximation of Shortest Superstrings. Journal of the ACM. 41(4) (1994)

15 2. J. Buhler: Efficient large-scale sequence comparison by locality-sensitive hashing. Bioinformatics 17 (2001) T. Feder and D. H. Greene: Optimal algorithms for approximate clustering. In Proceedings of the 20th Annual ACM Symposium on Theory of Computing (1988) R.J. Fowler, M.S. Paterson and S.L. Tanimoto: Optimal packing and covering in the plane are NP-complete. Inf. Proc. Letters 12(3) (1981) M.R. Garey and D. S. Johnson: The rectilinear Steiner tree problem is NPcomplete. SIAM J.Appl.Math. 32 (1977) L. Gasieniec, J. Jansson, and A. Lingas: Efficient approximation algorithms for the Hamming center problem. Proc. 10th ACM-SIAM Symp. on Discrete Algorithms (1999) S905 S L. Gasieniec, J. Jansson, and A. Lingas: Approximation Algorithms for Hamming Clustering Problems. Proc. 11th Symp. CPM. (2000) T. F. Gonzalez: Clustering to minimize the maximum intercluster distance. Theoretical Computer Science 38 (1985) P. Indyk, R. Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality. Proc. 30th Annual ACM Symp. Theory Comput. (1998) J. Jansson: Consensus Algorithms for Trees and Strings. Doctoral dissertation (2003) 11. K. Lanctot, M. Li, B. Ma, L. Wang and L. Zhang: Distinguishing string selection problems. Proc. 10th ACM-SIAM Symp. Discrete Algorithms (1999) M. Li, B. Ma, and L. Wang: Finding similar regions in many strings. In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (1999) M. Li, B. Ma, and L. Wang: On the Closest String and Substring Problems. Journal of ACM. 49(2) (2002) B. Ma: A polynomial time approximation scheme for the Closest Substring problem. In Proc. 11th Annual Symposium on Combinatorial Pattern Matching (2000) R. Motwani and P.Raghavan: Randomized Algorithms. Cambridge Univ. Press (1995) 16. R. Ostrovsky and Y. Rabani: Polynomial-Time Approximation Schemes for Geometric Min-Sum Median Clustering. Journal of ACM. 49(2) (2002)

Closest String and Closest Substring Problems

Closest String and Closest Substring Problems January 8, 2010 Problem Formulation Problem Statement I Closest String Given a set S = {s 1, s 2,, s n } of strings each length m, find a center string s of length m minimizing d such that for every string

More information

Consensus Optimizing Both Distance Sum and Radius

Consensus Optimizing Both Distance Sum and Radius Consensus Optimizing Both Distance Sum and Radius Amihood Amir 1, Gad M. Landau 2, Joong Chae Na 3, Heejin Park 4, Kunsoo Park 5, and Jeong Seop Sim 6 1 Bar-Ilan University, 52900 Ramat-Gan, Israel 2 University

More information

Approximation algorithms for Hamming clustering problems

Approximation algorithms for Hamming clustering problems Journal of Discrete Algorithms 2 (2004) 289 301 www.elsevier.com/locate/jda Approximation algorithms for Hamming clustering problems Leszek G asieniec a,, Jesper Jansson b, Andrzej Lingas b a Department

More information

Computing Bi-Clusters for Microarray Analysis

Computing Bi-Clusters for Microarray Analysis December 21, 2006 General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model General Bi-clustering Problem Input: a n m matrix A. Output: a sub-matrix A P,Q of A such that the rows

More information

Locality Sensitive Hashing

Locality Sensitive Hashing Locality Sensitive Hashing February 1, 016 1 LSH in Hamming space The following discussion focuses on the notion of Locality Sensitive Hashing which was first introduced in [5]. We focus in the case of

More information

Lecture 8: The Goemans-Williamson MAXCUT algorithm

Lecture 8: The Goemans-Williamson MAXCUT algorithm IU Summer School Lecture 8: The Goemans-Williamson MAXCUT algorithm Lecturer: Igor Gorodezky The Goemans-Williamson algorithm is an approximation algorithm for MAX-CUT based on semidefinite programming.

More information

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003 CS6999 Probabilistic Methods in Integer Programming Randomized Rounding April 2003 Overview 2 Background Randomized Rounding Handling Feasibility Derandomization Advanced Techniques Integer Programming

More information

Theoretical Computer Science. Why greed works for shortest common superstring problem

Theoretical Computer Science. Why greed works for shortest common superstring problem Theoretical Computer Science 410 (2009) 5374 5381 Contents lists available at ScienceDirect Theoretical Computer Science journal homepage: www.elsevier.com/locate/tcs Why greed works for shortest common

More information

K-center Hardness and Max-Coverage (Greedy)

K-center Hardness and Max-Coverage (Greedy) IOE 691: Approximation Algorithms Date: 01/11/2017 Lecture Notes: -center Hardness and Max-Coverage (Greedy) Instructor: Viswanath Nagarajan Scribe: Sentao Miao 1 Overview In this lecture, we will talk

More information

Minimum Moment Steiner Trees

Minimum Moment Steiner Trees Minimum Moment Steiner Trees Wangqi Qiu Weiping Shi Abstract For a rectilinear Steiner tree T with a root, define its k-th moment M k (T ) = (d T (u)) k du, T where the integration is over all edges of

More information

Partitioning Metric Spaces

Partitioning Metric Spaces Partitioning Metric Spaces Computational and Metric Geometry Instructor: Yury Makarychev 1 Multiway Cut Problem 1.1 Preliminaries Definition 1.1. We are given a graph G = (V, E) and a set of terminals

More information

ACO Comprehensive Exam October 14 and 15, 2013

ACO Comprehensive Exam October 14 and 15, 2013 1. Computability, Complexity and Algorithms (a) Let G be the complete graph on n vertices, and let c : V (G) V (G) [0, ) be a symmetric cost function. Consider the following closest point heuristic for

More information

Approximation Basics

Approximation Basics Approximation Basics, Concepts, and Examples Xiaofeng Gao Department of Computer Science and Engineering Shanghai Jiao Tong University, P.R.China Fall 2012 Special thanks is given to Dr. Guoqiang Li for

More information

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de

More information

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016

Divide and Conquer. Maximum/minimum. Median finding. CS125 Lecture 4 Fall 2016 CS125 Lecture 4 Fall 2016 Divide and Conquer We have seen one general paradigm for finding algorithms: the greedy approach. We now consider another general paradigm, known as divide and conquer. We have

More information

Decision Problems TSP. Instance: A complete graph G with non-negative edge costs, and an integer

Decision Problems TSP. Instance: A complete graph G with non-negative edge costs, and an integer Decision Problems The theory of NP-completeness deals only with decision problems. Why? Because if a decision problem is hard, then the corresponding optimization problem must be hard too. For example,

More information

Packing and decomposition of graphs with trees

Packing and decomposition of graphs with trees Packing and decomposition of graphs with trees Raphael Yuster Department of Mathematics University of Haifa-ORANIM Tivon 36006, Israel. e-mail: raphy@math.tau.ac.il Abstract Let H be a tree on h 2 vertices.

More information

16 Embeddings of the Euclidean metric

16 Embeddings of the Euclidean metric 16 Embeddings of the Euclidean metric In today s lecture, we will consider how well we can embed n points in the Euclidean metric (l 2 ) into other l p metrics. More formally, we ask the following question.

More information

Lecture 10 September 27, 2016

Lecture 10 September 27, 2016 CS 395T: Sublinear Algorithms Fall 2016 Prof. Eric Price Lecture 10 September 27, 2016 Scribes: Quinten McNamara & William Hoza 1 Overview In this lecture, we focus on constructing coresets, which are

More information

arxiv:cs/ v1 [cs.cg] 7 Feb 2006

arxiv:cs/ v1 [cs.cg] 7 Feb 2006 Approximate Weighted Farthest Neighbors and Minimum Dilation Stars John Augustine, David Eppstein, and Kevin A. Wortman Computer Science Department University of California, Irvine Irvine, CA 92697, USA

More information

The NP-Hardness of the Connected p-median Problem on Bipartite Graphs and Split Graphs

The NP-Hardness of the Connected p-median Problem on Bipartite Graphs and Split Graphs Chiang Mai J. Sci. 2013; 40(1) 8 3 Chiang Mai J. Sci. 2013; 40(1) : 83-88 http://it.science.cmu.ac.th/ejournal/ Contributed Paper The NP-Hardness of the Connected p-median Problem on Bipartite Graphs and

More information

The Steiner k-cut Problem

The Steiner k-cut Problem The Steiner k-cut Problem Chandra Chekuri Sudipto Guha Joseph (Seffi) Naor September 23, 2005 Abstract We consider the Steiner k-cut problem which generalizes both the k-cut problem and the multiway cut

More information

Complexity of determining the most vital elements for the 1-median and 1-center location problems

Complexity of determining the most vital elements for the 1-median and 1-center location problems Complexity of determining the most vital elements for the -median and -center location problems Cristina Bazgan, Sonia Toubaline, and Daniel Vanderpooten Université Paris-Dauphine, LAMSADE, Place du Maréchal

More information

PARTITIONING PROBLEMS IN DENSE HYPERGRAPHS

PARTITIONING PROBLEMS IN DENSE HYPERGRAPHS PARTITIONING PROBLEMS IN DENSE HYPERGRAPHS A. CZYGRINOW Abstract. We study the general partitioning problem and the discrepancy problem in dense hypergraphs. Using the regularity lemma [16] and its algorithmic

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 P and NP P: The family of problems that can be solved quickly in polynomial time.

More information

Increasing the Span of Stars

Increasing the Span of Stars Increasing the Span of Stars Ning Chen Roee Engelberg C. Thach Nguyen Prasad Raghavendra Atri Rudra Gynanit Singh Department of Computer Science and Engineering, University of Washington, Seattle, WA.

More information

On the Complexity of the Minimum Independent Set Partition Problem

On the Complexity of the Minimum Independent Set Partition Problem On the Complexity of the Minimum Independent Set Partition Problem T-H. Hubert Chan 1, Charalampos Papamanthou 2, and Zhichao Zhao 1 1 Department of Computer Science the University of Hong Kong {hubert,zczhao}@cs.hku.hk

More information

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti and Simone Zaccaria DISCo, computer

More information

Lecture 18: March 15

Lecture 18: March 15 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may

More information

Facility Location in Sublinear Time

Facility Location in Sublinear Time Facility Location in Sublinear Time Mihai Bădoiu 1, Artur Czumaj 2,, Piotr Indyk 1, and Christian Sohler 3, 1 MIT Computer Science and Artificial Intelligence Laboratory, Stata Center, Cambridge, Massachusetts

More information

Random Feature Maps for Dot Product Kernels Supplementary Material

Random Feature Maps for Dot Product Kernels Supplementary Material Random Feature Maps for Dot Product Kernels Supplementary Material Purushottam Kar and Harish Karnick Indian Institute of Technology Kanpur, INDIA {purushot,hk}@cse.iitk.ac.in Abstract This document contains

More information

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity

Cell-Probe Proofs and Nondeterministic Cell-Probe Complexity Cell-obe oofs and Nondeterministic Cell-obe Complexity Yitong Yin Department of Computer Science, Yale University yitong.yin@yale.edu. Abstract. We study the nondeterministic cell-probe complexity of static

More information

Consecutive ones Block for Symmetric Matrices

Consecutive ones Block for Symmetric Matrices Consecutive ones Block for Symmetric Matrices Rui Wang, FrancisCM Lau Department of Computer Science and Information Systems The University of Hong Kong, Hong Kong, PR China Abstract We show that a cubic

More information

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets Emanuele Viola July 6, 2009 Abstract We prove that to store strings x {0, 1} n so that each prefix sum a.k.a. rank query Sumi := k i x k can

More information

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools

Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Finite Metric Spaces & Their Embeddings: Introduction and Basic Tools Manor Mendel, CMI, Caltech 1 Finite Metric Spaces Definition of (semi) metric. (M, ρ): M a (finite) set of points. ρ a distance function

More information

The Budgeted Unique Coverage Problem and Color-Coding (Extended Abstract)

The Budgeted Unique Coverage Problem and Color-Coding (Extended Abstract) The Budgeted Unique Coverage Problem and Color-Coding (Extended Abstract) Neeldhara Misra 1, Venkatesh Raman 1, Saket Saurabh 2, and Somnath Sikdar 1 1 The Institute of Mathematical Sciences, Chennai,

More information

A Separator Theorem for Graphs with an Excluded Minor and its Applications

A Separator Theorem for Graphs with an Excluded Minor and its Applications A Separator Theorem for Graphs with an Excluded Minor and its Applications Noga Alon IBM Almaden Research Center, San Jose, CA 95120,USA and Sackler Faculty of Exact Sciences, Tel Aviv University, Tel

More information

Approximation Algorithms

Approximation Algorithms Approximation Algorithms What do you do when a problem is NP-complete? or, when the polynomial time solution is impractically slow? assume input is random, do expected performance. Eg, Hamiltonian path

More information

arxiv:cs/ v1 [cs.cc] 21 May 2002

arxiv:cs/ v1 [cs.cc] 21 May 2002 Parameterized Intractability of Motif Search Problems arxiv:cs/0205056v1 [cs.cc] 21 May 2002 Michael R. Fellows Jens Gramm Rolf Niedermeier Abstract We show that Closest Substring, one of the most important

More information

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved.

Chapter 11. Approximation Algorithms. Slides by Kevin Wayne Pearson-Addison Wesley. All rights reserved. Chapter 11 Approximation Algorithms Slides by Kevin Wayne. Copyright @ 2005 Pearson-Addison Wesley. All rights reserved. 1 Approximation Algorithms Q. Suppose I need to solve an NP-hard problem. What should

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms 南京大学 尹一通 Martingales Definition: A sequence of random variables X 0, X 1,... is a martingale if for all i > 0, E[X i X 0,...,X i1 ] = X i1 x 0, x 1,...,x i1, E[X i X 0 = x 0, X 1

More information

Approximation algorithms and hardness results for the clique packing problem. October, 2007

Approximation algorithms and hardness results for the clique packing problem. October, 2007 Approximation algorithms and hardness results for the clique packing problem F. Chataigner 1 G. Manić 2 Y.Wakabayashi 1 R. Yuster 3 1 Instituto de Matemática e Estatística Universidade de São Paulo, SP,

More information

A Las Vegas approximation algorithm for metric 1-median selection

A Las Vegas approximation algorithm for metric 1-median selection A Las Vegas approximation algorithm for metric -median selection arxiv:70.0306v [cs.ds] 5 Feb 07 Ching-Lueh Chang February 8, 07 Abstract Given an n-point metric space, consider the problem of finding

More information

1.1 P, NP, and NP-complete

1.1 P, NP, and NP-complete CSC5160: Combinatorial Optimization and Approximation Algorithms Topic: Introduction to NP-complete Problems Date: 11/01/2008 Lecturer: Lap Chi Lau Scribe: Jerry Jilin Le This lecture gives a general introduction

More information

Strong Computational Lower Bounds via Parameterized Complexity

Strong Computational Lower Bounds via Parameterized Complexity Strong Computational Lower Bounds via Parameterized Complexity Jianer Chen Xiuzhen Huang 1 Iyad A. Kanj 2 Ge Xia 3 Department of Computer Science, Texas A&M University, College Station, TX 77843-3112,

More information

An asymptotically tight bound on the adaptable chromatic number

An asymptotically tight bound on the adaptable chromatic number An asymptotically tight bound on the adaptable chromatic number Michael Molloy and Giovanna Thron University of Toronto Department of Computer Science 0 King s College Road Toronto, ON, Canada, M5S 3G

More information

An Improved Approximation Algorithm for Requirement Cut

An Improved Approximation Algorithm for Requirement Cut An Improved Approximation Algorithm for Requirement Cut Anupam Gupta Viswanath Nagarajan R. Ravi Abstract This note presents improved approximation guarantees for the requirement cut problem: given an

More information

Trace Reconstruction Revisited

Trace Reconstruction Revisited Trace Reconstruction Revisited Andrew McGregor 1, Eric Price 2, and Sofya Vorotnikova 1 1 University of Massachusetts Amherst {mcgregor,svorotni}@cs.umass.edu 2 IBM Almaden Research Center ecprice@mit.edu

More information

Approximability and Parameterized Complexity of Consecutive Ones Submatrix Problems

Approximability and Parameterized Complexity of Consecutive Ones Submatrix Problems Proc. 4th TAMC, 27 Approximability and Parameterized Complexity of Consecutive Ones Submatrix Problems Michael Dom, Jiong Guo, and Rolf Niedermeier Institut für Informatik, Friedrich-Schiller-Universität

More information

Bounds on the generalised acyclic chromatic numbers of bounded degree graphs

Bounds on the generalised acyclic chromatic numbers of bounded degree graphs Bounds on the generalised acyclic chromatic numbers of bounded degree graphs Catherine Greenhill 1, Oleg Pikhurko 2 1 School of Mathematics, The University of New South Wales, Sydney NSW Australia 2052,

More information

A Fixed-Parameter Algorithm for Max Edge Domination

A Fixed-Parameter Algorithm for Max Edge Domination A Fixed-Parameter Algorithm for Max Edge Domination Tesshu Hanaka and Hirotaka Ono Department of Economic Engineering, Kyushu University, Fukuoka 812-8581, Japan ono@cscekyushu-uacjp Abstract In a graph,

More information

Dispersing Points on Intervals

Dispersing Points on Intervals Dispersing Points on Intervals Shimin Li 1 and Haitao Wang 1 Department of Computer Science, Utah State University, Logan, UT 843, USA shiminli@aggiemail.usu.edu Department of Computer Science, Utah State

More information

Conflict-Free Colorings of Rectangles Ranges

Conflict-Free Colorings of Rectangles Ranges Conflict-Free Colorings of Rectangles Ranges Khaled Elbassioni Nabil H. Mustafa Max-Planck-Institut für Informatik, Saarbrücken, Germany felbassio, nmustafag@mpi-sb.mpg.de Abstract. Given the range space

More information

CS 583: Approximation Algorithms: Introduction

CS 583: Approximation Algorithms: Introduction CS 583: Approximation Algorithms: Introduction Chandra Chekuri January 15, 2018 1 Introduction Course Objectives 1. To appreciate that not all intractable problems are the same. NP optimization problems,

More information

A Randomized Rounding Approach to the Traveling Salesman Problem

A Randomized Rounding Approach to the Traveling Salesman Problem A Randomized Rounding Approach to the Traveling Salesman Problem Shayan Oveis Gharan Amin Saberi. Mohit Singh. Abstract For some positive constant ɛ 0, we give a ( 3 2 ɛ 0)-approximation algorithm for

More information

Tight Approximation Ratio of a General Greedy Splitting Algorithm for the Minimum k-way Cut Problem

Tight Approximation Ratio of a General Greedy Splitting Algorithm for the Minimum k-way Cut Problem Algorithmica (2011 59: 510 520 DOI 10.1007/s00453-009-9316-1 Tight Approximation Ratio of a General Greedy Splitting Algorithm for the Minimum k-way Cut Problem Mingyu Xiao Leizhen Cai Andrew Chi-Chih

More information

Linear FPT Reductions and Computational Lower Bounds

Linear FPT Reductions and Computational Lower Bounds Linear FPT Reductions and Computational Lower Bounds Jianer Chen, Xiuzhen Huang, Iyad A. Kanj, and Ge Xia Department of Computer Science, Texas A&M University, College Station, TX 77843 email: {chen,xzhuang,gexia}@cs.tamu.edu

More information

An 0.5-Approximation Algorithm for MAX DICUT with Given Sizes of Parts

An 0.5-Approximation Algorithm for MAX DICUT with Given Sizes of Parts An 0.5-Approximation Algorithm for MAX DICUT with Given Sizes of Parts Alexander Ageev Refael Hassin Maxim Sviridenko Abstract Given a directed graph G and an edge weight function w : E(G) R +, themaximumdirectedcutproblem(max

More information

Approximation algorithm for Max Cut with unit weights

Approximation algorithm for Max Cut with unit weights Definition Max Cut Definition: Given an undirected graph G=(V, E), find a partition of V into two subsets A, B so as to maximize the number of edges having one endpoint in A and the other in B. Definition:

More information

Solutions to Exercises

Solutions to Exercises 1/13 Solutions to Exercises The exercises referred to as WS 1.1(a), and so forth, are from the course book: Williamson and Shmoys, The Design of Approximation Algorithms, Cambridge University Press, 2011,

More information

ON THE NUMBER OF ALTERNATING PATHS IN BIPARTITE COMPLETE GRAPHS

ON THE NUMBER OF ALTERNATING PATHS IN BIPARTITE COMPLETE GRAPHS ON THE NUMBER OF ALTERNATING PATHS IN BIPARTITE COMPLETE GRAPHS PATRICK BENNETT, ANDRZEJ DUDEK, ELLIOT LAFORGE December 1, 016 Abstract. Let C [r] m be a code such that any two words of C have Hamming

More information

Inapproximability for planar embedding problems

Inapproximability for planar embedding problems Inapproximability for planar embedding problems Jeff Edmonds 1 Anastasios Sidiropoulos 2 Anastasios Zouzias 3 1 York University 2 TTI-Chicago 3 University of Toronto January, 2010 A. Zouzias (University

More information

arxiv: v1 [cs.dm] 18 May 2016

arxiv: v1 [cs.dm] 18 May 2016 A note on the shortest common superstring of NGS reads arxiv:1605.05542v1 [cs.dm] 18 May 2016 Tristan Braquelaire Marie Gasparoux Mathieu Raffinot Raluca Uricaru May 19, 2016 Abstract The Shortest Superstring

More information

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich

Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Eidgenössische Technische Hochschule Zürich Swiss Federal Institute of Technology Zurich Paz Carmi, Thomas Erlebach, Yoshio Okamoto Greedy edge-disjoint paths in complete graphs TIK-Report Nr. 155, February

More information

An Improved Approximation Algorithm for Maximum Edge 2-Coloring in Simple Graphs

An Improved Approximation Algorithm for Maximum Edge 2-Coloring in Simple Graphs An Improved Approximation Algorithm for Maximum Edge 2-Coloring in Simple Graphs Zhi-Zhong Chen Ruka Tanahashi Lusheng Wang Abstract We present a polynomial-time approximation algorithm for legally coloring

More information

6.854J / J Advanced Algorithms Fall 2008

6.854J / J Advanced Algorithms Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.85J / 8.5J Advanced Algorithms Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 8.5/6.85 Advanced Algorithms

More information

arxiv: v1 [cs.cg] 29 Jun 2012

arxiv: v1 [cs.cg] 29 Jun 2012 Single-Source Dilation-Bounded Minimum Spanning Trees Otfried Cheong Changryeol Lee May 2, 2014 arxiv:1206.6943v1 [cs.cg] 29 Jun 2012 Abstract Given a set S of points in the plane, a geometric network

More information

The Complexity of Maximum. Matroid-Greedoid Intersection and. Weighted Greedoid Maximization

The Complexity of Maximum. Matroid-Greedoid Intersection and. Weighted Greedoid Maximization Department of Computer Science Series of Publications C Report C-2004-2 The Complexity of Maximum Matroid-Greedoid Intersection and Weighted Greedoid Maximization Taneli Mielikäinen Esko Ukkonen University

More information

W -Hardness under Linear FPT-Reductions: Structural Properties and Further Applications

W -Hardness under Linear FPT-Reductions: Structural Properties and Further Applications W -Hardness under Linear FPT-Reductions: Structural Properties and Further Applications Jianer Chen 1 Xiuzhen Huang 2 Iyad A. Kanj 3 Ge Xia 4 1 Dept. of Computer Science, Texas A&M University, College

More information

Notes for Lecture 2. Statement of the PCP Theorem and Constraint Satisfaction

Notes for Lecture 2. Statement of the PCP Theorem and Constraint Satisfaction U.C. Berkeley Handout N2 CS294: PCP and Hardness of Approximation January 23, 2006 Professor Luca Trevisan Scribe: Luca Trevisan Notes for Lecture 2 These notes are based on my survey paper [5]. L.T. Statement

More information

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta

Approximation Algorithms and Hardness of Approximation. IPM, Jan Mohammad R. Salavatipour Department of Computing Science University of Alberta Approximation Algorithms and Hardness of Approximation IPM, Jan 2006 Mohammad R. Salavatipour Department of Computing Science University of Alberta 1 Introduction For NP-hard optimization problems, we

More information

18.5 Crossings and incidences

18.5 Crossings and incidences 18.5 Crossings and incidences 257 The celebrated theorem due to P. Turán (1941) states: if a graph G has n vertices and has no k-clique then it has at most (1 1/(k 1)) n 2 /2 edges (see Theorem 4.8). Its

More information

More on NP and Reductions

More on NP and Reductions Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai 600 127, India An Autonomous Institute under MHRD, Govt of India http://www.iiitdm.ac.in COM 501 Advanced Data

More information

Optimization of Submodular Functions Tutorial - lecture I

Optimization of Submodular Functions Tutorial - lecture I Optimization of Submodular Functions Tutorial - lecture I Jan Vondrák 1 1 IBM Almaden Research Center San Jose, CA Jan Vondrák (IBM Almaden) Submodular Optimization Tutorial 1 / 1 Lecture I: outline 1

More information

The canadian traveller problem and its competitive analysis

The canadian traveller problem and its competitive analysis J Comb Optim (2009) 18: 195 205 DOI 10.1007/s10878-008-9156-y The canadian traveller problem and its competitive analysis Yinfeng Xu Maolin Hu Bing Su Binhai Zhu Zhijun Zhu Published online: 9 April 2008

More information

Greedy Conjecture for Strings of Length 4

Greedy Conjecture for Strings of Length 4 Greedy Conjecture for Strings of Length 4 Alexander S. Kulikov 1, Sergey Savinov 2, and Evgeniy Sluzhaev 1,2 1 St. Petersburg Department of Steklov Institute of Mathematics 2 St. Petersburg Academic University

More information

Dominating Set. Chapter 7

Dominating Set. Chapter 7 Chapter 7 Dominating Set In this chapter we present another randomized algorithm that demonstrates the power of randomization to break symmetries. We study the problem of finding a small dominating set

More information

Tree Decomposition of Graphs

Tree Decomposition of Graphs Tree Decomposition of Graphs Raphael Yuster Department of Mathematics University of Haifa-ORANIM Tivon 36006, Israel. e-mail: raphy@math.tau.ac.il Abstract Let H be a tree on h 2 vertices. It is shown

More information

Complexity of conditional colorability of graphs

Complexity of conditional colorability of graphs Complexity of conditional colorability of graphs Xueliang Li 1, Xiangmei Yao 1, Wenli Zhou 1 and Hajo Broersma 2 1 Center for Combinatorics and LPMC-TJKLC, Nankai University Tianjin 300071, P.R. China.

More information

1 The Knapsack Problem

1 The Knapsack Problem Comp 260: Advanced Algorithms Prof. Lenore Cowen Tufts University, Spring 2018 Scribe: Tom Magerlein 1 Lecture 4: The Knapsack Problem 1 The Knapsack Problem Suppose we are trying to burgle someone s house.

More information

On the Optimality of the Dimensionality Reduction Method

On the Optimality of the Dimensionality Reduction Method On the Optimality of the Dimensionality Reduction Method Alexandr Andoni MIT andoni@mit.edu Piotr Indyk MIT indyk@mit.edu Mihai Pǎtraşcu MIT mip@mit.edu Abstract We investigate the optimality of (1+)-approximation

More information

Parameterized Complexity of the Arc-Preserving Subsequence Problem

Parameterized Complexity of the Arc-Preserving Subsequence Problem Parameterized Complexity of the Arc-Preserving Subsequence Problem Dániel Marx 1 and Ildikó Schlotter 2 1 Tel Aviv University, Israel 2 Budapest University of Technology and Economics, Hungary {dmarx,ildi}@cs.bme.hu

More information

Lecture 5: January 30

Lecture 5: January 30 CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 5: January 30 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They

More information

Constructing c-ary Perfect Factors

Constructing c-ary Perfect Factors Constructing c-ary Perfect Factors Chris J. Mitchell Computer Science Department Royal Holloway University of London Egham Hill Egham Surrey TW20 0EX England. Tel.: +44 784 443423 Fax: +44 784 443420 Email:

More information

Decomposing oriented graphs into transitive tournaments

Decomposing oriented graphs into transitive tournaments Decomposing oriented graphs into transitive tournaments Raphael Yuster Department of Mathematics University of Haifa Haifa 39105, Israel Abstract For an oriented graph G with n vertices, let f(g) denote

More information

arxiv: v1 [math.co] 1 Oct 2013

arxiv: v1 [math.co] 1 Oct 2013 Tiling in bipartite graphs with asymmetric minimum degrees Andrzej Czygrinow and Louis DeBiasio November 9, 018 arxiv:1310.0481v1 [math.co] 1 Oct 013 Abstract The problem of determining the optimal minimum

More information

APTAS for Bin Packing

APTAS for Bin Packing APTAS for Bin Packing Bin Packing has an asymptotic PTAS (APTAS) [de la Vega and Leuker, 1980] For every fixed ε > 0 algorithm outputs a solution of size (1+ε)OPT + 1 in time polynomial in n APTAS for

More information

Fast Algorithms for Constant Approximation k-means Clustering

Fast Algorithms for Constant Approximation k-means Clustering Transactions on Machine Learning and Data Mining Vol. 3, No. 2 (2010) 67-79 c ISSN:1865-6781 (Journal), ISBN: 978-3-940501-19-6, IBaI Publishing ISSN 1864-9734 Fast Algorithms for Constant Approximation

More information

Scheduling Parallel Jobs with Linear Speedup

Scheduling Parallel Jobs with Linear Speedup Scheduling Parallel Jobs with Linear Speedup Alexander Grigoriev and Marc Uetz Maastricht University, Quantitative Economics, P.O.Box 616, 6200 MD Maastricht, The Netherlands. Email: {a.grigoriev, m.uetz}@ke.unimaas.nl

More information

A Linear Time Algorithm for Ordered Partition

A Linear Time Algorithm for Ordered Partition A Linear Time Algorithm for Ordered Partition Yijie Han School of Computing and Engineering University of Missouri at Kansas City Kansas City, Missouri 64 hanyij@umkc.edu Abstract. We present a deterministic

More information

Data Mining and Matrices

Data Mining and Matrices Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen June 13, 2013 Outline 1 Warm-Up 2 What is BMF 3 BMF vs. other three-letter abbreviations 4 Binary matrices, tiles,

More information

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch] NP-Completeness Andreas Klappenecker [based on slides by Prof. Welch] 1 Prelude: Informal Discussion (Incidentally, we will never get very formal in this course) 2 Polynomial Time Algorithms Most of the

More information

Optimal spaced seeds for faster approximate string matching

Optimal spaced seeds for faster approximate string matching Optimal spaced seeds for faster approximate string matching Martin Farach-Colton Gad M. Landau S. Cenk Sahinalp Dekel Tsur Abstract Filtering is a standard technique for fast approximate string matching

More information

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden 1 Results Multiple Alignment with SP-score Star Alignment Tree Alignment (with given phylogeny) are NP-hard

More information

The path partition problem and related problems in bipartite graphs

The path partition problem and related problems in bipartite graphs The path partition problem and related problems in bipartite graphs Jérôme Monnot and Sophie Toulouse Abstract We prove that it is NP-complete to decide whether a bipartite graph of maximum degree three

More information

Finding Large Induced Subgraphs

Finding Large Induced Subgraphs Finding Large Induced Subgraphs Illinois Institute of Technology www.math.iit.edu/ kaul kaul@math.iit.edu Joint work with S. Kapoor, M. Pelsmajer (IIT) Largest Induced Subgraph Given a graph G, find the

More information

THE METHOD OF CONDITIONAL PROBABILITIES: DERANDOMIZING THE PROBABILISTIC METHOD

THE METHOD OF CONDITIONAL PROBABILITIES: DERANDOMIZING THE PROBABILISTIC METHOD THE METHOD OF CONDITIONAL PROBABILITIES: DERANDOMIZING THE PROBABILISTIC METHOD JAMES ZHOU Abstract. We describe the probabilistic method as a nonconstructive way of proving the existence of combinatorial

More information

Differential approximation results for the Steiner tree problem

Differential approximation results for the Steiner tree problem Differential approximation results for the Steiner tree problem Marc Demange, Jérôme Monnot, Vangelis Paschos To cite this version: Marc Demange, Jérôme Monnot, Vangelis Paschos. Differential approximation

More information

The minimum G c cut problem

The minimum G c cut problem The minimum G c cut problem Abstract In this paper we define and study the G c -cut problem. Given a complete undirected graph G = (V ; E) with V = n, edge weighted by w(v i, v j ) 0 and an undirected

More information