Finding recurrent sources in sequences

Size: px

Start display at page:

Download "Finding recurrent sources in sequences"

Magnus Sutton
5 years ago
Views:

1 Finding recurrent sources in sequences Aristides Gionis Deartment of Comuter Science Stanford University Stanford, CA, 94305, USA Heikki Mannila HIIT Basic Research Unit Deartment of Comuter Science University of Helsinki P.O. Box 26, Teollisuuskatu 23 FIN Helsinki, Finland ABSTRACT Many genomic sequences and, more generally, (multivariate) time series dislay tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h ieces, each of which stems from one of the h sources. We dene this (k; h)- segmentation roblem and show that it is NP-hard in the general case. We give aroximation algorithms achieving aroximation ratios of 3 for the L 1 error measure and 5 for the L 2 error measure, and generalize the results to higher dimensions. We give emirical results on real (chromosome 22) and articial data showing that the methods work well in ractice. 1. INTRODUCTION Many genomic sequences and, more generally, (multivariate) time series dislay tremendous variability. In many cases it is interesting to nd out whether the sequence can be viewed as consisting of segments coming from a small number of dierent sources. That is, we want to examine whether there are h hidden sources such that the sequence T can be written as a concatenation of k > h ieces each of which can be viewed as coming from one of the h sources. For genomic analyses, the segments might be viral or microbial inserts, stemming from a small number of ossible sources. For instance, Azad et al [2] try to identify a coarse{ grained descrition of the given DNA string in terms of a smaller set of distinct domain labels. The biological hyothesis behind this is that a mosaic organization of DNA Suorted by a Microsoft Research Fellowshi. This work was done while the author was visiting the HIIT Basic Research Unit, Deartment of Comuter Science, University of Helsinki, Finland. Permission to make digital or hard coies of all or art of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage and that coies bear this notice and the full citation on the first age. To coy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior secific ermission and/or a fee. RECOMB 03, Aril 10 13, 2003, Berlin, Germany. Coyright 2003 ACM /03/ $5.00. sequences could have originated from the insertion of fragments of one genome (the arasite) inside another (the host). For time series tye of data, the segments would reect the generating rocess having h dierent states, with arbitrary (but relatively rare) transitions between the states. The segmentation roblem for genome sequences and time series has been discussed widely; see, e.g., [10, 3, 7, 9, 5, 16, 14, 15, 11, 4, 2]. For a wide variety of score functions, dynamic rogramming can be used to obtain the best segmentation of an n-element sequence into k ieces in time O(n 2 k); this idea goes back at least to Bellman [3] in The roblem with this aroach is that the descritions for each segment are indeendent. That is, we get as many sources as there are segments. In many alcations it makes more sense to assume that the number of sources is (much) smaller than the number of segments. A comlex dynamic rocess can have a small set of underlying states, and the number of state transitions is far larger than the number of states. Similarly, a genome can have an evolutionary structure with only a few dierent sources, each source contributing several segments. Our roblem is thus to nd a good way of segmenting an n-element sequence into k segments, each of which comes from one of h dierent sources. We call this the (k; h)- segmentation roblem. The normal dynamic rogramming aroaches to sequence segmentation ignore the constraint of having less sources than segments and solve the (k; k)- segmentation roblem. When h < k, the (k; h)-segmentation roblem cannot be solved using that method: the restriction to h dierent sources means that decisions made early in the sequence have an imact later, invalidating the otimality condition needed for dynamic rogramming. The (k; h)-segmentation roblem turns out to be algorithmically quite interesting. Formally, the (k; h)-segmentation roblem is dened as follows. Suose we are given a sequence T = (t 1; : : : ; t n), where t i 2. A segmentation M of T is dened by k + 1 segment boundaries 1 = b 1 < b 2 < < b k < b k+1 = n + 1, yielding segments S 1; : : : ; S k where S i = (t bi ; : : : ; t bi +1?1). Denote the ossible sources (or states of the rocess) by?. Given a source 2? and a segment S i, denote by P (S i j ) the likelihood that the sequence S i comes from source. The roblem is to nd h sources 1; : : : ; h, a decomosition of T into k segments S 1; : : : ; S k, and for each segment S i an assignment of a source ji 2 f 1; : : : ; hg such that the total robability Q k P (Si j j i) is maximized.

2 The above formulation is in terms of an arbitrary likelihood function. We mostly consider the case of sequence of oints from R d and the roblem of nding (k; k)-segmentations for them. We next discuss how the two formulations are related. If the sequence consists of n oints in R d, we can take? to be R d, and say that the loglikelihood of an observation t i stemming from source (a oint in R d ) is roortional to?jjt i?jj 2. Then, given an interval [a; b], the loglikelihood of the subsequence S[a; b] is roortional to? P b i=a jjti? jj2. Thus the best source for the interval is the one that maximizes the robability P (S[a; b] j ji ), or equivalently the P b minimizer of the sum of squares error i=a jjti? jj2. The latter turns out to be the mean of the data oints in the interval S[a; b]. Denote by V ar(s i) the variance of segment S i, and by js ij the length (number of oints) of S i. Then the (k; k)- segmentation roblem is equivalent to nding the k segments P k such that the sum js ijv ar(s i) is minimized; the best source for each segment is simly the mean of the segment, and the error er oint is the variance of the segment. Instead of the L 2-metric, we can of course consider arbitrary L -metrics. For each segment, the (k; k)-segmentation roblem can use the source that minimizes the error, whereas in the (k; h)-segmentation roblem with h < k several segments have to use the same source. While the (k; k)-segmentation roblem has been considered often, to the best of our knowledge the roblem of (k; h)-segmentation has not been studied extensively. In 1989 Churchill [6] stated a related roblem for the urose of artitioning genomic sequences into segments. In his formulation he uses hidden Markov states to model dierent comositional roerties within each DNA segment, and his solution uses the Viterbi algorithm to determine the most robable sequence of states. More recently, Azad et al [2] formulate a similar roblem to (k; h)-segmentation but their solution is based on greedy \slit and merge" and it does not rovide any theoretical guarantee. An alternative way to view the (k; h)-segmentation roblem is as a clustering roblem with added constraints: the task is to cluster the oints into h clusters, with the restriction that along the dimension of the sequence the cluster may change at most k? 1 times. In this aer we study the (k; h)-segmentation roblem and aly it to several domains. First, we show that the (k; h)-segmentation for sequences of oints from R d is NPhard for d > 1, both for the L 1 and L 2 metrics. For the case d = 1, it is not known whether the roblem is NP-hard or not. Nevertheless, for d = 1 we give a 3-aroximation algorithms for the L 1 error metric and a 5-aroximation algorithm for the L 2 error metric. For general d, we get a 3 + aroximation ratio for L 1 and + 2 for L 2, where is the best aroximation factor for the k-means clustering roblem. These aroximation algorithms are simle and easy to imlement, while the analysis is slightly comlex. The methods are based on using instances of dynamic rogramming for the olynomially solvable cases of the roblem. The algorithms can be alied to any a olynomial-time comutable likelihood function P (S j ), as long as we can, given a segment S from the sequence, nd the source maximizing P (S j ); the details of the aroximation guarantees deend on the exact form of the likelihood function. We give emirical data both for articial and for real sequences. The articial data is generated from a rocess with a small number of hidden states; even for 0-1 sequences, our algorithms roduce very good aroximations of the hidden states. A tyical emirical aroximation error is about 4%. It seems to us that one of the main alications of the (k; h)-segmentation roblem is in genomic sequences. As an examle, we study the distribution of w-letter words (e.g. w = 2; 3) in blocks of length 500 kb in the human genome. That is, given the genome sequence, we look at each 500 kb block in it, and comute for each block the of each of the w-letter words over A, C, G, T. These word frequencies then form our sequence of dimension 4 w (16 or 64), and we search for recurrent states in this sequence. We aly the Bayesian information criterion (BIC, see [17, 10]) to search for the values of k and h yielding the best segmentation. In addition, we aly our segmentation algorithms on the chromosome sequence that describes the densities of SNPs and genes in each block. We also examine combinations of both of the above densities giving to each an equal weight. The results show that the methods rovide useful segmentations that can be alied in the search for large scale structure in the genome (see Section 5.2). The rest of this aer is organized as follows. In Section 2 we dene the (k; h)-segmentation roblem formally, give some basic observations, and show that the roblem is NP-hard in general. In Section 3 we give three algorithms for the roblem and describe the aroximability results; the roofs are in Section 4. Exerimental results are described in Section 5, and Section 6 is a short conclusion. 2. PRELIMINARIES As noted in the introduction, our algorithms can be alied to nding otimal (k; h)-segmentations for any likelihood function satisfying the following two mild conditions: First, for every set of oints coming from a single source we should be able to comute the otimal source that maximizes the likelihood function. Second, the total likelihood for a segmentation of the whole sequence should be an associative combination function (e.g. sum or roduct) of the individual likelihoods on each segment. For simlicity of notation, we consider in the sequel the case of multidimensional real-valued sequences with the L distances. Let T = (t 1; : : : ; t n) be a sequence with n data oints from R d. The L distance between any two data oints t i and t j of the sequence is dened as d (t i; t j) = ( d k=1 jt ik? t jkj ) 1 where (t i1; : : : ; t id) and (t j1; : : : ; t jd) are the coordinates of the two d-dimensional oints t i and t j. The most commonly used distances are the L 1 (Manhattan), the L 2 (Euclidean), and the L1 (L-innity). In the one dimension all of the above distances are simly jt i? t jj. In the rest of the aer when the subscrit is ommited the Euclidean distance is imlied, i.e. d(t i; t j) = d 2(t i; t j), but most of our algorithms can be easily modied so that other distance functions can also be used. A k-segmentation M of a sequence T is dened by k + 1 indices fb 1; b 2; : : : ; b k; b k+1g that divides T into k segments, i.e., with 1 = b 1 < b 2 < : : : < b k < b k+1 = n + 1. We write M 2 Segm k (T ) to denote that M is a k-segmentation of T. A k-segmentation M is also described by the sequence of

3 segments to which it artitions the original sequence, that is, M = (S 1; : : : ; S k), where S i = (t bi ; : : : ; t bi+1? 1) for all i = 1; : : : ; k. We write S 2 M for a segment in segmentation M, and t 2 S for a oint in segment S. Given the original sequence T, our goal is to nd a segmentation M such that each of the segments S 2 M is homogeneous with resect to an error measure. Two natural choices for error measures within a segment S are the 1-error measure: E 1(S) = min and the 2-error measure: x2r d d(t; x) s E 2(S) = min d(t; x) 2 x2r d Minimizing these error measures can also be obtained by reresenting the segment S with a single reresentative t S, chosen so that it minimizes the error measure. In other words, E 1(S) = P d(t; t S) and E 2 2 (S) = P d(t; t S) 2. It is easy to see that for E 1(S), the reresentative t S will be the median of the oints in S, while for E 2(S) it will be the mean of the oints in S. We write E () when we don't want to distinguish between E 1() and E 2(). The traditional sequence segmentation roblem can now be formalized as follows. Problem 1 ((k; k)-segmentation ). For a sequence T, a k-segmentation M of T, and = 1; 2, we dene the -error of a segmentation M to be E (T; M) = E(S)! 1 = d(t; t S)! 1 The task is to nd the segmentation M that minimizes the above error measure. The minimum error is denoted by Seg (T; k), i.e. Seg (T; k) = min M2Segm k (T ) E(T; M) This roblem can, of course, be solved in olynomial time using dynamic rogramming as shown in 1961 by Bellman [3]. As noted in the introduction, the drawback in this formulation is that the segments are comletely indeendent: we have no way of referring solutions that would use only a few tyes of segments. That is, in (k; k)-segmentation, the reresentative t S can take a distinct value for each segment S. We would like to restrict the segment reresentatives to use at most h dierent values. Given a set of L = fl 1; : : : ; l hg of h values and a segmentation M, the minimal error for segment S 2 M is the l i that minimizes the error measure, i.e., E (SjL) = min l2l d(t; l) : The minimizer of the error measure in the set L for a segment S is denoted by l S. The roblem of nding recurrent states in sequences is formalized as follows. Problem 2 ((k; h)-segmentation ). Given a sequence T, a k-segmentation M of T, a set L of h values, and = 1; 2, the -error of the segmentation M with resect to L is! 1! 1 E (T; M; L) = E(SjL) = d(t; l S) The task is to nd the segmentation M and the set of values L that minimizes E (T; M; L). The minimum error achieved is denoted by Seg (T; k; h), i.e. Seg (T; k; h) = min E(T; M; L) M2Segm k (T ); L:jLj=h The (k; k)-segmentation roblem is a secial case of (k; h)- segmentation, where the value of h is left unconstrained. This means the following. Observation 1. For a sequence T and any numbers k and h we have Seg (T; k) Seg (T; k; h). Also notice that for xed h the value of Seg (T; k; h) decreases as k increases. Since the largest value of k is n, we get the following Observation 2. For a sequence T of size n and any k and h we have Seg (T; n; h) Seg (T; k; h). The case of k = n, in Observation 2, can be viewed as ignoring the constraint that some consecutive oints in the sequence should be reresented by the same value. Thus, the (n; h)-segmentation roblem is the well-studied clustering roblem. For = 1 it is the Euclidean k-median roblem, and for = 2 it is k-means roblem of nding the k oints such that the sum of distances to the closest oint is minimized. 1 Both of these roblems can be solved in olynomial time for 1-dimensional data [13], and both are NP{hard for dimensions d 2 [12]. Lemma 1. For = 1; 2, and dimension d 2, the roblem (k; h)-segmentation is NP{hard. Proof. For = 1 and = 2, the k-median and k- means roblems, resectively, are secial cases of (k; h)- segmentation with k = n ALGORITHMS In this section we describe three algorithms for solving the (k; h)-segmentation roblem. The algorithms are based on solving the olynomial sub-cases and combining the results in dierent ways. 3.1 Segments2Levels algorithm The rst of our algorithms for the (k; h)-segmentation roblem is called Segments2Levels. The algorithm is as follows. First solve the (k; k)-segmentation roblem to obtain a k-segmentation M. Then solve the (n; h)-segmentation roblem to obtain a set L of h level values. Then assign each segment S 2 M to the level in L which is closest to t S. We show in Section 4 that this simle algorithm rovides a 3-aroximation to the (k; h)-segmentation roblem for = 1 and for = 2, for dimension 1. For higher dimensions, the aroximation guarantees are 3 + for = 1, and Note that \k-means" is often used for one aroximate algorithm for this roblem.

4 for = 2, where is the best aroximation factor for the k-means roblem. The algorithm runs in time O(n 2 (k + h)), as the running time of the dynamic rogramming method is quadratic in the number of oints and linear in the number of segments. 3.2 ClusterSegments algorithm The next algorithm, called ClusterSegments, yields aroximation rations 5 and 5, for the 1-error and 2-error measures, resectively. The rst ste of the algorithm is the same with the rst ste of Segments2Levels; it solves (k; k)-segmentation and decides about the segments of the solution. Then, each segment S in the otimal segmentation of (k; k)-segmentation is reresented with the single value t S and weight jsj. A set of h levels L is roduced by clustering those k weighted oints into h clusters. The nal solution is roduced by assigning each segment S to the level l S 2 L which is the closest to t S among all levels in L. The running time of the method is again O(n 2 (k + h)). 3.3 Iterative algorithm Our third algorithm is insired by the EM algorithm and it is based on the following observations. 1. If we know the set L of levels that are to be used, then we can nd the k otimal segments. This follows from simle alication of dynamic rogramming. This ste corresonds to the E-ste of the EM algorithm. 2. Given an existing segmentation we can readjust the values of the levels and ossibly imrove the value of the solution. This is done as the M-ste of the EM algorithm: for each level we obtain a ossibly better value by comuting the mean or the median (deending on the error that we are otimizing for) of all the oints in the segments that are assigned to this level. The Iterative algorithm uses these observations. It starts from an initial segmentation obtained by, e.g., either of the revious algorithms. Then, given the current levels, the otimal segmentation for those levels is found using the above observation. After that, a new set of h levels is comuted by using the second observation. These two stes are iterated until the error does not decrease any more. We denote by Iterative(SL) the version that starts with the outut of Segments2Levels, and by Iterative(CS) the version that starts with the outut of ClusterSegments. The Iterative algorithm roduces at least as good aroximations as the revious two methods. The running time of the algorithm is O(In 2 (k + h)), where I is the number of iterations. In most of our exeriments the initial solutions rovided by Segments2Levels and ClusterSegments were very good, so the number of iterations remained below 3 or APPROIMATION GUARANTEES In this section we rove the aroximation guarantees for the Segments2Levels and ClusterSegments algorithms. For the Iterative algorithm we only know that it never is worse that the other two methods. The emirical results in the next section show that in ractice the algorithms work far better than what the worst-case bounds suggest. 4.1 Analysis of the Segments2Levels algorithm Theorem 1. Consider a 1-dimensional sequence T and integers k and h. Let Ot 1 = Seg 1(T; k; h) be the value of the otimal segmentation for the 1-error measure. Denote by SL 1(T; k; h) be the value of the solution rovided by the algorithm Segments2Levels. Then SL 1(T; k; h) 3Ot 1. Proof. Recall from the descrition of the algorithm in Section 3.1, that there are two initial stes. The rst ste is to solve the (n; h)-segmentation 1 roblem. Since we are in the 1-dimensional case, this can be solved in olynomial time using dynamic rogramming. Thus we obtain a set of levels L for which we have Seg 1(T; n; h) Ot 1. The second ste is to solve the (k; k)-segmentation 1, again by dynamic rogramming. So, we obtain a k-segmentation M, for which Seg 1(T; k) Ot 1. Now consider a data oint t of the sequence. Let S be the segment in M which oint t belongs to, and let l S 2 L be the level which segment S is assigned to by Segments2Levels. Also dene l t 2 L to be the value which oint t is assigned to by the the solution of (n; h)-segmentation 1. The error that oint t contributes to the total error is d(t; l S) and by triangle inequality d(t; l S) d(t; t S) + d(t S; l S) where, as before, t S is the median of all oints in segment S. However, since l S is the closest to t S, among all levels in L, we have d(t S; l S) d(t S; l t) and with a second alication of the triangle inequality d(t S; l t) d(t S; t) + d(t; l t) Combining the revious inequalities gives d(t; l S) 2d(t; t S) + d(t; l t) and summing over all oints in T we obtain the desired result: SL 1(T; k; h) = d(t; l S) 2 d(t; t S) + d(t; l t) 2 Seg 1(T; k) + Seg 1(T; n; h) 3 Ot 1: Theorem 2. Let T be a 1-dimensional sequence and let k and h be integers. Let Ot 2 = Seg 2(T; k; h) be the value of the otimal segmentation for (k; h)-segmentation 2 roblem for the = 2 error measure. Let SL 2(T; k; h) be the value of the solution rovided by the algorithm Segments2Levels. Then SL 2(T; k; h) 3 Ot 2. Proof. The roof is very similar to the one for the 1- error measure. Again in the 1-dimensional case the roblem (n; h)-segmentation 2 can be solved in olynomial time, so Seg 2(T; n; h) Ot 2. Also by solving (k; k)-segmentation 2, we obtain segmentation M for which Seg 2(T; k) Ot 2. 2

5 Using the same notation as in Theorem 1 we have that the error that a single oint t contributes to the square of the total error is d(t; l S) 2 (d(t; t S) + d(t S; l S)) 2 Combining with d(t S; l S) d(t S; t) + d(t; l t) we get d(t; l S) 2 4d(t; t S) 2 + d(t; l t) 2 + 4d(t; t S)d(t; l t) and summing over all oints in T we obtain SL 2(T; k; h) 2 = d(t; l S) 2 4 d(t; t S) 2 + d(t; l t) d(t; t S) 2 + d(t; l t) 2 +4 s s d(t; t S) 2 d(t; l t) 2 4 Seg 2(T; k) 2 + Seg 2(T; n; h) 2 +4 Seg 2(T; k) Seg 2(T; n; h) = 9 Ot 2 2 d(t; t S)d(t; l t) (by Cauchy-Schwarz) which, by taking square roots, rovides the claimed aroximation ratio Higher dimensions The only lace that the 1-dimensionality is used in Theorems 1 and 2 is in solving exactly the clustering roblem (n; h)-segmentation, and claiming that Seg (T; n; h) Ot. In higher dimensions, as we already mentioned, those clustering roblems become NP{hard. However, we can aroximate the solutions by olynomial algorithms. The k- median roblem (i.e., (n; h)-segmentation 1) admits a PTAS [1], so we can nd a set of levels L for which Seg 1(T; n; h) (1 + )Ot 1. Plugging this in the roof of Theorem 1 we obtain an aroximation ratio of 3 + for the 1-error measure. Similarly, k-means can be aroximated in olynomial time. Calling the best aroximation ratio for k-means, and using that in the roof of Theorem 2 we obtain an aroximation ratio of + 2 for the 2-error measure. The current is 9 + by Kanungo et al [8], yielding a factor of 11 + to our roblem. The clustering algorithms (for (n; h)- segmentation) used to obtain these aroximability results for higher dimensions might not be very ecient in ractice. 4.2 Analysis of the ClusterSegments algorithm For analyzing the erformance of the ClusterSegments algorithm we need the following lemmas. The rst is a weaker version of the triangle inequality in the case of the square of distances between oints. Lemma 2 (Double triangle inequality). For oints x, y, and z we have d(x; y) 2 2 d(x; z) d(z; y) 2 Proof. The lemma can be shown with a simle alication of the triangle inequality, i.e., d(x; y) 2 (d(x; z) + d(z; y)) 2 = d(x; z) 2 + d(z; y) d(x; z)d(z; y) 2 d(x; z) d(z; y) 2 since 2 d(x; z)d(z;y) d(x; z) 2 + d(z; y) 2. 2 The next lemma is a bias-variance result, which we state and rove for comleteness. Lemma 3. Let ft 1; : : : ; t mg be a set of m oints and let t be their mean, i.e. the minimizer of the 2-error measure. For any oint x it is true that m d(t; x) 2 m d(t i; x) 2 = m d(t; x) 2 + m d(t i; t) 2 Proof. We will only show the equality art of the lemma, since this imlies the inequality, as well. For simlicity we will assume that the oints are one-dimensional, i.e. d(x; y) = jx? yj. The extension to the multidimensional case can be done by alying the equality in each coordinate searately and summing P in all coordinates. By using m the denition of the mean ti = m t, we have m(t? x) 2 + m (t i? t) 2 = mt 2 + mx 2? 2mxt + = mx 2? 2mxt + m m t 2 i m t 2 i + mt 2? 2mt = (t i? x) 2 2 Now we show that the solution found by ClusterSegments is within a factor of 5 of the otimal solution for the 2-error measure. Using a similar line of arguments, one can show that for the 1-error the aroximation ratio of ClusterSegments is 5. Since we have already shown a ratio of 3 for that case using the Segments2Levels algorithm, we omit the analysis. Theorem 3. For a sequence T, numbers k and h, and the 2-error measure, let Ot 2 = Seg 2(T; k; h) be the value of the otimal segmentation for (k; h)-segmentation 2. If CS 2(T; k; h) is the value of the solution rovided by the algorithm ClusterSegments, then CS 2(T; k; h) 5 Ot 2. Proof. By solving (k; k)-segmentation 2 we obtain a segmentation M for which Seg 2(T; k) Ot 2. Let L be the set of h oints found by clustering the k weighted oints t S, for S 2 M. Let l S be the cluster center oint which segment S is assigned to. Now, the error contributed by a secic segment S to CS 2(T; k; h) 2 is P d(t; ls)2. Summing over all segments, and alying Lemma 3 we obtain CS 2(T; k; h) 2 = = = d(t; l S) 2 jsj d(t S; l S) 2 + d(t; t S) 2! jsj d(t S; l S) 2 + Seg 2(T; k) 2 jsj d(t S; l S) 2 + Ot 2 2 m t i

6 P The next ste is to bound the term jsj d(t S; l S) 2. Consider the set of otimal level oints M for the original roblem. For each segment S, let S be the oint in M which is the closest to t S. For the 1-dimensional case, the clustering of the k weighted oints can be done otimally, by dynamic rogramming. So by claiming otimality for L we get jsj d(t S; l S) 2 jsj d(t S; S) 2 Assume that the otimal segmentation artitions a segment S of M into r subsegments S j, each of them having size s j and mean t j. Also assume that the otimal solution assigns each of those segments at a level j 2 M, for j = 1; : : : ; r. Since S is the closest to t S, we have jsj d(t S; S) r s j d(t S; j) 2 r r s j d(t S; t j) 2 + j d(t; t S) r s j d(t j; j) 2! (by Lemma 2) r = 2 d(t; t S) d(t; t) 2 j d(t; j) 2 (by Lemma 3) where, by t we denote the level to which the oint t is assigned by the otimal solution. Summing P over all segments in M, it is clear that the term jsj d(t S; S) 2 is bounded by 4 Ot 2 2. Combining the above inequalities we have CS 2(T; k; h) 2 5 Ot 2 2, which yields the desired bound. 2 Similarly with the Segments2Levels algorithm, the roof can be adated for higher dimensions giving ratios of 5 + for = 1, and for = 2. As before, is the best aroximation factor for the k-means roblem. Two oints should be discussed here regarding the extension of the algorithms in higher dimensions. First, one should note that ClusterSegments achieves a better aroximation ratio in the one dimension where = 1, while Segments2Levels is better in higher dimensions, and in general as long as 1:86. The second observation is that the clustering subroutine used by the ClusterSegments algorithm takes as inut only k oints, instead of n. This means that for small values of k, say small constants, one could ossibly aord to use a brute force algorithm in order to nd the otimal clustering. This would require time exonential in k (still constant however), but the aroximation ratio remains 5 indeendently of the dimension. 5. EPERIMENTAL EVALUATION 5.1 Exeriments on synthetic data To test the erformance of the methods we generated articial data sets that dislay the recurrent state behavior that is of interest to us. The data sets were generated as follows. First we created a Markov model with 5 states with transition robabilities between the states chosen randomly from in the interval 1 [0; 500 ], imlying that the average length of stay in a given state is about 250 stes. For the 0-1 data, each state i had a robability i of emitting a 1. We simulated the state transitions and emissions for n=5k, 10K, 15K and 20K stes. Given each oint in the sequence, we know what the generating robability of a 1 has been at that oint. When alying the algorithms for (k; h)-segmentation described in the revious sections, we obtain an estimate of the generated robabilities 2. We then comuted the average error between the true value of and the estimate roduced by the algorithms. The other articial data set was generated by having a real-valued hidden source x at each state. When the generating rocess was at that state, the generated oint was chosen randomly from the normal distribution with mean x and variance 2, with varying from 0.02 to 0.4. The results for these two sets of articial data are shown in Figures 1 and 2, resectively. One can see that the error is tyically very low, below 4% for the best algorithms, ClusterSegments and Iterative(CS). The Segments2Levels and the Iterative(SL) algorithms erform worse. This is only to be exected, as for 0-1 data the clustering algorithm rovides only two sources. The accuracy of ClusterSegments and Iterative(CS) solution imroves nicely when the size of the sequence grows. For normally distributed data, the error remains very small even for large variance. The number of states used does not have to be exactly the correct one, as seen in Figure 3. Note that one could also reconstruct the transition robabilities for sequences that are long enough. We obtained similar results for the case of 10 underlying states. Error Error on 0 1 data, 5 states Segments 2 Levels Iterative 1 Cluster Segments Iterative Timeseries length x 10 4 Figure 1: Average error of the estimated level from the generating level for articial 0-1 data as a function of the length of the inut sequence. 2 Note that given a segment S = (t 1; : : : ; t m) containing m 1 1s and m 0 Q = m? m 1 0s, the that maximizes the Bernoulli likelihood i t i (1? )1?t i is m1=m, i.e., the that minimizes the L 2 distance.

7 0.045 Error on Normal data, 5 states 0.12 Error on Normal data, 5 states Segments 2 Levels Iterative 1 Cluster Segments Iterative variance = 0.02 variance = 0.1 variance = Error Error Variance number of states (h) Figure 2: Average error of the estimated level from the generating level for normally distributed data as a function of the variance in the generated data. 5.2 Exeriments on genome data We used the sliding window technique to collect statistics from the human genome sequence, in articular from chromosome 22. Our window size was 500 kb, and the sliding ste was set to 50 kb. The known area of the chromosome is about 35 Mb, so our windowing arameters yield sequences of about 700 oints. Within each window we comute the of each of the w-letter words over A, C, G, T, for w = 2 and 3. Therefore, we obtain sequences of dimensions 16 and 64. The same windowing technique was used on data indicating the ositions of SNPs and genes on chromosome 22. For each window we comuted the of SNPs, the of genes, and the combined (2-d oints). For the sequences we described above, we used our method to search for segmentations with recurrent states. We alied the Bayesian information criterion (see [17, 10]) to search for the values of k and h yielding the best segmentation. Our results are shown in Figure 4. The results indicate that the (k; h)-segmentation methods roduce intuitively aealing results; the BIC-otimal number of segments is reasonably small and the number of sources is clearly smaller. The results can be viewed as showing that, with resect to the statistics used, there are segments in the chromosome that show the same characteristics. The statistics used for segmenting the sequences could, of course, be chosen dierently; the resented results are just examles. We are, however, exloring the relationshi of the obtained segments with, e.g., the human-mouse synteny mas. 6. CONCLUSIONS We have introduced the (k; h)-segmentation roblem, where the task is to nd a segmentation of a sequence into k segments stemming from h dierent sources. The roblem is NP-comlete in the general case. We gave simle aroximation algorithms for the roblem, roved bounds on the aroximation quality, and gave emirical results on real Figure 3: Average error of the estimated level from the generating level for normally distributed data as a function of the number h of states used by the algorithm. and articial data. Several oen roblems remain. On the theoretical side the foremost is nding out whether the (k; h)-segmentation roblem is NP-hard for 1-dimensional data. The analysis of the aroximation qualities seem to be fairly tight, but it would be interesting to know whether the results can be imroved. In this aer we gave only simle examles of the alications of the method. For genomic analyses the main task is to select good characteristics over which the segmentation roblem is then solved. The ones we used, gene and SNP density and distribution of short words, are just examles: basically any characteristic of small blocks of sequences can be used. An interesting issue is measuring the correlation between two segmentations. 7. REFERENCES [1] S. Arora, P. Raghavan, and S. Rao. Aroximation schemes for euclidean k -medians and related roblems. In ACM Symosium on Theory of Comuting, ages 106 { 113, [2] R. K. Azad, J. S. Rao, W. Li, and R. Ramaswamy. Simlifying the mosaic descrition of DNA sequences. Physical Review E, 66, article , [3] R. Bellman. On the aroximation of curves by line segments using dynamic rogramming. Commun. ACM, 4(6), [4] K. Bennett. Determination of the number of zones in a biostratigrahical sequence. New Phytol., 132:155{170, [5] A. Cantoni. Otimal curve tting with iecewise linear functions. IEEE Transactions on Comuters, C-20(1):59{67, [6] G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51:79 { 94, 1989.

8 Gene k=22, h=5 SNP k=12, h= models and reversible jum MCMC. In Euroean Conference on Comutational Biology (ECCB2003), To aear. [17] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 7(2):461 { 464, Gene/SNP k=25, h= word k=26, h= word k=24, h= Position on Chromosome 22 Figure 4: BIC-otimal (k; h)-segmentations for sequences of frequencies of genes, SNPs, combined genes and SNPs, 2-words, and 3-words. Each sequence has about 700 oints; the dimensions are 1, 1, 2, 16, and 64, resectively. The numbers labeling each segment indicate the recurrent states. x 10 7 [7] J. Himberg, K. Koriaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen. Time series segmentation for context recognition in mobile devices. In The 2001 IEEE International Conference on Data Mining (ICDM'01), ages 203 { 210, [8] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R.Silverman, and A. Wu. A local search aroximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symosium on Comutational Geometry, ages 10 { 18, [9] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In IEEE International Conference on Data Mining, ages 289 { 296, [10] W. Li. DNA segmentation as a model selection rocess. In RECOMB 2001, ages 204 { 210, [11] J. Liu and C. Lawrence. Bayesian inference on bioolymer models. Bioinformatics, 15(1):38{52, [12] N. Megiddo and K. J. Suowit. On the comlexity of some common geometric location roblems. SIAM Journal on Comuting, 13(1):182{196, [13] N. Megiddo, E. Zemel, and S. L. Hakimi. The maximum coverage location roblem. SIAM Journal on Algebraic and Discrete Methods, 4:253{261, [14] A. Pavlicek, J. Paces, O. Clay, and G. Bernardi. A comact view of isochores in the draft human genome sequence. FEBS Letters, 511:165{169, [15] V. Ramensky, V. Makeev, M. Roytberg, and V. Tumanyan. DNA segmentation through the bayesian aroach. Journal of Comutational Biology, 7(1/2):215{231, [16] M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using iecewise constant intensity

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost