Finding recurrent sources in sequences

Size: px
Start display at page:

Download "Finding recurrent sources in sequences"

Transcription

1 Finding recurrent sources in sequences Aristides Gionis Deartment of Comuter Science Stanford University Stanford, CA, 94305, USA Heikki Mannila HIIT Basic Research Unit Deartment of Comuter Science University of Helsinki P.O. Box 26, Teollisuuskatu 23 FIN Helsinki, Finland ABSTRACT Many genomic sequences and, more generally, (multivariate) time series dislay tremendous variability. However, often it is reasonable to assume that the sequence is actually generated by or assembled from a small number of sources, each of which might contribute several segments to the sequence. That is, there are h hidden sources such that the sequence can be written as a concatenation of k > h ieces, each of which stems from one of the h sources. We dene this (k; h)- segmentation roblem and show that it is NP-hard in the general case. We give aroximation algorithms achieving aroximation ratios of 3 for the L 1 error measure and 5 for the L 2 error measure, and generalize the results to higher dimensions. We give emirical results on real (chromosome 22) and articial data showing that the methods work well in ractice. 1. INTRODUCTION Many genomic sequences and, more generally, (multivariate) time series dislay tremendous variability. In many cases it is interesting to nd out whether the sequence can be viewed as consisting of segments coming from a small number of dierent sources. That is, we want to examine whether there are h hidden sources such that the sequence T can be written as a concatenation of k > h ieces each of which can be viewed as coming from one of the h sources. For genomic analyses, the segments might be viral or microbial inserts, stemming from a small number of ossible sources. For instance, Azad et al [2] try to identify a coarse{ grained descrition of the given DNA string in terms of a smaller set of distinct domain labels. The biological hyothesis behind this is that a mosaic organization of DNA Suorted by a Microsoft Research Fellowshi. This work was done while the author was visiting the HIIT Basic Research Unit, Deartment of Comuter Science, University of Helsinki, Finland. Permission to make digital or hard coies of all or art of this work for ersonal or classroom use is granted without fee rovided that coies are not made or distributed for rofit or commercial advantage and that coies bear this notice and the full citation on the first age. To coy otherwise, to reublish, to ost on servers or to redistribute to lists, requires rior secific ermission and/or a fee. RECOMB 03, Aril 10 13, 2003, Berlin, Germany. Coyright 2003 ACM /03/ $5.00. sequences could have originated from the insertion of fragments of one genome (the arasite) inside another (the host). For time series tye of data, the segments would reect the generating rocess having h dierent states, with arbitrary (but relatively rare) transitions between the states. The segmentation roblem for genome sequences and time series has been discussed widely; see, e.g., [10, 3, 7, 9, 5, 16, 14, 15, 11, 4, 2]. For a wide variety of score functions, dynamic rogramming can be used to obtain the best segmentation of an n-element sequence into k ieces in time O(n 2 k); this idea goes back at least to Bellman [3] in The roblem with this aroach is that the descritions for each segment are indeendent. That is, we get as many sources as there are segments. In many alcations it makes more sense to assume that the number of sources is (much) smaller than the number of segments. A comlex dynamic rocess can have a small set of underlying states, and the number of state transitions is far larger than the number of states. Similarly, a genome can have an evolutionary structure with only a few dierent sources, each source contributing several segments. Our roblem is thus to nd a good way of segmenting an n-element sequence into k segments, each of which comes from one of h dierent sources. We call this the (k; h)- segmentation roblem. The normal dynamic rogramming aroaches to sequence segmentation ignore the constraint of having less sources than segments and solve the (k; k)- segmentation roblem. When h < k, the (k; h)-segmentation roblem cannot be solved using that method: the restriction to h dierent sources means that decisions made early in the sequence have an imact later, invalidating the otimality condition needed for dynamic rogramming. The (k; h)-segmentation roblem turns out to be algorithmically quite interesting. Formally, the (k; h)-segmentation roblem is dened as follows. Suose we are given a sequence T = (t 1; : : : ; t n), where t i 2. A segmentation M of T is dened by k + 1 segment boundaries 1 = b 1 < b 2 < < b k < b k+1 = n + 1, yielding segments S 1; : : : ; S k where S i = (t bi ; : : : ; t bi +1?1). Denote the ossible sources (or states of the rocess) by?. Given a source 2? and a segment S i, denote by P (S i j ) the likelihood that the sequence S i comes from source. The roblem is to nd h sources 1; : : : ; h, a decomosition of T into k segments S 1; : : : ; S k, and for each segment S i an assignment of a source ji 2 f 1; : : : ; hg such that the total robability Q k P (Si j j i) is maximized.

2 The above formulation is in terms of an arbitrary likelihood function. We mostly consider the case of sequence of oints from R d and the roblem of nding (k; k)-segmentations for them. We next discuss how the two formulations are related. If the sequence consists of n oints in R d, we can take? to be R d, and say that the loglikelihood of an observation t i stemming from source (a oint in R d ) is roortional to?jjt i?jj 2. Then, given an interval [a; b], the loglikelihood of the subsequence S[a; b] is roortional to? P b i=a jjti? jj2. Thus the best source for the interval is the one that maximizes the robability P (S[a; b] j ji ), or equivalently the P b minimizer of the sum of squares error i=a jjti? jj2. The latter turns out to be the mean of the data oints in the interval S[a; b]. Denote by V ar(s i) the variance of segment S i, and by js ij the length (number of oints) of S i. Then the (k; k)- segmentation roblem is equivalent to nding the k segments P k such that the sum js ijv ar(s i) is minimized; the best source for each segment is simly the mean of the segment, and the error er oint is the variance of the segment. Instead of the L 2-metric, we can of course consider arbitrary L -metrics. For each segment, the (k; k)-segmentation roblem can use the source that minimizes the error, whereas in the (k; h)-segmentation roblem with h < k several segments have to use the same source. While the (k; k)-segmentation roblem has been considered often, to the best of our knowledge the roblem of (k; h)-segmentation has not been studied extensively. In 1989 Churchill [6] stated a related roblem for the urose of artitioning genomic sequences into segments. In his formulation he uses hidden Markov states to model dierent comositional roerties within each DNA segment, and his solution uses the Viterbi algorithm to determine the most robable sequence of states. More recently, Azad et al [2] formulate a similar roblem to (k; h)-segmentation but their solution is based on greedy \slit and merge" and it does not rovide any theoretical guarantee. An alternative way to view the (k; h)-segmentation roblem is as a clustering roblem with added constraints: the task is to cluster the oints into h clusters, with the restriction that along the dimension of the sequence the cluster may change at most k? 1 times. In this aer we study the (k; h)-segmentation roblem and aly it to several domains. First, we show that the (k; h)-segmentation for sequences of oints from R d is NPhard for d > 1, both for the L 1 and L 2 metrics. For the case d = 1, it is not known whether the roblem is NP-hard or not. Nevertheless, for d = 1 we give a 3-aroximation algorithms for the L 1 error metric and a 5-aroximation algorithm for the L 2 error metric. For general d, we get a 3 + aroximation ratio for L 1 and + 2 for L 2, where is the best aroximation factor for the k-means clustering roblem. These aroximation algorithms are simle and easy to imlement, while the analysis is slightly comlex. The methods are based on using instances of dynamic rogramming for the olynomially solvable cases of the roblem. The algorithms can be alied to any a olynomial-time comutable likelihood function P (S j ), as long as we can, given a segment S from the sequence, nd the source maximizing P (S j ); the details of the aroximation guarantees deend on the exact form of the likelihood function. We give emirical data both for articial and for real sequences. The articial data is generated from a rocess with a small number of hidden states; even for 0-1 sequences, our algorithms roduce very good aroximations of the hidden states. A tyical emirical aroximation error is about 4%. It seems to us that one of the main alications of the (k; h)-segmentation roblem is in genomic sequences. As an examle, we study the distribution of w-letter words (e.g. w = 2; 3) in blocks of length 500 kb in the human genome. That is, given the genome sequence, we look at each 500 kb block in it, and comute for each block the of each of the w-letter words over A, C, G, T. These word frequencies then form our sequence of dimension 4 w (16 or 64), and we search for recurrent states in this sequence. We aly the Bayesian information criterion (BIC, see [17, 10]) to search for the values of k and h yielding the best segmentation. In addition, we aly our segmentation algorithms on the chromosome sequence that describes the densities of SNPs and genes in each block. We also examine combinations of both of the above densities giving to each an equal weight. The results show that the methods rovide useful segmentations that can be alied in the search for large scale structure in the genome (see Section 5.2). The rest of this aer is organized as follows. In Section 2 we dene the (k; h)-segmentation roblem formally, give some basic observations, and show that the roblem is NP-hard in general. In Section 3 we give three algorithms for the roblem and describe the aroximability results; the roofs are in Section 4. Exerimental results are described in Section 5, and Section 6 is a short conclusion. 2. PRELIMINARIES As noted in the introduction, our algorithms can be alied to nding otimal (k; h)-segmentations for any likelihood function satisfying the following two mild conditions: First, for every set of oints coming from a single source we should be able to comute the otimal source that maximizes the likelihood function. Second, the total likelihood for a segmentation of the whole sequence should be an associative combination function (e.g. sum or roduct) of the individual likelihoods on each segment. For simlicity of notation, we consider in the sequel the case of multidimensional real-valued sequences with the L distances. Let T = (t 1; : : : ; t n) be a sequence with n data oints from R d. The L distance between any two data oints t i and t j of the sequence is dened as d (t i; t j) = ( d k=1 jt ik? t jkj ) 1 where (t i1; : : : ; t id) and (t j1; : : : ; t jd) are the coordinates of the two d-dimensional oints t i and t j. The most commonly used distances are the L 1 (Manhattan), the L 2 (Euclidean), and the L1 (L-innity). In the one dimension all of the above distances are simly jt i? t jj. In the rest of the aer when the subscrit is ommited the Euclidean distance is imlied, i.e. d(t i; t j) = d 2(t i; t j), but most of our algorithms can be easily modied so that other distance functions can also be used. A k-segmentation M of a sequence T is dened by k + 1 indices fb 1; b 2; : : : ; b k; b k+1g that divides T into k segments, i.e., with 1 = b 1 < b 2 < : : : < b k < b k+1 = n + 1. We write M 2 Segm k (T ) to denote that M is a k-segmentation of T. A k-segmentation M is also described by the sequence of

3 segments to which it artitions the original sequence, that is, M = (S 1; : : : ; S k), where S i = (t bi ; : : : ; t bi+1? 1) for all i = 1; : : : ; k. We write S 2 M for a segment in segmentation M, and t 2 S for a oint in segment S. Given the original sequence T, our goal is to nd a segmentation M such that each of the segments S 2 M is homogeneous with resect to an error measure. Two natural choices for error measures within a segment S are the 1-error measure: E 1(S) = min and the 2-error measure: x2r d d(t; x) s E 2(S) = min d(t; x) 2 x2r d Minimizing these error measures can also be obtained by reresenting the segment S with a single reresentative t S, chosen so that it minimizes the error measure. In other words, E 1(S) = P d(t; t S) and E 2 2 (S) = P d(t; t S) 2. It is easy to see that for E 1(S), the reresentative t S will be the median of the oints in S, while for E 2(S) it will be the mean of the oints in S. We write E () when we don't want to distinguish between E 1() and E 2(). The traditional sequence segmentation roblem can now be formalized as follows. Problem 1 ((k; k)-segmentation ). For a sequence T, a k-segmentation M of T, and = 1; 2, we dene the -error of a segmentation M to be E (T; M) = E(S)! 1 = d(t; t S)! 1 The task is to nd the segmentation M that minimizes the above error measure. The minimum error is denoted by Seg (T; k), i.e. Seg (T; k) = min M2Segm k (T ) E(T; M) This roblem can, of course, be solved in olynomial time using dynamic rogramming as shown in 1961 by Bellman [3]. As noted in the introduction, the drawback in this formulation is that the segments are comletely indeendent: we have no way of referring solutions that would use only a few tyes of segments. That is, in (k; k)-segmentation, the reresentative t S can take a distinct value for each segment S. We would like to restrict the segment reresentatives to use at most h dierent values. Given a set of L = fl 1; : : : ; l hg of h values and a segmentation M, the minimal error for segment S 2 M is the l i that minimizes the error measure, i.e., E (SjL) = min l2l d(t; l) : The minimizer of the error measure in the set L for a segment S is denoted by l S. The roblem of nding recurrent states in sequences is formalized as follows. Problem 2 ((k; h)-segmentation ). Given a sequence T, a k-segmentation M of T, a set L of h values, and = 1; 2, the -error of the segmentation M with resect to L is! 1! 1 E (T; M; L) = E(SjL) = d(t; l S) The task is to nd the segmentation M and the set of values L that minimizes E (T; M; L). The minimum error achieved is denoted by Seg (T; k; h), i.e. Seg (T; k; h) = min E(T; M; L) M2Segm k (T ); L:jLj=h The (k; k)-segmentation roblem is a secial case of (k; h)- segmentation, where the value of h is left unconstrained. This means the following. Observation 1. For a sequence T and any numbers k and h we have Seg (T; k) Seg (T; k; h). Also notice that for xed h the value of Seg (T; k; h) decreases as k increases. Since the largest value of k is n, we get the following Observation 2. For a sequence T of size n and any k and h we have Seg (T; n; h) Seg (T; k; h). The case of k = n, in Observation 2, can be viewed as ignoring the constraint that some consecutive oints in the sequence should be reresented by the same value. Thus, the (n; h)-segmentation roblem is the well-studied clustering roblem. For = 1 it is the Euclidean k-median roblem, and for = 2 it is k-means roblem of nding the k oints such that the sum of distances to the closest oint is minimized. 1 Both of these roblems can be solved in olynomial time for 1-dimensional data [13], and both are NP{hard for dimensions d 2 [12]. Lemma 1. For = 1; 2, and dimension d 2, the roblem (k; h)-segmentation is NP{hard. Proof. For = 1 and = 2, the k-median and k- means roblems, resectively, are secial cases of (k; h)- segmentation with k = n ALGORITHMS In this section we describe three algorithms for solving the (k; h)-segmentation roblem. The algorithms are based on solving the olynomial sub-cases and combining the results in dierent ways. 3.1 Segments2Levels algorithm The rst of our algorithms for the (k; h)-segmentation roblem is called Segments2Levels. The algorithm is as follows. First solve the (k; k)-segmentation roblem to obtain a k-segmentation M. Then solve the (n; h)-segmentation roblem to obtain a set L of h level values. Then assign each segment S 2 M to the level in L which is closest to t S. We show in Section 4 that this simle algorithm rovides a 3-aroximation to the (k; h)-segmentation roblem for = 1 and for = 2, for dimension 1. For higher dimensions, the aroximation guarantees are 3 + for = 1, and Note that \k-means" is often used for one aroximate algorithm for this roblem.

4 for = 2, where is the best aroximation factor for the k-means roblem. The algorithm runs in time O(n 2 (k + h)), as the running time of the dynamic rogramming method is quadratic in the number of oints and linear in the number of segments. 3.2 ClusterSegments algorithm The next algorithm, called ClusterSegments, yields aroximation rations 5 and 5, for the 1-error and 2-error measures, resectively. The rst ste of the algorithm is the same with the rst ste of Segments2Levels; it solves (k; k)-segmentation and decides about the segments of the solution. Then, each segment S in the otimal segmentation of (k; k)-segmentation is reresented with the single value t S and weight jsj. A set of h levels L is roduced by clustering those k weighted oints into h clusters. The nal solution is roduced by assigning each segment S to the level l S 2 L which is the closest to t S among all levels in L. The running time of the method is again O(n 2 (k + h)). 3.3 Iterative algorithm Our third algorithm is insired by the EM algorithm and it is based on the following observations. 1. If we know the set L of levels that are to be used, then we can nd the k otimal segments. This follows from simle alication of dynamic rogramming. This ste corresonds to the E-ste of the EM algorithm. 2. Given an existing segmentation we can readjust the values of the levels and ossibly imrove the value of the solution. This is done as the M-ste of the EM algorithm: for each level we obtain a ossibly better value by comuting the mean or the median (deending on the error that we are otimizing for) of all the oints in the segments that are assigned to this level. The Iterative algorithm uses these observations. It starts from an initial segmentation obtained by, e.g., either of the revious algorithms. Then, given the current levels, the otimal segmentation for those levels is found using the above observation. After that, a new set of h levels is comuted by using the second observation. These two stes are iterated until the error does not decrease any more. We denote by Iterative(SL) the version that starts with the outut of Segments2Levels, and by Iterative(CS) the version that starts with the outut of ClusterSegments. The Iterative algorithm roduces at least as good aroximations as the revious two methods. The running time of the algorithm is O(In 2 (k + h)), where I is the number of iterations. In most of our exeriments the initial solutions rovided by Segments2Levels and ClusterSegments were very good, so the number of iterations remained below 3 or APPROIMATION GUARANTEES In this section we rove the aroximation guarantees for the Segments2Levels and ClusterSegments algorithms. For the Iterative algorithm we only know that it never is worse that the other two methods. The emirical results in the next section show that in ractice the algorithms work far better than what the worst-case bounds suggest. 4.1 Analysis of the Segments2Levels algorithm Theorem 1. Consider a 1-dimensional sequence T and integers k and h. Let Ot 1 = Seg 1(T; k; h) be the value of the otimal segmentation for the 1-error measure. Denote by SL 1(T; k; h) be the value of the solution rovided by the algorithm Segments2Levels. Then SL 1(T; k; h) 3Ot 1. Proof. Recall from the descrition of the algorithm in Section 3.1, that there are two initial stes. The rst ste is to solve the (n; h)-segmentation 1 roblem. Since we are in the 1-dimensional case, this can be solved in olynomial time using dynamic rogramming. Thus we obtain a set of levels L for which we have Seg 1(T; n; h) Ot 1. The second ste is to solve the (k; k)-segmentation 1, again by dynamic rogramming. So, we obtain a k-segmentation M, for which Seg 1(T; k) Ot 1. Now consider a data oint t of the sequence. Let S be the segment in M which oint t belongs to, and let l S 2 L be the level which segment S is assigned to by Segments2Levels. Also dene l t 2 L to be the value which oint t is assigned to by the the solution of (n; h)-segmentation 1. The error that oint t contributes to the total error is d(t; l S) and by triangle inequality d(t; l S) d(t; t S) + d(t S; l S) where, as before, t S is the median of all oints in segment S. However, since l S is the closest to t S, among all levels in L, we have d(t S; l S) d(t S; l t) and with a second alication of the triangle inequality d(t S; l t) d(t S; t) + d(t; l t) Combining the revious inequalities gives d(t; l S) 2d(t; t S) + d(t; l t) and summing over all oints in T we obtain the desired result: SL 1(T; k; h) = d(t; l S) 2 d(t; t S) + d(t; l t) 2 Seg 1(T; k) + Seg 1(T; n; h) 3 Ot 1: Theorem 2. Let T be a 1-dimensional sequence and let k and h be integers. Let Ot 2 = Seg 2(T; k; h) be the value of the otimal segmentation for (k; h)-segmentation 2 roblem for the = 2 error measure. Let SL 2(T; k; h) be the value of the solution rovided by the algorithm Segments2Levels. Then SL 2(T; k; h) 3 Ot 2. Proof. The roof is very similar to the one for the 1- error measure. Again in the 1-dimensional case the roblem (n; h)-segmentation 2 can be solved in olynomial time, so Seg 2(T; n; h) Ot 2. Also by solving (k; k)-segmentation 2, we obtain segmentation M for which Seg 2(T; k) Ot 2. 2

5 Using the same notation as in Theorem 1 we have that the error that a single oint t contributes to the square of the total error is d(t; l S) 2 (d(t; t S) + d(t S; l S)) 2 Combining with d(t S; l S) d(t S; t) + d(t; l t) we get d(t; l S) 2 4d(t; t S) 2 + d(t; l t) 2 + 4d(t; t S)d(t; l t) and summing over all oints in T we obtain SL 2(T; k; h) 2 = d(t; l S) 2 4 d(t; t S) 2 + d(t; l t) d(t; t S) 2 + d(t; l t) 2 +4 s s d(t; t S) 2 d(t; l t) 2 4 Seg 2(T; k) 2 + Seg 2(T; n; h) 2 +4 Seg 2(T; k) Seg 2(T; n; h) = 9 Ot 2 2 d(t; t S)d(t; l t) (by Cauchy-Schwarz) which, by taking square roots, rovides the claimed aroximation ratio Higher dimensions The only lace that the 1-dimensionality is used in Theorems 1 and 2 is in solving exactly the clustering roblem (n; h)-segmentation, and claiming that Seg (T; n; h) Ot. In higher dimensions, as we already mentioned, those clustering roblems become NP{hard. However, we can aroximate the solutions by olynomial algorithms. The k- median roblem (i.e., (n; h)-segmentation 1) admits a PTAS [1], so we can nd a set of levels L for which Seg 1(T; n; h) (1 + )Ot 1. Plugging this in the roof of Theorem 1 we obtain an aroximation ratio of 3 + for the 1-error measure. Similarly, k-means can be aroximated in olynomial time. Calling the best aroximation ratio for k-means, and using that in the roof of Theorem 2 we obtain an aroximation ratio of + 2 for the 2-error measure. The current is 9 + by Kanungo et al [8], yielding a factor of 11 + to our roblem. The clustering algorithms (for (n; h)- segmentation) used to obtain these aroximability results for higher dimensions might not be very ecient in ractice. 4.2 Analysis of the ClusterSegments algorithm For analyzing the erformance of the ClusterSegments algorithm we need the following lemmas. The rst is a weaker version of the triangle inequality in the case of the square of distances between oints. Lemma 2 (Double triangle inequality). For oints x, y, and z we have d(x; y) 2 2 d(x; z) d(z; y) 2 Proof. The lemma can be shown with a simle alication of the triangle inequality, i.e., d(x; y) 2 (d(x; z) + d(z; y)) 2 = d(x; z) 2 + d(z; y) d(x; z)d(z; y) 2 d(x; z) d(z; y) 2 since 2 d(x; z)d(z;y) d(x; z) 2 + d(z; y) 2. 2 The next lemma is a bias-variance result, which we state and rove for comleteness. Lemma 3. Let ft 1; : : : ; t mg be a set of m oints and let t be their mean, i.e. the minimizer of the 2-error measure. For any oint x it is true that m d(t; x) 2 m d(t i; x) 2 = m d(t; x) 2 + m d(t i; t) 2 Proof. We will only show the equality art of the lemma, since this imlies the inequality, as well. For simlicity we will assume that the oints are one-dimensional, i.e. d(x; y) = jx? yj. The extension to the multidimensional case can be done by alying the equality in each coordinate searately and summing P in all coordinates. By using m the denition of the mean ti = m t, we have m(t? x) 2 + m (t i? t) 2 = mt 2 + mx 2? 2mxt + = mx 2? 2mxt + m m t 2 i m t 2 i + mt 2? 2mt = (t i? x) 2 2 Now we show that the solution found by ClusterSegments is within a factor of 5 of the otimal solution for the 2-error measure. Using a similar line of arguments, one can show that for the 1-error the aroximation ratio of ClusterSegments is 5. Since we have already shown a ratio of 3 for that case using the Segments2Levels algorithm, we omit the analysis. Theorem 3. For a sequence T, numbers k and h, and the 2-error measure, let Ot 2 = Seg 2(T; k; h) be the value of the otimal segmentation for (k; h)-segmentation 2. If CS 2(T; k; h) is the value of the solution rovided by the algorithm ClusterSegments, then CS 2(T; k; h) 5 Ot 2. Proof. By solving (k; k)-segmentation 2 we obtain a segmentation M for which Seg 2(T; k) Ot 2. Let L be the set of h oints found by clustering the k weighted oints t S, for S 2 M. Let l S be the cluster center oint which segment S is assigned to. Now, the error contributed by a secic segment S to CS 2(T; k; h) 2 is P d(t; ls)2. Summing over all segments, and alying Lemma 3 we obtain CS 2(T; k; h) 2 = = = d(t; l S) 2 jsj d(t S; l S) 2 + d(t; t S) 2! jsj d(t S; l S) 2 + Seg 2(T; k) 2 jsj d(t S; l S) 2 + Ot 2 2 m t i

6 P The next ste is to bound the term jsj d(t S; l S) 2. Consider the set of otimal level oints M for the original roblem. For each segment S, let S be the oint in M which is the closest to t S. For the 1-dimensional case, the clustering of the k weighted oints can be done otimally, by dynamic rogramming. So by claiming otimality for L we get jsj d(t S; l S) 2 jsj d(t S; S) 2 Assume that the otimal segmentation artitions a segment S of M into r subsegments S j, each of them having size s j and mean t j. Also assume that the otimal solution assigns each of those segments at a level j 2 M, for j = 1; : : : ; r. Since S is the closest to t S, we have jsj d(t S; S) r s j d(t S; j) 2 r r s j d(t S; t j) 2 + j d(t; t S) r s j d(t j; j) 2! (by Lemma 2) r = 2 d(t; t S) d(t; t) 2 j d(t; j) 2 (by Lemma 3) where, by t we denote the level to which the oint t is assigned by the otimal solution. Summing P over all segments in M, it is clear that the term jsj d(t S; S) 2 is bounded by 4 Ot 2 2. Combining the above inequalities we have CS 2(T; k; h) 2 5 Ot 2 2, which yields the desired bound. 2 Similarly with the Segments2Levels algorithm, the roof can be adated for higher dimensions giving ratios of 5 + for = 1, and for = 2. As before, is the best aroximation factor for the k-means roblem. Two oints should be discussed here regarding the extension of the algorithms in higher dimensions. First, one should note that ClusterSegments achieves a better aroximation ratio in the one dimension where = 1, while Segments2Levels is better in higher dimensions, and in general as long as 1:86. The second observation is that the clustering subroutine used by the ClusterSegments algorithm takes as inut only k oints, instead of n. This means that for small values of k, say small constants, one could ossibly aord to use a brute force algorithm in order to nd the otimal clustering. This would require time exonential in k (still constant however), but the aroximation ratio remains 5 indeendently of the dimension. 5. EPERIMENTAL EVALUATION 5.1 Exeriments on synthetic data To test the erformance of the methods we generated articial data sets that dislay the recurrent state behavior that is of interest to us. The data sets were generated as follows. First we created a Markov model with 5 states with transition robabilities between the states chosen randomly from in the interval 1 [0; 500 ], imlying that the average length of stay in a given state is about 250 stes. For the 0-1 data, each state i had a robability i of emitting a 1. We simulated the state transitions and emissions for n=5k, 10K, 15K and 20K stes. Given each oint in the sequence, we know what the generating robability of a 1 has been at that oint. When alying the algorithms for (k; h)-segmentation described in the revious sections, we obtain an estimate of the generated robabilities 2. We then comuted the average error between the true value of and the estimate roduced by the algorithms. The other articial data set was generated by having a real-valued hidden source x at each state. When the generating rocess was at that state, the generated oint was chosen randomly from the normal distribution with mean x and variance 2, with varying from 0.02 to 0.4. The results for these two sets of articial data are shown in Figures 1 and 2, resectively. One can see that the error is tyically very low, below 4% for the best algorithms, ClusterSegments and Iterative(CS). The Segments2Levels and the Iterative(SL) algorithms erform worse. This is only to be exected, as for 0-1 data the clustering algorithm rovides only two sources. The accuracy of ClusterSegments and Iterative(CS) solution imroves nicely when the size of the sequence grows. For normally distributed data, the error remains very small even for large variance. The number of states used does not have to be exactly the correct one, as seen in Figure 3. Note that one could also reconstruct the transition robabilities for sequences that are long enough. We obtained similar results for the case of 10 underlying states. Error Error on 0 1 data, 5 states Segments 2 Levels Iterative 1 Cluster Segments Iterative Timeseries length x 10 4 Figure 1: Average error of the estimated level from the generating level for articial 0-1 data as a function of the length of the inut sequence. 2 Note that given a segment S = (t 1; : : : ; t m) containing m 1 1s and m 0 Q = m? m 1 0s, the that maximizes the Bernoulli likelihood i t i (1? )1?t i is m1=m, i.e., the that minimizes the L 2 distance.

7 0.045 Error on Normal data, 5 states 0.12 Error on Normal data, 5 states Segments 2 Levels Iterative 1 Cluster Segments Iterative variance = 0.02 variance = 0.1 variance = Error Error Variance number of states (h) Figure 2: Average error of the estimated level from the generating level for normally distributed data as a function of the variance in the generated data. 5.2 Exeriments on genome data We used the sliding window technique to collect statistics from the human genome sequence, in articular from chromosome 22. Our window size was 500 kb, and the sliding ste was set to 50 kb. The known area of the chromosome is about 35 Mb, so our windowing arameters yield sequences of about 700 oints. Within each window we comute the of each of the w-letter words over A, C, G, T, for w = 2 and 3. Therefore, we obtain sequences of dimensions 16 and 64. The same windowing technique was used on data indicating the ositions of SNPs and genes on chromosome 22. For each window we comuted the of SNPs, the of genes, and the combined (2-d oints). For the sequences we described above, we used our method to search for segmentations with recurrent states. We alied the Bayesian information criterion (see [17, 10]) to search for the values of k and h yielding the best segmentation. Our results are shown in Figure 4. The results indicate that the (k; h)-segmentation methods roduce intuitively aealing results; the BIC-otimal number of segments is reasonably small and the number of sources is clearly smaller. The results can be viewed as showing that, with resect to the statistics used, there are segments in the chromosome that show the same characteristics. The statistics used for segmenting the sequences could, of course, be chosen dierently; the resented results are just examles. We are, however, exloring the relationshi of the obtained segments with, e.g., the human-mouse synteny mas. 6. CONCLUSIONS We have introduced the (k; h)-segmentation roblem, where the task is to nd a segmentation of a sequence into k segments stemming from h dierent sources. The roblem is NP-comlete in the general case. We gave simle aroximation algorithms for the roblem, roved bounds on the aroximation quality, and gave emirical results on real Figure 3: Average error of the estimated level from the generating level for normally distributed data as a function of the number h of states used by the algorithm. and articial data. Several oen roblems remain. On the theoretical side the foremost is nding out whether the (k; h)-segmentation roblem is NP-hard for 1-dimensional data. The analysis of the aroximation qualities seem to be fairly tight, but it would be interesting to know whether the results can be imroved. In this aer we gave only simle examles of the alications of the method. For genomic analyses the main task is to select good characteristics over which the segmentation roblem is then solved. The ones we used, gene and SNP density and distribution of short words, are just examles: basically any characteristic of small blocks of sequences can be used. An interesting issue is measuring the correlation between two segmentations. 7. REFERENCES [1] S. Arora, P. Raghavan, and S. Rao. Aroximation schemes for euclidean k -medians and related roblems. In ACM Symosium on Theory of Comuting, ages 106 { 113, [2] R. K. Azad, J. S. Rao, W. Li, and R. Ramaswamy. Simlifying the mosaic descrition of DNA sequences. Physical Review E, 66, article , [3] R. Bellman. On the aroximation of curves by line segments using dynamic rogramming. Commun. ACM, 4(6), [4] K. Bennett. Determination of the number of zones in a biostratigrahical sequence. New Phytol., 132:155{170, [5] A. Cantoni. Otimal curve tting with iecewise linear functions. IEEE Transactions on Comuters, C-20(1):59{67, [6] G. A. Churchill. Stochastic models for heterogeneous DNA sequences. Bulletin of Mathematical Biology, 51:79 { 94, 1989.

8 Gene k=22, h=5 SNP k=12, h= models and reversible jum MCMC. In Euroean Conference on Comutational Biology (ECCB2003), To aear. [17] G. Schwarz. Estimating the dimension of a model. The Annals of Statistics, 7(2):461 { 464, Gene/SNP k=25, h= word k=26, h= word k=24, h= Position on Chromosome 22 Figure 4: BIC-otimal (k; h)-segmentations for sequences of frequencies of genes, SNPs, combined genes and SNPs, 2-words, and 3-words. Each sequence has about 700 oints; the dimensions are 1, 1, 2, 16, and 64, resectively. The numbers labeling each segment indicate the recurrent states. x 10 7 [7] J. Himberg, K. Koriaho, H. Mannila, J. Tikanmaki, and H. T. Toivonen. Time series segmentation for context recognition in mobile devices. In The 2001 IEEE International Conference on Data Mining (ICDM'01), ages 203 { 210, [8] T. Kanungo, D. Mount, N. Netanyahu, C. Piatko, R.Silverman, and A. Wu. A local search aroximation algorithm for k-means clustering. In Proceedings of the 18th Annual Symosium on Comutational Geometry, ages 10 { 18, [9] E. Keogh, S. Chu, D. Hart, and M. Pazzani. An online algorithm for segmenting time series. In IEEE International Conference on Data Mining, ages 289 { 296, [10] W. Li. DNA segmentation as a model selection rocess. In RECOMB 2001, ages 204 { 210, [11] J. Liu and C. Lawrence. Bayesian inference on bioolymer models. Bioinformatics, 15(1):38{52, [12] N. Megiddo and K. J. Suowit. On the comlexity of some common geometric location roblems. SIAM Journal on Comuting, 13(1):182{196, [13] N. Megiddo, E. Zemel, and S. L. Hakimi. The maximum coverage location roblem. SIAM Journal on Algebraic and Discrete Methods, 4:253{261, [14] A. Pavlicek, J. Paces, O. Clay, and G. Bernardi. A comact view of isochores in the draft human genome sequence. FEBS Letters, 511:165{169, [15] V. Ramensky, V. Makeev, M. Roytberg, and V. Tumanyan. DNA segmentation through the bayesian aroach. Journal of Comutational Biology, 7(1/2):215{231, [16] M. Salmenkivi, J. Kere, and H. Mannila. Genome segmentation using iecewise constant intensity

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i

For q 0; 1; : : : ; `? 1, we have m 0; 1; : : : ; q? 1. The set fh j(x) : j 0; 1; ; : : : ; `? 1g forms a basis for the tness functions dened on the i Comuting with Haar Functions Sami Khuri Deartment of Mathematics and Comuter Science San Jose State University One Washington Square San Jose, CA 9519-0103, USA khuri@juiter.sjsu.edu Fax: (40)94-500 Keywords:

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R.

Correspondence Between Fractal-Wavelet. Transforms and Iterated Function Systems. With Grey Level Maps. F. Mendivil and E.R. 1 Corresondence Between Fractal-Wavelet Transforms and Iterated Function Systems With Grey Level Mas F. Mendivil and E.R. Vrscay Deartment of Alied Mathematics Faculty of Mathematics University of Waterloo

More information

1 1 c (a) 1 (b) 1 Figure 1: (a) First ath followed by salesman in the stris method. (b) Alternative ath. 4. D = distance travelled closing the loo. Th

1 1 c (a) 1 (b) 1 Figure 1: (a) First ath followed by salesman in the stris method. (b) Alternative ath. 4. D = distance travelled closing the loo. Th 18.415/6.854 Advanced Algorithms ovember 7, 1996 Euclidean TSP (art I) Lecturer: Michel X. Goemans MIT These notes are based on scribe notes by Marios Paaefthymiou and Mike Klugerman. 1 Euclidean TSP Consider

More information

2 K. ENTACHER 2 Generalized Haar function systems In the following we x an arbitrary integer base b 2. For the notations and denitions of generalized

2 K. ENTACHER 2 Generalized Haar function systems In the following we x an arbitrary integer base b 2. For the notations and denitions of generalized BIT 38 :2 (998), 283{292. QUASI-MONTE CARLO METHODS FOR NUMERICAL INTEGRATION OF MULTIVARIATE HAAR SERIES II KARL ENTACHER y Deartment of Mathematics, University of Salzburg, Hellbrunnerstr. 34 A-52 Salzburg,

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III

AI*IA 2003 Fusion of Multiple Pattern Classifiers PART III AI*IA 23 Fusion of Multile Pattern Classifiers PART III AI*IA 23 Tutorial on Fusion of Multile Pattern Classifiers by F. Roli 49 Methods for fusing multile classifiers Methods for fusing multile classifiers

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

General Linear Model Introduction, Classes of Linear models and Estimation

General Linear Model Introduction, Classes of Linear models and Estimation Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces Abstract and Alied Analysis Volume 2012, Article ID 264103, 11 ages doi:10.1155/2012/264103 Research Article An iterative Algorithm for Hemicontractive Maings in Banach Saces Youli Yu, 1 Zhitao Wu, 2 and

More information

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS #A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS Ramy F. Taki ElDin Physics and Engineering Mathematics Deartment, Faculty of Engineering, Ain Shams University, Cairo, Egyt

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process

Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process Using the Divergence Information Criterion for the Determination of the Order of an Autoregressive Process P. Mantalos a1, K. Mattheou b, A. Karagrigoriou b a.deartment of Statistics University of Lund

More information

Multi-Operation Multi-Machine Scheduling

Multi-Operation Multi-Machine Scheduling Multi-Oeration Multi-Machine Scheduling Weizhen Mao he College of William and Mary, Williamsburg VA 3185, USA Abstract. In the multi-oeration scheduling that arises in industrial engineering, each job

More information

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems

Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various Families of R n Norms and Some Open Problems Int. J. Oen Problems Comt. Math., Vol. 3, No. 2, June 2010 ISSN 1998-6262; Coyright c ICSRS Publication, 2010 www.i-csrs.org Various Proofs for the Decrease Monotonicity of the Schatten s Power Norm, Various

More information

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720

Parallelism and Locality in Priority Queues. A. Ranade S. Cheng E. Deprit J. Jones S. Shih. University of California. Berkeley, CA 94720 Parallelism and Locality in Priority Queues A. Ranade S. Cheng E. Derit J. Jones S. Shih Comuter Science Division University of California Berkeley, CA 94720 Abstract We exlore two ways of incororating

More information

Estimation of Separable Representations in Psychophysical Experiments

Estimation of Separable Representations in Psychophysical Experiments Estimation of Searable Reresentations in Psychohysical Exeriments Michele Bernasconi (mbernasconi@eco.uninsubria.it) Christine Choirat (cchoirat@eco.uninsubria.it) Raffaello Seri (rseri@eco.uninsubria.it)

More information

Learning Sequence Motif Models Using Gibbs Sampling

Learning Sequence Motif Models Using Gibbs Sampling Learning Sequence Motif Models Using Gibbs Samling BMI/CS 776 www.biostat.wisc.edu/bmi776/ Sring 2018 Anthony Gitter gitter@biostat.wisc.edu These slides excluding third-arty material are licensed under

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO

RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO RESOLUTIONS OF THREE-ROWED SKEW- AND ALMOST SKEW-SHAPES IN CHARACTERISTIC ZERO MARIA ARTALE AND DAVID A. BUCHSBAUM Abstract. We find an exlicit descrition of the terms and boundary mas for the three-rowed

More information

Introduction Consider a set of jobs that are created in an on-line fashion and should be assigned to disks. Each job has a weight which is the frequen

Introduction Consider a set of jobs that are created in an on-line fashion and should be assigned to disks. Each job has a weight which is the frequen Ancient and new algorithms for load balancing in the L norm Adi Avidor Yossi Azar y Jir Sgall z July 7, 997 Abstract We consider the on-line load balancing roblem where there are m identical machines (servers)

More information

HENSEL S LEMMA KEITH CONRAD

HENSEL S LEMMA KEITH CONRAD HENSEL S LEMMA KEITH CONRAD 1. Introduction In the -adic integers, congruences are aroximations: for a and b in Z, a b mod n is the same as a b 1/ n. Turning information modulo one ower of into similar

More information

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation

Modeling and Estimation of Full-Chip Leakage Current Considering Within-Die Correlation 6.3 Modeling and Estimation of Full-Chi Leaage Current Considering Within-Die Correlation Khaled R. eloue, Navid Azizi, Farid N. Najm Deartment of ECE, University of Toronto,Toronto, Ontario, Canada {haled,nazizi,najm}@eecg.utoronto.ca

More information

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek

Use of Transformations and the Repeated Statement in PROC GLM in SAS Ed Stanek Use of Transformations and the Reeated Statement in PROC GLM in SAS Ed Stanek Introduction We describe how the Reeated Statement in PROC GLM in SAS transforms the data to rovide tests of hyotheses of interest.

More information

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule

The Graph Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule The Grah Accessibility Problem and the Universality of the Collision CRCW Conflict Resolution Rule STEFAN D. BRUDA Deartment of Comuter Science Bisho s University Lennoxville, Quebec J1M 1Z7 CANADA bruda@cs.ubishos.ca

More information

Probability Estimates for Multi-class Classification by Pairwise Coupling

Probability Estimates for Multi-class Classification by Pairwise Coupling Probability Estimates for Multi-class Classification by Pairwise Couling Ting-Fan Wu Chih-Jen Lin Deartment of Comuter Science National Taiwan University Taiei 06, Taiwan Ruby C. Weng Deartment of Statistics

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education

CERIAS Tech Report The period of the Bell numbers modulo a prime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education CERIAS Tech Reort 2010-01 The eriod of the Bell numbers modulo a rime by Peter Montgomery, Sangil Nahm, Samuel Wagstaff Jr Center for Education and Research Information Assurance and Security Purdue University,

More information

Linear diophantine equations for discrete tomography

Linear diophantine equations for discrete tomography Journal of X-Ray Science and Technology 10 001 59 66 59 IOS Press Linear diohantine euations for discrete tomograhy Yangbo Ye a,gewang b and Jiehua Zhu a a Deartment of Mathematics, The University of Iowa,

More information

AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM. Radu Balan, Alexander Jourjine, Justinian Rosca. Siemens Corporation Research

AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM. Radu Balan, Alexander Jourjine, Justinian Rosca. Siemens Corporation Research AR PROCESSES AND SOURCES CAN BE RECONSTRUCTED FROM DEGENERATE MIXTURES Radu Balan, Alexander Jourjine, Justinian Rosca Siemens Cororation Research 7 College Road East Princeton, NJ 8 fradu,jourjine,roscag@scr.siemens.com

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

On Wald-Type Optimal Stopping for Brownian Motion

On Wald-Type Optimal Stopping for Brownian Motion J Al Probab Vol 34, No 1, 1997, (66-73) Prerint Ser No 1, 1994, Math Inst Aarhus On Wald-Tye Otimal Stoing for Brownian Motion S RAVRSN and PSKIR The solution is resented to all otimal stoing roblems of

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES

EXACTLY PERIODIC SUBSPACE DECOMPOSITION BASED APPROACH FOR IDENTIFYING TANDEM REPEATS IN DNA SEQUENCES EXACTLY ERIODIC SUBSACE DECOMOSITION BASED AROACH FOR IDENTIFYING TANDEM REEATS IN DNA SEUENCES Ravi Guta, Divya Sarthi, Ankush Mittal, and Kuldi Singh Deartment of Electronics & Comuter Engineering, Indian

More information

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve

A Simple Weight Decay Can Improve. Abstract. It has been observed in numerical simulations that a weight decay can improve In Advances in Neural Information Processing Systems 4, J.E. Moody, S.J. Hanson and R.P. Limann, eds. Morgan Kaumann Publishers, San Mateo CA, 1995,. 950{957. A Simle Weight Decay Can Imrove Generalization

More information

On a Markov Game with Incomplete Information

On a Markov Game with Incomplete Information On a Markov Game with Incomlete Information Johannes Hörner, Dinah Rosenberg y, Eilon Solan z and Nicolas Vieille x{ January 24, 26 Abstract We consider an examle of a Markov game with lack of information

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models

Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Evaluating Circuit Reliability Under Probabilistic Gate-Level Fault Models Ketan N. Patel, Igor L. Markov and John P. Hayes University of Michigan, Ann Arbor 48109-2122 {knatel,imarkov,jhayes}@eecs.umich.edu

More information

GIVEN an input sequence x 0,..., x n 1 and the

GIVEN an input sequence x 0,..., x n 1 and the 1 Running Max/Min Filters using 1 + o(1) Comarisons er Samle Hao Yuan, Member, IEEE, and Mikhail J. Atallah, Fellow, IEEE Abstract A running max (or min) filter asks for the maximum or (minimum) elements

More information

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract

A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS. 1. Abstract A CONCRETE EXAMPLE OF PRIME BEHAVIOR IN QUADRATIC FIELDS CASEY BRUCK 1. Abstract The goal of this aer is to rovide a concise way for undergraduate mathematics students to learn about how rime numbers behave

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction

GOOD MODELS FOR CUBIC SURFACES. 1. Introduction GOOD MODELS FOR CUBIC SURFACES ANDREAS-STEPHAN ELSENHANS Abstract. This article describes an algorithm for finding a model of a hyersurface with small coefficients. It is shown that the aroach works in

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

x and y suer from two tyes of additive noise [], [3] Uncertainties e x, e y, where the only rior knowledge is their boundedness and zero mean Gaussian

x and y suer from two tyes of additive noise [], [3] Uncertainties e x, e y, where the only rior knowledge is their boundedness and zero mean Gaussian A New Estimator for Mixed Stochastic and Set Theoretic Uncertainty Models Alied to Mobile Robot Localization Uwe D. Hanebeck Joachim Horn Institute of Automatic Control Engineering Siemens AG, Cororate

More information

Hotelling s Two- Sample T 2

Hotelling s Two- Sample T 2 Chater 600 Hotelling s Two- Samle T Introduction This module calculates ower for the Hotelling s two-grou, T-squared (T) test statistic. Hotelling s T is an extension of the univariate two-samle t-test

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

Numerical Linear Algebra

Numerical Linear Algebra Numerical Linear Algebra Numerous alications in statistics, articularly in the fitting of linear models. Notation and conventions: Elements of a matrix A are denoted by a ij, where i indexes the rows and

More information

The Noise Power Ratio - Theory and ADC Testing

The Noise Power Ratio - Theory and ADC Testing The Noise Power Ratio - Theory and ADC Testing FH Irons, KJ Riley, and DM Hummels Abstract This aer develos theory behind the noise ower ratio (NPR) testing of ADCs. A mid-riser formulation is used for

More information

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks Analysis of Multi-Ho Emergency Message Proagation in Vehicular Ad Hoc Networks ABSTRACT Vehicular Ad Hoc Networks (VANETs) are attracting the attention of researchers, industry, and governments for their

More information

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points.

Solved Problems. (a) (b) (c) Figure P4.1 Simple Classification Problems First we draw a line between each set of dark and light data points. Solved Problems Solved Problems P Solve the three simle classification roblems shown in Figure P by drawing a decision boundary Find weight and bias values that result in single-neuron ercetrons with the

More information

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS #A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,

More information

CSC165H, Mathematical expression and reasoning for computer science week 12

CSC165H, Mathematical expression and reasoning for computer science week 12 CSC165H, Mathematical exression and reasoning for comuter science week 1 nd December 005 Gary Baumgartner and Danny Hea hea@cs.toronto.edu SF4306A 416-978-5899 htt//www.cs.toronto.edu/~hea/165/s005/index.shtml

More information

Principal Components Analysis and Unsupervised Hebbian Learning

Principal Components Analysis and Unsupervised Hebbian Learning Princial Comonents Analysis and Unsuervised Hebbian Learning Robert Jacobs Deartment of Brain & Cognitive Sciences University of Rochester Rochester, NY 1467, USA August 8, 008 Reference: Much of the material

More information

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations

Preconditioning techniques for Newton s method for the incompressible Navier Stokes equations Preconditioning techniques for Newton s method for the incomressible Navier Stokes equations H. C. ELMAN 1, D. LOGHIN 2 and A. J. WATHEN 3 1 Deartment of Comuter Science, University of Maryland, College

More information

A New Perspective on Learning Linear Separators with Large L q L p Margins

A New Perspective on Learning Linear Separators with Large L q L p Margins A New Persective on Learning Linear Searators with Large L q L Margins Maria-Florina Balcan Georgia Institute of Technology Christoher Berlind Georgia Institute of Technology Abstract We give theoretical

More information

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of..

Asymptotic Properties of the Markov Chain Model method of finding Markov chains Generators of.. IOSR Journal of Mathematics (IOSR-JM) e-issn: 78-578, -ISSN: 319-765X. Volume 1, Issue 4 Ver. III (Jul. - Aug.016), PP 53-60 www.iosrournals.org Asymtotic Proerties of the Markov Chain Model method of

More information

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT

THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT THE 2D CASE OF THE BOURGAIN-DEMETER-GUTH ARGUMENT ZANE LI Let e(z) := e 2πiz and for g : [0, ] C and J [0, ], define the extension oerator E J g(x) := g(t)e(tx + t 2 x 2 ) dt. J For a ositive weight ν

More information

Analysis of some entrance probabilities for killed birth-death processes

Analysis of some entrance probabilities for killed birth-death processes Analysis of some entrance robabilities for killed birth-death rocesses Master s Thesis O.J.G. van der Velde Suervisor: Dr. F.M. Sieksma July 5, 207 Mathematical Institute, Leiden University Contents Introduction

More information

E-companion to A risk- and ambiguity-averse extension of the max-min newsvendor order formula

E-companion to A risk- and ambiguity-averse extension of the max-min newsvendor order formula e-comanion to Han Du and Zuluaga: Etension of Scarf s ma-min order formula ec E-comanion to A risk- and ambiguity-averse etension of the ma-min newsvendor order formula Qiaoming Han School of Mathematics

More information

Recursive Estimation of the Preisach Density function for a Smart Actuator

Recursive Estimation of the Preisach Density function for a Smart Actuator Recursive Estimation of the Preisach Density function for a Smart Actuator Ram V. Iyer Deartment of Mathematics and Statistics, Texas Tech University, Lubbock, TX 7949-142. ABSTRACT The Preisach oerator

More information

ECON Answers Homework #2

ECON Answers Homework #2 ECON 33 - Answers Homework #2 Exercise : Denote by x the number of containers of tye H roduced, y the number of containers of tye T and z the number of containers of tye I. There are 3 inut equations that

More information

Named Entity Recognition using Maximum Entropy Model SEEM5680

Named Entity Recognition using Maximum Entropy Model SEEM5680 Named Entity Recognition using Maximum Entroy Model SEEM5680 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves

More information

Some results of convex programming complexity

Some results of convex programming complexity 2012c12 $ Ê Æ Æ 116ò 14Ï Dec., 2012 Oerations Research Transactions Vol.16 No.4 Some results of convex rogramming comlexity LOU Ye 1,2 GAO Yuetian 1 Abstract Recently a number of aers were written that

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

Hidden Predictors: A Factor Analysis Primer

Hidden Predictors: A Factor Analysis Primer Hidden Predictors: A Factor Analysis Primer Ryan C Sanchez Western Washington University Factor Analysis is a owerful statistical method in the modern research sychologist s toolbag When used roerly, factor

More information

Efficient algorithms for the smallest enclosing ball problem

Efficient algorithms for the smallest enclosing ball problem Efficient algorithms for the smallest enclosing ball roblem Guanglu Zhou, Kim-Chuan Toh, Jie Sun November 27, 2002; Revised August 4, 2003 Abstract. Consider the roblem of comuting the smallest enclosing

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

Sums of independent random variables

Sums of independent random variables 3 Sums of indeendent random variables This lecture collects a number of estimates for sums of indeendent random variables with values in a Banach sace E. We concentrate on sums of the form N γ nx n, where

More information

Notes on duality in second order and -order cone otimization E. D. Andersen Λ, C. Roos y, and T. Terlaky z Aril 6, 000 Abstract Recently, the so-calle

Notes on duality in second order and -order cone otimization E. D. Andersen Λ, C. Roos y, and T. Terlaky z Aril 6, 000 Abstract Recently, the so-calle McMaster University Advanced Otimization Laboratory Title: Notes on duality in second order and -order cone otimization Author: Erling D. Andersen, Cornelis Roos and Tamás Terlaky AdvOl-Reort No. 000/8

More information

COMMUNICATION BETWEEN SHAREHOLDERS 1

COMMUNICATION BETWEEN SHAREHOLDERS 1 COMMUNICATION BTWN SHARHOLDRS 1 A B. O A : A D Lemma B.1. U to µ Z r 2 σ2 Z + σ2 X 2r ω 2 an additive constant that does not deend on a or θ, the agents ayoffs can be written as: 2r rθa ω2 + θ µ Y rcov

More information

B8.1 Martingales Through Measure Theory. Concept of independence

B8.1 Martingales Through Measure Theory. Concept of independence B8.1 Martingales Through Measure Theory Concet of indeendence Motivated by the notion of indeendent events in relims robability, we have generalized the concet of indeendence to families of σ-algebras.

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Fig. 4. Example of Predicted Workers. Fig. 3. A Framework for Tackling the MQA Problem.

Fig. 4. Example of Predicted Workers. Fig. 3. A Framework for Tackling the MQA Problem. 217 IEEE 33rd International Conference on Data Engineering Prediction-Based Task Assignment in Satial Crowdsourcing Peng Cheng #, Xiang Lian, Lei Chen #, Cyrus Shahabi # Hong Kong University of Science

More information

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning

Uncorrelated Multilinear Principal Component Analysis for Unsupervised Multilinear Subspace Learning TNN-2009-P-1186.R2 1 Uncorrelated Multilinear Princial Comonent Analysis for Unsuervised Multilinear Subsace Learning Haiing Lu, K. N. Plataniotis and A. N. Venetsanooulos The Edward S. Rogers Sr. Deartment

More information

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1) CERTAIN CLASSES OF FINITE SUMS THAT INVOLVE GENERALIZED FIBONACCI AND LUCAS NUMBERS The beautiful identity R.S. Melham Deartment of Mathematical Sciences, University of Technology, Sydney PO Box 23, Broadway,

More information

On the Chvatál-Complexity of Knapsack Problems

On the Chvatál-Complexity of Knapsack Problems R u t c o r Research R e o r t On the Chvatál-Comlexity of Knasack Problems Gergely Kovács a Béla Vizvári b RRR 5-08, October 008 RUTCOR Rutgers Center for Oerations Research Rutgers University 640 Bartholomew

More information

Improvement on the Decay of Crossing Numbers

Improvement on the Decay of Crossing Numbers Grahs and Combinatorics 2013) 29:365 371 DOI 10.1007/s00373-012-1137-3 ORIGINAL PAPER Imrovement on the Decay of Crossing Numbers Jakub Černý Jan Kynčl Géza Tóth Received: 24 Aril 2007 / Revised: 1 November

More information

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material Robustness of classifiers to uniform l and Gaussian noise Sulementary material Jean-Yves Franceschi Ecole Normale Suérieure de Lyon LIP UMR 5668 Omar Fawzi Ecole Normale Suérieure de Lyon LIP UMR 5668

More information

Combinatorics of topmost discs of multi-peg Tower of Hanoi problem

Combinatorics of topmost discs of multi-peg Tower of Hanoi problem Combinatorics of tomost discs of multi-eg Tower of Hanoi roblem Sandi Klavžar Deartment of Mathematics, PEF, Unversity of Maribor Koroška cesta 160, 000 Maribor, Slovenia Uroš Milutinović Deartment of

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

The inverse Goldbach problem

The inverse Goldbach problem 1 The inverse Goldbach roblem by Christian Elsholtz Submission Setember 7, 2000 (this version includes galley corrections). Aeared in Mathematika 2001. Abstract We imrove the uer and lower bounds of the

More information

A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE

A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE THE 19 TH INTERNATIONAL CONFERENCE ON COMPOSITE MATERIALS A SIMPLE PLASTICITY MODEL FOR PREDICTING TRANSVERSE COMPOSITE RESPONSE AND FAILURE K.W. Gan*, M.R. Wisnom, S.R. Hallett, G. Allegri Advanced Comosites

More information

Optimism, Delay and (In)Efficiency in a Stochastic Model of Bargaining

Optimism, Delay and (In)Efficiency in a Stochastic Model of Bargaining Otimism, Delay and In)Efficiency in a Stochastic Model of Bargaining Juan Ortner Boston University Setember 10, 2012 Abstract I study a bilateral bargaining game in which the size of the surlus follows

More information

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model Shadow Comuting: An Energy-Aware Fault Tolerant Comuting Model Bryan Mills, Taieb Znati, Rami Melhem Deartment of Comuter Science University of Pittsburgh (bmills, znati, melhem)@cs.itt.edu Index Terms

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

Location of solutions for quasi-linear elliptic equations with general gradient dependence

Location of solutions for quasi-linear elliptic equations with general gradient dependence Electronic Journal of Qualitative Theory of Differential Equations 217, No. 87, 1 1; htts://doi.org/1.14232/ejqtde.217.1.87 www.math.u-szeged.hu/ejqtde/ Location of solutions for quasi-linear ellitic equations

More information

Genetic Algorithms, Selection Schemes, and the Varying Eects of Noise. IlliGAL Report No November Department of General Engineering

Genetic Algorithms, Selection Schemes, and the Varying Eects of Noise. IlliGAL Report No November Department of General Engineering Genetic Algorithms, Selection Schemes, and the Varying Eects of Noise Brad L. Miller Det. of Comuter Science University of Illinois at Urbana-Chamaign David E. Goldberg Det. of General Engineering University

More information

2x2x2 Heckscher-Ohlin-Samuelson (H-O-S) model with factor substitution

2x2x2 Heckscher-Ohlin-Samuelson (H-O-S) model with factor substitution 2x2x2 Heckscher-Ohlin-amuelson (H-O- model with factor substitution The HAT ALGEBRA of the Heckscher-Ohlin model with factor substitution o far we were dealing with the easiest ossible version of the H-O-

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

LEIBNIZ SEMINORMS IN PROBABILITY SPACES

LEIBNIZ SEMINORMS IN PROBABILITY SPACES LEIBNIZ SEMINORMS IN PROBABILITY SPACES ÁDÁM BESENYEI AND ZOLTÁN LÉKA Abstract. In this aer we study the (strong) Leibniz roerty of centered moments of bounded random variables. We shall answer a question

More information