Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments

Size: px
Start display at page:

Download "Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments"

Transcription

1 : Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Journal: Manuscript ID: Manuscript Type: Date Submitted by the Author: Complete List of Authors: Keyword: draft Original Paper n/a Chen, Zhixiang; University of Texas-Pan American, Department of Computer Science Fu, Bin; University of Texas-Pan American, Computer Science Schweller, Robert; University of Texas-Pan American, Computer Science Yang, Boting; University of Regina, Computer Science Zhao, Zhiyu; University of New Orleans, Computer Science Zhu, Binhai; Montana State University, Computer Science haplotypes, algorithms, probability, alignment, SEQUENCE ANALYSIS Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

2 Page of Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang Chen Bin Fu Robert Schweller Boting Yang Zhiyu Zhao Binhai Zhu Abstract In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration. Keywords: Singular haplotype reconstruction, SNP fragments, probabilistic modeling and analysis, linear time probabilistic algorithm, inconsistency and incompleteness errors.. Introduction Abstractly, a genome can be considered a string over the alphabet of nucleotides {A, G, C, T }. It is accepted that the genomes between any two humans are over % identical (Terwilliger and Weiss, ; Hoehe et al., 000). The remaining sites which exhibit substantial variation (in at least % of the population) are called Single Nucleotide Polymorphisms (SNPs). The values of a set of SNPs on a particular chromosome copy define a haplotype. While a haplotype is a string over the four nucleotide bases, typical SNP sites only vary Corresponding author. Department of Computer Science, University of Texas-Pan American, Edinburg, TX, USA. s: {chen, binfu, schwellerr}@cs.panam.edu, Phone: ()-0, Fax: ()-0. Department of Computer Science, University of Regina, Saskatchewan, SS 0A, Canada. boting@cs.uregina.ca, Phone:(0)-, Fax: (0)-. Department of Computer Science, University of New Orleans, New Orleans, LA 0, USA. zzha@cs.uno.edu, Phone: (0)0-, Fax: (0)0-. Department of Computer Science, Montana State University, Bozeman, MT. USA. bhz@cs.montana.edu, Phone: (0)-, Fax: (0)-.

3 Page of between two values. Therefore, we can logically represent a haplotype as a binary string over the alphabet {A, B}. The haplotype of an individual chromosome can be thought of as a genetic finger print specifying the bulk of the genetic variability among distinct members of the same species. Determining haplotypes is thus a key step in the analysis of genetic variation. In recent years, the haplotyping problem has been extensively studied, see, f.g., (G. Lancia, 00; Zhao et al., 00; Wang et al., 00; Zhang et al., 00; Bafna et al., 00; Rizzi et al., 00; Wang et al., 00; Lippert et al., 00; Cilibrasi et al., 00; Lancia et al., 00; G. Lancia, 00; Alessandro Panconesi, 00; Clark, 0; Gusfield, 000, 00; Xie and Wang, 00; Li et al., 00; Genovese et al., 00). There are several version of the haplotyping problem. Of most interest is the haplotyping of diploid organisms, such as humans, which have two chromosomes, and thus two haplotypes. A common method to obtain an individual s two haplotypes is to first obtain a collection of sequencing data from the individual s chromosomes. This data consists of a collection of fragments, each partially specifying the sequence of one of the two base chromosomes. This data typically contains two types of errors. The data is incomplete in that each fragment will fail to assign a base to some collection of positions, and inconsistent in that some positions may be sequenced with an incorrect value. Further, in practice it is not known which chromosome is being sequenced by which fragments. The general haplotype reconstruction problem is then to take this input set of sequenced fragments and derive the two haplotypes which yielded the given fragments. This problem, which we consider, is the singular haplotype reconstruction problem and, like other versions of the problem, has also been extensively studied, see, f.g., (Cilibrasi et al., 00; Wang et al., 00; Lippert et al., 00; Bafna et al., 00). More concretely, we assume we are given a collection of n length m strings over the alphabet {A, B, } denoting the sequence of each fragment, where denotes a hole in which no value is measured at that position for that fragment. From this basic setup, a number of problems have been considered for various different optimization functions. In particular, one can consider removing certain fragments from the input data such that the remaining set of fragments can be divided into two separate sets, each set consisting of fragments that do not contain any conflicts (a site in which two strings contain different, non-hole values). Such a removal implicity derives two haplotypes corresponding to the two groups. This problem is referred to as the minimum fragment removal problem (MFR). A second problem involves removing or discounting a subset of the SNP sites to create two consistent sets of fragments. This is referred to as the minimum SNP removal problem (MSR). Finally, a third approach is to flip or correct a number of sites for various fragments. This is referred to as the minimum error correction problem (MEC). While much work has been done on these problems, the incompleteness and inconsistency of data fragments makes most versions of these haplotyping problems NP-hard or even hard to approximate (f.g., (Cili-

4 Page of brasi et al., 00; Lippert et al., 00; Bafna et al., 00)), and many elegant and powerful methods such as those in (Li et al., 00) cannot be used to deal with incompleteness and inconsistency at the same time. Other methods have proposed heuristics (Genovese et al., 00; Alessandro Panconesi, 00), but do not provide provably good results. In this paper, we develop a probabilistic approach to overcome some of the difficulties caused by the incompleteness and inconsistency occurred in the input fragments. In our model, we assume the input data fragments are attained from two unknown haplotype strings. Some number of the fragments are derived from the first haplotype, some from the second haplotype. Each fragment is assumed to be generated according to two error parameters, α and α, representing inconsistency errors and incompleteness errors respectively. These parameters represent the percentage chance that any given position will be sequenced incorrectly from the base haplotype, or will be omitted as a hole, respectively. Further, we assume a third parameter β denoting a minimum difference percentage between the two base haplotype strings. With respect to these three parameters, we develop algorithms that output the base haplotypes with probability as a function of α, α, and β. In particular, we design three algorithms in our probabilistic model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. The first algorithm assumes prior knowledge of the α parameter. The second algorithm does not require prior knowledge of any parameters at a modest cost to run time. The third does not require prior knowledge of parameters and attains similar analytical bounds as algorithm, yet exhibits a substantial empirical improvement in accuracy. For all algorithms, we present experimental results that conform with the respective theoretically efficient performance. The software of our algorithms is available for public access and for real-time on-line demonstration. Paper Layout In Section we formally define the problem and the probabilistic model we use. In Section we develop some technical lemmas. In Section we provide a haplotype reconstruction algorithm that utilizes prior knowledge of an error parameter. In Section we present an algorithm that requires no prior knowledge of model parameters. In Section we provide a third algorithm with provably good accuracy that also performs particularly well in practice. In Section we detail experimental results for our algorithms.

5 Page of A Probabilistic Model Assume that we have two haplotypes H, H, denoted as H = a a a m and H = b b b m. Let Γ = {S, S,..., S n } be a set of n fragments obtained from the DNA sequencing process with respect to the two haplotypes H and H. In this case, each S i = c c c m is either a fragment of H or H. Because we lose the information concerning the DNA strand to which a fragment belongs, we do not know whether S i is a fragment of H or H. Suppose that S i is a fragment of H. Because of reading errors or corruptions that may occur during the sequencing process, there is a small chance that either c j - but c j a j, or c j = -, for j m, where the symbol - denotes a hole or missing value. For the former, the information of the fragment S i at the j-th SNP site is inconsistent, and we use α to denote the rate of this type of inconsistency error. For the latter, the information of S i at the j-th SNP is incomplete, and we use α to denote the rate of this type of incompleteness error. It is known (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) that α and α are in practice between % to %. Also, it is realistically reasonable to believe that the dissimilarity, denoted by β, between the two haplotypes H and H is big. Often, β is measured using the Hamming distance between H and H divided by the length m of H and H, and is assumed to be large, say, β 0.. It is also often assumed that roughly half of the fragments in Γ are from each of the two haplotypes H and H. In the experimental studies of algorithmic solutions to the singular haplotype reconstruction problem, we often need to generate synthetic data to evaluate the performance and accuracy of a given algorithm. One common practice (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) is as follows: First, choose two haplotypes H and H such that the dissimilarity between H and H is at least β. Second, make n i copies of H i, i =,. Third, for each copy H = a a a m of H i, for each i =,,..., m, with probability α, flip a i to a i so that they are inconsistent. Also, independently, a i has probability α to be a hole -. A synthetic data set is then generated by setting parameters m, n, n, β, α and α. Usually, n is roughly the same as n, and β 0., α [0.0, 0.0], and α [0., 0.]. Motivated by the above reality of the sequencing process and the common practice in experimental algorithm studies, we will present a probabilistic model for the singular haplotype reconstruction problem. But first we need to introduce some necessary notations and definitions. Let Σ = {A, B} and Σ = {A, B, -}. For a set C, C denotes the number of elements in C. For a fragment (or a sequence) S = a a a m Σ m, S[i] denotes the character a i, and S[i, j] denotes the substring a i a j for i j m. S denotes the length m of S. When no confusion arises, we alternatively use the terms fragment and sequence. Let G = g g g m Σ m be a fixed sequence of m characters. For any sequence S = a a m Σ m, S

6 Page of is called a F α,α (m, G) sequence if for each a i, with probability at most α, a i is not equal to g i and a i -; and with probability at most α, a i = -. For a sequence S, define holes(s) to be the number of holes in the sequence S. If A is a subset of {,, m} and S is a sequence of length m, holes A (S) is the number of i A such that S[i] is a hole. For two sequences S = a a m and S = b b m of the same length m, for any A {,, m}, define diff(s, S ) = {i {,,, m} a i - and b i - and a i b i } m diff A (S, S ) = {i A a i - and b i - and a i b i }. A For a set of sequences Γ = {S, S,, S k } of length m, define vote(γ) to be the sequence H of the same length m such that H[i] is the most frequent character among S [i], S [i],, S k [i] for i =,,, m. We often use an n m matrix M to represent a list of n fragments from Σ m matrix. For i n, let M[i] represent the i-th row of M, i.e., M[i] is a fragment in Σ m. We now define our probabilistic model: and call M an SNP fragment The Probabilistic Singular Haplotype Reconstruction Problem: Let β, α and α be small positive constants. Let G, G Σ m be two haplotypes with diff(g, G ) β. For any given n m matrix M of SNP fragments such that n i rows of M are F α,α (m, G i ) sequences, i =,, n + n = n, reconstruct the two haplotypes G and G, which are unknown to the users, from M as accurately as possible with high probability. We call β (resp., α, α ) dissimilarity rate (resp., inconsistency error rate, incompleteness error rate).. Technical Lemmas For probabilistic analysis we need the following two Chernoff bounds (Motwani and Raghavan, 000, see), which can be derived from the work in (Li et al., 00). Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at most p. Let X = n i= X i. Then for any ɛ > 0, Pr(X > pn + ɛn) < e nɛ. Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at least p. Let X = n i= X i. Then for any ɛ > 0, Pr(X < pn ɛn) < e nɛ. We shall prove several technical lemmas for algorithm analysis in the next three sections.

7 Page of Lemma. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m. Proof: Let X k, k =, m, be random variables such that X k = if S[k] G i [k] and S[k], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X k are independent and P r(x k = ) α. So, by Lemma, with probability at most e ɛ m, X + + X m > (α + ɛ)m Thus, we have diff(g, S) > α + ɛ with probability at most e ɛ m. Similarly, with probability at most e ɛ m, holes(s) > (α + ɛ)m. Lemma. Assume that A is a fixed subset of {,,, m}. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ A, diff A (G i, S) > α + ɛ or holes A (S) > (α + ɛ) A. Proof: Let S be the subsequence consisting of all the characters S[i], i A, with the same order as in S. Similarly, let G be the subsequence consisting of all the characters G[i], i A, with the same order as in G. It is easy to see that diff A (S, G i ) = diff(s, G ). The lemma follows from a similar proof for Lemma. Lemma. Let N i be a set of n i many F α,α (m, G i ) sequences, i =,. Let β and ɛ be two positive constants such that α +α +ɛ <, and diff(g, G ) β. Then, with probability at most (n +n )e ɛ βm, diff(s i, S j ) β( α α ɛ) for some S i N i and some S j N j with i j. Proof: For each G i, let A i be the set of indexes {k {,,, m} G i [k] G j [k]}, where i j. Since diff(g i, G j ) β and G i = G j = m, we have A i βm. For any F α,α (m, G i ) sequence S, by Lemma, with probability at most e ɛ A i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i. Hence, with probability at most n i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i, for some S N i. Therefore, with probability at most (n + n )e ɛ βm, we have diff Ai (S, G i ) > α + ɛ or holes Ai (S) > (α + ɛ) A i,j, for some S N i, for some i = or. In other words, with probability at least (n + n )e ɛ βm, we have diff Ai (S, G i ) α + ɛ and holes Ai (S) (α + ɛ) A i,j, for all S N i and for i = and. For any F α,α (m, G i ) sequence S i, i =,, if diff Ai (S i, G i ) α + ɛ and holes Ai (S i ) (α + ɛ) A i, then diff(s, S ) diff Ai (S, S ) β( α α ɛ). Thus, with probability at least (n +n )e ɛ βm, we have diff(s, S ) β( α α ɛ), for every S N and every S N. In words, with probability at most (n +n )e ɛ βm, we have diff(s, S ) < β( α α ɛ), for some S N and some S N. Lemma. Let α, α and ɛ be three small positive constants that satisfy 0 < α + α ɛ <. Assume that N = {S,, S n } is a set of F α,α (m, G) sequences. Let H = vote(n). Then, with probability at most m(e ɛ n ), G H.

8 Page of Proof: Given any j m, for any i n, let X i be random variables such that X i = if S i [j] G[j] and S i [j], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X i are independent and P r(x i = ) α. So, by Lemma, with probability at most e ɛ n, X + +X n < (α ɛ)n. That is, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] G[j] and S i [j] -. Similarly, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] = -. Thus, with probability at most me ɛ n, there are fewer than (α + α ɛ)n characters S i [j] such that S i [j] G[j] for some j m. This implies that, with probability at least me ɛ n, there are more than ( α α + ɛ)n characters S i [j] such that S i [j] = G[j] for any j m. Since 0 < α +α ɛ < by the assumption of the theorem, we have (α +ɛ)n < ( α α +ɛ)n. This further implies that with probability at least me ɛ n, vote(n)[j] = G[j] for any j m, i.e., vote(n) = G.. When the Inconsistency Error Parameter Is Known Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy (α + ɛ) < β and 0 < α + α ɛ <. Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. There exists an O(nm) time algorithm that can find two haplotypes H and H with probability at least ne ɛ m me ɛ n me ɛ n such that either H = G and H = G, or H = G and H = G. Proof: The algorithm, denoted as SHR-One, is described as follows. Algorithm SHR-One Input: M, an n m matrix of SNP fragments. Parameters α and ɛ. Output: Two haplotypes H and H. Set Γ = Γ =. Randomly select a fragment r = M[j] for some j n. For every fragment r from M do If (diff(r, r ) (α + ɛ)) then put r into Γ Let Γ = M Γ. Let H = vote(γ ) and H = vote(γ ). return H and H.

9 Page of End of Algorithm Claim. With probability at most ne ɛ m, we have either diff(f, G ) > α + ɛ for some F α,α (m, G ) sequence f in M, or diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. By Lemma, for any fragment f = M[k] such that f is a F α,α (m, G ) sequence, with probability at most e ɛ m we have diff(f, G ) > α + ɛ. Since there are n many F α,α (m, G ) sequences in M, with probability at most n e ɛ m, we have diff(f, G ) > α +ɛ for some F α,α (m, G ) sequence f in M. Similarly, with probability at most n e ɛ m, we have diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. Combining the above completes the proof for Claim. Claim. Let M i be the set of all the F α,α (m, G i ) sequences in M, i =,. With probability at least ne ɛ m, Γ and Γ is a permutation of M and M. By the assumption of the theorem, the fragment r of M is either a F α,α (m, G ) sequence or a F α,α (m, G ) sequence. We assume that the former is true. By Claim, with probability at least ne ɛ m, we have diff(f, G ) α + ɛ for all F α,α (m, G ) sequences f in M, and diff(g, G ) α + ɛ for all F α,α (m, G ) sequences g in M. Hence, for any fragment r in M, if r is a F α,α (m, G ) sequence, then with probability at least ne ɛ m, we have diff(r, r ) diff(r, G ) + diff(r, G ) (α + ɛ). This means that, with probability at least ne ɛ m, all F α,α (m, G ) sequences in M will be included in Γ. Now, consider the case that r is a F α,α (m, G ) sequence in M. Since diff(g, G ) diff(g, r) + diff(r, G ) diff(g, r) + diff(r, r ) + diff(r, G ), we have diff(r, r ) diff(g, G ) diff(g, r) diff(g, r ). By the given condition of diff(g, G ) β and (α + ɛ) < β, with probability at least ne ɛ m, we have diff(r, r ) β diff(g, r) diff(g, r ) β (α + ɛ) > (α + ɛ), i.e., r will not be added to Γ. Therefore, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. Similarly, if r is a F α,α (m, G ) sequence, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. This completes the proof of Claim. Suppose that Γ and Γ is a permutation of M and M. Say, without loss of generality, Γ = M and Γ = M. By Lemma, with probability at most me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Hence, by Claim, with probability at most ne ɛ m + me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Concerning the computational time of the algorithm, we need to compute the difference between the selected fragment r and each of the remaining n fragments in the matrix M. Finding the difference between r and r takes O(m) steps. So, the total computational time is O(nm), which is linear in the size of the input matrix M.

10 Page of When Parameters Are Not Known In this section, we consider the case that the parameters α, α and β are unknown. However, we assume the existence of those parameters for the input matrix M of SNP fragments. We will show that in this case we can still reconstruct the two unknown haplotypes from M with high probability. Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <, 0 < α +α ɛ <, and β( α α ɛ) > (α +ɛ). Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = Proof: nn n(n ) The algorithm, denoted as SHR-Two, is described as follows. Algorithm SHR-Two Input: M, an n m matrix M of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. Let d min = and M =. For (k = to u) do { //the k-loop } Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } and u is an integer parameter. If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = max{diff(r i, r ) r M i } for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Claim. With probability at most ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the k-loop iterations. For randomly selected fragments r and r, with probability γ, r, r is a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in M. When the k-loop is repeated u times, with probability at most

11 Page 0 of ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the u loop iterations. Thus, Claim is true. Let N i be the set of the n i fragments in M that are F α,α (m, G i ) sequences, i =,. Claim. With probability at most ne ɛ βm, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or ; or diff(s, S ) β( α α ɛ) for some S N and some S N. By Lemma, for every fragment S from N i, with probability at most e ɛ m, diff(gi, S) > α + ɛ or S has more than (α + ɛ)m holes. Thus, with probability at most ne ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or. By Lemma, with probability at most ne ɛ βm, diff(s, S ) β( α α ɛ) for some S N and some S N. The above analysis completes the proof for Claim. Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for all S N i for i =,. We consider possible choices of the two random fragments r and r in the following. At any iteration of the k-loop, if r N and r N, then by () we have diff(r, r ) diff(r, G ) + diff(r, G ) (α +ɛ) for any r N ; and diff(r, r ) diff(r, G )+diff(r, G ) (α +ɛ) for any r N. By () and the given condition of the theorem, we have, diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N ; and diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N. This implies that at this loop iteration we have M = N, M = N and d (α + ɛ). Similarly, if at this iteration r N and r N, then M = N, M = N and d (α + ɛ). If r, r N at some iteration of the k-loop, then for any r N, either r M or r M. In either case, by () of our assumption and the given condition of the theorem, we have d β( α α ɛ) > (α + ɛ) at this iteration. Similarly, if r, r N at some iteration of the k-loop, then we also have d > (α + ɛ) at this iteration. It follows from the above analysis that, under the assumption of () and (), once we have r N and r N or r N and r N at some iteration of the k-loop, then M, M is a permutation of N, N at the end of this iteration. Furthermore, if M and M are replaced by M and M after this iteration, then M, M must also be a permutation of N, N. By Claims and, with probability at most ( γ) u + ne ɛ βm, the assumption of () and () is not true, or r N and r N (or r N and r N ) is not true at all the iterations of the k-loop. Hence, with probability at most ( γ) u + ne ɛ βm, the final list of M and M returned by the algorithm is not a permutation of N, N, so the claim is proved. 0

12 Page of For M and M returned by the algorithm, we assume without loss of generality M i = N i, i =,. By Lemma and the given condition of the theorem, with probability at most me ɛ n + me ɛ n, we have H = vote(m ) G or H = vote(m ) G. Thus, by Claim, with probability at most ( γ) u + ne ɛ βm + me ɛ n, we have H G or H G. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M for a constant u.. Tuning the Dissimilarity Measure In this section, we consider a different dissimilarity measure in algorithm SHR-TWO to improve its ability to tolerate errors. We use the sum of the differences between r i and every fragment r M i, i =,, to measure the dissimilarity of the fragments in M i with r i. The new algorithm SHR-Three is given in the following. We will present experimental results in Section to show that algorithm SHR-Three is more reliable and robust in dealing with possible outliers in the data sets. Algorithm SHR-Three Input: M, an n m matrix of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. } Let d min = and M =. For (k = to u) do { //the k-loop Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = r M i diff(r i, r ) for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <,

13 Page of < α + α ɛ <, η > G, G Σ m (α +ɛ) β( α α ɛ) min(n,n) with η = n, and β( α α ɛ) > (α + ɛ). Let be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = nn n(n ) and u is an integer parameter. Proof: Let N i be the set of the n i many F α,α (m, G i ) sequences in M, for i =,. We first notice that both Claims and in the proof of Theorem hold here following the same analysis. However, we need to prove the following claim with different analysis: Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for each S from N i (i =, ). We shall consider the two cases: Case. At some iteration of the k-loop, both r and r are selected from the same N i for i = or. For each r N j, j i, by assumption () we have both diff(r i, r ) > β( α α ɛ), i =,. Notice that r must be either in M or M. So, at least half of the fragments in N j will be either in M or M. Therefore, at this iteration, we have d n jβ( α α ɛ) ηnβ( α α ɛ) = ηβ( α α ɛ)n > (α + ɛ)n. Case. At some iteration of the k-loop, r and r are selected from different N i for i = and. Without loss of generality, r i N i, i =,. For each r N, by assumption () we have diff(r, r ) (α + ɛ); by assumption () and the given condition of the theorem we have diff(r, r ) > β( α α ɛ) > (α +ɛ). Similarly, for each r N, diff(r, r ) (α + ɛ), and diff(r, r ) > (α + ɛ). Therefore, at this iteration, we have M = N and M = N, and d (α + ɛ)n + (α + ɛ)n = (α + ɛ)n. The two cases implies that under the assumption of () and (), if at any iteration of the k-loop, we have r N and r N, or r N and r N, then the final list of the two sets M, M is a permutation of N, N. Hence, Claim follows from Claims and in the proof of Theorem, which are true here as we mentioned earlier. Now, we assume that the final list of the two sets M, M is a permutation of N, N. By Lemma, with probability at most m(e δ n ) + m(e δ n ), H = vote(m ), H = vote(m ) is not a permutation of G, G. This, together with Claim, completes the probabilistic claim of the theorem. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M.

14 Page of Experimental Results We design a MATLAB program to test both the accuracy and the speed of algorithm SHR-Three. Due to the difficulty of getting real data from the public domain (Alessandro Panconesi, 00), our experiment data is created following the common practice in literature such as (Wang et al., 00; Alessandro Panconesi, 00). A random matrix of SNP fragments is created as follows: () Haplotype is generated at random with length m (m {0, 00, 0}). () Haplotype is generated by copying all the bits from haplotype and flipping each bit with probability β (β {0., 0., 0.}). This simulates the dissimilarity rate β between two haplotypes. () Each haplotype is copied n times so that the matrix has m columns and n(n {0, 0, 0}) rows. () Set each bit in the matrix to - with probability α (α {0., 0., 0.}). This simulates the incompleteness error rate α in the matrix. () Flip each non-empty bit with probability α (α {0.0, 0.0,..., 0.}). This is the simulation of the inconsistency error rate of α. Tables to show the performance of algorithm SHR-Three with different parameter settings in accordance with those in (Alessandro Panconesi, 00). The typical parameters used in (Alessandro Panconesi, 00) are m = 00, n = 0, β = 0., α = 0. and 0.0 α 0.0. These are considered to be close to the real situations. In our tables, the results are the average time and the reconstruction rate of the 000 executions of algorithm SHR-Three. A new random matrix is used for each execution. The reconstruction rate is defined as the ratio of the total number of correctly reconstructed bits to the total number of bits in two haplotypes. The computing environment is a PC machine with a typical configuration of.ghz AMD Turion X CPUs and GB memory.. Concluding Remarks The software of our algorithms is available for public access and for real-time on-line demonstration at We thank Liqiang Wang for implementing the programs in Java and setting up this web site. We have developed three linear time probabilistic algorithms for the singular haplotype reconstruction problem with provable high success probability. Our experimental results conform with the theoretical efficiency and accuracy of the algorithms. It should be pointed out that our work can be extended to reconstruct multiple haplotypes from a set of fragments. Our approach also opens the door to develop probabilistic methods for other variants of the

15 Page of haplotyping problem involving both inconsistency and incompleteness errors. References Alessandro Panconesi, M. S. (00). Fast Hare: A fast heuristic for single individual snp haplotype reconstruction. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science 0, pages. Bafna, V., Istrail, S., Lancia, G., and Rizzi, R. (00). Polynomial and apx-hard cases of the individual haplotyping problem. Theoretical Computer Science, :0. Cilibrasi, R., van Iersel, L., Kelk, S., and Tromp, J. (00). On the complexity of several haplotyping problems. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science, volume, pages. Clark, A. (0). Inference of haplotypes from pcr-amplified samples of diploid populations. Molecular Biology Evolution, :. G. Lancia, R. R. (00). A polynomial solution to a special case of parsimony haplotyping problem. Operations Research Letters, (0):. Genovese, L. M., Geraci, F., and Pellegrini, M. (00). A fast and accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage. Algorithms in Bioinformatics, : 0. Gusfield, D. (000). A practical algorithm for optimal inference of haplotype from diploid populations. In The Eighth International Conference on Intelligence Systems for Molecular Biology, pages. Gusfield, D. (00). Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In the Sixth Annual International Conference on Computational Biology, pages. Hoehe, M., Kopke, K., Wendel, B., Rohde, K., Flachmeier, C., Kidd, K., Berrettini, W., and Church, G. (000). Sequence variability and candidate gene analysis in complex disease: association of opioid receptor gene variation with substance dependence. Human Molecular Genetics, (): 0. Lancia, G., Pinotti, M. C., and Rizzi, R. (00). Haplotyping polulations by purs parsimongy: complexity and algorithms. INFORMS Journal on computing, :. Li, L., Kim, J. H., and Waterman, M. S. (00). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, ( ):0.

16 Page of Li, M., Ma, B., and Wang, L. (00). On the closest string and substring problems. Journal of the ACM, ():. Lippert, R., Schwartz, R., Lancia, G., and Istrail, S. (00). Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, :. Motwani, R. and Raghavan, P. (000). Randomized Algorithms. Cambridge. Rizzi, R., Bafna, V., Istrail, S., and Lancia, G. (00). Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In Algorithms in Bioinformatics: Second International Workshop, WABI 00, Rome, Italy, September -, pages. Terwilliger, J. and Weiss, K. (). Linkage disquilibrium mapping of complex disease: fantasy and reality? Current Opinion in Biotechnology, ():. Wang, R.-S., Wu, L.-Y., Li, Z.-P., and Zhang, X.-S. (00). Hyplotype reconstruction from SNP fragments by minimum error correction. BioInformatics, (0):. Wang, R.-S., Wu, L.-Y., Zhang, X.-S., and Chen, L. (00). A markov chain model for haplotype assembly from SNP fragments. Genome Informatics, ():. Xie, M. and Wang, J. (00). An improved (and practical) parameterized algorithm for the individual haplotyping problem MFR with mate-pairs. Algorithmica (online). Zhang, X.-S., Wang, R.-S., Wu, L.-Y., and Zhang, W. (00). Minimum conflict individual haplotyping from SNP fragments and related genotype. Evolutionary Bioinformatics Online, : 0. Zhao, Y., Wu, L., Zhang, J., and R. Wang, X. Z. (00). Haplotype assembly from aligned weighted SNP fragments. Computational Biology and Chemistry, :.

17 Page of α (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0, β = 0% and α = 0%

18 Page of n = 0 n = 0 α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, β = 0%, α = 0% β = 0% β = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and α = 0% α = 0% α = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and β = 0%

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti and Simone Zaccaria DISCo, computer

More information

SAT in Bioinformatics: Making the Case with Haplotype Inference

SAT in Bioinformatics: Making the Case with Haplotype Inference SAT in Bioinformatics: Making the Case with Haplotype Inference Inês Lynce 1 and João Marques-Silva 2 1 IST/INESC-ID, Technical University of Lisbon, Portugal ines@sat.inesc-id.pt 2 School of Electronics

More information

Haplotype Inference Constrained by Plausible Haplotype Data

Haplotype Inference Constrained by Plausible Haplotype Data Haplotype Inference Constrained by Plausible Haplotype Data Michael R. Fellows 1, Tzvika Hartman 2, Danny Hermelin 3, Gad M. Landau 3,4, Frances Rosamond 1, and Liat Rozenberg 3 1 The University of Newcastle,

More information

Allen Holder - Trinity University

Allen Holder - Trinity University Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University

More information

A Markov Chain Model for Haplotype Assembly from SNP Fragments

A Markov Chain Model for Haplotype Assembly from SNP Fragments 162 Genome Informatics 17(2): 162{171 (2006) A Markov Chain Model for Haplotype Assembly from SNP Fragments Rui-Sheng Wang 1;2 Ling-Yun Wu 3 wangrsh@amss.ac.cn wlyun@amt.ac.cn Xiang-Sun Zhang 3* Luonan

More information

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Germany. Tzvika Hartman Dept. of Computer Science,

More information

The Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection

The Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection The Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection M.T. Hajiaghayi K. Jain K. Konwar L.C. Lau I.I. Măndoiu A. Russell A. Shvartsman V.V. Vazirani Abstract In this paper we consider

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely

Let S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT

More information

Consensus Optimizing Both Distance Sum and Radius

Consensus Optimizing Both Distance Sum and Radius Consensus Optimizing Both Distance Sum and Radius Amihood Amir 1, Gad M. Landau 2, Joong Chae Na 3, Heejin Park 4, Kunsoo Park 5, and Jeong Seop Sim 6 1 Bar-Ilan University, 52900 Ramat-Gan, Israel 2 University

More information

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.! Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing

More information

Mathematical Approaches to the Pure Parsimony Problem

Mathematical Approaches to the Pure Parsimony Problem Mathematical Approaches to the Pure Parsimony Problem P. Blain a,, A. Holder b,, J. Silva c, and C. Vinzant d, July 29, 2005 Abstract Given the genetic information of a population, the Pure Parsimony problem

More information

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence

A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding

More information

An Overview of Combinatorial Methods for Haplotype Inference

An Overview of Combinatorial Methods for Haplotype Inference An Overview of Combinatorial Methods for Haplotype Inference Dan Gusfield 1 Department of Computer Science, University of California, Davis Davis, CA. 95616 Abstract A current high-priority phase of human

More information

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm

Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University

More information

Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem

Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem Romeo Rizzi 1,, Vineet Bafna 2, Sorin Istrail 2, and Giuseppe Lancia 3 1 Math. Dept., Università

More information

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm 1, Tzvika Hartman 2, Till Nierhoff 3, Roded Sharan 4, and Till Tantau 5 1 Wilhelm-Schickard-Institut für Informatik,

More information

Genetic Algorithms: Basic Principles and Applications

Genetic Algorithms: Basic Principles and Applications Genetic Algorithms: Basic Principles and Applications C. A. MURTHY MACHINE INTELLIGENCE UNIT INDIAN STATISTICAL INSTITUTE 203, B.T.ROAD KOLKATA-700108 e-mail: murthy@isical.ac.in Genetic algorithms (GAs)

More information

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing INFORMS Journal on Computing Vol. 22, No. 2, Spring 2010, pp. 195 209 issn 1091-9856 eissn 1526-5528 10 2202 0195 informs doi 10.1287/ijoc.1090.0333 2010 INFORMS A Class Representative Model for Pure Parsimony

More information

A Class Representative Model for Pure Parsimony Haplotyping

A Class Representative Model for Pure Parsimony Haplotyping A Class Representative Model for Pure Parsimony Haplotyping Daniele Catanzaro, Alessandra Godi, and Martine Labbé June 5, 2008 Abstract Haplotyping estimation from aligned Single Nucleotide Polymorphism

More information

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions

Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Computer Science Department and Center for the Neural Basis of Cognition Carnegie Mellon University

More information

SNPs Problems, Complexity, and Algorithms

SNPs Problems, Complexity, and Algorithms SNPs Problems, Complexity, and lgorithms Giuseppe Lancia 1,2, Vineet afna 1, Sorin Istrail 1, Ross Lippert 1, and Russell Schwartz 1 1 Celera Genomics, Rockville MD, US, {Giuseppe.Lancia,Vineet.afna,Sorin.Istrail,Ross.Lippert,

More information

Average Case Complexity: Levin s Theory

Average Case Complexity: Levin s Theory Chapter 15 Average Case Complexity: Levin s Theory 1 needs more work Our study of complexity - NP-completeness, #P-completeness etc. thus far only concerned worst-case complexity. However, algorithms designers

More information

This is a survey designed for mathematical programming people who do not know molecular biology and

This is a survey designed for mathematical programming people who do not know molecular biology and INFORMS Journal on Computing Vol. 16, No. 3, Summer 2004, pp. 211 231 issn 0899-1499 eissn 1526-5528 04 1603 0211 informs doi 10.1287/ijoc.1040.0073 2004 INFORMS Opportunities for Combinatorial Optimization

More information

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??

More information

Minimization of Boolean Expressions Using Matrix Algebra

Minimization of Boolean Expressions Using Matrix Algebra Minimization of Boolean Expressions Using Matrix Algebra Holger Schwender Collaborative Research Center SFB 475 University of Dortmund holger.schwender@udo.edu Abstract The more variables a logic expression

More information

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value

More information

Implementing Approximate Regularities

Implementing Approximate Regularities Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National

More information

Computational statistics

Computational statistics Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f

More information

EVOLUTIONARY DISTANCES

EVOLUTIONARY DISTANCES EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:

More information

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109 CONTENTS ix Preface xv Acknowledgments xxi Editors and contributors xxiv A computational micro primer xxvi P A R T I Genomes 1 1 Identifying the genetic basis of disease 3 Vineet Bafna 2 Pattern identification

More information

On the Monotonicity of the String Correction Factor for Words with Mismatches

On the Monotonicity of the String Correction Factor for Words with Mismatches On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.

More information

SINGLE nucleotide polymorphisms (SNPs) are differences

SINGLE nucleotide polymorphisms (SNPs) are differences IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006 303 Islands of Tractability for Parsimony Haplotyping Roded Sharan, Bjarni V. Halldórsson, and Sorin

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Computing Bi-Clusters for Microarray Analysis

Computing Bi-Clusters for Microarray Analysis December 21, 2006 General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model General Bi-clustering Problem Input: a n m matrix A. Output: a sub-matrix A P,Q of A such that the rows

More information

Analysis and Design of Algorithms Dynamic Programming

Analysis and Design of Algorithms Dynamic Programming Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................

More information

Efficient Haplotype Inference with Boolean Satisfiability

Efficient Haplotype Inference with Boolean Satisfiability Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University

More information

Lecture 15: A Brief Look at PCP

Lecture 15: A Brief Look at PCP IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 15: A Brief Look at PCP David Mix Barrington and Alexis Maciel August 4, 2000 1. Overview

More information

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser

Algorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact

More information

Lecture 2 Sep 5, 2017

Lecture 2 Sep 5, 2017 CS 388R: Randomized Algorithms Fall 2017 Lecture 2 Sep 5, 2017 Prof. Eric Price Scribe: V. Orestis Papadigenopoulos and Patrick Rall NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1

More information

ILP-Based Reduced Variable Neighborhood Search for Large-Scale Minimum Common String Partition

ILP-Based Reduced Variable Neighborhood Search for Large-Scale Minimum Common String Partition Available online at www.sciencedirect.com Electronic Notes in Discrete Mathematics 66 (2018) 15 22 www.elsevier.com/locate/endm ILP-Based Reduced Variable Neighborhood Search for Large-Scale Minimum Common

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata

An Evolution Strategy for the Induction of Fuzzy Finite-state Automata Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College

More information

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel

Notes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel Introduction to Coding Theory CMU: Spring 2010 Notes 3: Stochastic channels and noisy coding theorem bound January 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami We now turn to the basic

More information

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003

CS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003 CS6999 Probabilistic Methods in Integer Programming Randomized Rounding April 2003 Overview 2 Background Randomized Rounding Handling Feasibility Derandomization Advanced Techniques Integer Programming

More information

More Approximation Algorithms

More Approximation Algorithms CS 473: Algorithms, Spring 2018 More Approximation Algorithms Lecture 25 April 26, 2018 Most slides are courtesy Prof. Chekuri Ruta (UIUC) CS473 1 Spring 2018 1 / 28 Formal definition of approximation

More information

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa

CS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.

More information

A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA

A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA Aviral Takkar Computer Engineering Department, Delhi Technological University( Formerly Delhi College of Engineering), Shahbad Daulatpur, Main Bawana Road,

More information

STUDY of genetic variations is a critical task to disease

STUDY of genetic variations is a critical task to disease Haplotype Assembly Using Manifold Optimization and Error Correction Mechanism Mohamad Mahdi Mohades, Sina Majidian, and Mohammad Hossein Kahaei arxiv:9.36v [math.oc] Jan 29 Abstract Recent matrix completion

More information

Do Read Errors Matter for Genome Assembly?

Do Read Errors Matter for Genome Assembly? Do Read Errors Matter for Genome Assembly? Ilan Shomorony C Berkeley ilan.shomorony@berkeley.edu Thomas Courtade C Berkeley courtade@berkeley.edu David Tse Stanford niversity dntse@stanford.edu arxiv:50.0694v

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

Perfect Phylogenetic Networks with Recombination Λ

Perfect Phylogenetic Networks with Recombination Λ Perfect Phylogenetic Networks with Recombination Λ Lusheng Wang Dept. of Computer Sci. City Univ. of Hong Kong 83 Tat Chee Avenue Hong Kong lwang@cs.cityu.edu.hk Kaizhong Zhang Dept. of Computer Sci. Univ.

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Sequence Alignment (chapter 6)

Sequence Alignment (chapter 6) Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:

More information

A Survey of Computational Methods for Determining Haplotypes

A Survey of Computational Methods for Determining Haplotypes A Survey of Computational Methods for Determining Haplotypes Bjarni V. Halldórsson, Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail Informatics Research, Celera Genomics/Applied

More information

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:

More information

Integer Programming for Phylogenetic Network Problems

Integer Programming for Phylogenetic Network Problems Integer Programming for Phylogenetic Network Problems D. Gusfield University of California, Davis Presented at the National University of Singapore, July 27, 2015.! There are many important phylogeny problems

More information

Optimal XOR based (2,n)-Visual Cryptography Schemes

Optimal XOR based (2,n)-Visual Cryptography Schemes Optimal XOR based (2,n)-Visual Cryptography Schemes Feng Liu and ChuanKun Wu State Key Laboratory Of Information Security, Institute of Software Chinese Academy of Sciences, Beijing 0090, China Email:

More information

Better Approximations for the Minimum Common Integer Partition Problem

Better Approximations for the Minimum Common Integer Partition Problem Better Approximations for the Minimum Common Integer Partition Problem David P. Woodruff 1 MIT dpwood@mit.edu 2 Tsinghua University Abstract. In the k-minimum Common Integer Partition Problem, abbreviated

More information

Patterns of Simple Gene Assembly in Ciliates

Patterns of Simple Gene Assembly in Ciliates Patterns of Simple Gene Assembly in Ciliates Tero Harju Department of Mathematics, University of Turku Turku 20014 Finland harju@utu.fi Ion Petre Academy of Finland and Department of Information Technologies

More information

Lecture 4: September 19

Lecture 4: September 19 CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes

More information

Using Markov Chains To Model Human Migration in a Network Equilibrium Framework

Using Markov Chains To Model Human Migration in a Network Equilibrium Framework Using Markov Chains To Model Human Migration in a Network Equilibrium Framework Jie Pan Department of Mathematics and Computer Science Saint Joseph s University Philadelphia, PA 19131 Anna Nagurney School

More information

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.

More information

The Pure Parsimony Problem

The Pure Parsimony Problem Haplotyping and Minimum Diversity Graphs Courtney Davis - University of Utah - Trinity University Some Genetics Mother Paired Gene Representation Physical Trait ABABBA AAABBB Physical Trait ABA AAA Mother

More information

Scaffold Filling Under the Breakpoint Distance

Scaffold Filling Under the Breakpoint Distance Scaffold Filling Under the Breakpoint Distance Haitao Jiang 1,2, Chunfang Zheng 3, David Sankoff 4, and Binhai Zhu 1 1 Department of Computer Science, Montana State University, Bozeman, MT 59717-3880,

More information

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Repeat resolution. This exposition is based on the following sources, which are all recommended reading: Repeat resolution This exposition is based on the following sources, which are all recommended reading: 1. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions,

More information

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de

More information

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Lan Liu 1, Xi Chen 3, Jing Xiao 3, and Tao Jiang 1,2 1 Department of Computer Science and Engineering, University

More information

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough

Search. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough Search Search is a key component of intelligent problem solving Search can be used to Find a desired goal if time allows Get closer to the goal if time is not enough section 11 page 1 The size of the search

More information

Kleene Algebras and Algebraic Path Problems

Kleene Algebras and Algebraic Path Problems Kleene Algebras and Algebraic Path Problems Davis Foote May 8, 015 1 Regular Languages 1.1 Deterministic Finite Automata A deterministic finite automaton (DFA) is a model of computation that can simulate

More information

Lecture 3: Error Correcting Codes

Lecture 3: Error Correcting Codes CS 880: Pseudorandomness and Derandomization 1/30/2013 Lecture 3: Error Correcting Codes Instructors: Holger Dell and Dieter van Melkebeek Scribe: Xi Wu In this lecture we review some background on error

More information

Reductions for One-Way Functions

Reductions for One-Way Functions Reductions for One-Way Functions by Mark Liu A thesis submitted in partial fulfillment of the requirements for degree of Bachelor of Science (Honors Computer Science) from The University of Michigan 2013

More information

On Two Class-Constrained Versions of the Multiple Knapsack Problem

On Two Class-Constrained Versions of the Multiple Knapsack Problem On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic

More information

25 : Graphical induced structured input/output models

25 : Graphical induced structured input/output models 10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large

More information

Entropy Rate of Stochastic Processes

Entropy Rate of Stochastic Processes Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be

More information

Structure-Based Comparison of Biomolecules

Structure-Based Comparison of Biomolecules Structure-Based Comparison of Biomolecules Benedikt Christoph Wolters Seminar Bioinformatics Algorithms RWTH AACHEN 07/17/2015 Outline 1 Introduction and Motivation Protein Structure Hierarchy Protein

More information

arxiv: v1 [cs.cc] 15 Nov 2016

arxiv: v1 [cs.cc] 15 Nov 2016 Diploid Alignment is NP-hard Romeo Rizzi 1, Massimo Cairo 1, Veli Mäkinen 2, and Daniel Valenzuela 2 1 Department of Computer Science, University of Verona, Italy 2 Helsinki Institute for Information echnology,

More information

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem

A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence

More information

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8

The E-M Algorithm in Genetics. Biostatistics 666 Lecture 8 The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as

More information

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.

The genome encodes biology as patterns or motifs. We search the genome for biologically important patterns. Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric

More information

Stochastic processes and

Stochastic processes and Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University

More information

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST

More information

Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison

Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-7-2008 Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome

More information

Péter Ligeti Combinatorics on words and its applications

Péter Ligeti Combinatorics on words and its applications Eötvös Loránd University Institute of Mathematics Péter Ligeti Combinatorics on words and its applications theses of the doctoral thesis Doctoral school: Mathematics Director: Professor Miklós Laczkovich

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

6.842 Randomness and Computation Lecture 5

6.842 Randomness and Computation Lecture 5 6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its

More information

Marginal Screening and Post-Selection Inference

Marginal Screening and Post-Selection Inference Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2

More information

Outline. Complexity Theory. Example. Sketch of a log-space TM for palindromes. Log-space computations. Example VU , SS 2018

Outline. Complexity Theory. Example. Sketch of a log-space TM for palindromes. Log-space computations. Example VU , SS 2018 Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 3. Logarithmic Space Reinhard Pichler Institute of Logic and Computation DBAI Group TU Wien 3. Logarithmic Space 3.1 Computational

More information

A Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files

A Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files A Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files Pedro A. Diaz-Gomez and Dean F. Hougen Robotics, Evolution, Adaptation, and Learning Laboratory (REAL Lab) School of Computer

More information

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification

On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and

More information

CS243, Logic and Computation Nondeterministic finite automata

CS243, Logic and Computation Nondeterministic finite automata CS243, Prof. Alvarez NONDETERMINISTIC FINITE AUTOMATA (NFA) Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Maloney Hall, room 569 alvarez@cs.bc.edu Computer Science Department voice: (67) 552-4333

More information

EECS730: Introduction to Bioinformatics

EECS730: Introduction to Bioinformatics EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search

More information

The Incomplete Perfect Phylogeny Haplotype Problem

The Incomplete Perfect Phylogeny Haplotype Problem The Incomplete Perfect Phylogeny Haplotype Prolem Gad Kimmel 1 and Ron Shamir 1 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. Email: kgad@tau.ac.il, rshamir@tau.ac.il. Astract.

More information

Reducing storage requirements for biological sequence comparison

Reducing storage requirements for biological sequence comparison Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,

More information

CS6901: review of Theory of Computation and Algorithms

CS6901: review of Theory of Computation and Algorithms CS6901: review of Theory of Computation and Algorithms Any mechanically (automatically) discretely computation of problem solving contains at least three components: - problem description - computational

More information

Turing Machines, diagonalization, the halting problem, reducibility

Turing Machines, diagonalization, the halting problem, reducibility Notes on Computer Theory Last updated: September, 015 Turing Machines, diagonalization, the halting problem, reducibility 1 Turing Machines A Turing machine is a state machine, similar to the ones we have

More information

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from

More information

1 Randomized complexity

1 Randomized complexity 80240233: Complexity of Computation Lecture 6 ITCS, Tsinghua Univesity, Fall 2007 23 October 2007 Instructor: Elad Verbin Notes by: Zhang Zhiqiang and Yu Wei 1 Randomized complexity So far our notion of

More information

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t]

Pr[X = s Y = t] = Pr[X = s] Pr[Y = t] Homework 4 By: John Steinberger Problem 1. Recall that a real n n matrix A is positive semidefinite if A is symmetric and x T Ax 0 for all x R n. Assume A is a real n n matrix. Show TFAE 1 : (a) A is positive

More information