Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments
|
|
- Marilynn Sims
- 5 years ago
- Views:
Transcription
1 : Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Journal: Manuscript ID: Manuscript Type: Date Submitted by the Author: Complete List of Authors: Keyword: draft Original Paper n/a Chen, Zhixiang; University of Texas-Pan American, Department of Computer Science Fu, Bin; University of Texas-Pan American, Computer Science Schweller, Robert; University of Texas-Pan American, Computer Science Yang, Boting; University of Regina, Computer Science Zhao, Zhiyu; University of New Orleans, Computer Science Zhu, Binhai; Montana State University, Computer Science haplotypes, algorithms, probability, alignment, SEQUENCE ANALYSIS Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.
2 Page of Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang Chen Bin Fu Robert Schweller Boting Yang Zhiyu Zhao Binhai Zhu Abstract In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration. Keywords: Singular haplotype reconstruction, SNP fragments, probabilistic modeling and analysis, linear time probabilistic algorithm, inconsistency and incompleteness errors.. Introduction Abstractly, a genome can be considered a string over the alphabet of nucleotides {A, G, C, T }. It is accepted that the genomes between any two humans are over % identical (Terwilliger and Weiss, ; Hoehe et al., 000). The remaining sites which exhibit substantial variation (in at least % of the population) are called Single Nucleotide Polymorphisms (SNPs). The values of a set of SNPs on a particular chromosome copy define a haplotype. While a haplotype is a string over the four nucleotide bases, typical SNP sites only vary Corresponding author. Department of Computer Science, University of Texas-Pan American, Edinburg, TX, USA. s: {chen, binfu, schwellerr}@cs.panam.edu, Phone: ()-0, Fax: ()-0. Department of Computer Science, University of Regina, Saskatchewan, SS 0A, Canada. boting@cs.uregina.ca, Phone:(0)-, Fax: (0)-. Department of Computer Science, University of New Orleans, New Orleans, LA 0, USA. zzha@cs.uno.edu, Phone: (0)0-, Fax: (0)0-. Department of Computer Science, Montana State University, Bozeman, MT. USA. bhz@cs.montana.edu, Phone: (0)-, Fax: (0)-.
3 Page of between two values. Therefore, we can logically represent a haplotype as a binary string over the alphabet {A, B}. The haplotype of an individual chromosome can be thought of as a genetic finger print specifying the bulk of the genetic variability among distinct members of the same species. Determining haplotypes is thus a key step in the analysis of genetic variation. In recent years, the haplotyping problem has been extensively studied, see, f.g., (G. Lancia, 00; Zhao et al., 00; Wang et al., 00; Zhang et al., 00; Bafna et al., 00; Rizzi et al., 00; Wang et al., 00; Lippert et al., 00; Cilibrasi et al., 00; Lancia et al., 00; G. Lancia, 00; Alessandro Panconesi, 00; Clark, 0; Gusfield, 000, 00; Xie and Wang, 00; Li et al., 00; Genovese et al., 00). There are several version of the haplotyping problem. Of most interest is the haplotyping of diploid organisms, such as humans, which have two chromosomes, and thus two haplotypes. A common method to obtain an individual s two haplotypes is to first obtain a collection of sequencing data from the individual s chromosomes. This data consists of a collection of fragments, each partially specifying the sequence of one of the two base chromosomes. This data typically contains two types of errors. The data is incomplete in that each fragment will fail to assign a base to some collection of positions, and inconsistent in that some positions may be sequenced with an incorrect value. Further, in practice it is not known which chromosome is being sequenced by which fragments. The general haplotype reconstruction problem is then to take this input set of sequenced fragments and derive the two haplotypes which yielded the given fragments. This problem, which we consider, is the singular haplotype reconstruction problem and, like other versions of the problem, has also been extensively studied, see, f.g., (Cilibrasi et al., 00; Wang et al., 00; Lippert et al., 00; Bafna et al., 00). More concretely, we assume we are given a collection of n length m strings over the alphabet {A, B, } denoting the sequence of each fragment, where denotes a hole in which no value is measured at that position for that fragment. From this basic setup, a number of problems have been considered for various different optimization functions. In particular, one can consider removing certain fragments from the input data such that the remaining set of fragments can be divided into two separate sets, each set consisting of fragments that do not contain any conflicts (a site in which two strings contain different, non-hole values). Such a removal implicity derives two haplotypes corresponding to the two groups. This problem is referred to as the minimum fragment removal problem (MFR). A second problem involves removing or discounting a subset of the SNP sites to create two consistent sets of fragments. This is referred to as the minimum SNP removal problem (MSR). Finally, a third approach is to flip or correct a number of sites for various fragments. This is referred to as the minimum error correction problem (MEC). While much work has been done on these problems, the incompleteness and inconsistency of data fragments makes most versions of these haplotyping problems NP-hard or even hard to approximate (f.g., (Cili-
4 Page of brasi et al., 00; Lippert et al., 00; Bafna et al., 00)), and many elegant and powerful methods such as those in (Li et al., 00) cannot be used to deal with incompleteness and inconsistency at the same time. Other methods have proposed heuristics (Genovese et al., 00; Alessandro Panconesi, 00), but do not provide provably good results. In this paper, we develop a probabilistic approach to overcome some of the difficulties caused by the incompleteness and inconsistency occurred in the input fragments. In our model, we assume the input data fragments are attained from two unknown haplotype strings. Some number of the fragments are derived from the first haplotype, some from the second haplotype. Each fragment is assumed to be generated according to two error parameters, α and α, representing inconsistency errors and incompleteness errors respectively. These parameters represent the percentage chance that any given position will be sequenced incorrectly from the base haplotype, or will be omitted as a hole, respectively. Further, we assume a third parameter β denoting a minimum difference percentage between the two base haplotype strings. With respect to these three parameters, we develop algorithms that output the base haplotypes with probability as a function of α, α, and β. In particular, we design three algorithms in our probabilistic model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. The first algorithm assumes prior knowledge of the α parameter. The second algorithm does not require prior knowledge of any parameters at a modest cost to run time. The third does not require prior knowledge of parameters and attains similar analytical bounds as algorithm, yet exhibits a substantial empirical improvement in accuracy. For all algorithms, we present experimental results that conform with the respective theoretically efficient performance. The software of our algorithms is available for public access and for real-time on-line demonstration. Paper Layout In Section we formally define the problem and the probabilistic model we use. In Section we develop some technical lemmas. In Section we provide a haplotype reconstruction algorithm that utilizes prior knowledge of an error parameter. In Section we present an algorithm that requires no prior knowledge of model parameters. In Section we provide a third algorithm with provably good accuracy that also performs particularly well in practice. In Section we detail experimental results for our algorithms.
5 Page of A Probabilistic Model Assume that we have two haplotypes H, H, denoted as H = a a a m and H = b b b m. Let Γ = {S, S,..., S n } be a set of n fragments obtained from the DNA sequencing process with respect to the two haplotypes H and H. In this case, each S i = c c c m is either a fragment of H or H. Because we lose the information concerning the DNA strand to which a fragment belongs, we do not know whether S i is a fragment of H or H. Suppose that S i is a fragment of H. Because of reading errors or corruptions that may occur during the sequencing process, there is a small chance that either c j - but c j a j, or c j = -, for j m, where the symbol - denotes a hole or missing value. For the former, the information of the fragment S i at the j-th SNP site is inconsistent, and we use α to denote the rate of this type of inconsistency error. For the latter, the information of S i at the j-th SNP is incomplete, and we use α to denote the rate of this type of incompleteness error. It is known (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) that α and α are in practice between % to %. Also, it is realistically reasonable to believe that the dissimilarity, denoted by β, between the two haplotypes H and H is big. Often, β is measured using the Hamming distance between H and H divided by the length m of H and H, and is assumed to be large, say, β 0.. It is also often assumed that roughly half of the fragments in Γ are from each of the two haplotypes H and H. In the experimental studies of algorithmic solutions to the singular haplotype reconstruction problem, we often need to generate synthetic data to evaluate the performance and accuracy of a given algorithm. One common practice (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) is as follows: First, choose two haplotypes H and H such that the dissimilarity between H and H is at least β. Second, make n i copies of H i, i =,. Third, for each copy H = a a a m of H i, for each i =,,..., m, with probability α, flip a i to a i so that they are inconsistent. Also, independently, a i has probability α to be a hole -. A synthetic data set is then generated by setting parameters m, n, n, β, α and α. Usually, n is roughly the same as n, and β 0., α [0.0, 0.0], and α [0., 0.]. Motivated by the above reality of the sequencing process and the common practice in experimental algorithm studies, we will present a probabilistic model for the singular haplotype reconstruction problem. But first we need to introduce some necessary notations and definitions. Let Σ = {A, B} and Σ = {A, B, -}. For a set C, C denotes the number of elements in C. For a fragment (or a sequence) S = a a a m Σ m, S[i] denotes the character a i, and S[i, j] denotes the substring a i a j for i j m. S denotes the length m of S. When no confusion arises, we alternatively use the terms fragment and sequence. Let G = g g g m Σ m be a fixed sequence of m characters. For any sequence S = a a m Σ m, S
6 Page of is called a F α,α (m, G) sequence if for each a i, with probability at most α, a i is not equal to g i and a i -; and with probability at most α, a i = -. For a sequence S, define holes(s) to be the number of holes in the sequence S. If A is a subset of {,, m} and S is a sequence of length m, holes A (S) is the number of i A such that S[i] is a hole. For two sequences S = a a m and S = b b m of the same length m, for any A {,, m}, define diff(s, S ) = {i {,,, m} a i - and b i - and a i b i } m diff A (S, S ) = {i A a i - and b i - and a i b i }. A For a set of sequences Γ = {S, S,, S k } of length m, define vote(γ) to be the sequence H of the same length m such that H[i] is the most frequent character among S [i], S [i],, S k [i] for i =,,, m. We often use an n m matrix M to represent a list of n fragments from Σ m matrix. For i n, let M[i] represent the i-th row of M, i.e., M[i] is a fragment in Σ m. We now define our probabilistic model: and call M an SNP fragment The Probabilistic Singular Haplotype Reconstruction Problem: Let β, α and α be small positive constants. Let G, G Σ m be two haplotypes with diff(g, G ) β. For any given n m matrix M of SNP fragments such that n i rows of M are F α,α (m, G i ) sequences, i =,, n + n = n, reconstruct the two haplotypes G and G, which are unknown to the users, from M as accurately as possible with high probability. We call β (resp., α, α ) dissimilarity rate (resp., inconsistency error rate, incompleteness error rate).. Technical Lemmas For probabilistic analysis we need the following two Chernoff bounds (Motwani and Raghavan, 000, see), which can be derived from the work in (Li et al., 00). Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at most p. Let X = n i= X i. Then for any ɛ > 0, Pr(X > pn + ɛn) < e nɛ. Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at least p. Let X = n i= X i. Then for any ɛ > 0, Pr(X < pn ɛn) < e nɛ. We shall prove several technical lemmas for algorithm analysis in the next three sections.
7 Page of Lemma. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m. Proof: Let X k, k =, m, be random variables such that X k = if S[k] G i [k] and S[k], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X k are independent and P r(x k = ) α. So, by Lemma, with probability at most e ɛ m, X + + X m > (α + ɛ)m Thus, we have diff(g, S) > α + ɛ with probability at most e ɛ m. Similarly, with probability at most e ɛ m, holes(s) > (α + ɛ)m. Lemma. Assume that A is a fixed subset of {,,, m}. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ A, diff A (G i, S) > α + ɛ or holes A (S) > (α + ɛ) A. Proof: Let S be the subsequence consisting of all the characters S[i], i A, with the same order as in S. Similarly, let G be the subsequence consisting of all the characters G[i], i A, with the same order as in G. It is easy to see that diff A (S, G i ) = diff(s, G ). The lemma follows from a similar proof for Lemma. Lemma. Let N i be a set of n i many F α,α (m, G i ) sequences, i =,. Let β and ɛ be two positive constants such that α +α +ɛ <, and diff(g, G ) β. Then, with probability at most (n +n )e ɛ βm, diff(s i, S j ) β( α α ɛ) for some S i N i and some S j N j with i j. Proof: For each G i, let A i be the set of indexes {k {,,, m} G i [k] G j [k]}, where i j. Since diff(g i, G j ) β and G i = G j = m, we have A i βm. For any F α,α (m, G i ) sequence S, by Lemma, with probability at most e ɛ A i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i. Hence, with probability at most n i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i, for some S N i. Therefore, with probability at most (n + n )e ɛ βm, we have diff Ai (S, G i ) > α + ɛ or holes Ai (S) > (α + ɛ) A i,j, for some S N i, for some i = or. In other words, with probability at least (n + n )e ɛ βm, we have diff Ai (S, G i ) α + ɛ and holes Ai (S) (α + ɛ) A i,j, for all S N i and for i = and. For any F α,α (m, G i ) sequence S i, i =,, if diff Ai (S i, G i ) α + ɛ and holes Ai (S i ) (α + ɛ) A i, then diff(s, S ) diff Ai (S, S ) β( α α ɛ). Thus, with probability at least (n +n )e ɛ βm, we have diff(s, S ) β( α α ɛ), for every S N and every S N. In words, with probability at most (n +n )e ɛ βm, we have diff(s, S ) < β( α α ɛ), for some S N and some S N. Lemma. Let α, α and ɛ be three small positive constants that satisfy 0 < α + α ɛ <. Assume that N = {S,, S n } is a set of F α,α (m, G) sequences. Let H = vote(n). Then, with probability at most m(e ɛ n ), G H.
8 Page of Proof: Given any j m, for any i n, let X i be random variables such that X i = if S i [j] G[j] and S i [j], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X i are independent and P r(x i = ) α. So, by Lemma, with probability at most e ɛ n, X + +X n < (α ɛ)n. That is, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] G[j] and S i [j] -. Similarly, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] = -. Thus, with probability at most me ɛ n, there are fewer than (α + α ɛ)n characters S i [j] such that S i [j] G[j] for some j m. This implies that, with probability at least me ɛ n, there are more than ( α α + ɛ)n characters S i [j] such that S i [j] = G[j] for any j m. Since 0 < α +α ɛ < by the assumption of the theorem, we have (α +ɛ)n < ( α α +ɛ)n. This further implies that with probability at least me ɛ n, vote(n)[j] = G[j] for any j m, i.e., vote(n) = G.. When the Inconsistency Error Parameter Is Known Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy (α + ɛ) < β and 0 < α + α ɛ <. Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. There exists an O(nm) time algorithm that can find two haplotypes H and H with probability at least ne ɛ m me ɛ n me ɛ n such that either H = G and H = G, or H = G and H = G. Proof: The algorithm, denoted as SHR-One, is described as follows. Algorithm SHR-One Input: M, an n m matrix of SNP fragments. Parameters α and ɛ. Output: Two haplotypes H and H. Set Γ = Γ =. Randomly select a fragment r = M[j] for some j n. For every fragment r from M do If (diff(r, r ) (α + ɛ)) then put r into Γ Let Γ = M Γ. Let H = vote(γ ) and H = vote(γ ). return H and H.
9 Page of End of Algorithm Claim. With probability at most ne ɛ m, we have either diff(f, G ) > α + ɛ for some F α,α (m, G ) sequence f in M, or diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. By Lemma, for any fragment f = M[k] such that f is a F α,α (m, G ) sequence, with probability at most e ɛ m we have diff(f, G ) > α + ɛ. Since there are n many F α,α (m, G ) sequences in M, with probability at most n e ɛ m, we have diff(f, G ) > α +ɛ for some F α,α (m, G ) sequence f in M. Similarly, with probability at most n e ɛ m, we have diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. Combining the above completes the proof for Claim. Claim. Let M i be the set of all the F α,α (m, G i ) sequences in M, i =,. With probability at least ne ɛ m, Γ and Γ is a permutation of M and M. By the assumption of the theorem, the fragment r of M is either a F α,α (m, G ) sequence or a F α,α (m, G ) sequence. We assume that the former is true. By Claim, with probability at least ne ɛ m, we have diff(f, G ) α + ɛ for all F α,α (m, G ) sequences f in M, and diff(g, G ) α + ɛ for all F α,α (m, G ) sequences g in M. Hence, for any fragment r in M, if r is a F α,α (m, G ) sequence, then with probability at least ne ɛ m, we have diff(r, r ) diff(r, G ) + diff(r, G ) (α + ɛ). This means that, with probability at least ne ɛ m, all F α,α (m, G ) sequences in M will be included in Γ. Now, consider the case that r is a F α,α (m, G ) sequence in M. Since diff(g, G ) diff(g, r) + diff(r, G ) diff(g, r) + diff(r, r ) + diff(r, G ), we have diff(r, r ) diff(g, G ) diff(g, r) diff(g, r ). By the given condition of diff(g, G ) β and (α + ɛ) < β, with probability at least ne ɛ m, we have diff(r, r ) β diff(g, r) diff(g, r ) β (α + ɛ) > (α + ɛ), i.e., r will not be added to Γ. Therefore, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. Similarly, if r is a F α,α (m, G ) sequence, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. This completes the proof of Claim. Suppose that Γ and Γ is a permutation of M and M. Say, without loss of generality, Γ = M and Γ = M. By Lemma, with probability at most me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Hence, by Claim, with probability at most ne ɛ m + me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Concerning the computational time of the algorithm, we need to compute the difference between the selected fragment r and each of the remaining n fragments in the matrix M. Finding the difference between r and r takes O(m) steps. So, the total computational time is O(nm), which is linear in the size of the input matrix M.
10 Page of When Parameters Are Not Known In this section, we consider the case that the parameters α, α and β are unknown. However, we assume the existence of those parameters for the input matrix M of SNP fragments. We will show that in this case we can still reconstruct the two unknown haplotypes from M with high probability. Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <, 0 < α +α ɛ <, and β( α α ɛ) > (α +ɛ). Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = Proof: nn n(n ) The algorithm, denoted as SHR-Two, is described as follows. Algorithm SHR-Two Input: M, an n m matrix M of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. Let d min = and M =. For (k = to u) do { //the k-loop } Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } and u is an integer parameter. If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = max{diff(r i, r ) r M i } for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Claim. With probability at most ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the k-loop iterations. For randomly selected fragments r and r, with probability γ, r, r is a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in M. When the k-loop is repeated u times, with probability at most
11 Page 0 of ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the u loop iterations. Thus, Claim is true. Let N i be the set of the n i fragments in M that are F α,α (m, G i ) sequences, i =,. Claim. With probability at most ne ɛ βm, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or ; or diff(s, S ) β( α α ɛ) for some S N and some S N. By Lemma, for every fragment S from N i, with probability at most e ɛ m, diff(gi, S) > α + ɛ or S has more than (α + ɛ)m holes. Thus, with probability at most ne ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or. By Lemma, with probability at most ne ɛ βm, diff(s, S ) β( α α ɛ) for some S N and some S N. The above analysis completes the proof for Claim. Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for all S N i for i =,. We consider possible choices of the two random fragments r and r in the following. At any iteration of the k-loop, if r N and r N, then by () we have diff(r, r ) diff(r, G ) + diff(r, G ) (α +ɛ) for any r N ; and diff(r, r ) diff(r, G )+diff(r, G ) (α +ɛ) for any r N. By () and the given condition of the theorem, we have, diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N ; and diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N. This implies that at this loop iteration we have M = N, M = N and d (α + ɛ). Similarly, if at this iteration r N and r N, then M = N, M = N and d (α + ɛ). If r, r N at some iteration of the k-loop, then for any r N, either r M or r M. In either case, by () of our assumption and the given condition of the theorem, we have d β( α α ɛ) > (α + ɛ) at this iteration. Similarly, if r, r N at some iteration of the k-loop, then we also have d > (α + ɛ) at this iteration. It follows from the above analysis that, under the assumption of () and (), once we have r N and r N or r N and r N at some iteration of the k-loop, then M, M is a permutation of N, N at the end of this iteration. Furthermore, if M and M are replaced by M and M after this iteration, then M, M must also be a permutation of N, N. By Claims and, with probability at most ( γ) u + ne ɛ βm, the assumption of () and () is not true, or r N and r N (or r N and r N ) is not true at all the iterations of the k-loop. Hence, with probability at most ( γ) u + ne ɛ βm, the final list of M and M returned by the algorithm is not a permutation of N, N, so the claim is proved. 0
12 Page of For M and M returned by the algorithm, we assume without loss of generality M i = N i, i =,. By Lemma and the given condition of the theorem, with probability at most me ɛ n + me ɛ n, we have H = vote(m ) G or H = vote(m ) G. Thus, by Claim, with probability at most ( γ) u + ne ɛ βm + me ɛ n, we have H G or H G. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M for a constant u.. Tuning the Dissimilarity Measure In this section, we consider a different dissimilarity measure in algorithm SHR-TWO to improve its ability to tolerate errors. We use the sum of the differences between r i and every fragment r M i, i =,, to measure the dissimilarity of the fragments in M i with r i. The new algorithm SHR-Three is given in the following. We will present experimental results in Section to show that algorithm SHR-Three is more reliable and robust in dealing with possible outliers in the data sets. Algorithm SHR-Three Input: M, an n m matrix of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. } Let d min = and M =. For (k = to u) do { //the k-loop Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = r M i diff(r i, r ) for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <,
13 Page of < α + α ɛ <, η > G, G Σ m (α +ɛ) β( α α ɛ) min(n,n) with η = n, and β( α α ɛ) > (α + ɛ). Let be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = nn n(n ) and u is an integer parameter. Proof: Let N i be the set of the n i many F α,α (m, G i ) sequences in M, for i =,. We first notice that both Claims and in the proof of Theorem hold here following the same analysis. However, we need to prove the following claim with different analysis: Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for each S from N i (i =, ). We shall consider the two cases: Case. At some iteration of the k-loop, both r and r are selected from the same N i for i = or. For each r N j, j i, by assumption () we have both diff(r i, r ) > β( α α ɛ), i =,. Notice that r must be either in M or M. So, at least half of the fragments in N j will be either in M or M. Therefore, at this iteration, we have d n jβ( α α ɛ) ηnβ( α α ɛ) = ηβ( α α ɛ)n > (α + ɛ)n. Case. At some iteration of the k-loop, r and r are selected from different N i for i = and. Without loss of generality, r i N i, i =,. For each r N, by assumption () we have diff(r, r ) (α + ɛ); by assumption () and the given condition of the theorem we have diff(r, r ) > β( α α ɛ) > (α +ɛ). Similarly, for each r N, diff(r, r ) (α + ɛ), and diff(r, r ) > (α + ɛ). Therefore, at this iteration, we have M = N and M = N, and d (α + ɛ)n + (α + ɛ)n = (α + ɛ)n. The two cases implies that under the assumption of () and (), if at any iteration of the k-loop, we have r N and r N, or r N and r N, then the final list of the two sets M, M is a permutation of N, N. Hence, Claim follows from Claims and in the proof of Theorem, which are true here as we mentioned earlier. Now, we assume that the final list of the two sets M, M is a permutation of N, N. By Lemma, with probability at most m(e δ n ) + m(e δ n ), H = vote(m ), H = vote(m ) is not a permutation of G, G. This, together with Claim, completes the probabilistic claim of the theorem. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M.
14 Page of Experimental Results We design a MATLAB program to test both the accuracy and the speed of algorithm SHR-Three. Due to the difficulty of getting real data from the public domain (Alessandro Panconesi, 00), our experiment data is created following the common practice in literature such as (Wang et al., 00; Alessandro Panconesi, 00). A random matrix of SNP fragments is created as follows: () Haplotype is generated at random with length m (m {0, 00, 0}). () Haplotype is generated by copying all the bits from haplotype and flipping each bit with probability β (β {0., 0., 0.}). This simulates the dissimilarity rate β between two haplotypes. () Each haplotype is copied n times so that the matrix has m columns and n(n {0, 0, 0}) rows. () Set each bit in the matrix to - with probability α (α {0., 0., 0.}). This simulates the incompleteness error rate α in the matrix. () Flip each non-empty bit with probability α (α {0.0, 0.0,..., 0.}). This is the simulation of the inconsistency error rate of α. Tables to show the performance of algorithm SHR-Three with different parameter settings in accordance with those in (Alessandro Panconesi, 00). The typical parameters used in (Alessandro Panconesi, 00) are m = 00, n = 0, β = 0., α = 0. and 0.0 α 0.0. These are considered to be close to the real situations. In our tables, the results are the average time and the reconstruction rate of the 000 executions of algorithm SHR-Three. A new random matrix is used for each execution. The reconstruction rate is defined as the ratio of the total number of correctly reconstructed bits to the total number of bits in two haplotypes. The computing environment is a PC machine with a typical configuration of.ghz AMD Turion X CPUs and GB memory.. Concluding Remarks The software of our algorithms is available for public access and for real-time on-line demonstration at We thank Liqiang Wang for implementing the programs in Java and setting up this web site. We have developed three linear time probabilistic algorithms for the singular haplotype reconstruction problem with provable high success probability. Our experimental results conform with the theoretical efficiency and accuracy of the algorithms. It should be pointed out that our work can be extended to reconstruct multiple haplotypes from a set of fragments. Our approach also opens the door to develop probabilistic methods for other variants of the
15 Page of haplotyping problem involving both inconsistency and incompleteness errors. References Alessandro Panconesi, M. S. (00). Fast Hare: A fast heuristic for single individual snp haplotype reconstruction. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science 0, pages. Bafna, V., Istrail, S., Lancia, G., and Rizzi, R. (00). Polynomial and apx-hard cases of the individual haplotyping problem. Theoretical Computer Science, :0. Cilibrasi, R., van Iersel, L., Kelk, S., and Tromp, J. (00). On the complexity of several haplotyping problems. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science, volume, pages. Clark, A. (0). Inference of haplotypes from pcr-amplified samples of diploid populations. Molecular Biology Evolution, :. G. Lancia, R. R. (00). A polynomial solution to a special case of parsimony haplotyping problem. Operations Research Letters, (0):. Genovese, L. M., Geraci, F., and Pellegrini, M. (00). A fast and accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage. Algorithms in Bioinformatics, : 0. Gusfield, D. (000). A practical algorithm for optimal inference of haplotype from diploid populations. In The Eighth International Conference on Intelligence Systems for Molecular Biology, pages. Gusfield, D. (00). Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In the Sixth Annual International Conference on Computational Biology, pages. Hoehe, M., Kopke, K., Wendel, B., Rohde, K., Flachmeier, C., Kidd, K., Berrettini, W., and Church, G. (000). Sequence variability and candidate gene analysis in complex disease: association of opioid receptor gene variation with substance dependence. Human Molecular Genetics, (): 0. Lancia, G., Pinotti, M. C., and Rizzi, R. (00). Haplotyping polulations by purs parsimongy: complexity and algorithms. INFORMS Journal on computing, :. Li, L., Kim, J. H., and Waterman, M. S. (00). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, ( ):0.
16 Page of Li, M., Ma, B., and Wang, L. (00). On the closest string and substring problems. Journal of the ACM, ():. Lippert, R., Schwartz, R., Lancia, G., and Istrail, S. (00). Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, :. Motwani, R. and Raghavan, P. (000). Randomized Algorithms. Cambridge. Rizzi, R., Bafna, V., Istrail, S., and Lancia, G. (00). Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In Algorithms in Bioinformatics: Second International Workshop, WABI 00, Rome, Italy, September -, pages. Terwilliger, J. and Weiss, K. (). Linkage disquilibrium mapping of complex disease: fantasy and reality? Current Opinion in Biotechnology, ():. Wang, R.-S., Wu, L.-Y., Li, Z.-P., and Zhang, X.-S. (00). Hyplotype reconstruction from SNP fragments by minimum error correction. BioInformatics, (0):. Wang, R.-S., Wu, L.-Y., Zhang, X.-S., and Chen, L. (00). A markov chain model for haplotype assembly from SNP fragments. Genome Informatics, ():. Xie, M. and Wang, J. (00). An improved (and practical) parameterized algorithm for the individual haplotyping problem MFR with mate-pairs. Algorithmica (online). Zhang, X.-S., Wang, R.-S., Wu, L.-Y., and Zhang, W. (00). Minimum conflict individual haplotyping from SNP fragments and related genotype. Evolutionary Bioinformatics Online, : 0. Zhao, Y., Wu, L., Zhang, J., and R. Wang, X. Z. (00). Haplotype assembly from aligned weighted SNP fragments. Computational Biology and Chemistry, :.
17 Page of α (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0, β = 0% and α = 0%
18 Page of n = 0 n = 0 α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, β = 0%, α = 0% β = 0% β = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and α = 0% α = 0% α = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and β = 0%
On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem
On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti and Simone Zaccaria DISCo, computer
More informationSAT in Bioinformatics: Making the Case with Haplotype Inference
SAT in Bioinformatics: Making the Case with Haplotype Inference Inês Lynce 1 and João Marques-Silva 2 1 IST/INESC-ID, Technical University of Lisbon, Portugal ines@sat.inesc-id.pt 2 School of Electronics
More informationHaplotype Inference Constrained by Plausible Haplotype Data
Haplotype Inference Constrained by Plausible Haplotype Data Michael R. Fellows 1, Tzvika Hartman 2, Danny Hermelin 3, Gad M. Landau 3,4, Frances Rosamond 1, and Liat Rozenberg 3 1 The University of Newcastle,
More informationAllen Holder - Trinity University
Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University
More informationA Markov Chain Model for Haplotype Assembly from SNP Fragments
162 Genome Informatics 17(2): 162{171 (2006) A Markov Chain Model for Haplotype Assembly from SNP Fragments Rui-Sheng Wang 1;2 Ling-Yun Wu 3 wangrsh@amss.ac.cn wlyun@amt.ac.cn Xiang-Sun Zhang 3* Luonan
More informationHaplotyping as Perfect Phylogeny: A direct approach
Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable
More informationOn the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model
On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Germany. Tzvika Hartman Dept. of Computer Science,
More informationThe Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection
The Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection M.T. Hajiaghayi K. Jain K. Konwar L.C. Lau I.I. Măndoiu A. Russell A. Shvartsman V.V. Vazirani Abstract In this paper we consider
More informationHumans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase
Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs
More informationLet S be a set of n species. A phylogeny is a rooted tree with n leaves, each of which is uniquely
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 8, Number 1, 2001 Mary Ann Liebert, Inc. Pp. 69 78 Perfect Phylogenetic Networks with Recombination LUSHENG WANG, 1 KAIZHONG ZHANG, 2 and LOUXIN ZHANG 3 ABSTRACT
More informationConsensus Optimizing Both Distance Sum and Radius
Consensus Optimizing Both Distance Sum and Radius Amihood Amir 1, Gad M. Landau 2, Joong Chae Na 3, Heejin Park 4, Kunsoo Park 5, and Jeong Seop Sim 6 1 Bar-Ilan University, 52900 Ramat-Gan, Israel 2 University
More informationInteger Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!
Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing
More informationMathematical Approaches to the Pure Parsimony Problem
Mathematical Approaches to the Pure Parsimony Problem P. Blain a,, A. Holder b,, J. Silva c, and C. Vinzant d, July 29, 2005 Abstract Given the genetic information of a population, the Pure Parsimony problem
More informationA Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence
A Simple Linear Space Algorithm for Computing a Longest Common Increasing Subsequence Danlin Cai, Daxin Zhu, Lei Wang, and Xiaodong Wang Abstract This paper presents a linear space algorithm for finding
More informationAn Overview of Combinatorial Methods for Haplotype Inference
An Overview of Combinatorial Methods for Haplotype Inference Dan Gusfield 1 Department of Computer Science, University of California, Davis Davis, CA. 95616 Abstract A current high-priority phase of human
More informationSome Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm
Some Formal Analysis of Rocchio s Similarity-Based Relevance Feedback Algorithm Zhixiang Chen (chen@cs.panam.edu) Department of Computer Science, University of Texas-Pan American, 1201 West University
More informationPractical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem
Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem Romeo Rizzi 1,, Vineet Bafna 2, Sorin Istrail 2, and Giuseppe Lancia 3 1 Math. Dept., Università
More informationOn the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model
On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm 1, Tzvika Hartman 2, Till Nierhoff 3, Roded Sharan 4, and Till Tantau 5 1 Wilhelm-Schickard-Institut für Informatik,
More informationGenetic Algorithms: Basic Principles and Applications
Genetic Algorithms: Basic Principles and Applications C. A. MURTHY MACHINE INTELLIGENCE UNIT INDIAN STATISTICAL INSTITUTE 203, B.T.ROAD KOLKATA-700108 e-mail: murthy@isical.ac.in Genetic algorithms (GAs)
More informationHaplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing
INFORMS Journal on Computing Vol. 22, No. 2, Spring 2010, pp. 195 209 issn 1091-9856 eissn 1526-5528 10 2202 0195 informs doi 10.1287/ijoc.1090.0333 2010 INFORMS A Class Representative Model for Pure Parsimony
More informationA Class Representative Model for Pure Parsimony Haplotyping
A Class Representative Model for Pure Parsimony Haplotyping Daniele Catanzaro, Alessandra Godi, and Martine Labbé June 5, 2008 Abstract Haplotyping estimation from aligned Single Nucleotide Polymorphism
More informationUpper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions
Upper Bounds on the Time and Space Complexity of Optimizing Additively Separable Functions Matthew J. Streeter Computer Science Department and Center for the Neural Basis of Cognition Carnegie Mellon University
More informationSNPs Problems, Complexity, and Algorithms
SNPs Problems, Complexity, and lgorithms Giuseppe Lancia 1,2, Vineet afna 1, Sorin Istrail 1, Ross Lippert 1, and Russell Schwartz 1 1 Celera Genomics, Rockville MD, US, {Giuseppe.Lancia,Vineet.afna,Sorin.Istrail,Ross.Lippert,
More informationAverage Case Complexity: Levin s Theory
Chapter 15 Average Case Complexity: Levin s Theory 1 needs more work Our study of complexity - NP-completeness, #P-completeness etc. thus far only concerned worst-case complexity. However, algorithms designers
More informationThis is a survey designed for mathematical programming people who do not know molecular biology and
INFORMS Journal on Computing Vol. 16, No. 3, Summer 2004, pp. 211 231 issn 0899-1499 eissn 1526-5528 04 1603 0211 informs doi 10.1287/ijoc.1040.0073 2004 INFORMS Opportunities for Combinatorial Optimization
More informationAlgorithms in Computational Biology (236522) spring 2008 Lecture #1
Algorithms in Computational Biology (236522) spring 2008 Lecture #1 Lecturer: Shlomo Moran, Taub 639, tel 4363 Office hours: 15:30-16:30/by appointment TA: Ilan Gronau, Taub 700, tel 4894 Office hours:??
More informationMinimization of Boolean Expressions Using Matrix Algebra
Minimization of Boolean Expressions Using Matrix Algebra Holger Schwender Collaborative Research Center SFB 475 University of Dortmund holger.schwender@udo.edu Abstract The more variables a logic expression
More informationBioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment
Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment Substitution score matrices, PAM, BLOSUM Needleman-Wunsch algorithm (Global) Smith-Waterman algorithm (Local) BLAST (local, heuristic) E-value
More informationImplementing Approximate Regularities
Implementing Approximate Regularities Manolis Christodoulakis Costas S. Iliopoulos Department of Computer Science King s College London Kunsoo Park School of Computer Science and Engineering, Seoul National
More informationComputational statistics
Computational statistics Combinatorial optimization Thierry Denœux February 2017 Thierry Denœux Computational statistics February 2017 1 / 37 Combinatorial optimization Assume we seek the maximum of f
More informationEVOLUTIONARY DISTANCES
EVOLUTIONARY DISTANCES FROM STRINGS TO TREES Luca Bortolussi 1 1 Dipartimento di Matematica ed Informatica Università degli studi di Trieste luca@dmi.units.it Trieste, 14 th November 2007 OUTLINE 1 STRINGS:
More informationCONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109
CONTENTS ix Preface xv Acknowledgments xxi Editors and contributors xxiv A computational micro primer xxvi P A R T I Genomes 1 1 Identifying the genetic basis of disease 3 Vineet Bafna 2 Pattern identification
More informationOn the Monotonicity of the String Correction Factor for Words with Mismatches
On the Monotonicity of the String Correction Factor for Words with Mismatches (extended abstract) Alberto Apostolico Georgia Tech & Univ. of Padova Cinzia Pizzi Univ. of Padova & Univ. of Helsinki Abstract.
More informationSINGLE nucleotide polymorphisms (SNPs) are differences
IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006 303 Islands of Tractability for Parsimony Haplotyping Roded Sharan, Bjarni V. Halldórsson, and Sorin
More informationA Phylogenetic Network Construction due to Constrained Recombination
A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer
More informationComputing Bi-Clusters for Microarray Analysis
December 21, 2006 General Bi-clustering Problem Similarity of Rows (1-5) Cheng and Churchs model General Bi-clustering Problem Input: a n m matrix A. Output: a sub-matrix A P,Q of A such that the rows
More informationAnalysis and Design of Algorithms Dynamic Programming
Analysis and Design of Algorithms Dynamic Programming Lecture Notes by Dr. Wang, Rui Fall 2008 Department of Computer Science Ocean University of China November 6, 2009 Introduction 2 Introduction..................................................................
More informationEfficient Haplotype Inference with Boolean Satisfiability
Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University
More informationLecture 15: A Brief Look at PCP
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 15: A Brief Look at PCP David Mix Barrington and Alexis Maciel August 4, 2000 1. Overview
More informationAlgorithms and Data S tructures Structures Complexity Complexit of Algorithms Ulf Leser
Algorithms and Data Structures Complexity of Algorithms Ulf Leser Content of this Lecture Efficiency of Algorithms Machine Model Complexity Examples Multiplication of two binary numbers (unit cost?) Exact
More informationLecture 2 Sep 5, 2017
CS 388R: Randomized Algorithms Fall 2017 Lecture 2 Sep 5, 2017 Prof. Eric Price Scribe: V. Orestis Papadigenopoulos and Patrick Rall NOTE: THESE NOTES HAVE NOT BEEN EDITED OR CHECKED FOR CORRECTNESS 1
More informationILP-Based Reduced Variable Neighborhood Search for Large-Scale Minimum Common String Partition
Available online at www.sciencedirect.com Electronic Notes in Discrete Mathematics 66 (2018) 15 22 www.elsevier.com/locate/endm ILP-Based Reduced Variable Neighborhood Search for Large-Scale Minimum Common
More informationNJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees
NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana
More informationAn Evolution Strategy for the Induction of Fuzzy Finite-state Automata
Journal of Mathematics and Statistics 2 (2): 386-390, 2006 ISSN 1549-3644 Science Publications, 2006 An Evolution Strategy for the Induction of Fuzzy Finite-state Automata 1,2 Mozhiwen and 1 Wanmin 1 College
More informationNotes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel
Introduction to Coding Theory CMU: Spring 2010 Notes 3: Stochastic channels and noisy coding theorem bound January 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami We now turn to the basic
More informationCS6999 Probabilistic Methods in Integer Programming Randomized Rounding Andrew D. Smith April 2003
CS6999 Probabilistic Methods in Integer Programming Randomized Rounding April 2003 Overview 2 Background Randomized Rounding Handling Feasibility Derandomization Advanced Techniques Integer Programming
More informationMore Approximation Algorithms
CS 473: Algorithms, Spring 2018 More Approximation Algorithms Lecture 25 April 26, 2018 Most slides are courtesy Prof. Chekuri Ruta (UIUC) CS473 1 Spring 2018 1 / 28 Formal definition of approximation
More informationCS:4330 Theory of Computation Spring Regular Languages. Finite Automata and Regular Expressions. Haniel Barbosa
CS:4330 Theory of Computation Spring 2018 Regular Languages Finite Automata and Regular Expressions Haniel Barbosa Readings for this lecture Chapter 1 of [Sipser 1996], 3rd edition. Sections 1.1 and 1.3.
More informationA GENETIC ALGORITHM FOR FINITE STATE AUTOMATA
A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA Aviral Takkar Computer Engineering Department, Delhi Technological University( Formerly Delhi College of Engineering), Shahbad Daulatpur, Main Bawana Road,
More informationSTUDY of genetic variations is a critical task to disease
Haplotype Assembly Using Manifold Optimization and Error Correction Mechanism Mohamad Mahdi Mohades, Sina Majidian, and Mohammad Hossein Kahaei arxiv:9.36v [math.oc] Jan 29 Abstract Recent matrix completion
More informationDo Read Errors Matter for Genome Assembly?
Do Read Errors Matter for Genome Assembly? Ilan Shomorony C Berkeley ilan.shomorony@berkeley.edu Thomas Courtade C Berkeley courtade@berkeley.edu David Tse Stanford niversity dntse@stanford.edu arxiv:50.0694v
More informationPhylogenetic Networks, Trees, and Clusters
Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University
More informationPerfect Phylogenetic Networks with Recombination Λ
Perfect Phylogenetic Networks with Recombination Λ Lusheng Wang Dept. of Computer Sci. City Univ. of Hong Kong 83 Tat Chee Avenue Hong Kong lwang@cs.cityu.edu.hk Kaizhong Zhang Dept. of Computer Sci. Univ.
More informationAlgorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment
Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot
More informationSequence Alignment (chapter 6)
Sequence lignment (chapter 6) he biological problem lobal alignment Local alignment Multiple alignment Introduction to bioinformatics, utumn 6 Background: comparative genomics Basic question in biology:
More informationA Survey of Computational Methods for Determining Haplotypes
A Survey of Computational Methods for Determining Haplotypes Bjarni V. Halldórsson, Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail Informatics Research, Celera Genomics/Applied
More informationMATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME
MATHEMATICAL MODELING AND THE HUMAN GENOME Hilary S. Booth Australian National University, Australia Keywords: Human genome, DNA, bioinformatics, sequence analysis, evolution. Contents 1. Introduction:
More informationInteger Programming for Phylogenetic Network Problems
Integer Programming for Phylogenetic Network Problems D. Gusfield University of California, Davis Presented at the National University of Singapore, July 27, 2015.! There are many important phylogeny problems
More informationOptimal XOR based (2,n)-Visual Cryptography Schemes
Optimal XOR based (2,n)-Visual Cryptography Schemes Feng Liu and ChuanKun Wu State Key Laboratory Of Information Security, Institute of Software Chinese Academy of Sciences, Beijing 0090, China Email:
More informationBetter Approximations for the Minimum Common Integer Partition Problem
Better Approximations for the Minimum Common Integer Partition Problem David P. Woodruff 1 MIT dpwood@mit.edu 2 Tsinghua University Abstract. In the k-minimum Common Integer Partition Problem, abbreviated
More informationPatterns of Simple Gene Assembly in Ciliates
Patterns of Simple Gene Assembly in Ciliates Tero Harju Department of Mathematics, University of Turku Turku 20014 Finland harju@utu.fi Ion Petre Academy of Finland and Department of Information Technologies
More informationLecture 4: September 19
CSCI1810: Computational Molecular Biology Fall 2017 Lecture 4: September 19 Lecturer: Sorin Istrail Scribe: Cyrus Cousins Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These notes
More informationUsing Markov Chains To Model Human Migration in a Network Equilibrium Framework
Using Markov Chains To Model Human Migration in a Network Equilibrium Framework Jie Pan Department of Mathematics and Computer Science Saint Joseph s University Philadelphia, PA 19131 Anna Nagurney School
More informationCS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding
CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.
More informationThe Pure Parsimony Problem
Haplotyping and Minimum Diversity Graphs Courtney Davis - University of Utah - Trinity University Some Genetics Mother Paired Gene Representation Physical Trait ABABBA AAABBB Physical Trait ABA AAA Mother
More informationScaffold Filling Under the Breakpoint Distance
Scaffold Filling Under the Breakpoint Distance Haitao Jiang 1,2, Chunfang Zheng 3, David Sankoff 4, and Binhai Zhu 1 1 Department of Computer Science, Montana State University, Bozeman, MT 59717-3880,
More informationRepeat resolution. This exposition is based on the following sources, which are all recommended reading:
Repeat resolution This exposition is based on the following sources, which are all recommended reading: 1. Separation of nearly identical repeats in shotgun assemblies using defined nucleotide positions,
More informationFinding Consensus Strings With Small Length Difference Between Input and Solution Strings
Finding Consensus Strings With Small Length Difference Between Input and Solution Strings Markus L. Schmid Trier University, Fachbereich IV Abteilung Informatikwissenschaften, D-54286 Trier, Germany, MSchmid@uni-trier.de
More informationComplexity and Approximation of the Minimum Recombination Haplotype Configuration Problem
Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Lan Liu 1, Xi Chen 3, Jing Xiao 3, and Tao Jiang 1,2 1 Department of Computer Science and Engineering, University
More informationSearch. Search is a key component of intelligent problem solving. Get closer to the goal if time is not enough
Search Search is a key component of intelligent problem solving Search can be used to Find a desired goal if time allows Get closer to the goal if time is not enough section 11 page 1 The size of the search
More informationKleene Algebras and Algebraic Path Problems
Kleene Algebras and Algebraic Path Problems Davis Foote May 8, 015 1 Regular Languages 1.1 Deterministic Finite Automata A deterministic finite automaton (DFA) is a model of computation that can simulate
More informationLecture 3: Error Correcting Codes
CS 880: Pseudorandomness and Derandomization 1/30/2013 Lecture 3: Error Correcting Codes Instructors: Holger Dell and Dieter van Melkebeek Scribe: Xi Wu In this lecture we review some background on error
More informationReductions for One-Way Functions
Reductions for One-Way Functions by Mark Liu A thesis submitted in partial fulfillment of the requirements for degree of Bachelor of Science (Honors Computer Science) from The University of Michigan 2013
More informationOn Two Class-Constrained Versions of the Multiple Knapsack Problem
On Two Class-Constrained Versions of the Multiple Knapsack Problem Hadas Shachnai Tami Tamir Department of Computer Science The Technion, Haifa 32000, Israel Abstract We study two variants of the classic
More information25 : Graphical induced structured input/output models
10-708: Probabilistic Graphical Models 10-708, Spring 2016 25 : Graphical induced structured input/output models Lecturer: Eric P. Xing Scribes: Raied Aljadaany, Shi Zong, Chenchen Zhu Disclaimer: A large
More informationEntropy Rate of Stochastic Processes
Entropy Rate of Stochastic Processes Timo Mulder tmamulder@gmail.com Jorn Peters jornpeters@gmail.com February 8, 205 The entropy rate of independent and identically distributed events can on average be
More informationStructure-Based Comparison of Biomolecules
Structure-Based Comparison of Biomolecules Benedikt Christoph Wolters Seminar Bioinformatics Algorithms RWTH AACHEN 07/17/2015 Outline 1 Introduction and Motivation Protein Structure Hierarchy Protein
More informationarxiv: v1 [cs.cc] 15 Nov 2016
Diploid Alignment is NP-hard Romeo Rizzi 1, Massimo Cairo 1, Veli Mäkinen 2, and Daniel Valenzuela 2 1 Department of Computer Science, University of Verona, Italy 2 Helsinki Institute for Information echnology,
More informationA New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem
A New Dominant Point-Based Parallel Algorithm for Multiple Longest Common Subsequence Problem Dmitry Korkin This work introduces a new parallel algorithm for computing a multiple longest common subsequence
More informationThe E-M Algorithm in Genetics. Biostatistics 666 Lecture 8
The E-M Algorithm in Genetics Biostatistics 666 Lecture 8 Maximum Likelihood Estimation of Allele Frequencies Find parameter estimates which make observed data most likely General approach, as long as
More informationThe genome encodes biology as patterns or motifs. We search the genome for biologically important patterns.
Curriculum, fourth lecture: Niels Richard Hansen November 30, 2011 NRH: Handout pages 1-8 (NRH: Sections 2.1-2.5) Keywords: binomial distribution, dice games, discrete probability distributions, geometric
More informationStochastic processes and
Stochastic processes and Markov chains (part II) Wessel van Wieringen w.n.van.wieringen@vu.nl wieringen@vu nl Department of Epidemiology and Biostatistics, VUmc & Department of Mathematics, VU University
More informationCISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)
CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I) Contents Alignment algorithms Needleman-Wunsch (global alignment) Smith-Waterman (local alignment) Heuristic algorithms FASTA BLAST
More informationRobust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome Sequence Comparison
University of New Orleans ScholarWorks@UNO University of New Orleans Theses and Dissertations Dissertations and Theses 8-7-2008 Robust and Efficient Algorithms for Protein 3-D Structure Alignment and Genome
More informationPéter Ligeti Combinatorics on words and its applications
Eötvös Loránd University Institute of Mathematics Péter Ligeti Combinatorics on words and its applications theses of the doctoral thesis Doctoral school: Mathematics Director: Professor Miklós Laczkovich
More informationLearning ancestral genetic processes using nonparametric Bayesian models
Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew
More information6.842 Randomness and Computation Lecture 5
6.842 Randomness and Computation 2012-02-22 Lecture 5 Lecturer: Ronitt Rubinfeld Scribe: Michael Forbes 1 Overview Today we will define the notion of a pairwise independent hash function, and discuss its
More informationMarginal Screening and Post-Selection Inference
Marginal Screening and Post-Selection Inference Ian McKeague August 13, 2017 Ian McKeague (Columbia University) Marginal Screening August 13, 2017 1 / 29 Outline 1 Background on Marginal Screening 2 2
More informationOutline. Complexity Theory. Example. Sketch of a log-space TM for palindromes. Log-space computations. Example VU , SS 2018
Complexity Theory Complexity Theory Outline Complexity Theory VU 181.142, SS 2018 3. Logarithmic Space Reinhard Pichler Institute of Logic and Computation DBAI Group TU Wien 3. Logarithmic Space 3.1 Computational
More informationA Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files
A Genetic Algorithm Approach for Doing Misuse Detection in Audit Trail Files Pedro A. Diaz-Gomez and Dean F. Hougen Robotics, Evolution, Adaptation, and Learning Laboratory (REAL Lab) School of Computer
More informationOn the Problem of Error Propagation in Classifier Chains for Multi-Label Classification
On the Problem of Error Propagation in Classifier Chains for Multi-Label Classification Robin Senge, Juan José del Coz and Eyke Hüllermeier Draft version of a paper to appear in: L. Schmidt-Thieme and
More informationCS243, Logic and Computation Nondeterministic finite automata
CS243, Prof. Alvarez NONDETERMINISTIC FINITE AUTOMATA (NFA) Prof. Sergio A. Alvarez http://www.cs.bc.edu/ alvarez/ Maloney Hall, room 569 alvarez@cs.bc.edu Computer Science Department voice: (67) 552-4333
More informationEECS730: Introduction to Bioinformatics
EECS730: Introduction to Bioinformatics Lecture 05: Index-based alignment algorithms Slides adapted from Dr. Shaojie Zhang (University of Central Florida) Real applications of alignment Database search
More informationThe Incomplete Perfect Phylogeny Haplotype Problem
The Incomplete Perfect Phylogeny Haplotype Prolem Gad Kimmel 1 and Ron Shamir 1 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. Email: kgad@tau.ac.il, rshamir@tau.ac.il. Astract.
More informationReducing storage requirements for biological sequence comparison
Bioinformatics Advance Access published July 15, 2004 Bioinfor matics Oxford University Press 2004; all rights reserved. Reducing storage requirements for biological sequence comparison Michael Roberts,
More informationCS6901: review of Theory of Computation and Algorithms
CS6901: review of Theory of Computation and Algorithms Any mechanically (automatically) discretely computation of problem solving contains at least three components: - problem description - computational
More informationTuring Machines, diagonalization, the halting problem, reducibility
Notes on Computer Theory Last updated: September, 015 Turing Machines, diagonalization, the halting problem, reducibility 1 Turing Machines A Turing machine is a state machine, similar to the ones we have
More informationLecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)
Bioinformatics II Probability and Statistics Universität Zürich and ETH Zürich Spring Semester 2009 Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM) Dr Fraser Daly adapted from
More information1 Randomized complexity
80240233: Complexity of Computation Lecture 6 ITCS, Tsinghua Univesity, Fall 2007 23 October 2007 Instructor: Elad Verbin Notes by: Zhang Zhiqiang and Yu Wei 1 Randomized complexity So far our notion of
More informationPr[X = s Y = t] = Pr[X = s] Pr[Y = t]
Homework 4 By: John Steinberger Problem 1. Recall that a real n n matrix A is positive semidefinite if A is symmetric and x T Ax 0 for all x R n. Assume A is a real n n matrix. Show TFAE 1 : (a) A is positive
More information