Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments

Size: px

Start display at page:

Download "Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments"

Marilynn Sims
5 years ago
Views:

1 : Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Journal: Manuscript ID: Manuscript Type: Date Submitted by the Author: Complete List of Authors: Keyword: draft Original Paper n/a Chen, Zhixiang; University of Texas-Pan American, Department of Computer Science Fu, Bin; University of Texas-Pan American, Computer Science Schweller, Robert; University of Texas-Pan American, Computer Science Yang, Boting; University of Regina, Computer Science Zhao, Zhiyu; University of New Orleans, Computer Science Zhu, Binhai; Montana State University, Computer Science haplotypes, algorithms, probability, alignment, SEQUENCE ANALYSIS Abstract: In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration.

2 Page of Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Zhixiang Chen Bin Fu Robert Schweller Boting Yang Zhiyu Zhao Binhai Zhu Abstract In this paper, we develop a probabilistic model to approach two realistic scenarios regarding the singular haplotype reconstruction problem - the incompleteness and inconsistency occurred in the DNA sequencing process to generate the input haplotype fragments and the common practice used to generate synthetic data in experimental algorithm studies. We design three algorithms in the model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. We also present experimental results that conform with the theoretical efficient performance of those algorithms. The software of our algorithms is available for public access and for real-time on-line demonstration. Keywords: Singular haplotype reconstruction, SNP fragments, probabilistic modeling and analysis, linear time probabilistic algorithm, inconsistency and incompleteness errors.. Introduction Abstractly, a genome can be considered a string over the alphabet of nucleotides {A, G, C, T }. It is accepted that the genomes between any two humans are over % identical (Terwilliger and Weiss, ; Hoehe et al., 000). The remaining sites which exhibit substantial variation (in at least % of the population) are called Single Nucleotide Polymorphisms (SNPs). The values of a set of SNPs on a particular chromosome copy define a haplotype. While a haplotype is a string over the four nucleotide bases, typical SNP sites only vary Corresponding author. Department of Computer Science, University of Texas-Pan American, Edinburg, TX, USA. s: {chen, binfu, schwellerr}@cs.panam.edu, Phone: ()-0, Fax: ()-0. Department of Computer Science, University of Regina, Saskatchewan, SS 0A, Canada. boting@cs.uregina.ca, Phone:(0)-, Fax: (0)-. Department of Computer Science, University of New Orleans, New Orleans, LA 0, USA. zzha@cs.uno.edu, Phone: (0)0-, Fax: (0)0-. Department of Computer Science, Montana State University, Bozeman, MT. USA. bhz@cs.montana.edu, Phone: (0)-, Fax: (0)-.

3 Page of between two values. Therefore, we can logically represent a haplotype as a binary string over the alphabet {A, B}. The haplotype of an individual chromosome can be thought of as a genetic finger print specifying the bulk of the genetic variability among distinct members of the same species. Determining haplotypes is thus a key step in the analysis of genetic variation. In recent years, the haplotyping problem has been extensively studied, see, f.g., (G. Lancia, 00; Zhao et al., 00; Wang et al., 00; Zhang et al., 00; Bafna et al., 00; Rizzi et al., 00; Wang et al., 00; Lippert et al., 00; Cilibrasi et al., 00; Lancia et al., 00; G. Lancia, 00; Alessandro Panconesi, 00; Clark, 0; Gusfield, 000, 00; Xie and Wang, 00; Li et al., 00; Genovese et al., 00). There are several version of the haplotyping problem. Of most interest is the haplotyping of diploid organisms, such as humans, which have two chromosomes, and thus two haplotypes. A common method to obtain an individual s two haplotypes is to first obtain a collection of sequencing data from the individual s chromosomes. This data consists of a collection of fragments, each partially specifying the sequence of one of the two base chromosomes. This data typically contains two types of errors. The data is incomplete in that each fragment will fail to assign a base to some collection of positions, and inconsistent in that some positions may be sequenced with an incorrect value. Further, in practice it is not known which chromosome is being sequenced by which fragments. The general haplotype reconstruction problem is then to take this input set of sequenced fragments and derive the two haplotypes which yielded the given fragments. This problem, which we consider, is the singular haplotype reconstruction problem and, like other versions of the problem, has also been extensively studied, see, f.g., (Cilibrasi et al., 00; Wang et al., 00; Lippert et al., 00; Bafna et al., 00). More concretely, we assume we are given a collection of n length m strings over the alphabet {A, B, } denoting the sequence of each fragment, where denotes a hole in which no value is measured at that position for that fragment. From this basic setup, a number of problems have been considered for various different optimization functions. In particular, one can consider removing certain fragments from the input data such that the remaining set of fragments can be divided into two separate sets, each set consisting of fragments that do not contain any conflicts (a site in which two strings contain different, non-hole values). Such a removal implicity derives two haplotypes corresponding to the two groups. This problem is referred to as the minimum fragment removal problem (MFR). A second problem involves removing or discounting a subset of the SNP sites to create two consistent sets of fragments. This is referred to as the minimum SNP removal problem (MSR). Finally, a third approach is to flip or correct a number of sites for various fragments. This is referred to as the minimum error correction problem (MEC). While much work has been done on these problems, the incompleteness and inconsistency of data fragments makes most versions of these haplotyping problems NP-hard or even hard to approximate (f.g., (Cili-

4 Page of brasi et al., 00; Lippert et al., 00; Bafna et al., 00)), and many elegant and powerful methods such as those in (Li et al., 00) cannot be used to deal with incompleteness and inconsistency at the same time. Other methods have proposed heuristics (Genovese et al., 00; Alessandro Panconesi, 00), but do not provide provably good results. In this paper, we develop a probabilistic approach to overcome some of the difficulties caused by the incompleteness and inconsistency occurred in the input fragments. In our model, we assume the input data fragments are attained from two unknown haplotype strings. Some number of the fragments are derived from the first haplotype, some from the second haplotype. Each fragment is assumed to be generated according to two error parameters, α and α, representing inconsistency errors and incompleteness errors respectively. These parameters represent the percentage chance that any given position will be sequenced incorrectly from the base haplotype, or will be omitted as a hole, respectively. Further, we assume a third parameter β denoting a minimum difference percentage between the two base haplotype strings. With respect to these three parameters, we develop algorithms that output the base haplotypes with probability as a function of α, α, and β. In particular, we design three algorithms in our probabilistic model that can reconstruct the two unknown haplotypes from the given matrix of haplotype fragments with provable high probability and in time linear in the size of the input matrix. The first algorithm assumes prior knowledge of the α parameter. The second algorithm does not require prior knowledge of any parameters at a modest cost to run time. The third does not require prior knowledge of parameters and attains similar analytical bounds as algorithm, yet exhibits a substantial empirical improvement in accuracy. For all algorithms, we present experimental results that conform with the respective theoretically efficient performance. The software of our algorithms is available for public access and for real-time on-line demonstration. Paper Layout In Section we formally define the problem and the probabilistic model we use. In Section we develop some technical lemmas. In Section we provide a haplotype reconstruction algorithm that utilizes prior knowledge of an error parameter. In Section we present an algorithm that requires no prior knowledge of model parameters. In Section we provide a third algorithm with provably good accuracy that also performs particularly well in practice. In Section we detail experimental results for our algorithms.

5 Page of A Probabilistic Model Assume that we have two haplotypes H, H, denoted as H = a a a m and H = b b b m. Let Γ = {S, S,..., S n } be a set of n fragments obtained from the DNA sequencing process with respect to the two haplotypes H and H. In this case, each S i = c c c m is either a fragment of H or H. Because we lose the information concerning the DNA strand to which a fragment belongs, we do not know whether S i is a fragment of H or H. Suppose that S i is a fragment of H. Because of reading errors or corruptions that may occur during the sequencing process, there is a small chance that either c j - but c j a j, or c j = -, for j m, where the symbol - denotes a hole or missing value. For the former, the information of the fragment S i at the j-th SNP site is inconsistent, and we use α to denote the rate of this type of inconsistency error. For the latter, the information of S i at the j-th SNP is incomplete, and we use α to denote the rate of this type of incompleteness error. It is known (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) that α and α are in practice between % to %. Also, it is realistically reasonable to believe that the dissimilarity, denoted by β, between the two haplotypes H and H is big. Often, β is measured using the Hamming distance between H and H divided by the length m of H and H, and is assumed to be large, say, β 0.. It is also often assumed that roughly half of the fragments in Γ are from each of the two haplotypes H and H. In the experimental studies of algorithmic solutions to the singular haplotype reconstruction problem, we often need to generate synthetic data to evaluate the performance and accuracy of a given algorithm. One common practice (f.g., (Wang et al., 00; Lippert et al., 00; Bafna et al., 00)) is as follows: First, choose two haplotypes H and H such that the dissimilarity between H and H is at least β. Second, make n i copies of H i, i =,. Third, for each copy H = a a a m of H i, for each i =,,..., m, with probability α, flip a i to a i so that they are inconsistent. Also, independently, a i has probability α to be a hole -. A synthetic data set is then generated by setting parameters m, n, n, β, α and α. Usually, n is roughly the same as n, and β 0., α [0.0, 0.0], and α [0., 0.]. Motivated by the above reality of the sequencing process and the common practice in experimental algorithm studies, we will present a probabilistic model for the singular haplotype reconstruction problem. But first we need to introduce some necessary notations and definitions. Let Σ = {A, B} and Σ = {A, B, -}. For a set C, C denotes the number of elements in C. For a fragment (or a sequence) S = a a a m Σ m, S[i] denotes the character a i, and S[i, j] denotes the substring a i a j for i j m. S denotes the length m of S. When no confusion arises, we alternatively use the terms fragment and sequence. Let G = g g g m Σ m be a fixed sequence of m characters. For any sequence S = a a m Σ m, S

6 Page of is called a F α,α (m, G) sequence if for each a i, with probability at most α, a i is not equal to g i and a i -; and with probability at most α, a i = -. For a sequence S, define holes(s) to be the number of holes in the sequence S. If A is a subset of {,, m} and S is a sequence of length m, holes A (S) is the number of i A such that S[i] is a hole. For two sequences S = a a m and S = b b m of the same length m, for any A {,, m}, define diff(s, S ) = {i {,,, m} a i - and b i - and a i b i } m diff A (S, S ) = {i A a i - and b i - and a i b i }. A For a set of sequences Γ = {S, S,, S k } of length m, define vote(γ) to be the sequence H of the same length m such that H[i] is the most frequent character among S [i], S [i],, S k [i] for i =,,, m. We often use an n m matrix M to represent a list of n fragments from Σ m matrix. For i n, let M[i] represent the i-th row of M, i.e., M[i] is a fragment in Σ m. We now define our probabilistic model: and call M an SNP fragment The Probabilistic Singular Haplotype Reconstruction Problem: Let β, α and α be small positive constants. Let G, G Σ m be two haplotypes with diff(g, G ) β. For any given n m matrix M of SNP fragments such that n i rows of M are F α,α (m, G i ) sequences, i =,, n + n = n, reconstruct the two haplotypes G and G, which are unknown to the users, from M as accurately as possible with high probability. We call β (resp., α, α ) dissimilarity rate (resp., inconsistency error rate, incompleteness error rate).. Technical Lemmas For probabilistic analysis we need the following two Chernoff bounds (Motwani and Raghavan, 000, see), which can be derived from the work in (Li et al., 00). Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at most p. Let X = n i= X i. Then for any ɛ > 0, Pr(X > pn + ɛn) < e nɛ. Lemma. (Li et al., 00) Let X,, X n be n independent random 0, variables, where X i takes with probability at least p. Let X = n i= X i. Then for any ɛ > 0, Pr(X < pn ɛn) < e nɛ. We shall prove several technical lemmas for algorithm analysis in the next three sections.

7 Page of Lemma. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m. Proof: Let X k, k =, m, be random variables such that X k = if S[k] G i [k] and S[k], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X k are independent and P r(x k = ) α. So, by Lemma, with probability at most e ɛ m, X + + X m > (α + ɛ)m Thus, we have diff(g, S) > α + ɛ with probability at most e ɛ m. Similarly, with probability at most e ɛ m, holes(s) > (α + ɛ)m. Lemma. Assume that A is a fixed subset of {,,, m}. Let S be a F α,α (m, G) sequence. Then, for any 0 < ɛ, with probability at most e ɛ A, diff A (G i, S) > α + ɛ or holes A (S) > (α + ɛ) A. Proof: Let S be the subsequence consisting of all the characters S[i], i A, with the same order as in S. Similarly, let G be the subsequence consisting of all the characters G[i], i A, with the same order as in G. It is easy to see that diff A (S, G i ) = diff(s, G ). The lemma follows from a similar proof for Lemma. Lemma. Let N i be a set of n i many F α,α (m, G i ) sequences, i =,. Let β and ɛ be two positive constants such that α +α +ɛ <, and diff(g, G ) β. Then, with probability at most (n +n )e ɛ βm, diff(s i, S j ) β( α α ɛ) for some S i N i and some S j N j with i j. Proof: For each G i, let A i be the set of indexes {k {,,, m} G i [k] G j [k]}, where i j. Since diff(g i, G j ) β and G i = G j = m, we have A i βm. For any F α,α (m, G i ) sequence S, by Lemma, with probability at most e ɛ A i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i. Hence, with probability at most n i e ɛ βm, diff Ai (S, G i ) > α +ɛ or holes Ai (S) > (α +ɛ) A i, for some S N i. Therefore, with probability at most (n + n )e ɛ βm, we have diff Ai (S, G i ) > α + ɛ or holes Ai (S) > (α + ɛ) A i,j, for some S N i, for some i = or. In other words, with probability at least (n + n )e ɛ βm, we have diff Ai (S, G i ) α + ɛ and holes Ai (S) (α + ɛ) A i,j, for all S N i and for i = and. For any F α,α (m, G i ) sequence S i, i =,, if diff Ai (S i, G i ) α + ɛ and holes Ai (S i ) (α + ɛ) A i, then diff(s, S ) diff Ai (S, S ) β( α α ɛ). Thus, with probability at least (n +n )e ɛ βm, we have diff(s, S ) β( α α ɛ), for every S N and every S N. In words, with probability at most (n +n )e ɛ βm, we have diff(s, S ) < β( α α ɛ), for some S N and some S N. Lemma. Let α, α and ɛ be three small positive constants that satisfy 0 < α + α ɛ <. Assume that N = {S,, S n } is a set of F α,α (m, G) sequences. Let H = vote(n). Then, with probability at most m(e ɛ n ), G H.

8 Page of Proof: Given any j m, for any i n, let X i be random variables such that X i = if S i [j] G[j] and S i [j], or 0 otherwise. By the definition of the F α,α (m, G) sequences, X i are independent and P r(x i = ) α. So, by Lemma, with probability at most e ɛ n, X + +X n < (α ɛ)n. That is, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] G[j] and S i [j] -. Similarly, with probability at most e ɛ n, there are fewer than (α ɛ)n characters S i [j] such that S i [j] = -. Thus, with probability at most me ɛ n, there are fewer than (α + α ɛ)n characters S i [j] such that S i [j] G[j] for some j m. This implies that, with probability at least me ɛ n, there are more than ( α α + ɛ)n characters S i [j] such that S i [j] = G[j] for any j m. Since 0 < α +α ɛ < by the assumption of the theorem, we have (α +ɛ)n < ( α α +ɛ)n. This further implies that with probability at least me ɛ n, vote(n)[j] = G[j] for any j m, i.e., vote(n) = G.. When the Inconsistency Error Parameter Is Known Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy (α + ɛ) < β and 0 < α + α ɛ <. Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. There exists an O(nm) time algorithm that can find two haplotypes H and H with probability at least ne ɛ m me ɛ n me ɛ n such that either H = G and H = G, or H = G and H = G. Proof: The algorithm, denoted as SHR-One, is described as follows. Algorithm SHR-One Input: M, an n m matrix of SNP fragments. Parameters α and ɛ. Output: Two haplotypes H and H. Set Γ = Γ =. Randomly select a fragment r = M[j] for some j n. For every fragment r from M do If (diff(r, r ) (α + ɛ)) then put r into Γ Let Γ = M Γ. Let H = vote(γ ) and H = vote(γ ). return H and H.

9 Page of End of Algorithm Claim. With probability at most ne ɛ m, we have either diff(f, G ) > α + ɛ for some F α,α (m, G ) sequence f in M, or diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. By Lemma, for any fragment f = M[k] such that f is a F α,α (m, G ) sequence, with probability at most e ɛ m we have diff(f, G ) > α + ɛ. Since there are n many F α,α (m, G ) sequences in M, with probability at most n e ɛ m, we have diff(f, G ) > α +ɛ for some F α,α (m, G ) sequence f in M. Similarly, with probability at most n e ɛ m, we have diff(g, G ) > α + ɛ for some F α,α (m, G ) sequence g in M. Combining the above completes the proof for Claim. Claim. Let M i be the set of all the F α,α (m, G i ) sequences in M, i =,. With probability at least ne ɛ m, Γ and Γ is a permutation of M and M. By the assumption of the theorem, the fragment r of M is either a F α,α (m, G ) sequence or a F α,α (m, G ) sequence. We assume that the former is true. By Claim, with probability at least ne ɛ m, we have diff(f, G ) α + ɛ for all F α,α (m, G ) sequences f in M, and diff(g, G ) α + ɛ for all F α,α (m, G ) sequences g in M. Hence, for any fragment r in M, if r is a F α,α (m, G ) sequence, then with probability at least ne ɛ m, we have diff(r, r ) diff(r, G ) + diff(r, G ) (α + ɛ). This means that, with probability at least ne ɛ m, all F α,α (m, G ) sequences in M will be included in Γ. Now, consider the case that r is a F α,α (m, G ) sequence in M. Since diff(g, G ) diff(g, r) + diff(r, G ) diff(g, r) + diff(r, r ) + diff(r, G ), we have diff(r, r ) diff(g, G ) diff(g, r) diff(g, r ). By the given condition of diff(g, G ) β and (α + ɛ) < β, with probability at least ne ɛ m, we have diff(r, r ) β diff(g, r) diff(g, r ) β (α + ɛ) > (α + ɛ), i.e., r will not be added to Γ. Therefore, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. Similarly, if r is a F α,α (m, G ) sequence, with probability at least ne ɛ m, Γ = M and Γ = M Γ = M. This completes the proof of Claim. Suppose that Γ and Γ is a permutation of M and M. Say, without loss of generality, Γ = M and Γ = M. By Lemma, with probability at most me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Hence, by Claim, with probability at most ne ɛ m + me ɛ n + me ɛ n, vote(γ ) G or vote(γ ) G. Concerning the computational time of the algorithm, we need to compute the difference between the selected fragment r and each of the remaining n fragments in the matrix M. Finding the difference between r and r takes O(m) steps. So, the total computational time is O(nm), which is linear in the size of the input matrix M.

10 Page of When Parameters Are Not Known In this section, we consider the case that the parameters α, α and β are unknown. However, we assume the existence of those parameters for the input matrix M of SNP fragments. We will show that in this case we can still reconstruct the two unknown haplotypes from M with high probability. Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <, 0 < α +α ɛ <, and β( α α ɛ) > (α +ɛ). Let G, G Σ m be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = Proof: nn n(n ) The algorithm, denoted as SHR-Two, is described as follows. Algorithm SHR-Two Input: M, an n m matrix M of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. Let d min = and M =. For (k = to u) do { //the k-loop } Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } and u is an integer parameter. If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = max{diff(r i, r ) r M i } for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Claim. With probability at most ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the k-loop iterations. For randomly selected fragments r and r, with probability γ, r, r is a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in M. When the k-loop is repeated u times, with probability at most

11 Page 0 of ( γ) u, r, r is not a permutation of a F α,β (m, G ) sequence and a F α,β (m, G ) sequence in all of the u loop iterations. Thus, Claim is true. Let N i be the set of the n i fragments in M that are F α,α (m, G i ) sequences, i =,. Claim. With probability at most ne ɛ βm, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or ; or diff(s, S ) β( α α ɛ) for some S N and some S N. By Lemma, for every fragment S from N i, with probability at most e ɛ m, diff(gi, S) > α + ɛ or S has more than (α + ɛ)m holes. Thus, with probability at most ne ɛ m, diff(g i, S) > α + ɛ or holes(s) > (α + ɛ)m for some S from N i for some i = or. By Lemma, with probability at most ne ɛ βm, diff(s, S ) β( α α ɛ) for some S N and some S N. The above analysis completes the proof for Claim. Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for all S N i for i =,. We consider possible choices of the two random fragments r and r in the following. At any iteration of the k-loop, if r N and r N, then by () we have diff(r, r ) diff(r, G ) + diff(r, G ) (α +ɛ) for any r N ; and diff(r, r ) diff(r, G )+diff(r, G ) (α +ɛ) for any r N. By () and the given condition of the theorem, we have, diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N ; and diff(r, r ) > β( α α ɛ) > (α + ɛ) for any r N. This implies that at this loop iteration we have M = N, M = N and d (α + ɛ). Similarly, if at this iteration r N and r N, then M = N, M = N and d (α + ɛ). If r, r N at some iteration of the k-loop, then for any r N, either r M or r M. In either case, by () of our assumption and the given condition of the theorem, we have d β( α α ɛ) > (α + ɛ) at this iteration. Similarly, if r, r N at some iteration of the k-loop, then we also have d > (α + ɛ) at this iteration. It follows from the above analysis that, under the assumption of () and (), once we have r N and r N or r N and r N at some iteration of the k-loop, then M, M is a permutation of N, N at the end of this iteration. Furthermore, if M and M are replaced by M and M after this iteration, then M, M must also be a permutation of N, N. By Claims and, with probability at most ( γ) u + ne ɛ βm, the assumption of () and () is not true, or r N and r N (or r N and r N ) is not true at all the iterations of the k-loop. Hence, with probability at most ( γ) u + ne ɛ βm, the final list of M and M returned by the algorithm is not a permutation of N, N, so the claim is proved. 0

12 Page of For M and M returned by the algorithm, we assume without loss of generality M i = N i, i =,. By Lemma and the given condition of the theorem, with probability at most me ɛ n + me ɛ n, we have H = vote(m ) G or H = vote(m ) G. Thus, by Claim, with probability at most ( γ) u + ne ɛ βm + me ɛ n, we have H G or H G. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M for a constant u.. Tuning the Dissimilarity Measure In this section, we consider a different dissimilarity measure in algorithm SHR-TWO to improve its ability to tolerate errors. We use the sum of the differences between r i and every fragment r M i, i =,, to measure the dissimilarity of the fragments in M i with r i. The new algorithm SHR-Three is given in the following. We will present experimental results in Section to show that algorithm SHR-Three is more reliable and robust in dealing with possible outliers in the data sets. Algorithm SHR-Three Input: M, an n m matrix of SNP fragments. u, a parameter to control the loop. Output: two haplotypes H and H. } Let d min = and M =. For (k = to u) do { //the k-loop Let M = M = and d = d = 0. Randomly select two fragments r = M[i ], r = M[i ] from M For every fragment r from M do { } If (diff(r i, r ) = min{diff(r, r ), diff(r, r )} for i = or, then put r into M i. Let d i = r M i diff(r i, r ) for i =,. Let d = max{d, d }. If (d < d min ) then let M = {M, M } and d min = d. return H = vote(m ) and H = vote(m ). End of Algorithm Theorem. Assume that α, α, β, and ɛ > 0 are small positive constants that satisfy α + α + ɛ <,

13 Page of < α + α ɛ <, η > G, G Σ m (α +ɛ) β( α α ɛ) min(n,n) with η = n, and β( α α ɛ) > (α + ɛ). Let be the two unknown haplotypes such that diff(g, G ) β. Let M be any given n m matrix of SNP fragments such that M has n i fragments that are F α,α (m, G i ) sequences, i =,, and n + n = n. Then, there exists an O(umn) time algorithm that can find two haplotypes H and H with probability at least ( γ) u ne ɛ βm me ɛ n me ɛ n such that H, H is a permutation of G, G, where γ = nn n(n ) and u is an integer parameter. Proof: Let N i be the set of the n i many F α,α (m, G i ) sequences in M, for i =,. We first notice that both Claims and in the proof of Theorem hold here following the same analysis. However, we need to prove the following claim with different analysis: Claim. Let H = vote(m ) and H = vote(m ) be the two haplotypes returned by the algorithm. With probability at most ( γ) u + ne ɛ βm, M, M is not a permutation of N, N. We assume that () diff(s, S ) > β( α α ɛ) for every S from N and every S from N ; and () diff(g i, S) α + ɛ and holes(s) (α + ɛ)m for each S from N i (i =, ). We shall consider the two cases: Case. At some iteration of the k-loop, both r and r are selected from the same N i for i = or. For each r N j, j i, by assumption () we have both diff(r i, r ) > β( α α ɛ), i =,. Notice that r must be either in M or M. So, at least half of the fragments in N j will be either in M or M. Therefore, at this iteration, we have d n jβ( α α ɛ) ηnβ( α α ɛ) = ηβ( α α ɛ)n > (α + ɛ)n. Case. At some iteration of the k-loop, r and r are selected from different N i for i = and. Without loss of generality, r i N i, i =,. For each r N, by assumption () we have diff(r, r ) (α + ɛ); by assumption () and the given condition of the theorem we have diff(r, r ) > β( α α ɛ) > (α +ɛ). Similarly, for each r N, diff(r, r ) (α + ɛ), and diff(r, r ) > (α + ɛ). Therefore, at this iteration, we have M = N and M = N, and d (α + ɛ)n + (α + ɛ)n = (α + ɛ)n. The two cases implies that under the assumption of () and (), if at any iteration of the k-loop, we have r N and r N, or r N and r N, then the final list of the two sets M, M is a permutation of N, N. Hence, Claim follows from Claims and in the proof of Theorem, which are true here as we mentioned earlier. Now, we assume that the final list of the two sets M, M is a permutation of N, N. By Lemma, with probability at most m(e δ n ) + m(e δ n ), H = vote(m ), H = vote(m ) is not a permutation of G, G. This, together with Claim, completes the probabilistic claim of the theorem. It is easy to see that the time complexity of the algorithm is O(umn), which is linear in the size of M.

14 Page of Experimental Results We design a MATLAB program to test both the accuracy and the speed of algorithm SHR-Three. Due to the difficulty of getting real data from the public domain (Alessandro Panconesi, 00), our experiment data is created following the common practice in literature such as (Wang et al., 00; Alessandro Panconesi, 00). A random matrix of SNP fragments is created as follows: () Haplotype is generated at random with length m (m {0, 00, 0}). () Haplotype is generated by copying all the bits from haplotype and flipping each bit with probability β (β {0., 0., 0.}). This simulates the dissimilarity rate β between two haplotypes. () Each haplotype is copied n times so that the matrix has m columns and n(n {0, 0, 0}) rows. () Set each bit in the matrix to - with probability α (α {0., 0., 0.}). This simulates the incompleteness error rate α in the matrix. () Flip each non-empty bit with probability α (α {0.0, 0.0,..., 0.}). This is the simulation of the inconsistency error rate of α. Tables to show the performance of algorithm SHR-Three with different parameter settings in accordance with those in (Alessandro Panconesi, 00). The typical parameters used in (Alessandro Panconesi, 00) are m = 00, n = 0, β = 0., α = 0. and 0.0 α 0.0. These are considered to be close to the real situations. In our tables, the results are the average time and the reconstruction rate of the 000 executions of algorithm SHR-Three. A new random matrix is used for each execution. The reconstruction rate is defined as the ratio of the total number of correctly reconstructed bits to the total number of bits in two haplotypes. The computing environment is a PC machine with a typical configuration of.ghz AMD Turion X CPUs and GB memory.. Concluding Remarks The software of our algorithms is available for public access and for real-time on-line demonstration at We thank Liqiang Wang for implementing the programs in Java and setting up this web site. We have developed three linear time probabilistic algorithms for the singular haplotype reconstruction problem with provable high success probability. Our experimental results conform with the theoretical efficiency and accuracy of the algorithms. It should be pointed out that our work can be extended to reconstruct multiple haplotypes from a set of fragments. Our approach also opens the door to develop probabilistic methods for other variants of the

15 Page of haplotyping problem involving both inconsistency and incompleteness errors. References Alessandro Panconesi, M. S. (00). Fast Hare: A fast heuristic for single individual snp haplotype reconstruction. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science 0, pages. Bafna, V., Istrail, S., Lancia, G., and Rizzi, R. (00). Polynomial and apx-hard cases of the individual haplotyping problem. Theoretical Computer Science, :0. Cilibrasi, R., van Iersel, L., Kelk, S., and Tromp, J. (00). On the complexity of several haplotyping problems. In Algorithms in Bioinformatics, th International Workshop, WABI 00, Lecture Notes in Computer Science, volume, pages. Clark, A. (0). Inference of haplotypes from pcr-amplified samples of diploid populations. Molecular Biology Evolution, :. G. Lancia, R. R. (00). A polynomial solution to a special case of parsimony haplotyping problem. Operations Research Letters, (0):. Genovese, L. M., Geraci, F., and Pellegrini, M. (00). A fast and accurate heuristic for the single individual SNP haplotyping problem with many gaps, high reading error rate and low coverage. Algorithms in Bioinformatics, : 0. Gusfield, D. (000). A practical algorithm for optimal inference of haplotype from diploid populations. In The Eighth International Conference on Intelligence Systems for Molecular Biology, pages. Gusfield, D. (00). Haplotyping as perfect phylogeny: Conceptual framework and efficient solutions. In the Sixth Annual International Conference on Computational Biology, pages. Hoehe, M., Kopke, K., Wendel, B., Rohde, K., Flachmeier, C., Kidd, K., Berrettini, W., and Church, G. (000). Sequence variability and candidate gene analysis in complex disease: association of opioid receptor gene variation with substance dependence. Human Molecular Genetics, (): 0. Lancia, G., Pinotti, M. C., and Rizzi, R. (00). Haplotyping polulations by purs parsimongy: complexity and algorithms. INFORMS Journal on computing, :. Li, L., Kim, J. H., and Waterman, M. S. (00). Haplotype reconstruction from SNP alignment. Journal of Computational Biology, ( ):0.

16 Page of Li, M., Ma, B., and Wang, L. (00). On the closest string and substring problems. Journal of the ACM, ():. Lippert, R., Schwartz, R., Lancia, G., and Istrail, S. (00). Algorithmic strategies for the single nucleotide polymorphism haplotype assembly problem. Briefings in bioinformatics, :. Motwani, R. and Raghavan, P. (000). Randomized Algorithms. Cambridge. Rizzi, R., Bafna, V., Istrail, S., and Lancia, G. (00). Practical algorithms and fixed-parameter tractability for the single individual SNP haplotyping problem. In Algorithms in Bioinformatics: Second International Workshop, WABI 00, Rome, Italy, September -, pages. Terwilliger, J. and Weiss, K. (). Linkage disquilibrium mapping of complex disease: fantasy and reality? Current Opinion in Biotechnology, ():. Wang, R.-S., Wu, L.-Y., Li, Z.-P., and Zhang, X.-S. (00). Hyplotype reconstruction from SNP fragments by minimum error correction. BioInformatics, (0):. Wang, R.-S., Wu, L.-Y., Zhang, X.-S., and Chen, L. (00). A markov chain model for haplotype assembly from SNP fragments. Genome Informatics, ():. Xie, M. and Wang, J. (00). An improved (and practical) parameterized algorithm for the individual haplotyping problem MFR with mate-pairs. Algorithmica (online). Zhang, X.-S., Wang, R.-S., Wu, L.-Y., and Zhang, W. (00). Minimum conflict individual haplotyping from SNP fragments and related genotype. Evolutionary Bioinformatics Online, : 0. Zhao, Y., Wu, L., Zhang, J., and R. Wang, X. Z. (00). Haplotype assembly from aligned weighted SNP fragments. Computational Biology and Chemistry, :.

17 Page of α (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0, β = 0% and α = 0%

18 Page of n = 0 n = 0 α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, β = 0%, α = 0% β = 0% β = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and α = 0% α = 0% α = 0% α (%) Time (ms) Reconstruction Rate (%) Time (ms) Reconstruction Rate (%) Table : Results for m = 00, n = 0 and β = 0%

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem Paola Bonizzoni, Riccardo Dondi, Gunnar W. Klau, Yuri Pirola, Nadia Pisanti and Simone Zaccaria DISCo, computer