The Incomplete Perfect Phylogeny Haplotype Problem

Size: px

Start display at page:

Download "The Incomplete Perfect Phylogeny Haplotype Problem"

Sharlene McKenzie
5 years ago
Views:

1 The Incomplete Perfect Phylogeny Haplotype Prolem Gad Kimmel 1 and Ron Shamir 1 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. kgad@tau.ac.il, rshamir@tau.ac.il. Astract. The prolem of resolving genotypes into haplotypes, under the perfect phylogeny model, has een under intensive study recently. All studies so far handled missing data entries in a heuristic manner. We prove that the perfect phylogeny haplotype prolem is NP-complete when some of the data entries are missing, even when the phylogeny is rooted. We define a iologically motivated proailistic model for genotype generation and for the way missing data occur. Under this model, we provide an algorithm, which takes an expected polynomial time. In tests on simulated data, our algorithm quickly resolves the genotypes under high rates of missing entries. Keywords: haplotype, haplotype lock, genotype, SNP, algorithm, complexity, genotype phasing, haplotype resolution, perfect phylogeny.

2 1 Introduction A current challenge in human genome research is to learn aout DNA differences among individuals. This knowledge will hopefully lead to finding the genetic causes of complex and multi-factorial diseases. The distinct single-ase sites along the DNA sequence, which show variaility in their nucleic acids contents across the population, are called single nucleotide polymorphisms (SNPs). Millions of SNPs have already een detected [19, 22], out of an estimated total of 10 millions common SNPs [8]. In diploid organisms (e.g. humans) there are two nearly identical copies of each chromosome. Most techniques for determining SNPs provide a pair of readings, one from each copy, ut cannot distinguish from which of the two chromosomes each reading came [14]. The goal of phasing (or resolving) is to infer that missing information. The original conflated data from oth chromosomes are called the genotype of the individual, and is represented y a set of two nucleotide readings for each site. The two separated sequences corresponding to the two chromosomes of an individual are called his/her haplotypes. If the two ases in a site are identical (resp. different), the site is called homozygote (resp., heterozygote). For recent reviews on iological and computational aspects of haplotype analysis see [11, 13]. Resolving the genotypes is a central prolem in haplotyping. It is elieved that more accurate association studies can e performed once the genotypes are resolved [14, 4]. In the asence of additional information, each genotype can e resolved in 2 h 1 different ways, where h is the numer of heterozygote sites in the genotype. To find the correct way, resolution is done simultaneously on all the availale genotypes, and according to a model. A pioneering approach to haplotype resolution was Clark s parsimony-ased algorithm [3]. A likelihood-ased EM algorithm [6, 15] gave etter results. Stephens et al. [21] and Niu et al. [16] proposed MCMC-ased methods which gave promising results. All of those methods assumed that the genotype data correspond to a single lock with no recomination events. Hence, for multi-lock data the lock structure must e determined separately. Recently, a new cominatorial formulation of the phasing prolem was suggested y Gusfield [10]. According to this model, phasing must e done so that the resulting haplotypes define a perfect phylogeny tree. This model is ased on the iological assumption that there are regions along the chromosome, where recomination occurred infrequently, and on the infinite site model [10]. Gusfield showed how to solve the prolem efficiently, and improved algorithms were susequently developed y Bafna et al. [2] and Eskin et al. [5]. Eskin et al. [5] showed good resolving results with small error rates on real genotypes. They also reported that their algorithm was faster and more accurate in practical settings than previous methods as [21]. In real genotype data (e.g., [17, 7, 4]) some of the data entries are often missing, due to technical causes. Current phasing algorithms (which are ased on perfect phylogeny) require complete genotypes. This situation raises the following algorithmic prolem: Complete the missing entries in the genotypes and then resolve the data, such that the resulting haplotypes define a perfect phylogeny tree. We call this prolem incomplete perfect phylogeny haplotype (IPPH). It was posed y Halldòrsson et al. [11]. In order to deal with such incomplete data, Eskin et al. [5] used a heuristic to complete the missing entries, and showed very good results. However, finding an algorithm for optimally

3 handling missing data entries should allow more accurate resolution. In this paper we address the IPPH prolem. A special case of IPPH was studied in phylogeny y Pe er et al. [18]. In the incomplete directed perfect phylogeny prolem, the input is an n m species-characters matrix, the characters are inary and directed, i.e., a species can only gain characters, and some of the characters are missing. The question is whether one can complete the missing states in a way admitting a perfect phylogeny. Pe er et al. provided a near optimal Õ(nm) time algorithm for the prolem. (We use Õ notation to suppress polylogarithmic factors in presenting complexity ounds 1 ). This prolem is a special case of IPPH in which all the sites in all genotypes are homozygote, and the root is known. The IPPH prolem can e stated in two variants: rooted (or directed) and unrooted (or general). In the rooted version, the root haplotype is given as part of the input. The unrooted version is a more direct formulation of the practice in iology, since in phasing, the root of the haplotypes is not given. However, we argue that the more restricted rooted version is of practical importance: Though theoretically finding the root might take an exponential time, in practice it can often e found efficiently y finding one genotype which is complete and homozygote in all sites. Once such haplotype is found, it can e used as a root for the construction of perfect phylogeny tree. This haplotype need not e the real evolutionary root of the tree. This procedure is correct, since each one of the haplotypes can e used as a root in the perfect phylogeny tree, as was shown y Gusfield [9]. As we shall demonstrate in Section 5, on simulated and real iological data, virtually always at least one such genotype exists. If there is no such genotype, one can seek a genotype with few undetermined sites and enumerate the values in these sites. In the rare cases that this too is not feasile, one can physically separate the two chromosomes of a single individual and sequence one haplotype, as was done in [17]. This procedure is consideraly more expensive than standard genotyping techniques, ut it will e performed only for one individual, so the price is small. Thus, oth variants of IPPH are iologically important. In this paper, we show that oth the rooted and the unrooted versions of the prolem are NP-complete. The hardness of unrooted IPPH also follows immediately from the hardness of determining the compatiility of unrooted partial inary characters (incomplete haplotype matrix) [20]. This was oserved first y R. Sharan (private communication). However, this result does not imply the hardness of rooted version. In fact, our proof for rooted IPPH is quite involved. To cope with the theoretical hardness of IPPH, we invoke a proailistic approach. We define a stochastic model for generating the haplotypes and for the way missing entries occur in them. The model assumptions are mild and seem to apply to iological data. In addition, we assume that the numer of sites m grows much more slowly than the numer of genotypes n. Specifically, we assume that m = o(n.25 ). As m is ounded y the lock size which in practice is not more than a modest constant (10-30), this condition also holds in practice. We design an algorithm which always finds the correct solution, and under the assumptions aove takes an expected time of Õ(m2 n). 1 The definition of e O notation is: e O(g(n)) := {f(n) n 0 > 0, c > 0, d > 0, n n 0 : 0 f(n) c[log n] d g(n)}.

4 To test our algorithm, we applied it to simulated data under iologically realistic values of the parameters, and calculated an upper ound Γ on the main factor in the running time. Γ may e exponential, ut under the model assumptions was shown to have an expected polynomial time. Γ m gives a ound on the numer of times the polynomial algorithm of [18] would e invoked to complete the calculation. On data with 200 genotypes and 30 sites, we show that on average Γ < 4000 even when only two haplotypes are present and the rate of missing entries is 50%. For a more realistic case of five haplotypes and 20% missing entries, E[Γ ] < 100. Hence, the algorithm requires modest time even far eyond the range of its provale performance. The paper is organized as follows: Section 2 presents definitions and preliminaries. Section 3 shows the hardness result. Section 4 presents the algorithm and the proailistic analysis. Section 5 summarizes our experimental results. Due to lack of space most of the proofs are deferred to an appendix. 2 Preliminaries In this section we provide asic defnitions, lemmas and oservations that are needed for our analysis. Given n genotypes, the haplotype inference prolem is to find n pairs of haplotypes vectors that could have generated the genotypes vectors. Formally, the input can e presented y an n m genotype matrix M, with M[i, j] {0, 1, 2}. The i-th row M[i, ] descries the i-th genotype species. The j-th column descries the alleles in the j-th location: 0 or 1 for two homozygote alleles, and 2 for polymorphic (heterozygote) site. A 2n m inary matrix M is an expansion of the genotype matrix M if each row M[i, ] expands to two rows denoted y M [i, ] and M [i, ], with i = n + i, satisfying the following: for every i, if M[i, j] {0, 1}, then M[i, j] = M [i, j] = M [i, j]; if M[i, j] = 2, then M [i, j] M [i, j]. M is also called a haplotype matrix corresponding to M. Definition 1. Perfect Phylogeny Tree for a Matrix A perfect phylogeny for a k m haplotype matrix M is a tree T with a root r, exactly k leaves and integer edge laels, and a inary vector (l v (1)... l v (m)) for each vector v, that oeys the following properties: 1. Each of the rows is the lael of exactly one leaf of T. 2. Each of the columns laels exactly one edge of T. 3. Every edge of T is laelled y one column. 4. For any node v, l v (i) l r (i) if and only if i laels an edge on the unique path from the root to v. Hence, given the root lael, the root-node paths provide a compact representation of all node laels. An equivalent definition appeared in [2], with the difference that here, we replace edges with multiple laels y paths with a single lael per edge. Definition 2. The Perfect Phylogeny Haplotype (PPH) Prolem [10] Given a matrix M, find an expansion M of M which admits a perfect phylogeny.

5 Here we define a generalization of PPH that allows missing data entries. The input to our prolem is an incomplete genotype matrix, i.e., a matrix M with M[i, j] {0, 1, 2,?}, where? indicates missing data entries. The process of replacing each? y 0,1 or 2 is called completing the matrix M. Prolem 1. Incomplete Perfect Phylogeny Haplotype (IPPH) Given an incomplete genotype matrix M, can one complete M, so that there exists an expansion M of M, which admits a perfect phylogeny? Definition 3. Perfect Phylogeny Forest Let M e a haplotype matrix, and let P = (V P, E P ) e a perfect phylogeny tree corresponding to M. The perfect phylogeny forest of P is a directed forest F = (V F, E F ) whose vertices are the edges of P, and for u, v V F, u is a parent of v in F if and only if the edge corresponding to u in P is a parent of the edge corresponding to v in P. From the aove definition, the vertices of perfect phylogeny forest correspond to M s columns, and reflect the order of mutations in the phylogeny tree. Clearly, each perfect phylogeny tree can e converted into perfect phylogeny forest and vice versa. Thus, M admits a perfect phylogeny tree iff M admits a perfect phylogeny forest. For a column j {1, 2,..., m} of M, we denote y u j its corresponding vertex in the perfect phylogeny forest. For a perfect phylogeny forest F, we say that two vertices are in parenthood relation if one is an ancestor of the other. Otherwise, we say that they are in rotherhood relation. Note that rothers can either e in different connected components, or e in the same component and have the root on the path connecting them. The following special case of IPPH will e a main suject of our investigation. Prolem 2. Incomplete Perfect Phylogeny Haplotype, rooted version (IPPH-ROOTED) Given an incomplete genotype matrix M and a haplotype r, can one complete M, such that there exists an expansion M of M, which admits an perfect phylogeny, with r as a root? In this prolem, w.l.o.g., we assume that the root haplotype is r 0 = (0,..., 0) (cf. [9]). The following lemma explains the connection etween F and M, assuming that the root is r 0. Lemma 1. ([2], [5]) Let M e a haplotype matrix 2n m, then F = (V F, E F ) is its perfect phylogeny forest with the root haplotype r 0 iff for all u a, u V F and for all i {1,..., 2n}: 1. If u a is an ancestor of u then M [i, a] = 1 or M [i, ] = If u a and v are in rotherhood relation, then M [i, a] = 0 or M [i, ] = 0. In the rest of this section, we provide our own definitions, uilding on those introduced aove, and prove several lemmas which will e needed for our analysis. Definition 4. Constrained Mixed Graph A constrained mixed graph is a triplet G c = (V, E, X), where G = (V, E) is a graph and X = {X 1, X 2,..., X p }, where for each i: X i V. The sets X i are called XOR relations. G has four types of edges: undirected, dashed undirected, directed and dashed directed.

6 Definition 5. Parenthood Connected Components Two vertices u and v in a constrained mixed graph are in the same parenthood connected component if there exists a path etween u and v consists only of undirected or directed edges (a parenthood relation). Note, that the direction of an edge is not important in this definition. Definition 6. Constrained Mixed Completion Graph For a constrained mixed graph G c = (V, E, X), we define its constrained mixed completion graph G = (V, E ) to e a complete graph (with a single edge for each pair u, v E), where E contains two types of edges: directed and dashed undirected. Each edge of G is laelled with L : E {0, 1}, where a directed edge is laelled with 0, and dashed undirected edge is laelled with 1. G maintains all of the following properties: 1. All G edges maintain the following properties: (a) If e : (u, v) E is an undirected edge then the corresponding e : (u, v) E must e a directed edge from u to v or from v to u. () Directed edges and dashed undirected edges in G preserve their type in G. (c) If e : (u, v) E is a dashed directed edge from u to v then the corresponding e : (u, v) E must e a dashed undirected edge or a directed edge from u to v. 2. There exists a spanning directed forest F = (V, E F E ), such that: (a) If node u V is an ancestor of v V in F, then there is a directed edge from u to v in G. () If node u V is not an ancestor of v V and v is not an ancestor of u in F, then there is a dashed undirected edge etween u and v in G. 3. For each XOR relation X i, for every three vertices: x i,a, x i,, x i,c X i, there exists 2 : L(x i,a, x i, ) L(x i,, x i,c) L(x i,a, x i,c) = 0 Prolem 3. Constrained Mixed Graph Spanning (CMGS) prolem The input to CMGS prolem is a constrained mixed graph G. The output is a constrained mixed completion graph of G, if such exists. An example of CMGS prolem is presented in Figure 2. The decision version of CMGS prolem is to decide whether there exists a constrained mixed completion graph G for G. An important quality of the constrained mixed completion graph, is that it can e viewed as a directed spanning forest F, with additional edges etween nodes, according to the relation of those nodes in the forest: a dashed undirected edge for a rotherhood relation, or a directed edge for a parenthood relation. The following notations are adopted from Eskin et al. [5]: c(m, x) is defined as the set of rows of M containing the value x at column c. Let c, c e columns and x, y e elements of {0, 1}. The pair c, c induces (x, y) in M if ((c(m, x) c (M, y)) (c(m, x) c (M, 2)) (c(m, 2) c (M, y)). Let R(M, c, c ) e the set of pairs (x, y) such that (c, c ) induces (x, y) in M. Note, that R(M, c, c ) does not contain pairs with?, ut only 0 and 1. Let c, c e two columns such that c(m, 2) c (M, 2). Let M e an expansion of the M, after completing the missing entries, which admits a perfect phylogeny. We say that M resolves the pair of columns (c, c ) unequally if {(0, 1), (1, 0)} R(M, c, c ) and equally if (1, 1) R(M, c, c ). According to Lemma 1, M must resolve the pair (c, c ) either equally or unequally, and can not resolve the pair in oth ways. For an incomplete genotype matrix M, we uild a constrained mixed graph G c (M), where each column in M has a corresponding vertex in G c. The edges represent the 2 The operator denotes the oolean xor operator.

7 possile relations of the columns in the perfect phylogeny forest, and are determined according to lemma 1: For each two vertices u a, u : (1) If R(M, a, )\{(0, 0)} = {(1, 1), (1, 0)} then u a is an ancestor of u in F. The edge (u a, u ) is determined to e a directed edge from u a to u. (2) If R(M, a, )\{(0, 0)} = {(1, 1)} then u a, u are in parenthood relation in F, ut it is unknown which of the vertices is the ancestor. The edge (u a, u ) is determined to e an undirected edge. (3) If R(M, a, )\{(0, 0)} = {(1, 0), (0, 1)} then u a, u are in rotherhood relation in F. The edge (u a, u ) is determined to e a dashed undirected edge. (4) If R(M, a, )\{(0, 0)} = {(1, 0)} then either u a is an ancestor of u in F, or that u a, u are in rotherhood relation in F. The edge (u a, u ) is determined to e a dashed directed edge from u a to u. (5) If R(M, a, )\{(0, 0)} = then the relation of u a, u in F is unknown. In that case: (u a, u ) / E. In addition, for each set of columns a 1,..., a t, if there exists a row i, such that M[i, a 1 ] =,..., = M[i, a t ] = 2, then the corresponding vertices u a1,..., u at elong to a common XOR relation. Each pair of vertices of G c is laelled with L : (u a, u ) {0, 1,?}, where an un-dashed (directed or undirected) edge, i.e. a parenthood relation, is laelled with 0; dashed undirected edge, i.e. a rotherhood relation, is laelled with 1; and all other cases, i.e. an unknown relation, are laelled with?. The last set is called: unlaelled pairs. Note, that if for two vertices u a, u none of the aove five possiilities applies, than according to Lemma 1, M does not define a perfect phylogeny forest. Definition 7. Preliminary Lael Completion A preliminary lael completion of G c (M) is assigning a lael to the unlaelled pairs of vertices, according to the following algorithm: while possile, iteratively find three vertices: x i,a, x i,, x i,c X i, such that L(x i,a, x i, ) and L(x i,, x i,c ) are set and L(x i,a, x i,c ) is not, and assign: L(x i,a, x i,c ) = L(x i,a, x i, ) L(x i,, x i,c ). Define U Gc to e the set: {(u a, u ) : L(u a, u ) =?}, i.e., the set of pairs of vertices with an unknown lael, after preliminary lael completion was performed. Definition 8. Secondary Lael Completion A lael completion of a constrained of G c (M) is assigning to all (u a, u ) U Gc a lael {0, 1}, such that for each XOR relation X i, for every three vertices: x i,a, x i,, x i,c X i, there exists: L(x i,a, x i, ) L(x i,, x i,c ) L(x i,a, x i,c ) = 0. After secondary lael completion, we can perform lael resolution of the incomplete genotype matrix, which is defined to e: Definition 9. Lael Resolution of an Incomplete Genotype Matrix A 2n m incomplete inary matrix M is an expansion of the incomplete genotype matrix M if each row M[i, ] expands to two rows denoted y M [i, ] and M [i, ], with i = n + i, satisfying the following: for every i, if M[i, j] {0, 1,?}, then M[i, j] = M [i, j] = M [i, j]; if M[i, j] = 2, then M [i, j] = 0, M [i, j] = 1 or M [i, j] = 1, M [i, j] = 0. M is also called an incomplete haplotype matrix corresponding to M. A lael resolution of genotype matrix M is expansion of M to an incomplete haplotype matrix M, according to the lael function L: For each two columns a,, where there exists i, such that M[i, a] = M[i, ] = 2, if L(u a, u ) = 0 resolve (a, ) equally and if L(u a, u ) = 1 resolve (a, ) unequally.

8 Lael resolution of an incomplete genotype can e done y the same algorithm proposed y Bafna et al. [2] (algorithm E2M). Oserve, that any sumatrix M[i, (a, )], where M[i, a] and M[i, ] are oth not equal 2, has a unique expansion in any incomplete haplotype matrix. Hence, for such sumatrix, the resolution is not influenced y the lael function. The algorithm descried in definition 7, was suggested y [2] as part of their algorithm for complete genotype matrix phasing. Interestingly, they proved that once preliminary lael completion is performed, for any possile (legal) secondary lael completion of U Gc, a lael resolution of the genotype matrix results in a haplotype matrix, which admits a perfect phylogeny. This is true for a complete genotype matrix (with no missing entries), ut not for the incomplete case: Not every secondary lael completion followed y its corresponding lael resolution of the incomplete genotype matrix, results with an incomplete haplotype matrix, that can e completed to a complete haplotype matrix, which admits a perfect phylogeny. The following lemma, which is proven in the Appendix in Susection 7.4, descries a weaker connection etween secondary lael completion and the solution of IPPH. Lemma 2. If an incomplete genotype matrix M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny, then there exists some secondary lael completion of U Gc, such that a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. 3 The Hardness Result In this section we show that IPPH and IPPH-rooted are NP-complete. Trivially, oth versions elong to NP. To prove NP-hardness, we will show the following polynomial reductions: 3-SAT CMGS IPPH-ROOTED IPPH, which implies that oth the rooted and unrooted versions are NP-complete. The reduction IPPH-ROOTED IPPH is as follows: Given an instance (M, r) of IPPH-ROOTED, we simply add the genotype row r to M. The resulting matrix M is the input to IPPH. In a solution to the latter, there will e a leaf laelled with r, and thus it solves the former prolem. Conversely, if M has a solution with root r then it is also a solution for M. The exact same idea was used y Bafna et al. [2], only that here, it is applied to an inomplete genotype matrix. The following two theorems, which are oth proven in the Appendix in Susections 7.2 and 7.3, imply the hardness of IPPH: Theorem 1. CMGS IPPH-ROOTED Theorem 2. 3-SAT CMGS 4 An Algorithmic Solution for IPPH In spite the results of Section 3, we provide an algorithmic approach to IPPH. We restrict the prolem, y applying some iological insights, in which the data is assumed to

9 e generated y a stochastic model. We provide an algorithm that takes an expected polynomial time for oth the rooted and the unrooted versions of IPPH. Pe er et al. [18] et al. suggested a polynomial time algorithm of Õ(mn) time for solving the rooted version of perfect phylogeny with missing data. Let the input incomplete haplotype matrix e M, with M[i, j] {0, 1,?}, and the root e r. We denote y IDP ( M,r), the completion matrix M, i.e. after performing this algorithm on M. We also use IDP ( M) to denote IDP ( M,r 0 ). We use h(, ) to denote the hamming distance etween two inary vectors. The following notation is used in the description of the algorithm: Σ 0 (M[, j]) and Σ 1 (M[, j]) are the numers of 0s and 1s in the j th column, respectively. Consider that the root is known (r 0 ). Given an instance of an incomplete matrix M, we uild a constrained mixed graph, as descried in Section 2. We then perform preliminary lael completion (definition 7). According to Lemma 2, if M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny, then there exists some secondary lael completion of U Gc, where a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. Thus, the computational challenge, is to find such secondary lael completion. Suppose, we were ale to guess the correct secondary lael completion y an oracle. In that case, let M e the resulted incomplete haplotype matrix, y performing lael resolution (definition 9) accordingly. A completion of M can e done in polynomial time y IDP( M). Hence, the ottleneck step is finding a secondary lael completion. Due to the hardness result in Section 3, a polynomial time oracle for finding the correct secondary lael completion does not exist, unless P=NP. However, when assuming additional properties of the genotype data, this can e performed y a polynomial expected time algorithm. We now descrie those assumptions, and for each, we descrie its iological motivation: 1. Each entry value in the original genotype matrix is replaced y? with proaility p, independently of the other values. This assumption makes sense as missing data entry are caused y technical prolems in the iological experiment. The same value p may e used for all entries. One may claim, that there are situations such that in each allele (SNP) there is a different proaility for a missing entry, due to distinct difficulties in sequencing along different regions in the human genome. In that case, we denote y p i the proaility for a missing entry in the i th allele and determine p to e: p max i { p i }. 2. We can assume that each haplotype h i, which is a node in a perfect phylogeny tree, is chosen to e in a genotype with proaility of α i, independently. This assumption is also made as part of the Hardy-Weinerg equilirium model [12]. An additional assumption is that those proailities do not depend on n or m. The logic ehind the last assumption is that proailities of the haplotypes in the population do not depend on the numer of sampled genotypes (n) nor on the size of a lock (m). 3. Assume that the numer of columns (m) and the numer of rows (n) maintain a rule, that states that when n ecomes larger, m is not consideraly increased. Specifically, we use m = o(n.25 ). The last assumption applies in all iological constellations: In future experiments, the numer of genotypes is expected to e larger, while m is not expected to grow sustantially, since m is the size of a region

10 in the chromosome where the numer of recomination events in the sampled population is small. A constant value of m is thus also plausile, ut for our analysis, a much weaker assumption than that is required. Pro-IPPH(M): 1. Let G c(m) = (V, E, X) e the constrained mixed graph of M. 2. Perform preliminary lael completion of G c(m). 3. Let r e a vector such that r j = 0 if Σ 0(M[, j]) > Σ 1(M[, j]) and 1 otherwise. 4. For i = 0 `m 2 For each possile root r {0, 1} m, such that h(r, r) = i do Relael the matrix entries according to r, so that r 0 is the new root. For each possile secondary lael completion of U Gc, such that {(u a, u ) : (u a, u ) U Gc L((u a, u )) = 0} = i do Perform lael resolution of M to M. f If IDP ( M) f is compatile then output IDP ( M) f and halt. 5. Output: no solution. Fig. 1. An algorithm for IPPH. The aove algorithm was designed to solve IPPH under the assumptions aove. Informally, algorithm Pro-IPPH(M) ignores the missing data entries in order to decide the relation etween each two columns in the matrix. As we shall prove, if unale to conclude deterministically from the matrix, with high proaility, a correct relation is otained just y guessing. The following theorem is proven in Susection 7.5 in the Appendix. Theorem 3. Under the assuptions of the model, algorithm Pro-IPPH(M) solves IPPH correctly within expected time of Õ(m2 n). 5 Experimental Results In order to assess our algorithm, we applied it on simulated data. The simulations used parameters which were adopted from several large scale iological studies [4, 17, 7]. By Theorem 3 the algorithm always outputs a correct solution. Although we proved that under our model assumptions the expected running time is Õ(m2 n), we wanted to estimate the actual running time, under realistic iological parameters and eyond the range of the model assumptions. Specifically, we wanted to calculate the expected numer of different phylogenic tree solutions for a given data set. The proof of Theorem 3 implies that Γ = 2 UGc is an upper ound on the numer of different phylogeny solutions, and the dominant factor in the complexity of the algorithm. In each different experiment, we randomly generated N = 10 5 perfect phylogeny trees. We used the following procedure to generate a perfect phylogeny tree of haplotypes: We start with a inary root vector with m = 30 sites. Initially, no site is marked. In each step, we randomly pick a tree node and an unmarked site, add a new child haplotype to that node in which only the state of that site is changed, and mark the site.

11 For each tree, we randomly chose k haplotypes for reconstructing the genotypes, where k = 2, 3,..., 9. We assigned proailities, denoted y α 1, α 2,..., α k, to the k chosen haplotypes, such that k i=1 α i = 1 and i : α i For each tree, different proailities were assigned. Next, we generated 200 genotypes according to the chosen haplotypes and their assigned proailities. Introducing missing data entries to the genotypes was performed as follows: Each site in the genotypes data was flipped into a missing entry independently with proaility p. Recalling that in real data p 0.1, we checked a wider range: p = 0, 0.05,..., 0.5. Thus, for each sampled tree T j : j = 1, 2,..., N, we sampled one incomplete genotype matrix M j of size We performed our algorithm on each M j. We denote U Gc(M j) y U j. We stopped when i = 0 to calculate 2 Uj, after performing steps 1-3 of the algorithm. As was shown in Section 4, if the secondary lael completion is known, it is possile in Õ(m2 n) time to output the solution to IPPH. Hence, completion of the algorithm, for each M j, should take less than 2 Uj Õ(m 2 n) time. The dominating factor in the running time is the random variale 2 Uj, which its expectation is approximated y: E[Γ ] = E[2 Uj ] 1 N N j=1 2 Uj. The results are presented in Figure 6 in the Appendix (Susection 7.6). Generally, E[Γ ] is elow When the missing data rate is elow 20%, E[Γ ] is ounded to e smaller than 100. Another oservation, is that the larger the numer of chosen haplotypes from the phylogeny tree, the smaller the value of E[Γ ]. Notaly, in all cases we assessed a correct root: either y finding at least one haplotype, which is homozygote with no missing entries in all sites, or y using the majority rule descried in the algorithm. For demonstrating that in real iological data, the root can practically e found in a linear time, we chose the genotype data of Daly et al. [4]. This data set consists of 103 SNPs and 129 genotypes. We checked all possile ( ) locks. In all the locks, which their size was smaller than 65 SNPs, there could always e found at least one genotype, which is homozygote in all alleles, without any missing entry. This genotype, can e resolved in only one possile way, and hence, this known haplotype can e used as a root. Since the size of a lock is almost always smaller than 30, this naive simple method can e used for finding a root in iological data. 6 Concluding Remarks We investigated the incomplete perfect phylogeny haplotype prolem: phasing of genotypes into haplotypes, under the perfect phylogeny model, where some of the data are missing. We proved that the prolem, oth in its rooted and unrooted version is NPcomplete. We also provided a practical expected polynomial-time algorithm for a iologically motivated restriction of the prolem. We applied our algorithm on simulated data, and concluded that the running time and the numer of distinct phylogeny solutions are relatively small, under common iological conditions and parameters, even when the missing data rate is 50%. A more accurate treatment for phasing of genotypes with missing entries can now e otained. In addition, due to the small numer of phylogenic solutions resulted in simulations, incorporation of other statistical and cominatorial models with our algorithm is feasile.

12 Acknowledgments This research was supported y a grant from the Israel Science Foundation (grant 309/02). We thank Roded Sharan for fruitful discussions. References 1. N. Alon and J. H. Spencer. The Proailistic Method. John Wiley and Sons, Inc., V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: A direct approach. Technical Report UCDavis CSE , A. Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7(2):111 22, M.J. Daly et al. High-resolution haplotype structure in the human genome. Nature Genetics, 29(2): , E. Eskin, E. Halperin, and R. M. Karp. Large scale reconstruction of haplotypes from genotype data. In Proceedings of The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages , L. Excoffier and M. Slatkin. Maximum-likelihood estimation of mollecular haplotype frequencies in a diploid population. Molecular Biology and Evolution, 12(5):912 7, S. B. Gariel et al. The structure of haplotype locks in the human genome. Science, 296: , L. Grugliyak and D. A. Nickerson. Variation is the spice of life. Nature Genetics, 27: , D. Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:19 28, D. Gusfield. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages , B. V. Halldorsson et al. Cominatorial prolems arising in SNP. DMTCS 03 Conference. 12. G. H. Hardy. Mendelian proportions in a mixed population. Science, 18:49 50, M. R. Hoehe. Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics, 4(5): , M. R. Hoehe et al. Sequence variaility and candidate gene analysis in complex disease: association of µ opioid receptor gene variation with sustance dependence. Human Molecular Genetics, 9: , J. Long et al. An EM algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics, 56(3): , T. Niu et al. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70(1):157 69, N. Patil et al. Blocks of limited haplotype diversity revealed y high-resolution scanning of human chromosome 21. Science, 294: , I. Pe er, R. Shamir, and R. Sharan. Incomplete directed perfect phylogeny. In Proceedings of the Cominatorial Pattern Matching Conference (CPM), pages , R. Sachidanandam et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 291: , M. A. Steel. The complexity of reconstructing trees from qualitative characters and sutrees. Journal of Classification, 9:91 116, M. Stephens et al. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68(4):978 89, C. Venter et al. The sequence of the human genome. Science, 291: , 2001.

13 7 Appendix 7.1 Figures of Graphs a a c c d e d e XOR relations: {,c,e}, {a,,d} A B Fig. 2. Example of CMGS prolem. A: an instance of a graph for CMGS prolem with XOR relations. B: a possile solution for this instance. The edges of the forest are olded. c j 5 c j 4 c j 3 c j 2 j c j c 0 1 Fig. 3. Clause ase graph. 7.2 Proof of Theorem 1 Proof. Given an instance of constrained mixed graph G c = (V, E, X), for CMGS prolem, we uild a matrix M, which together with r : i, r[i] = 0, serve as input for IPPH-ROOTED. M is uilt to e in the size of (2 E +p) V. For each e E there are two rows correspondingly, and their indices are denoted y Ne 0 and Ne 1, respectively. For each X i {X i } 1 i p there is one row correspondingly, and its index is denoted y N Xi. For a column i {1, 2,..., V }, we denote its corresponding vertex in G c y u i. The construction of M is as follows:

14 a 5 5 a 4 4 a 3 3 d e 3 3 a 2 2 d e 2 2 a 1 1 d 1 e 1 a 0 0 d e 0 0 x x Fig. 4. The graph structure for a variale positive and negative connector. The following sets of vertices maintain a XOR relation: {a 0, 0, a 1, 1 }, {a 1, 1, a 2, 2 }, {a 2, 2, a 3, 3 }, {a 3, 3, a 4, 4 }, {a 4, 4, a 5, 5 }, {d 0, d 0, e 1, e 1 }, {d 1, d 1, e 2, e 2 } and {d 2, d 2, e 3, e 3 }. 1. For each e : (u a, u ) E, we add 2 rows M[Ne 0, ] and M[N e 1, ], such that u c V \{u a, u }, M[Ne 0, c] = M[Ne 1, c] =?, ( and: ( ) M[N 0 e, a] M[Ne 0, ] 0 0 (a) If e : (u a, u ) is an undirected edge then () If e : (u a, u ) is a dashed undirected edge then ( ) ) M[Ne 1, a] M[N e 1, ] =. 1 1 ( ) M[N 0 e, a] M[Ne 0, ] M[Ne 1, a] M[N e 1, ] ( ) M[N 0 (c) If e : (u a, u ) is a directed edge from u a to u then e, a] M[Ne 0, ] M[Ne 1, a] M[N e 1, ] = ( ) (d) If ( e : (u a, u ) is ) a ( dashed ) directed edge from u a to u then M[N 0 e, a] M[Ne 0, ] 0 0 M[Ne 1, a] M[N e 1, ] = For each {X i } 1 i p, we add one row M[N Xi, ], such that u j X i : M[N Xi, j] = 2 and u k V \X i : M[N Xi, k] =?. ( ) Suppose that IPPH-ROOTED(M,r) = TRUE, i.e. M has an expansion to M, such that M admits a perfect phylogeny tree, with r 0 as a root. Thus, M has a directed perfect phylogeny forest F = (V F, E F ). Let F = (V F, ÊF ) e complete graph, where for each u, v V F, we add a directed edge from u to v if u is an ancestor of v in F, =

15 a 5 T 5 a 5 F 5 a 4 4 a 4 4 a 3 3 a 3 3 d F e 3 3 d 3 T e 3 a 2 2 a 2 2 d e 2 2 d 2 e 2 a 1 1 a 1 1 d 1 e 1 d 1 e 1 a 0 a 0 0 d e 0 d T F T F x x e 0 Fig. 5. Completion of variale positive and negative connectors, for the 2 possile edges of a variale-edge. a directed edge from v to u if v is an ancestor of u in F, or a dashed undirected edge otherwise (rotherhood relation). We claim that F is the constrained mixed completion graph of Gc. This is proven y checking that all three properties of F as a constrained mixed completion graph of graph G c exist. Property 2 exists since y the construction of F from F, F is a rooted spanning forest of F as required. In order to prove property 3 we use Lemma 2 in [2]: the rows {M[N Xi, ]} 1 i p forces that for each of the XOR relations, for every three vertices: x i,a, x i,, x i,c (X i V F ), there exists: L(x i,a, x i, ) L(x i,, x i,c ) L(x i,a, x i,c ) = 0. Last, property 1 exists, since that for an edge e E the values of {M[Ne j, ]} j {0,1} are determined in step 1 of the construction of M, according to the possile relation of u and v in F, suggested y the edge (u, v) in graph G c. Since, y the assumption, M has an expansion to M, such that M admits a perfect phylogeny forest F, then for each u, v F, the edge e : (u, v) E F must e determined according to e : (u, v) E in G c : if e is an undirected edge then e must e a directed edge, if e is a dashed undirected edge then e must e a dashed undirected edge, if e is a directed edge from u to v then e must e a a directed edge from u to v, and if e is a dashed directed edge from u to v then e must e a dashed undirected edge or a directed edge from u to v. This concludes the existence of property 3. Thus, F is the constrained mixed completion graph of G, and CMGS(G) = TRUE. ( )

16 Suppose that CMGS (G C ) = TRUE, i.e. there exists a constrained mixed completion graph G for G c. According to the second property of G, there exists a directed forest F = (E F, V ), which spans on V. Due to the third property of constrained mixed completion graph, the completion of edges in G c, does not violet the XOR relations. We create an expansion of M into M as follows: Resolve the 2 of the genotypes in those rows, according to G : for 2 vertices {u a, u X i } 1 i p, in case M[N Xi, a]=m[n Xi, ] = 2, if there is an undirected dashed edge etween u a, u V, then resolve the sumatrix (M[N Xi, a] M[N Xi, ]) unequally, and if there is an directed edge etween u a, u V, then resolve the sumatrix equally. Since those edges are completed in G according to XOR relations (see definition 6, property 3), then each of the 2s in these rows can e resolved accordingly. We denote the remaining matrix y M. Note that M [i, j] {0, 1,?}. We call the {0, 1} components constants, and the? components variales. We denote the set of column s indices of constants in row i y C i, and the set of column s indices of variales in this row y V i. Complete the variales components in the matrix M to create matrix M according to: { M 1 if c Ci s.t.: M [i, j] j Vi = [i, c] = 1 u j is an ancestor of u c 0 otherwise Now, matrix M [i, j] {0, 1}. We claim that M (which is an expansion of M) admits a perfect phylogeny forest. Moreover, this forest is F. This will e proven y showing that each two columns in M do not contradict F, and thus, according to Lemma 1, F is the perfect phylogeny forest of M. Consider two vertices u a, u V and their corresponding columns in M : a,. For each row i, we examine the 3 possile cases for the sumatrix (M [i, a] M [i, ]): 1. u a, u C i The sumatrix (M [i, a] M [i, ]) is determined according to the edge (u a, u ) E, which y definition of G, does not contradict F. 2. u a C i, u V i (w.l.o.g.) First, suppose M [i, a] = 0: If M [i, ] is determined to e 0, then there is no contradiction for any relations of u a and u in F. Otherwise, if M [i, ] is determined to e 1, then there exists c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. Suppose, on the contrary, that a foridden sumatrix occurred, i.e.: u a is an ancestor of u. Since u is an ancestor of u c, then u a must e an ancestor of u c. However, according to the construction of M, it is not possile that u a is an ancestor of u c, since M [i, a] = 0 and M [i, c] = 1 and a, c C i. Second, suppose M [i, a] = 1: If M [i, ] is determined to e 0, clearly u is not an ancestor of u a, so (M [i, a] M [i, ]) does not contradict F. Otherwise, if M [i, ] is determined to e 1, then there exist c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. In case c = a, then u a and u can not e in a rotherhood relation. In case c a, then u a and u c are in parenthood relation, and since u is an ancestor of u c, then u a and u can not e in a rotherhood relation. It follows that, in this case, (M [i, a] M [i, ]) does not contradict F. 3. u a, u V i

17 First, suppose that M [i, a] and M [i, ] are oth determined to e 0. Oviously, the sumatrix does not contradict F. Second, suppose w.l.o.g. that M [i, a] is determined to e 0 and M [i, ] is determined to e 1. There exists c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. Suppose, on the contrary, that a foridden sumatrix occurred, i.e.: u a is an ancestor of u. Since u is an ancestor of u c, u a must e an ancestor of u c. However, in that case, M [i, a] should have een determined to e 1. Third, suppose that M [i, a] and M [i, ] are oth determined to e 1. There exist c a, c C i such that M [i, c a ] = 1, M [i, c ] = 1 and u a is an ancestor of u ca and u is an ancestor of u c. Clearly, u ca and u c are in parenthood relation, so w.l.o.g. suppose that u ca is an ancestor of u c. Thus, oth u a and u are ancestors of u c, and it follows that u a and u can not e in rotherhood relation. It follows that, in this case, (M [i, a] M [i, ]) does not contradict F. 7.3 Proof of Theorem 2 Proof. For a 3-SAT instance, we uild a CMGS graph G c. Denote the variales y {Y i } 1 i t and the clauses y {C j } 1 j s. First we define four graph structures (figures of those structure are presented in the Appendix, in Susection 7.1): variale ase graph contains 2 vertices denoted y x i 0 and xi 1, without an edge. This graph is denoted y V ar i. clause ase graph (see Figure 3) contains 6 vertices denoted y {c j t} 0 t 5. There are directed edges from c j 1 to cj 0 and from cj 5 to cj 4 ; there are directed dashed edges from c j 4 to cj 3, from cj 3 to cj 2, from cj 2 to cj 1 ; and finally, there is an undirected dashed edge etween c j 0 and cj 5. This graph is denoted y Cl j. variale positive connector (see Figure 4) contains 12 vertices denoted y {a t } 0 t 5 and { t } 0 t 5. There are undirected edges etween a 1 and 3 and 4 ; and there are undirected dashed edges e- a 2 and etween tween a 0 and a 1, a 2 and a 3, a 3 and a 4, a 4 and a 5, 0 and 1, 1 and 2, 2 and 3, 4 and 5, and etween a 4 and a 5. The XOR relations are: {a 0, 0, a 1, 1 }, {a 1, 1, a 2, 2 }, {a 2, 2, a 3, }. This graph is denoted y P os. {a 3, 3, a 4, 4 }, and {a 4, 4, a 5, 5 variale negative connector (see Figure 4) contains 8 vertices denoted y 3 }, {d t } 0 t 3 and {e t } 0 t 3. There are undirected edges etween d 1 and d 2 ; and there are undirected dashed edges etween d 0 and d 1, d 2 and d 3, e 0 and e 1, e 1 and e 2, and etween e 2 and e. The XOR relations are: {d 0, d 0, e denoted y N eg. 1, e 1 }, {d 1, d 1, e 2, e 2 } and {d 2, d 2, e 3 3, e 3 }. This graph is Note, that there are two possile ways to complete the variale positive connector and the variale negative connector with undirected edges, in order to satisfy the XOR relations. Both of the ways for oth types of connectors are presented in Figure 5. An important key in understanding the reduction, is that in the positive connector, the type

18 of edge (a 0, 0 ) is the same: dashed or non-dashed, as the type of the edge (a 5, 5 ). While in the negative connector, the type of edge (d 0, e 0 ) is the opposite: dashed and non-dashed, from the edge (d 3, e 3 ). The construction of G c is done as follows: 1. For each variale {Y i } 1 i t create a copy of variale ase graph: V ar i. 2. For each clause {C j } 1 j s create a copy of clause ase graph: Cl j. 3. For all 1 j s, for all 1 k 3 do: 4. if Y i is the k-th literal in clause C j then do: create a copy of variale positive connector with superscripts i, j. identify a 0 with x i 0 and 0 with x i 1. and 5 with c i k if Y i is the k-th literal in clause C j then do: create a copy of variale negative connector with superscripts i, j. identify d 0 with x i 0 and e 0 with xi 1. identify a 5 with c i k identify d 3 with c i k and e 3 with ci k+1. For convenience, we also call the undirected dashed edge a positive edge, and the directed and undirected (non-dashed) edge a negative edge. ( ) Suppose that 3-SAT({C j } 1 j s ) = TRUE. There exists an assignment for {Y i } 1 i t, such that in all of the clauses at least one variale is TRUE. For each variale graph {V ar i } 1 i t complete the edge according to the assignment: 1 i t : (x i 0, x i 1) is determined to e a positive edge if Y i =TRUE, or a negative edge, otherwise. Now, resolve the XOR relations in all the variale connectors. In all the clause ase graphs {Cl j } 1 j s, at least one of the 3 edges: (c j 1, cj 2 ), (cj 2, cj 3 ) and (cj 3, cj 4 ), is a positive edge. It follows, that in each clause ase graph there is more than one parenthood connectivity component. In each of those components, each vertex is not attached to another vertex in the same component, with a dashed edge, and there is a directed edge etween 2 vertices: c j a to c j, only if a = +1. It follows, that a directed tree can e uild in each of the parenthood connectivity component of a clause ase graph, under the constrains of G c. In addition, any of the two possile completions of each of the variale connectors (according to the XOR relations), for any assignment, provides parenthood connectivity components in the variale connectors, such that, in each of those components, each two connected vertices are connected with an undirected edge (see Figure 5). For a variale positive connector the parenthood connectivity components for a positive edge assignment are: {a 0, 1 }, { 0, a 1, a 2, 2 }, { 3, a 3, 4, a 5 }, and {a 4, 5 }, and for a negative edge assignment the components are: {a 0, 0 }, {a 1, 1, a 2, 3, a 4, 4 }, {a 3, 2 }, and {a 5, 5 }. For a variale negative connector the parenthood connectivity components for a positive edge assignment are: {d 0, e 1 }, {e 0, d 1, d 2, e 2 }, and {d 3, e 3 }, and for a negative edge assignment the components are: {d 0, e 0 }, {d 1, e 1, d 2, e 3 }, and {d 3, e 2 }. Thus, in each variale connector, there exists an induced directed tree, in each parenthood connected component, according to G c constrains. Note, that it is possile, that sugraphs of two

19 different variale connectors: Con 1 and Con 2 will e in the same parenthood connectivity component. This may happen only when two variale connectors are connected to a clause ase graph, to edges (c j 1, cj 2 ) and (cj 3, cj 4 ) respectively, and when (cj 2, cj 3 ) is a directed (non-dashed) edge and (c j 1, cj 2 ) and (cj 3, cj 4 ) are undirected dashed edges. In this case, there is only one directed edge, which connects Con 1 and Con 2, so directed trees T 1 and T 2 can e uilt on Con 1 and Con 2 respectively, and then T 1 and T 2 can e united to a spanning directed tree on Con 1 Con 2. It follows that the graph can e divided into h parenthood connectivity components {R i } 1 i h, where a directed spanning tree T i can e uilt in each of this components, under the constrains of G c. Since each of the trees is in different parenthood connectivity component, then h i=1 T i is a directed forest spanning on G c vertices. The constrained mixed completion graph can now e accomplished simply y completing the rest of the missing edges, in each parenthood connectivity component according to its spanning tree, and etween the components, y undirected dashed edges. It follows that CMGS (G c )=TRUE. ( ) Suppose that 3-SAT({C j } 1 j s ) = FALSE. Then for each for {Y i } 1 i t, at least in one of the clauses, all variales are assigned to e FALSE. This implies that in any completion of G c, there will e always one clause ase graph Cl j, such that all the 3 edges: (c j 1, cj 2 ), (cj 2, cj 3 ) and (cj 3, cj 4 ), are negative directed edges. Thus cj 5 must e an ancestor of c j 0 in the forest. However, this contradicts the undirected dashed edge etween c j 0 and cj 5, so a spanning forest which satisfied G c constrains does not exists. Thus, CMGS (G c )=FALSE. 7.4 Proof of Lemma 2 Proof. Suppose an incomplete genotype matrix M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny. Let C e the set columns of M. After completing the missing data in M, and since M admits a perfect phylogeny, each two columns have to e resolved according to some lael function f L of the pairs of the vertices of G c (M), i.e. i, j C : f L (u i, u j ) {0, 1}. This complete lael function can not contradict the XOR relations of G c (M) (for proof, see [2]). Next, preliminary lael completion of G c (M), for the known pairs, must give the exact lael as f L, as there is only one possile preliminary lael completion. Then, we can chose the following secondary lael completion: (u i, u j ) U Gc : L(u i, u j ) = f L (u i, u j ), which oviously gives an equivalent lael function to f L. Thus, using this secondary lael completion of M, a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. 7.5 Proof of Theorem 3 Proof. Correctness: According to lemma 2, it is enough to find one correct secondary lael completion of U Gc, if the root is known. There are 2 m possiilities for the root, and 2 UGc 2 (m 2 ) possiilities for secondary lael completion of UGc. The starting point is when i = 0: The algorithm sets U Gc to an aritrary laelling: (u a, u )

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable