The Incomplete Perfect Phylogeny Haplotype Problem

Size: px
Start display at page:

Download "The Incomplete Perfect Phylogeny Haplotype Problem"

Transcription

1 The Incomplete Perfect Phylogeny Haplotype Prolem Gad Kimmel 1 and Ron Shamir 1 School of Computer Science, Tel-Aviv University, Tel-Aviv 69978, Israel. kgad@tau.ac.il, rshamir@tau.ac.il. Astract. The prolem of resolving genotypes into haplotypes, under the perfect phylogeny model, has een under intensive study recently. All studies so far handled missing data entries in a heuristic manner. We prove that the perfect phylogeny haplotype prolem is NP-complete when some of the data entries are missing, even when the phylogeny is rooted. We define a iologically motivated proailistic model for genotype generation and for the way missing data occur. Under this model, we provide an algorithm, which takes an expected polynomial time. In tests on simulated data, our algorithm quickly resolves the genotypes under high rates of missing entries. Keywords: haplotype, haplotype lock, genotype, SNP, algorithm, complexity, genotype phasing, haplotype resolution, perfect phylogeny.

2 1 Introduction A current challenge in human genome research is to learn aout DNA differences among individuals. This knowledge will hopefully lead to finding the genetic causes of complex and multi-factorial diseases. The distinct single-ase sites along the DNA sequence, which show variaility in their nucleic acids contents across the population, are called single nucleotide polymorphisms (SNPs). Millions of SNPs have already een detected [19, 22], out of an estimated total of 10 millions common SNPs [8]. In diploid organisms (e.g. humans) there are two nearly identical copies of each chromosome. Most techniques for determining SNPs provide a pair of readings, one from each copy, ut cannot distinguish from which of the two chromosomes each reading came [14]. The goal of phasing (or resolving) is to infer that missing information. The original conflated data from oth chromosomes are called the genotype of the individual, and is represented y a set of two nucleotide readings for each site. The two separated sequences corresponding to the two chromosomes of an individual are called his/her haplotypes. If the two ases in a site are identical (resp. different), the site is called homozygote (resp., heterozygote). For recent reviews on iological and computational aspects of haplotype analysis see [11, 13]. Resolving the genotypes is a central prolem in haplotyping. It is elieved that more accurate association studies can e performed once the genotypes are resolved [14, 4]. In the asence of additional information, each genotype can e resolved in 2 h 1 different ways, where h is the numer of heterozygote sites in the genotype. To find the correct way, resolution is done simultaneously on all the availale genotypes, and according to a model. A pioneering approach to haplotype resolution was Clark s parsimony-ased algorithm [3]. A likelihood-ased EM algorithm [6, 15] gave etter results. Stephens et al. [21] and Niu et al. [16] proposed MCMC-ased methods which gave promising results. All of those methods assumed that the genotype data correspond to a single lock with no recomination events. Hence, for multi-lock data the lock structure must e determined separately. Recently, a new cominatorial formulation of the phasing prolem was suggested y Gusfield [10]. According to this model, phasing must e done so that the resulting haplotypes define a perfect phylogeny tree. This model is ased on the iological assumption that there are regions along the chromosome, where recomination occurred infrequently, and on the infinite site model [10]. Gusfield showed how to solve the prolem efficiently, and improved algorithms were susequently developed y Bafna et al. [2] and Eskin et al. [5]. Eskin et al. [5] showed good resolving results with small error rates on real genotypes. They also reported that their algorithm was faster and more accurate in practical settings than previous methods as [21]. In real genotype data (e.g., [17, 7, 4]) some of the data entries are often missing, due to technical causes. Current phasing algorithms (which are ased on perfect phylogeny) require complete genotypes. This situation raises the following algorithmic prolem: Complete the missing entries in the genotypes and then resolve the data, such that the resulting haplotypes define a perfect phylogeny tree. We call this prolem incomplete perfect phylogeny haplotype (IPPH). It was posed y Halldòrsson et al. [11]. In order to deal with such incomplete data, Eskin et al. [5] used a heuristic to complete the missing entries, and showed very good results. However, finding an algorithm for optimally

3 handling missing data entries should allow more accurate resolution. In this paper we address the IPPH prolem. A special case of IPPH was studied in phylogeny y Pe er et al. [18]. In the incomplete directed perfect phylogeny prolem, the input is an n m species-characters matrix, the characters are inary and directed, i.e., a species can only gain characters, and some of the characters are missing. The question is whether one can complete the missing states in a way admitting a perfect phylogeny. Pe er et al. provided a near optimal Õ(nm) time algorithm for the prolem. (We use Õ notation to suppress polylogarithmic factors in presenting complexity ounds 1 ). This prolem is a special case of IPPH in which all the sites in all genotypes are homozygote, and the root is known. The IPPH prolem can e stated in two variants: rooted (or directed) and unrooted (or general). In the rooted version, the root haplotype is given as part of the input. The unrooted version is a more direct formulation of the practice in iology, since in phasing, the root of the haplotypes is not given. However, we argue that the more restricted rooted version is of practical importance: Though theoretically finding the root might take an exponential time, in practice it can often e found efficiently y finding one genotype which is complete and homozygote in all sites. Once such haplotype is found, it can e used as a root for the construction of perfect phylogeny tree. This haplotype need not e the real evolutionary root of the tree. This procedure is correct, since each one of the haplotypes can e used as a root in the perfect phylogeny tree, as was shown y Gusfield [9]. As we shall demonstrate in Section 5, on simulated and real iological data, virtually always at least one such genotype exists. If there is no such genotype, one can seek a genotype with few undetermined sites and enumerate the values in these sites. In the rare cases that this too is not feasile, one can physically separate the two chromosomes of a single individual and sequence one haplotype, as was done in [17]. This procedure is consideraly more expensive than standard genotyping techniques, ut it will e performed only for one individual, so the price is small. Thus, oth variants of IPPH are iologically important. In this paper, we show that oth the rooted and the unrooted versions of the prolem are NP-complete. The hardness of unrooted IPPH also follows immediately from the hardness of determining the compatiility of unrooted partial inary characters (incomplete haplotype matrix) [20]. This was oserved first y R. Sharan (private communication). However, this result does not imply the hardness of rooted version. In fact, our proof for rooted IPPH is quite involved. To cope with the theoretical hardness of IPPH, we invoke a proailistic approach. We define a stochastic model for generating the haplotypes and for the way missing entries occur in them. The model assumptions are mild and seem to apply to iological data. In addition, we assume that the numer of sites m grows much more slowly than the numer of genotypes n. Specifically, we assume that m = o(n.25 ). As m is ounded y the lock size which in practice is not more than a modest constant (10-30), this condition also holds in practice. We design an algorithm which always finds the correct solution, and under the assumptions aove takes an expected time of Õ(m2 n). 1 The definition of e O notation is: e O(g(n)) := {f(n) n 0 > 0, c > 0, d > 0, n n 0 : 0 f(n) c[log n] d g(n)}.

4 To test our algorithm, we applied it to simulated data under iologically realistic values of the parameters, and calculated an upper ound Γ on the main factor in the running time. Γ may e exponential, ut under the model assumptions was shown to have an expected polynomial time. Γ m gives a ound on the numer of times the polynomial algorithm of [18] would e invoked to complete the calculation. On data with 200 genotypes and 30 sites, we show that on average Γ < 4000 even when only two haplotypes are present and the rate of missing entries is 50%. For a more realistic case of five haplotypes and 20% missing entries, E[Γ ] < 100. Hence, the algorithm requires modest time even far eyond the range of its provale performance. The paper is organized as follows: Section 2 presents definitions and preliminaries. Section 3 shows the hardness result. Section 4 presents the algorithm and the proailistic analysis. Section 5 summarizes our experimental results. Due to lack of space most of the proofs are deferred to an appendix. 2 Preliminaries In this section we provide asic defnitions, lemmas and oservations that are needed for our analysis. Given n genotypes, the haplotype inference prolem is to find n pairs of haplotypes vectors that could have generated the genotypes vectors. Formally, the input can e presented y an n m genotype matrix M, with M[i, j] {0, 1, 2}. The i-th row M[i, ] descries the i-th genotype species. The j-th column descries the alleles in the j-th location: 0 or 1 for two homozygote alleles, and 2 for polymorphic (heterozygote) site. A 2n m inary matrix M is an expansion of the genotype matrix M if each row M[i, ] expands to two rows denoted y M [i, ] and M [i, ], with i = n + i, satisfying the following: for every i, if M[i, j] {0, 1}, then M[i, j] = M [i, j] = M [i, j]; if M[i, j] = 2, then M [i, j] M [i, j]. M is also called a haplotype matrix corresponding to M. Definition 1. Perfect Phylogeny Tree for a Matrix A perfect phylogeny for a k m haplotype matrix M is a tree T with a root r, exactly k leaves and integer edge laels, and a inary vector (l v (1)... l v (m)) for each vector v, that oeys the following properties: 1. Each of the rows is the lael of exactly one leaf of T. 2. Each of the columns laels exactly one edge of T. 3. Every edge of T is laelled y one column. 4. For any node v, l v (i) l r (i) if and only if i laels an edge on the unique path from the root to v. Hence, given the root lael, the root-node paths provide a compact representation of all node laels. An equivalent definition appeared in [2], with the difference that here, we replace edges with multiple laels y paths with a single lael per edge. Definition 2. The Perfect Phylogeny Haplotype (PPH) Prolem [10] Given a matrix M, find an expansion M of M which admits a perfect phylogeny.

5 Here we define a generalization of PPH that allows missing data entries. The input to our prolem is an incomplete genotype matrix, i.e., a matrix M with M[i, j] {0, 1, 2,?}, where? indicates missing data entries. The process of replacing each? y 0,1 or 2 is called completing the matrix M. Prolem 1. Incomplete Perfect Phylogeny Haplotype (IPPH) Given an incomplete genotype matrix M, can one complete M, so that there exists an expansion M of M, which admits a perfect phylogeny? Definition 3. Perfect Phylogeny Forest Let M e a haplotype matrix, and let P = (V P, E P ) e a perfect phylogeny tree corresponding to M. The perfect phylogeny forest of P is a directed forest F = (V F, E F ) whose vertices are the edges of P, and for u, v V F, u is a parent of v in F if and only if the edge corresponding to u in P is a parent of the edge corresponding to v in P. From the aove definition, the vertices of perfect phylogeny forest correspond to M s columns, and reflect the order of mutations in the phylogeny tree. Clearly, each perfect phylogeny tree can e converted into perfect phylogeny forest and vice versa. Thus, M admits a perfect phylogeny tree iff M admits a perfect phylogeny forest. For a column j {1, 2,..., m} of M, we denote y u j its corresponding vertex in the perfect phylogeny forest. For a perfect phylogeny forest F, we say that two vertices are in parenthood relation if one is an ancestor of the other. Otherwise, we say that they are in rotherhood relation. Note that rothers can either e in different connected components, or e in the same component and have the root on the path connecting them. The following special case of IPPH will e a main suject of our investigation. Prolem 2. Incomplete Perfect Phylogeny Haplotype, rooted version (IPPH-ROOTED) Given an incomplete genotype matrix M and a haplotype r, can one complete M, such that there exists an expansion M of M, which admits an perfect phylogeny, with r as a root? In this prolem, w.l.o.g., we assume that the root haplotype is r 0 = (0,..., 0) (cf. [9]). The following lemma explains the connection etween F and M, assuming that the root is r 0. Lemma 1. ([2], [5]) Let M e a haplotype matrix 2n m, then F = (V F, E F ) is its perfect phylogeny forest with the root haplotype r 0 iff for all u a, u V F and for all i {1,..., 2n}: 1. If u a is an ancestor of u then M [i, a] = 1 or M [i, ] = If u a and v are in rotherhood relation, then M [i, a] = 0 or M [i, ] = 0. In the rest of this section, we provide our own definitions, uilding on those introduced aove, and prove several lemmas which will e needed for our analysis. Definition 4. Constrained Mixed Graph A constrained mixed graph is a triplet G c = (V, E, X), where G = (V, E) is a graph and X = {X 1, X 2,..., X p }, where for each i: X i V. The sets X i are called XOR relations. G has four types of edges: undirected, dashed undirected, directed and dashed directed.

6 Definition 5. Parenthood Connected Components Two vertices u and v in a constrained mixed graph are in the same parenthood connected component if there exists a path etween u and v consists only of undirected or directed edges (a parenthood relation). Note, that the direction of an edge is not important in this definition. Definition 6. Constrained Mixed Completion Graph For a constrained mixed graph G c = (V, E, X), we define its constrained mixed completion graph G = (V, E ) to e a complete graph (with a single edge for each pair u, v E), where E contains two types of edges: directed and dashed undirected. Each edge of G is laelled with L : E {0, 1}, where a directed edge is laelled with 0, and dashed undirected edge is laelled with 1. G maintains all of the following properties: 1. All G edges maintain the following properties: (a) If e : (u, v) E is an undirected edge then the corresponding e : (u, v) E must e a directed edge from u to v or from v to u. () Directed edges and dashed undirected edges in G preserve their type in G. (c) If e : (u, v) E is a dashed directed edge from u to v then the corresponding e : (u, v) E must e a dashed undirected edge or a directed edge from u to v. 2. There exists a spanning directed forest F = (V, E F E ), such that: (a) If node u V is an ancestor of v V in F, then there is a directed edge from u to v in G. () If node u V is not an ancestor of v V and v is not an ancestor of u in F, then there is a dashed undirected edge etween u and v in G. 3. For each XOR relation X i, for every three vertices: x i,a, x i,, x i,c X i, there exists 2 : L(x i,a, x i, ) L(x i,, x i,c) L(x i,a, x i,c) = 0 Prolem 3. Constrained Mixed Graph Spanning (CMGS) prolem The input to CMGS prolem is a constrained mixed graph G. The output is a constrained mixed completion graph of G, if such exists. An example of CMGS prolem is presented in Figure 2. The decision version of CMGS prolem is to decide whether there exists a constrained mixed completion graph G for G. An important quality of the constrained mixed completion graph, is that it can e viewed as a directed spanning forest F, with additional edges etween nodes, according to the relation of those nodes in the forest: a dashed undirected edge for a rotherhood relation, or a directed edge for a parenthood relation. The following notations are adopted from Eskin et al. [5]: c(m, x) is defined as the set of rows of M containing the value x at column c. Let c, c e columns and x, y e elements of {0, 1}. The pair c, c induces (x, y) in M if ((c(m, x) c (M, y)) (c(m, x) c (M, 2)) (c(m, 2) c (M, y)). Let R(M, c, c ) e the set of pairs (x, y) such that (c, c ) induces (x, y) in M. Note, that R(M, c, c ) does not contain pairs with?, ut only 0 and 1. Let c, c e two columns such that c(m, 2) c (M, 2). Let M e an expansion of the M, after completing the missing entries, which admits a perfect phylogeny. We say that M resolves the pair of columns (c, c ) unequally if {(0, 1), (1, 0)} R(M, c, c ) and equally if (1, 1) R(M, c, c ). According to Lemma 1, M must resolve the pair (c, c ) either equally or unequally, and can not resolve the pair in oth ways. For an incomplete genotype matrix M, we uild a constrained mixed graph G c (M), where each column in M has a corresponding vertex in G c. The edges represent the 2 The operator denotes the oolean xor operator.

7 possile relations of the columns in the perfect phylogeny forest, and are determined according to lemma 1: For each two vertices u a, u : (1) If R(M, a, )\{(0, 0)} = {(1, 1), (1, 0)} then u a is an ancestor of u in F. The edge (u a, u ) is determined to e a directed edge from u a to u. (2) If R(M, a, )\{(0, 0)} = {(1, 1)} then u a, u are in parenthood relation in F, ut it is unknown which of the vertices is the ancestor. The edge (u a, u ) is determined to e an undirected edge. (3) If R(M, a, )\{(0, 0)} = {(1, 0), (0, 1)} then u a, u are in rotherhood relation in F. The edge (u a, u ) is determined to e a dashed undirected edge. (4) If R(M, a, )\{(0, 0)} = {(1, 0)} then either u a is an ancestor of u in F, or that u a, u are in rotherhood relation in F. The edge (u a, u ) is determined to e a dashed directed edge from u a to u. (5) If R(M, a, )\{(0, 0)} = then the relation of u a, u in F is unknown. In that case: (u a, u ) / E. In addition, for each set of columns a 1,..., a t, if there exists a row i, such that M[i, a 1 ] =,..., = M[i, a t ] = 2, then the corresponding vertices u a1,..., u at elong to a common XOR relation. Each pair of vertices of G c is laelled with L : (u a, u ) {0, 1,?}, where an un-dashed (directed or undirected) edge, i.e. a parenthood relation, is laelled with 0; dashed undirected edge, i.e. a rotherhood relation, is laelled with 1; and all other cases, i.e. an unknown relation, are laelled with?. The last set is called: unlaelled pairs. Note, that if for two vertices u a, u none of the aove five possiilities applies, than according to Lemma 1, M does not define a perfect phylogeny forest. Definition 7. Preliminary Lael Completion A preliminary lael completion of G c (M) is assigning a lael to the unlaelled pairs of vertices, according to the following algorithm: while possile, iteratively find three vertices: x i,a, x i,, x i,c X i, such that L(x i,a, x i, ) and L(x i,, x i,c ) are set and L(x i,a, x i,c ) is not, and assign: L(x i,a, x i,c ) = L(x i,a, x i, ) L(x i,, x i,c ). Define U Gc to e the set: {(u a, u ) : L(u a, u ) =?}, i.e., the set of pairs of vertices with an unknown lael, after preliminary lael completion was performed. Definition 8. Secondary Lael Completion A lael completion of a constrained of G c (M) is assigning to all (u a, u ) U Gc a lael {0, 1}, such that for each XOR relation X i, for every three vertices: x i,a, x i,, x i,c X i, there exists: L(x i,a, x i, ) L(x i,, x i,c ) L(x i,a, x i,c ) = 0. After secondary lael completion, we can perform lael resolution of the incomplete genotype matrix, which is defined to e: Definition 9. Lael Resolution of an Incomplete Genotype Matrix A 2n m incomplete inary matrix M is an expansion of the incomplete genotype matrix M if each row M[i, ] expands to two rows denoted y M [i, ] and M [i, ], with i = n + i, satisfying the following: for every i, if M[i, j] {0, 1,?}, then M[i, j] = M [i, j] = M [i, j]; if M[i, j] = 2, then M [i, j] = 0, M [i, j] = 1 or M [i, j] = 1, M [i, j] = 0. M is also called an incomplete haplotype matrix corresponding to M. A lael resolution of genotype matrix M is expansion of M to an incomplete haplotype matrix M, according to the lael function L: For each two columns a,, where there exists i, such that M[i, a] = M[i, ] = 2, if L(u a, u ) = 0 resolve (a, ) equally and if L(u a, u ) = 1 resolve (a, ) unequally.

8 Lael resolution of an incomplete genotype can e done y the same algorithm proposed y Bafna et al. [2] (algorithm E2M). Oserve, that any sumatrix M[i, (a, )], where M[i, a] and M[i, ] are oth not equal 2, has a unique expansion in any incomplete haplotype matrix. Hence, for such sumatrix, the resolution is not influenced y the lael function. The algorithm descried in definition 7, was suggested y [2] as part of their algorithm for complete genotype matrix phasing. Interestingly, they proved that once preliminary lael completion is performed, for any possile (legal) secondary lael completion of U Gc, a lael resolution of the genotype matrix results in a haplotype matrix, which admits a perfect phylogeny. This is true for a complete genotype matrix (with no missing entries), ut not for the incomplete case: Not every secondary lael completion followed y its corresponding lael resolution of the incomplete genotype matrix, results with an incomplete haplotype matrix, that can e completed to a complete haplotype matrix, which admits a perfect phylogeny. The following lemma, which is proven in the Appendix in Susection 7.4, descries a weaker connection etween secondary lael completion and the solution of IPPH. Lemma 2. If an incomplete genotype matrix M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny, then there exists some secondary lael completion of U Gc, such that a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. 3 The Hardness Result In this section we show that IPPH and IPPH-rooted are NP-complete. Trivially, oth versions elong to NP. To prove NP-hardness, we will show the following polynomial reductions: 3-SAT CMGS IPPH-ROOTED IPPH, which implies that oth the rooted and unrooted versions are NP-complete. The reduction IPPH-ROOTED IPPH is as follows: Given an instance (M, r) of IPPH-ROOTED, we simply add the genotype row r to M. The resulting matrix M is the input to IPPH. In a solution to the latter, there will e a leaf laelled with r, and thus it solves the former prolem. Conversely, if M has a solution with root r then it is also a solution for M. The exact same idea was used y Bafna et al. [2], only that here, it is applied to an inomplete genotype matrix. The following two theorems, which are oth proven in the Appendix in Susections 7.2 and 7.3, imply the hardness of IPPH: Theorem 1. CMGS IPPH-ROOTED Theorem 2. 3-SAT CMGS 4 An Algorithmic Solution for IPPH In spite the results of Section 3, we provide an algorithmic approach to IPPH. We restrict the prolem, y applying some iological insights, in which the data is assumed to

9 e generated y a stochastic model. We provide an algorithm that takes an expected polynomial time for oth the rooted and the unrooted versions of IPPH. Pe er et al. [18] et al. suggested a polynomial time algorithm of Õ(mn) time for solving the rooted version of perfect phylogeny with missing data. Let the input incomplete haplotype matrix e M, with M[i, j] {0, 1,?}, and the root e r. We denote y IDP ( M,r), the completion matrix M, i.e. after performing this algorithm on M. We also use IDP ( M) to denote IDP ( M,r 0 ). We use h(, ) to denote the hamming distance etween two inary vectors. The following notation is used in the description of the algorithm: Σ 0 (M[, j]) and Σ 1 (M[, j]) are the numers of 0s and 1s in the j th column, respectively. Consider that the root is known (r 0 ). Given an instance of an incomplete matrix M, we uild a constrained mixed graph, as descried in Section 2. We then perform preliminary lael completion (definition 7). According to Lemma 2, if M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny, then there exists some secondary lael completion of U Gc, where a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. Thus, the computational challenge, is to find such secondary lael completion. Suppose, we were ale to guess the correct secondary lael completion y an oracle. In that case, let M e the resulted incomplete haplotype matrix, y performing lael resolution (definition 9) accordingly. A completion of M can e done in polynomial time y IDP( M). Hence, the ottleneck step is finding a secondary lael completion. Due to the hardness result in Section 3, a polynomial time oracle for finding the correct secondary lael completion does not exist, unless P=NP. However, when assuming additional properties of the genotype data, this can e performed y a polynomial expected time algorithm. We now descrie those assumptions, and for each, we descrie its iological motivation: 1. Each entry value in the original genotype matrix is replaced y? with proaility p, independently of the other values. This assumption makes sense as missing data entry are caused y technical prolems in the iological experiment. The same value p may e used for all entries. One may claim, that there are situations such that in each allele (SNP) there is a different proaility for a missing entry, due to distinct difficulties in sequencing along different regions in the human genome. In that case, we denote y p i the proaility for a missing entry in the i th allele and determine p to e: p max i { p i }. 2. We can assume that each haplotype h i, which is a node in a perfect phylogeny tree, is chosen to e in a genotype with proaility of α i, independently. This assumption is also made as part of the Hardy-Weinerg equilirium model [12]. An additional assumption is that those proailities do not depend on n or m. The logic ehind the last assumption is that proailities of the haplotypes in the population do not depend on the numer of sampled genotypes (n) nor on the size of a lock (m). 3. Assume that the numer of columns (m) and the numer of rows (n) maintain a rule, that states that when n ecomes larger, m is not consideraly increased. Specifically, we use m = o(n.25 ). The last assumption applies in all iological constellations: In future experiments, the numer of genotypes is expected to e larger, while m is not expected to grow sustantially, since m is the size of a region

10 in the chromosome where the numer of recomination events in the sampled population is small. A constant value of m is thus also plausile, ut for our analysis, a much weaker assumption than that is required. Pro-IPPH(M): 1. Let G c(m) = (V, E, X) e the constrained mixed graph of M. 2. Perform preliminary lael completion of G c(m). 3. Let r e a vector such that r j = 0 if Σ 0(M[, j]) > Σ 1(M[, j]) and 1 otherwise. 4. For i = 0 `m 2 For each possile root r {0, 1} m, such that h(r, r) = i do Relael the matrix entries according to r, so that r 0 is the new root. For each possile secondary lael completion of U Gc, such that {(u a, u ) : (u a, u ) U Gc L((u a, u )) = 0} = i do Perform lael resolution of M to M. f If IDP ( M) f is compatile then output IDP ( M) f and halt. 5. Output: no solution. Fig. 1. An algorithm for IPPH. The aove algorithm was designed to solve IPPH under the assumptions aove. Informally, algorithm Pro-IPPH(M) ignores the missing data entries in order to decide the relation etween each two columns in the matrix. As we shall prove, if unale to conclude deterministically from the matrix, with high proaility, a correct relation is otained just y guessing. The following theorem is proven in Susection 7.5 in the Appendix. Theorem 3. Under the assuptions of the model, algorithm Pro-IPPH(M) solves IPPH correctly within expected time of Õ(m2 n). 5 Experimental Results In order to assess our algorithm, we applied it on simulated data. The simulations used parameters which were adopted from several large scale iological studies [4, 17, 7]. By Theorem 3 the algorithm always outputs a correct solution. Although we proved that under our model assumptions the expected running time is Õ(m2 n), we wanted to estimate the actual running time, under realistic iological parameters and eyond the range of the model assumptions. Specifically, we wanted to calculate the expected numer of different phylogenic tree solutions for a given data set. The proof of Theorem 3 implies that Γ = 2 UGc is an upper ound on the numer of different phylogeny solutions, and the dominant factor in the complexity of the algorithm. In each different experiment, we randomly generated N = 10 5 perfect phylogeny trees. We used the following procedure to generate a perfect phylogeny tree of haplotypes: We start with a inary root vector with m = 30 sites. Initially, no site is marked. In each step, we randomly pick a tree node and an unmarked site, add a new child haplotype to that node in which only the state of that site is changed, and mark the site.

11 For each tree, we randomly chose k haplotypes for reconstructing the genotypes, where k = 2, 3,..., 9. We assigned proailities, denoted y α 1, α 2,..., α k, to the k chosen haplotypes, such that k i=1 α i = 1 and i : α i For each tree, different proailities were assigned. Next, we generated 200 genotypes according to the chosen haplotypes and their assigned proailities. Introducing missing data entries to the genotypes was performed as follows: Each site in the genotypes data was flipped into a missing entry independently with proaility p. Recalling that in real data p 0.1, we checked a wider range: p = 0, 0.05,..., 0.5. Thus, for each sampled tree T j : j = 1, 2,..., N, we sampled one incomplete genotype matrix M j of size We performed our algorithm on each M j. We denote U Gc(M j) y U j. We stopped when i = 0 to calculate 2 Uj, after performing steps 1-3 of the algorithm. As was shown in Section 4, if the secondary lael completion is known, it is possile in Õ(m2 n) time to output the solution to IPPH. Hence, completion of the algorithm, for each M j, should take less than 2 Uj Õ(m 2 n) time. The dominating factor in the running time is the random variale 2 Uj, which its expectation is approximated y: E[Γ ] = E[2 Uj ] 1 N N j=1 2 Uj. The results are presented in Figure 6 in the Appendix (Susection 7.6). Generally, E[Γ ] is elow When the missing data rate is elow 20%, E[Γ ] is ounded to e smaller than 100. Another oservation, is that the larger the numer of chosen haplotypes from the phylogeny tree, the smaller the value of E[Γ ]. Notaly, in all cases we assessed a correct root: either y finding at least one haplotype, which is homozygote with no missing entries in all sites, or y using the majority rule descried in the algorithm. For demonstrating that in real iological data, the root can practically e found in a linear time, we chose the genotype data of Daly et al. [4]. This data set consists of 103 SNPs and 129 genotypes. We checked all possile ( ) locks. In all the locks, which their size was smaller than 65 SNPs, there could always e found at least one genotype, which is homozygote in all alleles, without any missing entry. This genotype, can e resolved in only one possile way, and hence, this known haplotype can e used as a root. Since the size of a lock is almost always smaller than 30, this naive simple method can e used for finding a root in iological data. 6 Concluding Remarks We investigated the incomplete perfect phylogeny haplotype prolem: phasing of genotypes into haplotypes, under the perfect phylogeny model, where some of the data are missing. We proved that the prolem, oth in its rooted and unrooted version is NPcomplete. We also provided a practical expected polynomial-time algorithm for a iologically motivated restriction of the prolem. We applied our algorithm on simulated data, and concluded that the running time and the numer of distinct phylogeny solutions are relatively small, under common iological conditions and parameters, even when the missing data rate is 50%. A more accurate treatment for phasing of genotypes with missing entries can now e otained. In addition, due to the small numer of phylogenic solutions resulted in simulations, incorporation of other statistical and cominatorial models with our algorithm is feasile.

12 Acknowledgments This research was supported y a grant from the Israel Science Foundation (grant 309/02). We thank Roded Sharan for fruitful discussions. References 1. N. Alon and J. H. Spencer. The Proailistic Method. John Wiley and Sons, Inc., V. Bafna, D. Gusfield, G. Lancia, and S. Yooseph. Haplotyping as perfect phylogeny: A direct approach. Technical Report UCDavis CSE , A. Clark. Inference of haplotypes from PCR-amplified samples of diploid populations. Molecular Biology and Evolution, 7(2):111 22, M.J. Daly et al. High-resolution haplotype structure in the human genome. Nature Genetics, 29(2): , E. Eskin, E. Halperin, and R. M. Karp. Large scale reconstruction of haplotypes from genotype data. In Proceedings of The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages , L. Excoffier and M. Slatkin. Maximum-likelihood estimation of mollecular haplotype frequencies in a diploid population. Molecular Biology and Evolution, 12(5):912 7, S. B. Gariel et al. The structure of haplotype locks in the human genome. Science, 296: , L. Grugliyak and D. A. Nickerson. Variation is the spice of life. Nature Genetics, 27: , D. Gusfield. Efficient algorithms for inferring evolutionary trees. Networks, 21:19 28, D. Gusfield. Haplotyping as perfect phylogeny: conceptual framework and efficient solutions. In Proceedings of The Seventh Annual International Conference on Research in Computational Molecular Biology (RECOMB), pages , B. V. Halldorsson et al. Cominatorial prolems arising in SNP. DMTCS 03 Conference. 12. G. H. Hardy. Mendelian proportions in a mixed population. Science, 18:49 50, M. R. Hoehe. Haplotypes and the systematic analysis of genetic variation in genes and genomes. Pharmacogenomics, 4(5): , M. R. Hoehe et al. Sequence variaility and candidate gene analysis in complex disease: association of µ opioid receptor gene variation with sustance dependence. Human Molecular Genetics, 9: , J. Long et al. An EM algorithm and testing strategy for multiple-locus haplotypes. American Journal of Human Genetics, 56(3): , T. Niu et al. Bayesian haplotype inference for multiple linked single-nucleotide polymorphisms. American Journal of Human Genetics, 70(1):157 69, N. Patil et al. Blocks of limited haplotype diversity revealed y high-resolution scanning of human chromosome 21. Science, 294: , I. Pe er, R. Shamir, and R. Sharan. Incomplete directed perfect phylogeny. In Proceedings of the Cominatorial Pattern Matching Conference (CPM), pages , R. Sachidanandam et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 291: , M. A. Steel. The complexity of reconstructing trees from qualitative characters and sutrees. Journal of Classification, 9:91 116, M. Stephens et al. A new statistical method for haplotype reconstruction from population data. American Journal of Human Genetics, 68(4):978 89, C. Venter et al. The sequence of the human genome. Science, 291: , 2001.

13 7 Appendix 7.1 Figures of Graphs a a c c d e d e XOR relations: {,c,e}, {a,,d} A B Fig. 2. Example of CMGS prolem. A: an instance of a graph for CMGS prolem with XOR relations. B: a possile solution for this instance. The edges of the forest are olded. c j 5 c j 4 c j 3 c j 2 j c j c 0 1 Fig. 3. Clause ase graph. 7.2 Proof of Theorem 1 Proof. Given an instance of constrained mixed graph G c = (V, E, X), for CMGS prolem, we uild a matrix M, which together with r : i, r[i] = 0, serve as input for IPPH-ROOTED. M is uilt to e in the size of (2 E +p) V. For each e E there are two rows correspondingly, and their indices are denoted y Ne 0 and Ne 1, respectively. For each X i {X i } 1 i p there is one row correspondingly, and its index is denoted y N Xi. For a column i {1, 2,..., V }, we denote its corresponding vertex in G c y u i. The construction of M is as follows:

14 a 5 5 a 4 4 a 3 3 d e 3 3 a 2 2 d e 2 2 a 1 1 d 1 e 1 a 0 0 d e 0 0 x x Fig. 4. The graph structure for a variale positive and negative connector. The following sets of vertices maintain a XOR relation: {a 0, 0, a 1, 1 }, {a 1, 1, a 2, 2 }, {a 2, 2, a 3, 3 }, {a 3, 3, a 4, 4 }, {a 4, 4, a 5, 5 }, {d 0, d 0, e 1, e 1 }, {d 1, d 1, e 2, e 2 } and {d 2, d 2, e 3, e 3 }. 1. For each e : (u a, u ) E, we add 2 rows M[Ne 0, ] and M[N e 1, ], such that u c V \{u a, u }, M[Ne 0, c] = M[Ne 1, c] =?, ( and: ( ) M[N 0 e, a] M[Ne 0, ] 0 0 (a) If e : (u a, u ) is an undirected edge then () If e : (u a, u ) is a dashed undirected edge then ( ) ) M[Ne 1, a] M[N e 1, ] =. 1 1 ( ) M[N 0 e, a] M[Ne 0, ] M[Ne 1, a] M[N e 1, ] ( ) M[N 0 (c) If e : (u a, u ) is a directed edge from u a to u then e, a] M[Ne 0, ] M[Ne 1, a] M[N e 1, ] = ( ) (d) If ( e : (u a, u ) is ) a ( dashed ) directed edge from u a to u then M[N 0 e, a] M[Ne 0, ] 0 0 M[Ne 1, a] M[N e 1, ] = For each {X i } 1 i p, we add one row M[N Xi, ], such that u j X i : M[N Xi, j] = 2 and u k V \X i : M[N Xi, k] =?. ( ) Suppose that IPPH-ROOTED(M,r) = TRUE, i.e. M has an expansion to M, such that M admits a perfect phylogeny tree, with r 0 as a root. Thus, M has a directed perfect phylogeny forest F = (V F, E F ). Let F = (V F, ÊF ) e complete graph, where for each u, v V F, we add a directed edge from u to v if u is an ancestor of v in F, =

15 a 5 T 5 a 5 F 5 a 4 4 a 4 4 a 3 3 a 3 3 d F e 3 3 d 3 T e 3 a 2 2 a 2 2 d e 2 2 d 2 e 2 a 1 1 a 1 1 d 1 e 1 d 1 e 1 a 0 a 0 0 d e 0 d T F T F x x e 0 Fig. 5. Completion of variale positive and negative connectors, for the 2 possile edges of a variale-edge. a directed edge from v to u if v is an ancestor of u in F, or a dashed undirected edge otherwise (rotherhood relation). We claim that F is the constrained mixed completion graph of Gc. This is proven y checking that all three properties of F as a constrained mixed completion graph of graph G c exist. Property 2 exists since y the construction of F from F, F is a rooted spanning forest of F as required. In order to prove property 3 we use Lemma 2 in [2]: the rows {M[N Xi, ]} 1 i p forces that for each of the XOR relations, for every three vertices: x i,a, x i,, x i,c (X i V F ), there exists: L(x i,a, x i, ) L(x i,, x i,c ) L(x i,a, x i,c ) = 0. Last, property 1 exists, since that for an edge e E the values of {M[Ne j, ]} j {0,1} are determined in step 1 of the construction of M, according to the possile relation of u and v in F, suggested y the edge (u, v) in graph G c. Since, y the assumption, M has an expansion to M, such that M admits a perfect phylogeny forest F, then for each u, v F, the edge e : (u, v) E F must e determined according to e : (u, v) E in G c : if e is an undirected edge then e must e a directed edge, if e is a dashed undirected edge then e must e a dashed undirected edge, if e is a directed edge from u to v then e must e a a directed edge from u to v, and if e is a dashed directed edge from u to v then e must e a dashed undirected edge or a directed edge from u to v. This concludes the existence of property 3. Thus, F is the constrained mixed completion graph of G, and CMGS(G) = TRUE. ( )

16 Suppose that CMGS (G C ) = TRUE, i.e. there exists a constrained mixed completion graph G for G c. According to the second property of G, there exists a directed forest F = (E F, V ), which spans on V. Due to the third property of constrained mixed completion graph, the completion of edges in G c, does not violet the XOR relations. We create an expansion of M into M as follows: Resolve the 2 of the genotypes in those rows, according to G : for 2 vertices {u a, u X i } 1 i p, in case M[N Xi, a]=m[n Xi, ] = 2, if there is an undirected dashed edge etween u a, u V, then resolve the sumatrix (M[N Xi, a] M[N Xi, ]) unequally, and if there is an directed edge etween u a, u V, then resolve the sumatrix equally. Since those edges are completed in G according to XOR relations (see definition 6, property 3), then each of the 2s in these rows can e resolved accordingly. We denote the remaining matrix y M. Note that M [i, j] {0, 1,?}. We call the {0, 1} components constants, and the? components variales. We denote the set of column s indices of constants in row i y C i, and the set of column s indices of variales in this row y V i. Complete the variales components in the matrix M to create matrix M according to: { M 1 if c Ci s.t.: M [i, j] j Vi = [i, c] = 1 u j is an ancestor of u c 0 otherwise Now, matrix M [i, j] {0, 1}. We claim that M (which is an expansion of M) admits a perfect phylogeny forest. Moreover, this forest is F. This will e proven y showing that each two columns in M do not contradict F, and thus, according to Lemma 1, F is the perfect phylogeny forest of M. Consider two vertices u a, u V and their corresponding columns in M : a,. For each row i, we examine the 3 possile cases for the sumatrix (M [i, a] M [i, ]): 1. u a, u C i The sumatrix (M [i, a] M [i, ]) is determined according to the edge (u a, u ) E, which y definition of G, does not contradict F. 2. u a C i, u V i (w.l.o.g.) First, suppose M [i, a] = 0: If M [i, ] is determined to e 0, then there is no contradiction for any relations of u a and u in F. Otherwise, if M [i, ] is determined to e 1, then there exists c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. Suppose, on the contrary, that a foridden sumatrix occurred, i.e.: u a is an ancestor of u. Since u is an ancestor of u c, then u a must e an ancestor of u c. However, according to the construction of M, it is not possile that u a is an ancestor of u c, since M [i, a] = 0 and M [i, c] = 1 and a, c C i. Second, suppose M [i, a] = 1: If M [i, ] is determined to e 0, clearly u is not an ancestor of u a, so (M [i, a] M [i, ]) does not contradict F. Otherwise, if M [i, ] is determined to e 1, then there exist c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. In case c = a, then u a and u can not e in a rotherhood relation. In case c a, then u a and u c are in parenthood relation, and since u is an ancestor of u c, then u a and u can not e in a rotherhood relation. It follows that, in this case, (M [i, a] M [i, ]) does not contradict F. 3. u a, u V i

17 First, suppose that M [i, a] and M [i, ] are oth determined to e 0. Oviously, the sumatrix does not contradict F. Second, suppose w.l.o.g. that M [i, a] is determined to e 0 and M [i, ] is determined to e 1. There exists c C i, c a such that M [i, c] = 1 and u is an ancestor of u c. Suppose, on the contrary, that a foridden sumatrix occurred, i.e.: u a is an ancestor of u. Since u is an ancestor of u c, u a must e an ancestor of u c. However, in that case, M [i, a] should have een determined to e 1. Third, suppose that M [i, a] and M [i, ] are oth determined to e 1. There exist c a, c C i such that M [i, c a ] = 1, M [i, c ] = 1 and u a is an ancestor of u ca and u is an ancestor of u c. Clearly, u ca and u c are in parenthood relation, so w.l.o.g. suppose that u ca is an ancestor of u c. Thus, oth u a and u are ancestors of u c, and it follows that u a and u can not e in rotherhood relation. It follows that, in this case, (M [i, a] M [i, ]) does not contradict F. 7.3 Proof of Theorem 2 Proof. For a 3-SAT instance, we uild a CMGS graph G c. Denote the variales y {Y i } 1 i t and the clauses y {C j } 1 j s. First we define four graph structures (figures of those structure are presented in the Appendix, in Susection 7.1): variale ase graph contains 2 vertices denoted y x i 0 and xi 1, without an edge. This graph is denoted y V ar i. clause ase graph (see Figure 3) contains 6 vertices denoted y {c j t} 0 t 5. There are directed edges from c j 1 to cj 0 and from cj 5 to cj 4 ; there are directed dashed edges from c j 4 to cj 3, from cj 3 to cj 2, from cj 2 to cj 1 ; and finally, there is an undirected dashed edge etween c j 0 and cj 5. This graph is denoted y Cl j. variale positive connector (see Figure 4) contains 12 vertices denoted y {a t } 0 t 5 and { t } 0 t 5. There are undirected edges etween a 1 and 3 and 4 ; and there are undirected dashed edges e- a 2 and etween tween a 0 and a 1, a 2 and a 3, a 3 and a 4, a 4 and a 5, 0 and 1, 1 and 2, 2 and 3, 4 and 5, and etween a 4 and a 5. The XOR relations are: {a 0, 0, a 1, 1 }, {a 1, 1, a 2, 2 }, {a 2, 2, a 3, }. This graph is denoted y P os. {a 3, 3, a 4, 4 }, and {a 4, 4, a 5, 5 variale negative connector (see Figure 4) contains 8 vertices denoted y 3 }, {d t } 0 t 3 and {e t } 0 t 3. There are undirected edges etween d 1 and d 2 ; and there are undirected dashed edges etween d 0 and d 1, d 2 and d 3, e 0 and e 1, e 1 and e 2, and etween e 2 and e. The XOR relations are: {d 0, d 0, e denoted y N eg. 1, e 1 }, {d 1, d 1, e 2, e 2 } and {d 2, d 2, e 3 3, e 3 }. This graph is Note, that there are two possile ways to complete the variale positive connector and the variale negative connector with undirected edges, in order to satisfy the XOR relations. Both of the ways for oth types of connectors are presented in Figure 5. An important key in understanding the reduction, is that in the positive connector, the type

18 of edge (a 0, 0 ) is the same: dashed or non-dashed, as the type of the edge (a 5, 5 ). While in the negative connector, the type of edge (d 0, e 0 ) is the opposite: dashed and non-dashed, from the edge (d 3, e 3 ). The construction of G c is done as follows: 1. For each variale {Y i } 1 i t create a copy of variale ase graph: V ar i. 2. For each clause {C j } 1 j s create a copy of clause ase graph: Cl j. 3. For all 1 j s, for all 1 k 3 do: 4. if Y i is the k-th literal in clause C j then do: create a copy of variale positive connector with superscripts i, j. identify a 0 with x i 0 and 0 with x i 1. and 5 with c i k if Y i is the k-th literal in clause C j then do: create a copy of variale negative connector with superscripts i, j. identify d 0 with x i 0 and e 0 with xi 1. identify a 5 with c i k identify d 3 with c i k and e 3 with ci k+1. For convenience, we also call the undirected dashed edge a positive edge, and the directed and undirected (non-dashed) edge a negative edge. ( ) Suppose that 3-SAT({C j } 1 j s ) = TRUE. There exists an assignment for {Y i } 1 i t, such that in all of the clauses at least one variale is TRUE. For each variale graph {V ar i } 1 i t complete the edge according to the assignment: 1 i t : (x i 0, x i 1) is determined to e a positive edge if Y i =TRUE, or a negative edge, otherwise. Now, resolve the XOR relations in all the variale connectors. In all the clause ase graphs {Cl j } 1 j s, at least one of the 3 edges: (c j 1, cj 2 ), (cj 2, cj 3 ) and (cj 3, cj 4 ), is a positive edge. It follows, that in each clause ase graph there is more than one parenthood connectivity component. In each of those components, each vertex is not attached to another vertex in the same component, with a dashed edge, and there is a directed edge etween 2 vertices: c j a to c j, only if a = +1. It follows, that a directed tree can e uild in each of the parenthood connectivity component of a clause ase graph, under the constrains of G c. In addition, any of the two possile completions of each of the variale connectors (according to the XOR relations), for any assignment, provides parenthood connectivity components in the variale connectors, such that, in each of those components, each two connected vertices are connected with an undirected edge (see Figure 5). For a variale positive connector the parenthood connectivity components for a positive edge assignment are: {a 0, 1 }, { 0, a 1, a 2, 2 }, { 3, a 3, 4, a 5 }, and {a 4, 5 }, and for a negative edge assignment the components are: {a 0, 0 }, {a 1, 1, a 2, 3, a 4, 4 }, {a 3, 2 }, and {a 5, 5 }. For a variale negative connector the parenthood connectivity components for a positive edge assignment are: {d 0, e 1 }, {e 0, d 1, d 2, e 2 }, and {d 3, e 3 }, and for a negative edge assignment the components are: {d 0, e 0 }, {d 1, e 1, d 2, e 3 }, and {d 3, e 2 }. Thus, in each variale connector, there exists an induced directed tree, in each parenthood connected component, according to G c constrains. Note, that it is possile, that sugraphs of two

19 different variale connectors: Con 1 and Con 2 will e in the same parenthood connectivity component. This may happen only when two variale connectors are connected to a clause ase graph, to edges (c j 1, cj 2 ) and (cj 3, cj 4 ) respectively, and when (cj 2, cj 3 ) is a directed (non-dashed) edge and (c j 1, cj 2 ) and (cj 3, cj 4 ) are undirected dashed edges. In this case, there is only one directed edge, which connects Con 1 and Con 2, so directed trees T 1 and T 2 can e uilt on Con 1 and Con 2 respectively, and then T 1 and T 2 can e united to a spanning directed tree on Con 1 Con 2. It follows that the graph can e divided into h parenthood connectivity components {R i } 1 i h, where a directed spanning tree T i can e uilt in each of this components, under the constrains of G c. Since each of the trees is in different parenthood connectivity component, then h i=1 T i is a directed forest spanning on G c vertices. The constrained mixed completion graph can now e accomplished simply y completing the rest of the missing edges, in each parenthood connectivity component according to its spanning tree, and etween the components, y undirected dashed edges. It follows that CMGS (G c )=TRUE. ( ) Suppose that 3-SAT({C j } 1 j s ) = FALSE. Then for each for {Y i } 1 i t, at least in one of the clauses, all variales are assigned to e FALSE. This implies that in any completion of G c, there will e always one clause ase graph Cl j, such that all the 3 edges: (c j 1, cj 2 ), (cj 2, cj 3 ) and (cj 3, cj 4 ), are negative directed edges. Thus cj 5 must e an ancestor of c j 0 in the forest. However, this contradicts the undirected dashed edge etween c j 0 and cj 5, so a spanning forest which satisfied G c constrains does not exists. Thus, CMGS (G c )=FALSE. 7.4 Proof of Lemma 2 Proof. Suppose an incomplete genotype matrix M can e completed, so that there exists an expansion M of M, which admits a perfect phylogeny. Let C e the set columns of M. After completing the missing data in M, and since M admits a perfect phylogeny, each two columns have to e resolved according to some lael function f L of the pairs of the vertices of G c (M), i.e. i, j C : f L (u i, u j ) {0, 1}. This complete lael function can not contradict the XOR relations of G c (M) (for proof, see [2]). Next, preliminary lael completion of G c (M), for the known pairs, must give the exact lael as f L, as there is only one possile preliminary lael completion. Then, we can chose the following secondary lael completion: (u i, u j ) U Gc : L(u i, u j ) = f L (u i, u j ), which oviously gives an equivalent lael function to f L. Thus, using this secondary lael completion of M, a lael resolution of the incomplete genotype matrix M gives an incomplete haplotype matrix, that can e completed to M. 7.5 Proof of Theorem 3 Proof. Correctness: According to lemma 2, it is enough to find one correct secondary lael completion of U Gc, if the root is known. There are 2 m possiilities for the root, and 2 UGc 2 (m 2 ) possiilities for secondary lael completion of UGc. The starting point is when i = 0: The algorithm sets U Gc to an aritrary laelling: (u a, u )

Haplotyping as Perfect Phylogeny: A direct approach

Haplotyping as Perfect Phylogeny: A direct approach Haplotyping as Perfect Phylogeny: A direct approach Vineet Bafna Dan Gusfield Giuseppe Lancia Shibu Yooseph February 7, 2003 Abstract A full Haplotype Map of the human genome will prove extremely valuable

More information

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm Wilhelm-Schickard-Institut für Informatik, Universität Tübingen, Germany. Tzvika Hartman Dept. of Computer Science,

More information

Haplotype Inference Constrained by Plausible Haplotype Data

Haplotype Inference Constrained by Plausible Haplotype Data Haplotype Inference Constrained by Plausible Haplotype Data Michael R. Fellows 1, Tzvika Hartman 2, Danny Hermelin 3, Gad M. Landau 3,4, Frances Rosamond 1, and Liat Rozenberg 3 1 The University of Newcastle,

More information

Lecture 6 January 15, 2014

Lecture 6 January 15, 2014 Advanced Graph Algorithms Jan-Apr 2014 Lecture 6 January 15, 2014 Lecturer: Saket Sourah Scrie: Prafullkumar P Tale 1 Overview In the last lecture we defined simple tree decomposition and stated that for

More information

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model Jens Gramm 1, Tzvika Hartman 2, Till Nierhoff 3, Roded Sharan 4, and Till Tantau 5 1 Wilhelm-Schickard-Institut für Informatik,

More information

An Overview of Combinatorial Methods for Haplotype Inference

An Overview of Combinatorial Methods for Haplotype Inference An Overview of Combinatorial Methods for Haplotype Inference Dan Gusfield 1 Department of Computer Science, University of California, Davis Davis, CA. 95616 Abstract A current high-priority phase of human

More information

SAT in Bioinformatics: Making the Case with Haplotype Inference

SAT in Bioinformatics: Making the Case with Haplotype Inference SAT in Bioinformatics: Making the Case with Haplotype Inference Inês Lynce 1 and João Marques-Silva 2 1 IST/INESC-ID, Technical University of Lisbon, Portugal ines@sat.inesc-id.pt 2 School of Electronics

More information

Essential facts about NP-completeness:

Essential facts about NP-completeness: CMPSCI611: NP Completeness Lecture 17 Essential facts about NP-completeness: Any NP-complete problem can be solved by a simple, but exponentially slow algorithm. We don t have polynomial-time solutions

More information

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT COMMUNICATIONS IN INFORMATION AND SYSTEMS c 2009 International Press Vol. 9, No. 4, pp. 295-302, 2009 001 THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT DAN GUSFIELD AND YUFENG WU Abstract.

More information

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Polynomial Time Perfect Sampler for Discretized Dirichlet Distribution

MATHEMATICAL ENGINEERING TECHNICAL REPORTS. Polynomial Time Perfect Sampler for Discretized Dirichlet Distribution MATHEMATICAL ENGINEERING TECHNICAL REPORTS Polynomial Time Perfect Sampler for Discretized Dirichlet Distriution Tomomi MATSUI and Shuji KIJIMA METR 003 7 April 003 DEPARTMENT OF MATHEMATICAL INFORMATICS

More information

IN this paper we study a discrete optimization problem. Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach

IN this paper we study a discrete optimization problem. Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach Constrained Shortest Link-Disjoint Paths Selection: A Network Programming Based Approach Ying Xiao, Student Memer, IEEE, Krishnaiyan Thulasiraman, Fellow, IEEE, and Guoliang Xue, Senior Memer, IEEE Astract

More information

SINGLE nucleotide polymorphisms (SNPs) are differences

SINGLE nucleotide polymorphisms (SNPs) are differences IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, VOL. 3, NO. 3, JULY-SEPTEMBER 2006 303 Islands of Tractability for Parsimony Haplotyping Roded Sharan, Bjarni V. Halldórsson, and Sorin

More information

Genetic Algorithms applied to Problems of Forbidden Configurations

Genetic Algorithms applied to Problems of Forbidden Configurations Genetic Algorithms applied to Prolems of Foridden Configurations R.P. Anstee Miguel Raggi Department of Mathematics University of British Columia Vancouver, B.C. Canada V6T Z2 anstee@math.uc.ca mraggi@gmail.com

More information

19 Tree Construction using Singular Value Decomposition

19 Tree Construction using Singular Value Decomposition 19 Tree Construction using Singular Value Decomposition Nicholas Eriksson We present a new, statistically consistent algorithm for phylogenetic tree construction that uses the algeraic theory of statistical

More information

Properties of normal phylogenetic networks

Properties of normal phylogenetic networks Properties of normal phylogenetic networks Stephen J. Willson Department of Mathematics Iowa State University Ames, IA 50011 USA swillson@iastate.edu August 13, 2009 Abstract. A phylogenetic network is

More information

Increasing the Span of Stars

Increasing the Span of Stars Increasing the Span of Stars Ning Chen Roee Engelberg C. Thach Nguyen Prasad Raghavendra Atri Rudra Gynanit Singh Department of Computer Science and Engineering, University of Washington, Seattle, WA.

More information

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.!

Integer Programming in Computational Biology. D. Gusfield University of California, Davis Presented December 12, 2016.! Integer Programming in Computational Biology D. Gusfield University of California, Davis Presented December 12, 2016. There are many important phylogeny problems that depart from simple tree models: Missing

More information

c 2004 Society for Industrial and Applied Mathematics

c 2004 Society for Industrial and Applied Mathematics SIAM J. COMPUT. Vol. 33, No. 3, pp. 590 607 c 2004 Society for Industrial and Applied Mathematics INCOMPLETE DIRECTED PERFECT PHYLOGENY ITSIK PE ER, TAL PUPKO, RON SHAMIR, AND RODED SHARAN Abstract. Perfect

More information

Mathematical Approaches to the Pure Parsimony Problem

Mathematical Approaches to the Pure Parsimony Problem Mathematical Approaches to the Pure Parsimony Problem P. Blain a,, A. Holder b,, J. Silva c, and C. Vinzant d, July 29, 2005 Abstract Given the genetic information of a population, the Pure Parsimony problem

More information

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables

Minimizing a convex separable exponential function subject to linear equality constraint and bounded variables Minimizing a convex separale exponential function suect to linear equality constraint and ounded variales Stefan M. Stefanov Department of Mathematics Neofit Rilski South-Western University 2700 Blagoevgrad

More information

Bounded Queries, Approximations and the Boolean Hierarchy

Bounded Queries, Approximations and the Boolean Hierarchy Bounded Queries, Approximations and the Boolean Hierarchy Richard Chang Department of Computer Science and Electrical Engineering University of Maryland Baltimore County March 23, 1999 Astract This paper

More information

A Phylogenetic Network Construction due to Constrained Recombination

A Phylogenetic Network Construction due to Constrained Recombination A Phylogenetic Network Construction due to Constrained Recombination Mohd. Abdul Hai Zahid Research Scholar Research Supervisors: Dr. R.C. Joshi Dr. Ankush Mittal Department of Electronics and Computer

More information

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1, Jacques S. Beckmann 2,3, Ron Shamir 4 and Itsik Pe er 5 1 Dept. of Computer Science and Applied Mathematics,

More information

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs

Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza 1, Jacques S. Beckmann 2,3, Ron Shamir 4, and Itsik Pe er 5 1 Dept. of Computer Science and Applied Mathematics,

More information

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES

#A50 INTEGERS 14 (2014) ON RATS SEQUENCES IN GENERAL BASES #A50 INTEGERS 14 (014) ON RATS SEQUENCES IN GENERAL BASES Johann Thiel Dept. of Mathematics, New York City College of Technology, Brooklyn, New York jthiel@citytech.cuny.edu Received: 6/11/13, Revised:

More information

More on NP and Reductions

More on NP and Reductions Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai 600 127, India An Autonomous Institute under MHRD, Govt of India http://www.iiitdm.ac.in COM 501 Advanced Data

More information

Determinants of generalized binary band matrices

Determinants of generalized binary band matrices Determinants of generalized inary and matrices Dmitry Efimov arxiv:17005655v1 [mathra] 18 Fe 017 Department of Mathematics, Komi Science Centre UrD RAS, Syktyvkar, Russia Astract Under inary matrices we

More information

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem

Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Complexity and Approximation of the Minimum Recombination Haplotype Configuration Problem Lan Liu 1, Xi Chen 3, Jing Xiao 3, and Tao Jiang 1,2 1 Department of Computer Science and Engineering, University

More information

Integer Programming for Phylogenetic Network Problems

Integer Programming for Phylogenetic Network Problems Integer Programming for Phylogenetic Network Problems D. Gusfield University of California, Davis Presented at the National University of Singapore, July 27, 2015.! There are many important phylogeny problems

More information

Allen Holder - Trinity University

Allen Holder - Trinity University Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University

More information

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing INFORMS Journal on Computing Vol. 22, No. 2, Spring 2010, pp. 195 209 issn 1091-9856 eissn 1526-5528 10 2202 0195 informs doi 10.1287/ijoc.1090.0333 2010 INFORMS A Class Representative Model for Pure Parsimony

More information

Announcements. Friday Four Square! Problem Set 8 due right now. Problem Set 9 out, due next Friday at 2:15PM. Did you lose a phone in my office?

Announcements. Friday Four Square! Problem Set 8 due right now. Problem Set 9 out, due next Friday at 2:15PM. Did you lose a phone in my office? N P NP Completeness Announcements Friday Four Square! Today at 4:15PM, outside Gates. Problem Set 8 due right now. Problem Set 9 out, due next Friday at 2:15PM. Explore P, NP, and their connection. Did

More information

Out-colourings of Digraphs

Out-colourings of Digraphs Out-colourings of Digraphs N. Alon J. Bang-Jensen S. Bessy July 13, 2017 Abstract We study vertex colourings of digraphs so that no out-neighbourhood is monochromatic and call such a colouring an out-colouring.

More information

Cluster Graph Modification Problems

Cluster Graph Modification Problems Cluster Graph Modification Problems Ron Shamir Roded Sharan Dekel Tsur December 2002 Abstract In a clustering problem one has to partition a set of elements into homogeneous and well-separated subsets.

More information

A Survey of Computational Methods for Determining Haplotypes

A Survey of Computational Methods for Determining Haplotypes A Survey of Computational Methods for Determining Haplotypes Bjarni V. Halldórsson, Vineet Bafna, Nathan Edwards, Ross Lippert, Shibu Yooseph, and Sorin Istrail Informatics Research, Celera Genomics/Applied

More information

6.045: Automata, Computability, and Complexity (GITCS) Class 15 Nancy Lynch

6.045: Automata, Computability, and Complexity (GITCS) Class 15 Nancy Lynch 6.045: Automata, Computability, and Complexity (GITCS) Class 15 Nancy Lynch Today: More Complexity Theory Polynomial-time reducibility, NP-completeness, and the Satisfiability (SAT) problem Topics: Introduction

More information

Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments

Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments : http://mc.manuscriptcentral.com/liebert/jcb Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments Journal: Manuscript ID: Manuscript Type: Date Submitted

More information

Exploring Lucas s Theorem. Abstract: Lucas s Theorem is used to express the remainder of the binomial coefficient of any two

Exploring Lucas s Theorem. Abstract: Lucas s Theorem is used to express the remainder of the binomial coefficient of any two Delia Ierugan Exploring Lucas s Theorem Astract: Lucas s Theorem is used to express the remainder of the inomial coefficient of any two integers m and n when divided y any prime integer p. The remainder

More information

Efficient Haplotype Inference with Boolean Satisfiability

Efficient Haplotype Inference with Boolean Satisfiability Efficient Haplotype Inference with Boolean Satisfiability Joao Marques-Silva 1 and Ines Lynce 2 1 School of Electronics and Computer Science University of Southampton 2 INESC-ID/IST Technical University

More information

A Class Representative Model for Pure Parsimony Haplotyping

A Class Representative Model for Pure Parsimony Haplotyping A Class Representative Model for Pure Parsimony Haplotyping Daniele Catanzaro, Alessandra Godi, and Martine Labbé June 5, 2008 Abstract Haplotyping estimation from aligned Single Nucleotide Polymorphism

More information

Ma/CS 117c Handout # 5 P vs. NP

Ma/CS 117c Handout # 5 P vs. NP Ma/CS 117c Handout # 5 P vs. NP We consider the possible relationships among the classes P, NP, and co-np. First we consider properties of the class of NP-complete problems, as opposed to those which are

More information

NP-Completeness Part II

NP-Completeness Part II NP-Completeness Part II Please evaluate this course on Axess. Your comments really do make a difference. Announcements Problem Set 8 due tomorrow at 12:50PM sharp with one late day. Problem Set 9 out,

More information

arxiv: v1 [math.co] 24 Oct 2017

arxiv: v1 [math.co] 24 Oct 2017 The Erdős-Hajnal conjecture for caterpillars and their complements Anita Lieenau Marcin Pilipczu arxiv:1710.08701v1 [math.co] 24 Oct 2017 Astract The celerated Erdős-Hajnal conjecture states that for every

More information

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS

1Number ONLINE PAGE PROOFS. systems: real and complex. 1.1 Kick off with CAS 1Numer systems: real and complex 1.1 Kick off with CAS 1. Review of set notation 1.3 Properties of surds 1. The set of complex numers 1.5 Multiplication and division of complex numers 1.6 Representing

More information

Phylogenetic Networks with Recombination

Phylogenetic Networks with Recombination Phylogenetic Networks with Recombination October 17 2012 Recombination All DNA is recombinant DNA... [The] natural process of recombination and mutation have acted throughout evolution... Genetic exchange

More information

SUFFIX TREE. SYNONYMS Compact suffix trie

SUFFIX TREE. SYNONYMS Compact suffix trie SUFFIX TREE Maxime Crochemore King s College London and Université Paris-Est, http://www.dcs.kcl.ac.uk/staff/mac/ Thierry Lecroq Université de Rouen, http://monge.univ-mlv.fr/~lecroq SYNONYMS Compact suffix

More information

Learning ancestral genetic processes using nonparametric Bayesian models

Learning ancestral genetic processes using nonparametric Bayesian models Learning ancestral genetic processes using nonparametric Bayesian models Kyung-Ah Sohn October 31, 2011 Committee Members: Eric P. Xing, Chair Zoubin Ghahramani Russell Schwartz Kathryn Roeder Matthew

More information

Upper Bounds for Stern s Diatomic Sequence and Related Sequences

Upper Bounds for Stern s Diatomic Sequence and Related Sequences Upper Bounds for Stern s Diatomic Sequence and Related Sequences Colin Defant Department of Mathematics University of Florida, U.S.A. cdefant@ufl.edu Sumitted: Jun 18, 01; Accepted: Oct, 016; Pulished:

More information

Linear Regression (1/1/17)

Linear Regression (1/1/17) STA613/CBB540: Statistical methods in computational biology Linear Regression (1/1/17) Lecturer: Barbara Engelhardt Scribe: Ethan Hada 1. Linear regression 1.1. Linear regression basics. Linear regression

More information

Lecture 12: Grover s Algorithm

Lecture 12: Grover s Algorithm CPSC 519/619: Quantum Computation John Watrous, University of Calgary Lecture 12: Grover s Algorithm March 7, 2006 We have completed our study of Shor s factoring algorithm. The asic technique ehind Shor

More information

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author AUTHORIZATION TO LEND AND REPRODUCE THE THESIS As the sole author of this thesis, I authorize Brown University to lend it to other institutions or individuals for the purpose of scholarly research. Date

More information

Lecture 14 - P v.s. NP 1

Lecture 14 - P v.s. NP 1 CME 305: Discrete Mathematics and Algorithms Instructor: Professor Aaron Sidford (sidford@stanford.edu) February 27, 2018 Lecture 14 - P v.s. NP 1 In this lecture we start Unit 3 on NP-hardness and approximation

More information

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics.

Major questions of evolutionary genetics. Experimental tools of evolutionary genetics. Theoretical population genetics. Evolutionary Genetics (for Encyclopedia of Biodiversity) Sergey Gavrilets Departments of Ecology and Evolutionary Biology and Mathematics, University of Tennessee, Knoxville, TN 37996-6 USA Evolutionary

More information

NP-Completeness Part II

NP-Completeness Part II NP-Completeness Part II Recap from Last Time NP-Hardness A language L is called NP-hard iff for every L' NP, we have L' P L. A language in L is called NP-complete iff L is NP-hard and L NP. The class NPC

More information

Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Problems

Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Problems Scheduling Two Agents on a Single Machine: A Parameterized Analysis of NP-hard Prolems Danny Hermelin 1, Judith-Madeleine Kuitza 2, Dvir Shatay 1, Nimrod Talmon 3, and Gerhard Woeginger 4 arxiv:1709.04161v1

More information

Depth versus Breadth in Convolutional Polar Codes

Depth versus Breadth in Convolutional Polar Codes Depth versus Breadth in Convolutional Polar Codes Maxime Tremlay, Benjamin Bourassa and David Poulin,2 Département de physique & Institut quantique, Université de Sherrooke, Sherrooke, Quéec, Canada JK

More information

Lecture 2 (Notes) 1. The book Computational Complexity: A Modern Approach by Sanjeev Arora and Boaz Barak;

Lecture 2 (Notes) 1. The book Computational Complexity: A Modern Approach by Sanjeev Arora and Boaz Barak; Topics in Theoretical Computer Science February 29, 2016 Lecturer: Ola Svensson Lecture 2 (Notes) Scribes: Ola Svensson Disclaimer: These notes were written for the lecturer only and may contain inconsistent

More information

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University

Algorithms. NP -Complete Problems. Dong Kyue Kim Hanyang University Algorithms NP -Complete Problems Dong Kyue Kim Hanyang University dqkim@hanyang.ac.kr The Class P Definition 13.2 Polynomially bounded An algorithm is said to be polynomially bounded if its worst-case

More information

THE BALANCED DECOMPOSITION NUMBER AND VERTEX CONNECTIVITY

THE BALANCED DECOMPOSITION NUMBER AND VERTEX CONNECTIVITY THE BALANCED DECOMPOSITION NUMBER AND VERTEX CONNECTIVITY SHINYA FUJITA AND HENRY LIU Astract The alanced decomposition numer f(g) of a graph G was introduced y Fujita and Nakamigawa [Discr Appl Math,

More information

Computational Complexity. IE 496 Lecture 6. Dr. Ted Ralphs

Computational Complexity. IE 496 Lecture 6. Dr. Ted Ralphs Computational Complexity IE 496 Lecture 6 Dr. Ted Ralphs IE496 Lecture 6 1 Reading for This Lecture N&W Sections I.5.1 and I.5.2 Wolsey Chapter 6 Kozen Lectures 21-25 IE496 Lecture 6 2 Introduction to

More information

Phylogenetic Networks, Trees, and Clusters

Phylogenetic Networks, Trees, and Clusters Phylogenetic Networks, Trees, and Clusters Luay Nakhleh 1 and Li-San Wang 2 1 Department of Computer Science Rice University Houston, TX 77005, USA nakhleh@cs.rice.edu 2 Department of Biology University

More information

DRAFT. Diagonalization. Chapter 4

DRAFT. Diagonalization. Chapter 4 Chapter 4 Diagonalization..the relativized P =?NP question has a positive answer for some oracles and a negative answer for other oracles. We feel that this is further evidence of the difficulty of the

More information

Computational Complexity and Intractability: An Introduction to the Theory of NP. Chapter 9

Computational Complexity and Intractability: An Introduction to the Theory of NP. Chapter 9 1 Computational Complexity and Intractability: An Introduction to the Theory of NP Chapter 9 2 Objectives Classify problems as tractable or intractable Define decision problems Define the class P Define

More information

Branching Bisimilarity with Explicit Divergence

Branching Bisimilarity with Explicit Divergence Branching Bisimilarity with Explicit Divergence Ro van Glaeek National ICT Australia, Sydney, Australia School of Computer Science and Engineering, University of New South Wales, Sydney, Australia Bas

More information

This is a repository copy of Attributed Graph Transformation via Rule Schemata : Church-Rosser Theorem.

This is a repository copy of Attributed Graph Transformation via Rule Schemata : Church-Rosser Theorem. This is a repository copy of Attriuted Graph Transformation via Rule Schemata : Church-Rosser Theorem. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/9/ Version: Accepted

More information

Compactness vs Collusion Resistance in Functional Encryption

Compactness vs Collusion Resistance in Functional Encryption Compactness vs Collusion Resistance in Functional Encryption Baiyu Li Daniele Micciancio April 10, 2017 Astract We present two general constructions that can e used to comine any two functional encryption

More information

INVERTING THE CUT-TREE TRANSFORM

INVERTING THE CUT-TREE TRANSFORM INVERTING THE CUT-TREE TRANSFORM LOUIGI ADDARIO-BERRY, DAPHNÉ DIEULEVEUT, AND CHRISTINA GOLDSCHMIDT Astract. We consider fragmentations of an R-tree T driven y cuts arriving according to a Poisson process

More information

Instructor N.Sadagopan Scribe: P.Renjith. Lecture- Complexity Class- P and NP

Instructor N.Sadagopan Scribe: P.Renjith. Lecture- Complexity Class- P and NP Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai 600 127, India An Autonomous Institute under MHRD, Govt of India http://www.iiitdm.ac.in COM 501 Advanced Data

More information

A An Overview of Complexity Theory for the Algorithm Designer

A An Overview of Complexity Theory for the Algorithm Designer A An Overview of Complexity Theory for the Algorithm Designer A.1 Certificates and the class NP A decision problem is one whose answer is either yes or no. Two examples are: SAT: Given a Boolean formula

More information

Instructor N.Sadagopan Scribe: P.Renjith

Instructor N.Sadagopan Scribe: P.Renjith Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai 600 127, India An Autonomous Institute under MHRD, Govt of India http://www.iiitdm.ac.in COM 501 Advanced Data

More information

Evolution with Recombination

Evolution with Recombination Evolution with Recomination Varun Kanade SEAS Harvard University Camridge, MA, USA vkanade@fas.harvard.edu Astract Valiant (2007) introduced a computational model of evolution and suggested that Darwinian

More information

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems

Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Classical Complexity and Fixed-Parameter Tractability of Simultaneous Consecutive Ones Submatrix & Editing Problems Rani M. R, Mohith Jagalmohanan, R. Subashini Binary matrices having simultaneous consecutive

More information

Linear Programming. Our market gardener example had the form: min x. subject to: where: [ acres cabbages acres tomatoes T

Linear Programming. Our market gardener example had the form: min x. subject to: where: [ acres cabbages acres tomatoes T Our market gardener eample had the form: min - 900 1500 [ ] suject to: Ñ Ò Ó 1.5 2 â 20 60 á ã Ñ Ò Ó 3 â 60 á ã where: [ acres caages acres tomatoes T ]. We need a more systematic approach to solving these

More information

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics:

Homework Assignment, Evolutionary Systems Biology, Spring Homework Part I: Phylogenetics: Homework Assignment, Evolutionary Systems Biology, Spring 2009. Homework Part I: Phylogenetics: Introduction. The objective of this assignment is to understand the basics of phylogenetic relationships

More information

Limitations of Algorithm Power

Limitations of Algorithm Power Limitations of Algorithm Power Objectives We now move into the third and final major theme for this course. 1. Tools for analyzing algorithms. 2. Design strategies for designing algorithms. 3. Identifying

More information

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees Erin Molloy and Tandy Warnow {emolloy2, warnow}@illinois.edu University of Illinois at Urbana

More information

Program Analysis. Lecture 5. Rayna Dimitrova WS 2016/2017

Program Analysis. Lecture 5. Rayna Dimitrova WS 2016/2017 Program Analysis Lecture 5 Rayna Dimitrova WS 2016/2017 2/21 Recap: Constant propagation analysis Goal: For each program point, determine whether a variale has a constant value whenever an execution reaches

More information

NP-COMPLETE PROBLEMS. 1. Characterizing NP. Proof

NP-COMPLETE PROBLEMS. 1. Characterizing NP. Proof T-79.5103 / Autumn 2006 NP-complete problems 1 NP-COMPLETE PROBLEMS Characterizing NP Variants of satisfiability Graph-theoretic problems Coloring problems Sets and numbers Pseudopolynomial algorithms

More information

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model

Luis Manuel Santana Gallego 100 Investigation and simulation of the clock skew in modern integrated circuits. Clock Skew Model Luis Manuel Santana Gallego 100 Appendix 3 Clock Skew Model Xiaohong Jiang and Susumu Horiguchi [JIA-01] 1. Introduction The evolution of VLSI chips toward larger die sizes and faster clock speeds makes

More information

Automata, Logic and Games: Theory and Application

Automata, Logic and Games: Theory and Application Automata, Logic and Games: Theory and Application 2 Parity Games, Tree Automata, and S2S Luke Ong University of Oxford TACL Summer School University of Salerno, 14-19 June 2015 Luke Ong S2S 14-19 June

More information

1 Caveats of Parallel Algorithms

1 Caveats of Parallel Algorithms CME 323: Distriuted Algorithms and Optimization, Spring 2015 http://stanford.edu/ reza/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 1, 9/26/2015. Scried y Suhas Suresha, Pin Pin, Andreas

More information

2 P vs. NP and Diagonalization

2 P vs. NP and Diagonalization 2 P vs NP and Diagonalization CS 6810 Theory of Computing, Fall 2012 Instructor: David Steurer (sc2392) Date: 08/28/2012 In this lecture, we cover the following topics: 1 3SAT is NP hard; 2 Time hierarchies;

More information

ITCS:CCT09 : Computational Complexity Theory Apr 8, Lecture 7

ITCS:CCT09 : Computational Complexity Theory Apr 8, Lecture 7 ITCS:CCT09 : Computational Complexity Theory Apr 8, 2009 Lecturer: Jayalal Sarma M.N. Lecture 7 Scribe: Shiteng Chen In this lecture, we will discuss one of the basic concepts in complexity theory; namely

More information

Processes of Evolution

Processes of Evolution 15 Processes of Evolution Forces of Evolution Concept 15.4 Selection Can Be Stabilizing, Directional, or Disruptive Natural selection can act on quantitative traits in three ways: Stabilizing selection

More information

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate.

Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. OEB 242 Exam Practice Problems Answer Key Q1) Explain how background selection and genetic hitchhiking could explain the positive correlation between genetic diversity and recombination rate. First, recall

More information

Fast inverse for big numbers: Picarte s iteration

Fast inverse for big numbers: Picarte s iteration Fast inverse for ig numers: Picarte s iteration Claudio Gutierrez and Mauricio Monsalve Computer Science Department, Universidad de Chile cgutierr,mnmonsal@dcc.uchile.cl Astract. This paper presents an

More information

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam

Representation theory of SU(2), density operators, purification Michael Walter, University of Amsterdam Symmetry and Quantum Information Feruary 6, 018 Representation theory of S(), density operators, purification Lecture 7 Michael Walter, niversity of Amsterdam Last week, we learned the asic concepts of

More information

CSCI3390-Second Test with Solutions

CSCI3390-Second Test with Solutions CSCI3390-Second Test with Solutions April 26, 2016 Each of the 15 parts of the problems below is worth 10 points, except for the more involved 4(d), which is worth 20. A perfect score is 100: if your score

More information

Zeroing the baseball indicator and the chirality of triples

Zeroing the baseball indicator and the chirality of triples 1 2 3 47 6 23 11 Journal of Integer Sequences, Vol. 7 (2004), Article 04.1.7 Zeroing the aseall indicator and the chirality of triples Christopher S. Simons and Marcus Wright Department of Mathematics

More information

Tree of Life iological Sequence nalysis Chapter http://tolweb.org/tree/ Phylogenetic Prediction ll organisms on Earth have a common ancestor. ll species are related. The relationship is called a phylogeny

More information

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis

Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis Travel Grouping of Evaporating Polydisperse Droplets in Oscillating Flow- Theoretical Analysis DAVID KATOSHEVSKI Department of Biotechnology and Environmental Engineering Ben-Gurion niversity of the Negev

More information

The variance for partial match retrievals in k-dimensional bucket digital trees

The variance for partial match retrievals in k-dimensional bucket digital trees The variance for partial match retrievals in k-dimensional ucket digital trees Michael FUCHS Department of Applied Mathematics National Chiao Tung University January 12, 21 Astract The variance of partial

More information

Building Graphs from Colored Trees

Building Graphs from Colored Trees Building Graphs from Colored Trees Rachel M. Esselstein CSUMB Department of Mathematics and Statistics 100 Campus Center Dr. Building 53 Seaside, CA 93955, U.S.A. resselstein@csumb.edu Peter Winkler Department

More information

CS154, Lecture 15: Cook-Levin Theorem SAT, 3SAT

CS154, Lecture 15: Cook-Levin Theorem SAT, 3SAT CS154, Lecture 15: Cook-Levin Theorem SAT, 3SAT Definition: A language B is NP-complete if: 1. B NP 2. Every A in NP is poly-time reducible to B That is, A P B When this is true, we say B is NP-hard On

More information

A note on network reliability

A note on network reliability A note on network reliability Noga Alon Institute for Advanced Study, Princeton, NJ 08540 and Department of Mathematics Tel Aviv University, Tel Aviv, Israel Let G = (V, E) be a loopless undirected multigraph,

More information

Beyond Loose LP-relaxations: Optimizing MRFs by Repairing Cycles

Beyond Loose LP-relaxations: Optimizing MRFs by Repairing Cycles Beyond Loose LP-relaxations: Optimizing MRFs y Repairing Cycles Nikos Komodakis 1 and Nikos Paragios 2 1 University of Crete, komod@csd.uoc.gr 2 Ecole Centrale de Paris, nikos.paragios@ecp.fr Astract.

More information

Lecture 4 : Quest for Structure in Counting Problems

Lecture 4 : Quest for Structure in Counting Problems CS6840: Advanced Complexity Theory Jan 10, 2012 Lecture 4 : Quest for Structure in Counting Problems Lecturer: Jayalal Sarma M.N. Scribe: Dinesh K. Theme: Between P and PSPACE. Lecture Plan:Counting problems

More information

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase Humans have two copies of each chromosome Inherited from mother and father. Genotyping technologies do not maintain the phase Genotyping technologies do not maintain the phase Recall that proximal SNPs

More information

Long non-crossing configurations in the plane

Long non-crossing configurations in the plane Long non-crossing configurations in the plane Adrian Dumitrescu Csaa D. Tóth July 4, 00 Astract We revisit some maximization prolems for geometric networks design under the non-crossing constraint, first

More information

Efficient Approximation for Restricted Biclique Cover Problems

Efficient Approximation for Restricted Biclique Cover Problems algorithms Article Efficient Approximation for Restricted Biclique Cover Problems Alessandro Epasto 1, *, and Eli Upfal 2 ID 1 Google Research, New York, NY 10011, USA 2 Department of Computer Science,

More information