The Pure Parsimony Problem

Similar documents
Allen Holder - Trinity University

Mathematical Approaches to the Pure Parsimony Problem

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

{a, b, c} {a, b} {a, c} {b, c} {a}

Properties of the Integers

a + b = b + a and a b = b a. (a + b) + c = a + (b + c) and (a b) c = a (b c). a (b + c) = a b + a c and (a + b) c = a c + b c.

Harvard CS121 and CSCI E-121 Lecture 2: Mathematical Preliminaries

Theory of Computer Science

CMPSCI611: The Matroid Theorem Lecture 5

arxiv: v2 [math.co] 7 Jan 2016

Haplotyping as Perfect Phylogeny: A direct approach

On the minimum neighborhood of independent sets in the n-cube

Well-Ordering Principle. Axiom: Every nonempty subset of Z + has a least element. That is, if S Z + and S, then S has a smallest element.

Reading 11 : Relations and Functions

The Probability of Winning a Series. Gregory Quenell

Even Cycles in Hypergraphs.

Phylogenetic Networks with Recombination

CSE 105 Homework 1 Due: Monday October 9, Instructions. should be on each page of the submission.

Complete Induction and the Well- Ordering Principle

ACO Comprehensive Exam March 17 and 18, Computability, Complexity and Algorithms

Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

List of Theorems. Mat 416, Introduction to Graph Theory. Theorem 1 The numbers R(p, q) exist and for p, q 2,

CS 133 : Automata Theory and Computability

Semigroup presentations via boundaries in Cayley graphs 1

Chapter 7 Matchings and r-factors

We want to show P (n) is true for all integers

Efficient Approximation for Restricted Biclique Cover Problems

Chapter 4. Regular Expressions. 4.1 Some Definitions

6.046 Recitation 11 Handout

Theory of Computation

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

Theory of Computation

4-coloring P 6 -free graphs with no induced 5-cycles

Indian Institute of Information Technology Design and Manufacturing, Kancheepuram Chennai , India. Instructor N.Sadagopan Scribe: P.

3. R = = on Z. R, S, A, T.

Chapter 9: Relations Relations

Models of Computation. by Costas Busch, LSU

10.4 The Kruskal Katona theorem

Set Theory. CSE 215, Foundations of Computer Science Stony Brook University

C1.1 Introduction. Theory of Computer Science. Theory of Computer Science. C1.1 Introduction. C1.2 Alphabets and Formal Languages. C1.

arxiv: v1 [cs.cc] 27 Feb 2011

CS60007 Algorithm Design and Analysis 2018 Assignment 1

1.4 Equivalence Relations and Partitions

GENERALIZED INDEPENDENCE IN GRAPHS HAVING CUT-VERTICES

Automata Theory Final Exam Solution 08:10-10:00 am Friday, June 13, 2008

Equivalence, Order, and Inductive Proof

MINORS OF GRAPHS OF LARGE PATH-WIDTH. A Dissertation Presented to The Academic Faculty. Thanh N. Dang

Greedy Trees, Caterpillars, and Wiener-Type Graph Invariants

Algorithms: Lecture 12. Chalmers University of Technology

arxiv: v3 [math.co] 20 Feb 2014

Graph Theorizing Peg Solitaire. D. Paul Hoilman East Tennessee State University

Part V. Matchings. Matching. 19 Augmenting Paths for Matchings. 18 Bipartite Matching via Flows

On k-rainbow independent domination in graphs

Finiteness conditions and index in semigroup theory

F. Roussel, I. Rusu. Université d Orléans, L.I.F.O., B.P. 6759, Orléans Cedex 2, France

Coloring Vertices and Edges of a Path by Nonempty Subsets of a Set

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing

EXACT DOUBLE DOMINATION IN GRAPHS

The Reduction of Graph Families Closed under Contraction

Combinatorial Optimization

On the Dynamic Chromatic Number of Graphs

Edge-coloring problems

K 4 -free graphs with no odd holes

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

In English, there are at least three different types of entities: letters, words, sentences.

On Approximating An Implicit Cover Problem in Biology

1 Matchings in Non-Bipartite Graphs

Observation 4.1 G has a proper separation of order 0 if and only if G is disconnected.

SAT in Bioinformatics: Making the Case with Haplotype Inference

Massachusetts Institute of Technology 6.042J/18.062J, Fall 02: Mathematics for Computer Science Professor Albert Meyer and Dr.

Strongly chordal and chordal bipartite graphs are sandwich monotone

arxiv: v1 [math.co] 28 Oct 2016

Observation 4.1 G has a proper separation of order 0 if and only if G is disconnected.

RECAP: Extremal problems Examples

k-blocks: a connectivity invariant for graphs

Aphylogenetic network is a generalization of a phylogenetic tree, allowing properties that are not tree-like.

Generating p-extremal graphs

Discrete Optimization 2010 Lecture 2 Matroids & Shortest Paths

Fall 2017 Qualifier Exam: OPTIMIZATION. September 18, 2017

THE MAXIMAL SUBGROUPS AND THE COMPLEXITY OF THE FLOW SEMIGROUP OF FINITE (DI)GRAPHS

AN IMPLICIT COVER PROBLEM IN WILD POPULATION STUDY

Realization Plans for Extensive Form Games without Perfect Recall

Estimating Recombination Rates. LRH selection test, and recombination

Generating Permutations and Combinations

Planar Graphs (1) Planar Graphs (3) Planar Graphs (2) Planar Graphs (4)

REALIZING TOURNAMENTS AS MODELS FOR K-MAJORITY VOTING

Markov properties for directed graphs

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

Phylogenetic Networks, Trees, and Clusters

Advanced Topics in Discrete Math: Graph Theory Fall 2010

Theoretical Computer Science

GRAPH ALGORITHMS Week 7 (13 Nov - 18 Nov 2017)

Tree-width and planar minors

Solving the MWT. Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP:

The Topology of Intersections of Coloring Complexes

Odd independent transversals are odd

α-acyclic Joins Jef Wijsen May 4, 2017

A Phylogenetic Network Construction due to Constrained Recombination

5 Flows and cuts in digraphs

Transcription:

Haplotyping and Minimum Diversity Graphs Courtney Davis - University of Utah - Trinity University

Some Genetics Mother Paired Gene Representation Physical Trait ABABBA AAABBB Physical Trait ABA AAA Mother s Donation ABABBA ABA Haplotype Genotype Representation AXABBX AXA Child ABABBA ABA BBAABA BBA Paired Gene Representation Paired Gene Representation Physical Trait BBAABA BBBAAB Father Physical Trait BBA BAA XBAXBA XBA Father s Donation BBAABA BBA Genotype Representation Haplotype Genotype Representation BBXAXX BXA SNP 2 SNP 7 SNP 2 SNP 7

Haplotyping Definition: The problem is to reconstruct the haplotypes donated by a previous population from the genotypes of the current population. Why: Tracing genetic markers from generation to generation is needed to gauge a population s susceptibility to disease and in the design of patient-specific drugs. Past Research: Investigations were started by Clark in 1990, and recent contributions were made by Lancia and Gusfield.

Clark s Rule 1. Start with an empty collection of haplotypes. 2. Choose a genotype. 3. Add as few haplotypes to the set as possible (you need to add either 1 or 2) so that the genotype can be formed from the collection of haplotypes. 4. Continue until all genotypes can be formed. This technique mimics what happens in nature. Notice that it can be interpreted as an attempt to find the smallest collection of haplotypes, but the process is dependent on the sequence of genotypes.

The Pure Parsimony Problem Parsimony Problems Haplotyping is a situation where simple explanations appear to be biologically relevant. So, finding small collections of haplotypes that can explain the genotypic information of the current population is important. Pure Parsimony Problem Finding a smallest collection of haplotypes that can reconstruct a set of genotypes is called the Pure Parsimony Problem.

For parent haplotypes h 1 and h 2 and offspring genotype g, we have the following at each SNP: g i = A if, and only if, h 1 i = h2 i = A. g i = B if, and only if, h 1 i = h2 i = B. g i = X if, and only if, either h 1 i = A and h2 i = B, or h1 i = B and h2 i = A. We say that h 1 h 2 = g provided that h 1, h 2, and g adhere to these rules. For example, let h 1 = AABAAB and h 2 = ABBABB. Then, h 1 h 2 = g = AXBAXB. It is easy to see that is a binary operation with the property that h i h j = h i h k implies h j = h k. Parental haplotypes that contribute genetic information to the same offspring s genotype are called mates. That is, if h 1 h 2 = g, we say that h 1 mates with h 2 to form g. Furthermore, we say that h 1 resolves g if h 1 h 2 = g for some h 2. This concept is extended to sets, and we say that H resolves G if for each g G, there is an h 1 and h 2 in H such that h 1 h 2 = g.

Diversity Graph A bipartite graph D = (H, G, E) is a diversity graph if G is nonempty, each genotype in G is resolved by some haplotype in H, and E has the property that if (h 1, g) E, then there exists an h 2 H such that (h 2, g) E and h 1 h 2 = g. Notice that the definition is biological.

Bipartite Graphs that are not Diversity Graphs A bipartite graph is a diversity graph if the nodes can be labeled to satisfy the definition. The definition requires that the degree of every node in the genotype set has even degree. There are graphs with each node having an even degree but that are not diversity graphs. As an example, K(2, 2) is not a diversity graph because it violates.

Some Definitions The set of all haplotypes of length n is denoted by H. The largest edge set between the collection of genotypes G and H is E. Any subgraph of (H, G, E) that is a diversity graph and has the property that the subset of H is as small as possible is a solution to the Pure Parsimony problem. These subgraphs are denoted by (H, G, E ). There are typically several optimal subgraphs, which makes solving an IP formulation of the problem difficult.

A Simple, but Useful Result Theorem If the elements of H are lexicographically ordered (where A < B), we have for 1 j 2 n that h j h (2n j+1) = XX...X. The proof is simple. An example for n = 3 is below. 0 B @ AAA AAB ABA ABB BAA BAB BBA BBB 1 0 C B A @ BBB BBA BAB BAA ABB ABA AAB AAA 1 0 = C B A @ XXX XXX XXX XXX XXX XXX XXX XXX 1. C A

Extending Bipartite Graphs to Diversity Graphs Let (V, W, E) be a bipartite graph. For each w, define May need to add a node if N(w) is ^ odd and V(w) is empty. V ^ F(w) T (w) = S w w [N(w) N(w )]. Add enough to remove conflicts in N(w). ^ V(w) W w Let ˆV (w) and ˆF (w) be vertex sets such that N(w) ˆF (w) = ˆV (w) = 2 T (w) N(w) + and ( 0, N(w) ˆV (w) is even 1, N(w) ˆV (w) is odd T(w)

Some Extension Bounds Lemma The bipartite graph (V, W, E) can be extended and labeled to become a diversity graph by adding no more than X w W ˆF (w) + (2 T (w) N(w) ) + nodes to V, provide that there are no isolated nodes. proof: This is a long constructive proof that uses the Lexicographic Theorem. Theorem Any bipartite graph (V, W, E) can be extended and labeled to become a diversity graph by adding no more than 2 3 4 X ˆF (w) + (2 T (w) N(w) ) + 5 + (M V M W ) + w W nodes to V, where M V and M W are the number of isolated nodes in V and W, respectively.

Locating Favorable Haplotypes Lemma Suppose that T (g) for some g G. Then, H contains an element of S g G T (g). proof: Let T (g) for some g G. Suppose that H does not contain an element of S g G T (g). Then, to resolve each g we must select two elements from N(g)\ S g G T (g), provided that g has at least one ambiguous SNP. If g contains no Xs, we select one element from N(g)\ S g G T (g). This implies that H = 2 G u, where u is the number of genotypes with no ambiguous SNPs. However, we know that T (g) is nonempty for some g, which means there exists g 1 and g 2 such that h 1 h 2 = g 1 and h 1 h 3 = g 2 for some h 1, h 2, and h 3. If we replace the four haplotypes that resolve g 1 and g 2 with h 1, h 2, and h 3, then we have resolved G with 2 G u 1 haplotypes, which contradicts the definition of H. Hence, H contains an element of S g G T (g).

Characterizing when H = 2 G Theorem Assume every g has one or more ambiguous SNPs. Then, H = 2 G if, and only if, the neighborhoods of the genotypes together with the set of isolated haplotypes partitions H. proof: ( ) Let all g have at least one ambiguous SNP, and let H I be the set of isolated haplotypes. Assume the neighborhoods of the genotypes and H I partition H. Then, there does not exist an h that resolves both g 1 and g 2, with g 1 g 2. Since all genotypes are ambiguous in some SNP, there is no g such that h h = g, for some h. So, two distinct haplotypes must mate to form every genotype. Since H is partitioned, H has exactly two distinct haplotypes from each genotype s neighborhood. Therefore, H = 2 G. ( ) Let H = 2 G, and suppose for the sake of obtaining a contradiction that T (g) for some g G. Then by Lemma 13, H

contains an element in S g G T (g). Let g1 and g 2 be such that h 1 h 2 = g 1 and h 1 h 3 = g 2, for some h 1, h 2, and h 3. Let G = G\{g 1, g 2 }, and let H = S g G N(g). Furthermore, let (H ) be such that (H ) = min{ H : H H, H resolves G }. Clearly (H ) 2 G. We know that we can resolve G by including h 1, h 2, and h 3 in (H ). Since all three haplotypes might not be required, we have that 2 G = H (H ) + 3. So, 2 G = H (H ) + 3 2 G + 3 = 2 ( G 2) + 3 = 2 G 1. Since this is a contradiction, we have that T (g) = for all g, and consequently, N(g i ) N(g j ) =, for all i j. The result follows since S H = g G N(g) H I.

Restricting Mating Structure We now constrain our optimization problem so that the maximum number of mates that any haplotype can have is m. A smallest haplotype set that resolves G with this restriction is denoted by H m, and we let φ(m) = H m. If m = 1, each haplotype can mate with at most one other haplotype. Biologically this means each parent can donate one of two haplotypes to a unique child, so this haplotype cannot be used to form another child. So, for m = 1 the neighborhoods of the genotypes in an optimal subgraph are disjoint, and the smallest number of haplotypes that can resolve G is φ(1) = 2 G u, where u is the number of genotypes with no ambiguous

SNPs.

Properties of φ(m) Calculating φ(m ) solves the Pure Parsimony problem and indicates the least amount of mating needed. At some threshold, increasing m does not change the cardinality of H m. Hence, for some m, φ(m) = φ(m + k) for every natural number k. Increasing the number of possible mates that any haplotype is allowed never causes an increase in H m. Thus, φ(m) φ(m + 1) for all m, and φ is non-increasing. The smallest m such that φ(m) = φ(m + k), for all k N, is denoted by m. So, if m m, we have that φ(m) = φ(m ). No haplotype can mate with more than G haplotypes, and hence, m G. If no haplotype reconciles more than one genotype, m = 1.

What if m is at its Upper Bound Theorem If m = G, we have that 8 < φ(m G, if h h = g for some h Hm ) =, : G + 1, otherwise. proof: Let m = G. Then, there exists h Hm such that h resolves every g. Since Hm resolves G, for each g i there is an h i Hm such that h h i = g i. We have two cases. Case 1: Suppose that h h G. Then, h mates with a unique h i H m \{h } to resolve each g i G. Hence, φ(m ) = H = G + 1. Case 2: Suppose h h G. In this case we have that h mates with G 1 haplotypes in h i H m \{h } to resolve the genotypes in G\{h h }. Hence, φ(m ) = H m = 1 + ( G 1) = G.

Calculating φ(2) Step 1: Set v = 0 and (H v, G v ) = (H, G). Step 2: Find the longest path in (H v, G v ), say P v. If no path exists, set P v =. Step 3: If P v =, stop. Step 4: Index v by 1. Step 5: Set (H v+1, G v+1 ) = (H v, G v )\P v. Step 6: Index v by 1. Step 7: Go to Step 2. This greedy algorithm iteratively removes the longest paths in a diversity graph.

The Greedy Algorithm Works Theorem The greedy algorithm finds an optimal subgraph of the acyclic diversity graph (H, G, E). Moreover, if v is the number of paths found by the algorithm, φ(2) = G + v. proof: The proof follows by induction on G.

An Example The paths through the genotypes g 1 = AXBBBB g 2 = XAXXBB g 3 = BXAXBX g 4 = BXXAXB g 5 = BBBXAB g 6 = BBXBBA must pass through these genotypes as indicated below. g 1 g2 g3 g4 g5 g6

Path Decompositions First Path s Genotype Progression Second Path s Genotype Progression (g 1, g 2, g 3, g 4, g 5 ) (g 6 ) (g 1, g 2, g 4, g 5 ) (g 3, g 6 ) (g 1, g 2, g 3, g 6 ) (g 4, g 5 ) (g 6, g 3, g 4, g 5 ) (g 1, g 2 ) The greedy algorithm finds the first solution in the Table, as the first path is as long as possible. None of the other paths have this property, and so the algorithm is not capable of finding these solutions.

Future Directions How fast does φ(m) grow? We see from Theorem that knowing m can solve the Pure Parsimony problem in some cases. Moreover, knowing m is beneficial in all cases as this removes many subgraphs from consideration. So, in an integer programming formulation of the Pure Parsimony problem, m provides a cut that may help reduce solution times. Finding bounds on m is an interesting area of future work. Randomized coloring algorithms have been efficient on many classes of graphs, and it may be that finding longest paths and cycles can be thought of as a coloring problem. If so, then these techniques could be used to approximate the greedy algorithm, with the hope being that substantial biological models could be addressed.

Thank you for your time, please ask questions