Allen Holder - Trinity University

Similar documents
The Pure Parsimony Problem

Mathematical Approaches to the Pure Parsimony Problem

SAT in Bioinformatics: Making the Case with Haplotype Inference

On the Fixed Parameter Tractability and Approximability of the Minimum Error Correction problem

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

Haplotype Inference Constrained by Plausible Haplotype Data

Harvard CS121 and CSCI E-121 Lecture 2: Mathematical Preliminaries

COSE212: Programming Languages. Lecture 1 Inductive Definitions (1)

Haplotyping estimation from aligned single nucleotide polymorphism fragments has attracted increasing

A Phylogenetic Network Construction due to Constrained Recombination

An Overview of Combinatorial Methods for Haplotype Inference

This is a survey designed for mathematical programming people who do not know molecular biology and

FINAL EXAM PRACTICE PROBLEMS CMSC 451 (Spring 2016)

ACO Comprehensive Exam March 17 and 18, Computability, Complexity and Algorithms

Theory of Computer Science

Haplotyping as Perfect Phylogeny: A direct approach

Name Class Date. KEY CONCEPT Gametes have half the number of chromosomes that body cells have.

{a, b, c} {a, b} {a, c} {b, c} {a}

Practical Algorithms and Fixed-Parameter Tractability for the Single Individual SNP Haplotyping Problem

4-coloring P 6 -free graphs with no induced 5-cycles

Graphical Model Inference with Perfect Graphs

Phylogenetic Networks, Trees, and Clusters

arxiv: v1 [cs.cc] 27 Feb 2011

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

CS6902 Theory of Computation and Algorithms

SNPs Problems, Complexity, and Algorithms

Efficient Approximation for Restricted Biclique Cover Problems

Estimating Recombination Rates. LRH selection test, and recombination

Efficient Haplotype Inference with Boolean Satisfiability

CS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle

A GENETIC ALGORITHM FOR FINITE STATE AUTOMATA

Theory of Computation

Undecidable Problems. Z. Sawa (TU Ostrava) Introd. to Theoretical Computer Science May 12, / 65

BOUNDS ON THE NUMBER OF INFERENCE FUNCTIONS OF A GRAPHICAL MODEL

The minimum G c cut problem

Linear-Time Algorithms for Finding Tucker Submatrices and Lekkerkerker-Boland Subgraphs

1 Matchings in Non-Bipartite Graphs

Minimization of Symmetric Submodular Functions under Hereditary Constraints

A Class Representative Model for Pure Parsimony Haplotyping

On the Complexity of SNP Block Partitioning Under the Perfect Phylogeny Model

More on NP and Reductions

Lecture 4: NP and computational intractability

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

On Approximating An Implicit Cover Problem in Biology

Labs 7 and 8: Mitosis, Meiosis, Gametes and Genetics

Name: Period: EOC Review Part F Outline

Fall 2017 Qualifier Exam: OPTIMIZATION. September 18, 2017

On the minimum neighborhood of independent sets in the n-cube

AN IMPLICIT COVER PROBLEM IN WILD POPULATION STUDY

Journal of Computational Biology. Linear Time Probabilistic Algorithms for the Singular Haplotype Reconstruction Problem from SNP Fragments

Semigroup presentations via boundaries in Cayley graphs 1

4. How to prove a problem is NPC

10. How many chromosomes are in human gametes (reproductive cells)? 23

6.046 Recitation 11 Handout

Linear Classifiers (Kernels)

Phylogenetic Networks with Recombination

arxiv: v4 [q-bio.pe] 7 Jul 2016

Theory of Computation

Unit 3 - Molecular Biology & Genetics - Review Packet

The Maximum Flow Problem with Disjunctive Constraints

The Mixed Chinese Postman Problem Parameterized by Pathwidth and Treedepth

C1.1 Introduction. Theory of Computer Science. Theory of Computer Science. C1.1 Introduction. C1.2 Alphabets and Formal Languages. C1.

On improving matchings in trees, via bounded-length augmentations 1

Hanna Furmańczyk EQUITABLE COLORING OF GRAPH PRODUCTS

Part V. Matchings. Matching. 19 Augmenting Paths for Matchings. 18 Bipartite Matching via Flows

ACM 116: Lecture 1. Agenda. Philosophy of the Course. Definition of probabilities. Equally likely outcomes. Elements of combinatorics

CS60007 Algorithm Design and Analysis 2018 Assignment 1

CSCE 750 Final Exam Answer Key Wednesday December 7, 2005

AUTHORIZATION TO LEND AND REPRODUCE THE THESIS. Date Jong Wha Joanne Joo, Author

Aphylogenetic network is a generalization of a phylogenetic tree, allowing properties that are not tree-like.

Solving the MWT. Recall the ILP for the MWT. We can obtain a solution to the MWT problem by solving the following ILP:

Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

10.4 The Kruskal Katona theorem

Combinatorial Optimization

A Tiling Approach to Chebyshev Polynomials

Lesson 4: Understanding Genetics

CSE 105 Homework 1 Due: Monday October 9, Instructions. should be on each page of the submission.

Preliminaries and Complexity Theory

THE THREE-STATE PERFECT PHYLOGENY PROBLEM REDUCES TO 2-SAT

1 Primals and Duals: Zero Sum Games

In English, there are at least three different types of entities: letters, words, sentences.

Algorithms and Theory of Computation. Lecture 22: NP-Completeness (2)

Graph coloring, perfect graphs

The Minimum k-colored Subgraph Problem in Haplotyping and DNA Primer Selection

Paths and cycles in extended and decomposable digraphs

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Greedy Algorithms My T. UF

Inference of A Minimum Size Boolean Function by Using A New Efficient Branch-and-Bound Approach From Examples

Lecture 14 - P v.s. NP 1

Cost-Constrained Matchings and Disjoint Paths

ACO Comprehensive Exam March 20 and 21, Computability, Complexity and Algorithms

arxiv: v1 [cs.ds] 2 Oct 2018

Chapter 7 Matchings and r-factors

1 Non-deterministic Turing Machine

CMPSCI611: The Matroid Theorem Lecture 5

k-blocks: a connectivity invariant for graphs

CS675: Convex and Combinatorial Optimization Fall 2014 Combinatorial Problems as Linear Programs. Instructor: Shaddin Dughmi

Parameterized Complexity of the Arc-Preserving Subsequence Problem

Decomposing dense bipartite graphs into 4-cycles

Equitable and semi-equitable coloring of cubic graphs and its application in batch scheduling

Transcription:

Haplotyping - Trinity University Population Problems - joint with Courtney Davis, University of Utah Single Individuals - joint with John Louie, Carrol College, and Lena Sherbakov, Williams University

Some Genetics Mother Paired Gene Representation Physical Trait ABABBA AAABBB Physical Trait ABA AAA Mother s Donation ABABBA ABA Haplotype Genotype Representation AXABBX AXA Child ABABBA ABA BBAABA BBA Paired Gene Representation Paired Gene Representation Physical Trait BBAABA BBBAAB Father Physical Trait BBA BAA XBAXBA XBA Father s Donation BBAABA BBA Genotype Representation Haplotype Genotype Representation BBXAXX BXA SNP 2 SNP 7 SNP 2 SNP 7

Haplotyping a Population Definition: The problem is to reconstruct the haplotypes donated by a previous population from the genotypes of the current population. Why: Tracing genetic markers from generation to generation is needed to gauge a population s susceptibility to disease and in the design of patient-specific drugs. Past Research: Investigations were started by Clark in 1990, and recent contributions were made by Istrial, Lancia, Gusfield, Pinotti, and Rizzi.

Clark s Rule 1. Start with an empty collection of haplotypes. 2. Choose a genotype. 3. Add as few haplotypes to the set as possible (you need to add either 1 or 2) so that the genotype can be formed from the collection of haplotypes. 4. Continue until all genotypes can be formed. This technique mimics what happens in nature. Notice that it can be interpreted as an attempt to find the smallest collection of haplotypes, but the process is dependent on the sequence of genotypes.

The Pure Parsimony Problem Parsimony Problems Haplotyping is a situation where simple explanations appear to be biologically relevant. So, finding small collections of haplotypes that can explain the genotypic information of the current population is important. Pure Parsimony Problem Finding a smallest collection of haplotypes that can reconstruct a set of genotypes is called the Pure Parsimony Problem. This problem is known to be APX-hard (Lancia, Pinotti, and Rizzi).

For parent haplotypes h 1 and h 2 and offspring genotype g, we have the following at each SNP: g i = A if and only if h 1 i = h2 i = A. g i = B if and only if h 1 i = h2 i = B. g i = X if and only if either h 1 i = A and h2 i = B, or h1 i = B and h2 i = A. We say that h 1 h 2 = g provided that h 1, h 2, and g adhere to these rules. For example, let h 1 = AABAAB and h 2 = ABBABB. Then, h 1 h 2 = g = AXBAXB. It is easy to see that is a binary operation with the property that h i h j = h i h k implies h j = h k. Parental haplotypes that contribute genetic information to the same offspring s genotype are called mates. That is, if h 1 h 2 = g, we say that h 1 mates with h 2 to form g. Furthermore, we say that h 1 resolves g if h 1 h 2 = g for some h 2. This concept is extended to sets, and we say that H resolves G if for each g G, there is an h 1 and h 2 in H such that h 1 h 2 = g.

Diversity Graph A bipartite graph D = (H, G, E) is a diversity graph if G is nonempty, each genotype in G is resolved by some haplotype in H, and E has the property that if (h 1, g) E, then there exists an h 2 H such that (h 2, g) E and h 1 h 2 = g. Notice that the definition is biological.

Bipartite Graphs that are not Diversity Graphs A bipartite graph is a diversity graph if the nodes can be labeled to satisfy the definition. The definition requires that the degree of every node in the genotype set has even degree. There are graphs with each node having an even degree but that are not diversity graphs. As an example, K(2, 2) is not a diversity graph because it violates.

Some Definitions The set of all haplotypes of length n is denoted by H. The largest edge set between the collection of genotypes G and H is E. Any subgraph of (H, G, E) that is a diversity graph and has the property that the subset of H is as small as possible is a solution to the Pure Parsimony problem. These subgraphs are denoted by (H, G, E ). There are typically several optimal subgraphs, which makes solving an IP formulation of the problem difficult.

A Simple, but Useful Result Theorem If the elements of H are lexicographically ordered (where A < B), we have for 1 j 2 n that h j h (2n j+1) = XX...X. The proof is simple. An example for n = 3 is below. 0 B @ AAA AAB ABA ABB BAA BAB BBA BBB 1 0 C B A @ BBB BBA BAB BAA ABB ABA AAB AAA 1 0 = C B A @ XXX XXX XXX XXX XXX XXX XXX XXX 1. C A

Extending Bipartite Graphs to Diversity Graphs Let (V, W, E) be a bipartite graph. For each w, define May need to add a node if N(w) is ^ odd and V(w) is empty. V ^ F(w) T (w) = S w w [N(w) N(w )]. Add enough to remove conflicts in N(w). ^ V(w) W w Let ˆV (w) and ˆF (w) be vertex sets such that N(w) ˆF (w) = ˆV (w) = 2 T (w) N(w) + and ( 0, N(w) ˆV (w) is even 1, N(w) ˆV (w) is odd T(w)

Some Extension Bounds Lemma The bipartite graph (V, W, E) can be extended and labeled to become a diversity graph by adding no more than X w W ˆF (w) + (2 T (w) N(w) ) + nodes to V, provide that there are no isolated nodes. proof: This is a long constructive proof that uses the Lexicographic Theorem. Theorem Any bipartite graph (V, W, E) can be extended and labeled to become a diversity graph by adding no more than 2 3 4 X ˆF (w) + (2 T (w) N(w) ) + 5 + (M V M W ) + w W nodes to V, where M V and M W are the number of isolated nodes in V and W, respectively.

Some Theory Lemma Suppose that T (g) for some g G. Then, H contains an element of S g G T (g). This lemma says that if some haplotypes can resolve multiple genotypes, then a smallest collection of haplotypes contains some of these haplotypes. Theorem Assume every g has one or more ambiguous SNPs. Then, H = 2 G if, and only if, the neighborhoods of the genotypes together with the set of isolated haplotypes partitions H. This result classifies the graphs that attain the largest, smallest collection of haplotypes.

Restricting Mating Structure We now constrain our optimization problem so that the maximum number of mates that any haplotype can have is m. A smallest haplotype set that resolves G with this restriction is denoted by H m, and we let φ(m) = H m. If m = 1, each haplotype can mate with at most one other haplotype. Biologically this means each parent can donate one of two haplotypes to a unique child, so this haplotype cannot be used to form another child. So, for m = 1 the neighborhoods of the genotypes in an optimal subgraph are disjoint, and the smallest number of haplotypes that can resolve G is φ(1) = 2 G u, where u is the number of unambiguous genotypes.

Properties of φ(m) At some threshold, increasing m does not change the cardinality of H m. Hence, for some m, φ(m) = φ(m + k) for every natural number k. The smallest m such that φ(m) = φ(m + k), for all k N, is denoted by m. So, if m m, we have that φ(m) = φ(m ). Calculating φ(m ) solves the Pure Parsimony problem and indicates the least amount of mating needed. Increasing the number of possible mates that any haplotype is allowed never causes an increase in Hm. Thus, φ(m) φ(m + 1) for all m, and φ is non-increasing. No haplotype can mate with more than G haplotypes, and hence, m G. If no haplotype reconciles more than one genotype, m = 1.

What if m is at its Upper Bound Theorem If m = G, we have that φ(m G, if h h = g for some h Hm ) =, G + 1, otherwise.

Calculating φ(2) Step 1: Set v = 0 and (H v, G v ) = (H, G). Step 2: Find the longest path in (H v, G v ), say P v. If no path exists, set P v =. Step 3: If P v =, stop. Step 4: Index v by 1. Step 5: Set (H v+1, G v+1 ) = (H v, G v )\P v. Step 6: Index v by 1. Step 7: Go to Step 2. This greedy algorithm iteratively removes the longest paths in a diversity graph.

The Greedy Algorithm Works Theorem The greedy algorithm finds an optimal subgraph of the acyclic diversity graph (H, G, E). Moreover, if v is the number of paths found by the algorithm, φ(2) = G + v. proof: The proof follows by induction on G.

An Example The paths through the genotypes g 1 = AXBBBB g 2 = XAXXBB g 3 = BXAXBX g 4 = BXXAXB g 5 = BBBXAB g 6 = BBXBBA must pass through these genotypes as indicated below. g 1 g2 g3 g4 g5 g6

Path Decompositions First Path s Genotype Progression Second Path s Genotype Progression (g 1, g 2, g 3, g 4, g 5 ) (g 6 ) (g 1, g 2, g 4, g 5 ) (g 3, g 6 ) (g 1, g 2, g 3, g 6 ) (g 4, g 5 ) (g 6, g 3, g 4, g 5 ) (g 1, g 2 ) The greedy algorithm finds the first solution in the Table, as the first path is as long as possible. None of the other paths have this property, and so the algorithm is not capable of finding these solutions.

Future Directions How fast does φ(m) grow? We see from Theorem that knowing m can solve the Pure Parsimony problem in some cases. Moreover, knowing m is beneficial in all cases as this removes many subgraphs from consideration. So, in an integer programming formulation of the Pure Parsimony problem, m provides a cut that may help reduce solution times. Finding bounds on m is an interesting area of future work. Randomized coloring algorithms have been efficient on many classes of graphs, and it may be that finding longest paths and cycles can be thought of as a coloring problem. If so, then these techniques could be used to approximate the greedy algorithm, with the hope being that substantial biological models could be addressed.

Haplotyping an Individual The hope in the future is to (partially) construct an individual s unique genetic information to design patient-specific drugs, screening methodologies, and other therapies. DNA sequencing machines are not capable of sequencing the entire DNA strand at once, and instead, the DNA strand is replicated and sequenced in smaller fragments (1,000-30,000 individual nucleotides). This process is called shotgun sequencing.

Sequencing machines make errors (analysis is based on probabilistic determination) and return information similar to that in the following matrix. SNP 1 2 3 4 5 6 7 1 A B A A - - - Fragments 2 B A B B - - - 3 - B A - B A - 4 - - B B B B - 5 - - - B A B B 6 - - - A B B A 7 - - - - A B A 8 - - - - B A A The A s and B s represent heterozygous or homozygous SNPs, and a dash indicates that the sequencer was not capable of deciding between an A and a B. NOTE: Errors can occur at any position.

The problem is to find a 2-set partition of the fragments (rows) so that the fragments within each set are not in conflict i.e. they agree in all spots (dashes can not cause conflicts). Correcting or removing errors is required to form the two haplotypes. One way is to the remove (the fewest) fragments or SNPs until we can construct the haplotypes. These problems are known to be polynomial with gapless data and APX-hard otherwise (Lancia, et. all). 1 2 3 4 5 6 7 1 A B A A - - - 2 B A B B - - - 3 - B A - B A - 4 - - B B B B - 5 - - - B A B B 6 - - - A B B A 7 - - - - A B A 8 - - - - B A A Instead of removing information, we address the problem of changing the fewest number of positions that allows the haplotypes to be constructed. The different haplotypes are indicated by red and blue. The green letters are changed to form the haplotypes.

Conflict Graphs For a collection of haplotypes, H, the conflict graph, CG(H), is the graph whose nodes are haplotypes and two haplotypes are connected by an edge if they are in conflict. AB A A A The Conflict Graph for H = {AB A, A A, AAAA, BBA } Ĥ = {AB A, B A, AAAA, BBA } AAAA BBA The Conflict Graph Minimum Letter Flip (CGMLF) problem is to find Ĥ such that CG(Ĥ) is a bipartite graph and z(h, Ĥ) is as small as possible, where z is a measure of the distance from H to Ĥ.

The P-Median Problem The p-median problem is to find p nodes in a (directed) graph with edge distances d (i,j) such that the aggregate distance from any node to one of these medians is as small as possible. An undirected graph with medians and edges associating them to nodes highlighted in red. Each edge has a distance, and this choice of medians minimizes the distance along the red edges. The p-median problems is known to be NP-hard, but for a fixed p it is polynomial. In particular, the 2-median problem is O(n 2 ).

Distances We let d(h i, h j ) be the number We let l(h i, h j ) be the of SNPs where one is an A and the nonsymmetrical sum of the other is a B number of SNPs where the supposedly certain information h i = AB A of an A or a B in h i 1 0 0 0 disagrees with the symbol in h j. h i = B A h i = AB A d(h i, h j ) = 1 1 1 0 0 h i = B A l(h i, h j ) = 2

Problem Statements MLF The 2-median problem on the complete graph K H = (H, H H), with edge distances defined by d. This problem is O(m 3 ). MLF The 2-median problem on the graph ({A, B, } n, H {A, B, } n ), with edge distance defined by d. (this is not a very interesting problem because one median is always....) DMLF The 2-median problem on the directed complete graph K H = (H, H H), with edge distances defined by l. This problem is O(m 3 ). DMLF is the 2-median problem on the directed graph ({A, B, } n, H {A, B, } n ), with edge distance defined by l. This problem is O(3 3n ).

Theoretical Results Theorem We have that MLF MLF DMLF = CGMLF DMLF. Hence, the exponential problem of finding the minimum number of flips is bound by polynomial problems. The proof is not complete. Theorem If there is one SNP for each median haplotype that is sampled correctly in all fragments, then MLF = DMLF So, CGMLF is solvable in polynomial time under normal practice -i.e. a sampling machine would need to missclassify some fragment for every SNP for the problem to fail to be polynomial.

Thank you for your time, please ask questions