DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

Similar documents
Lecture 15. Saad Mneimneh

Bio nformatics. Lecture 16. Saad Mneimneh

Graph Algorithms in Bioinformatics

Fragment Assembly of DNA

Lecture 15: Realities of Genome Assembly Protein Sequencing

FRAGMENT ASSEMBLY OF DNA

Bio nformatics. Lecture 3. Saad Mneimneh

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

Problem: Shortest Common Superstring. The Greedy Algorithm for Shortest Common Superstrings. Overlap graphs. Substring-freeness

Approximating Shortest Superstring Problem Using de Bruijn Graphs

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Pattern Matching (Exact Matching) Overview

Data Structures in Java

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

NP-Complete Reductions 3

10.4 The Kruskal Katona theorem

Practical Bioinformatics

Trees. A tree is a graph which is. (a) Connected and. (b) has no cycles (acyclic).

Greedy Conjecture for Strings of Length 4

Hierarchical Overlap Graph

8.3 Hamiltonian Paths and Circuits

Lecture 1 : Data Compression and Entropy

Algorithm Design and Analysis

Limitations of Algorithm Power

Lecture 4: NP and computational intractability

Towards More Effective Formulations of the Genome Assembly Problem

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

1 More finite deterministic automata

10.3 Matroids and approximation

Lecture 3: String Matching

Three new strategies for exact string matching

Deterministic Finite Automata (DFAs)

What we have done so far

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

NP-Completeness. Until now we have been designing algorithms for specific problems

Deterministic Finite Automata (DFAs)

1 Matchings in Non-Bipartite Graphs

NP-Complete problems

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

Number-controlled spatial arrangement of gold nanoparticles with

CSE 549: Computational Biology. Computer Science for Biologists Biology

Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

VIII. NP-completeness

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

CMPSCI611: The Matroid Theorem Lecture 5

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

1. Introduction Recap

Lecture 18: More NP-Complete Problems

Easy Problems vs. Hard Problems. CSE 421 Introduction to Algorithms Winter Is P a good definition of efficient? The class P

University of Toronto Department of Electrical and Computer Engineering. Final Examination. ECE 345 Algorithms and Data Structures Fall 2016

from notes written mostly by Dr. Matt Stallmann: All Rights Reserved

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

9/8/09 Comp /Comp Fall

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Lecture 4: September 19

Theoretical Computer Science

Crick s early Hypothesis Revisited

COS597D: Information Theory in Computer Science October 19, Lecture 10

Theoretical Computer Science

NP-Complete Problems. More reductions

Average Case Analysis of QuickSort and Insertion Tree Height using Incompressibility

BBM402-Lecture 11: The Class NP

Undecidable Problems. Z. Sawa (TU Ostrava) Introd. to Theoretical Computer Science May 12, / 65

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

1 Some loose ends from last time

Homework Assignment 4 Solutions

NP-Complete Problems and Approximation Algorithms

Structure-Based Comparison of Biomolecules

Advanced topics in bioinformatics

OPTIMAL PARALLEL SUPERPRIMITIVITY TESTING FOR SQUARE ARRAYS

Lecture 2: Pairwise Alignment. CG Ron Shamir

Topics in Approximation Algorithms Solution for Homework 3

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170

CSE 202 Homework 4 Matthias Springer, A

Intractable Problems Part Two

List of Theorems. Mat 416, Introduction to Graph Theory. Theorem 1 The numbers R(p, q) exist and for p, q 2,

Searching Sear ( Sub- (Sub )Strings Ulf Leser

arxiv: v1 [cs.cc] 4 Feb 2008

Analysis and Design of Algorithms Dynamic Programming

arxiv: v1 [cs.dm] 18 May 2016

Chapter 34: NP-Completeness

CSE 549: Computational Biology. Computer Science for Biologists Biology

GRAPH ALGORITHMS Week 3 (16-21 October 2017)

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

CS 320, Fall Dr. Geri Georg, Instructor 320 NP 1

Lecture 7: The Satisfiability Problem

Avoiding Approximate Squares

Graph Theory and Optimization Computational Complexity (in brief)

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Lyndon Words and Short Superstrings

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

1 Computational Problems

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

P, NP, NP-Complete. Ruth Anderson

String Matching with Variable Length Gaps

Complexity, P and NP

Algorithms Design & Analysis. Approximation Algorithm

Transcription:

Computat onal Biology Lecture 15 DNA sequencing Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Given a set of fragments F, Find the shortest string s that contains every f F as a substring This is NP-hard The SCS might not be what we really want Bad example (repeats) X X Shortest common superstring will give: X X

Solving SCS We are going to consider a Hamiltonian path approach to solving the SCS problem Overlap graph Consider the complete directed weighted graph G = (V, E), called the overlap graph V = F (each fragment is a vertex) (u,v) E with weight -t iff t is the length of the maximal suffix of u that is a prefix of v We allow self loops and zero weight edges TACGA a -2 c CTAAAG 0 weight edges not shown d GACA ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC

A path defines a superstring Every simple path P in the overlap graph involving a set of vertices (fragments) A defines a superstring s(p) for the set A. Therefore, a Hamiltonian path in the overlap graph defines a superstring for the set of fragments F. A Hamiltonian path must exist because the graph is complete (how many do we have?). TACGA a -2 c CTAAAG 0 weight edges not shown d GACA ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC s(p): P = adbc TACGA GACA ACCC CTAAAG ---------------- TACGACACCCTAAAG Does a superstring define a path? We have seen that every Hamiltonian path corresponds to a superstring. Is the converse true? No: A superstring can contain arbitrary characters that are not present in any fragments Does a shortest superstring correspond to a Hamiltonian path? Yes: if F is substring-free, i.e. no fragment in F is contained in another

G b 0 AGC a 0 0 0 0 c CT The shortest superstring is AGCT There is no Hamiltonian path P, such that s(p) = AGCT Subtring-free collection F Let F be a substring free set, then for every shortest superstring s, there is a Hamiltonian path P, such that s(p) = s. Proof: assume the fragments appear in s as follows (no gaps and no one can be contained in another) this must be the max overlap between a a and b b c d s -t 1 -t 2 0 t 1 t 2 Ham path: a b c d etc Non substring-free F If F is not substring-free, then we can remove all fragments from F that are substrings of other fragments We end up with a set F But any superstring of F is a superstring of F Therefore, we can use F

Length of string v.s. weight of path Let P be a Hamiltonian path. Let w(p) be the weight of P. Let F = Σ a F a Then s(p) = F + w(p) [proof is simple] Therefore, the shortest common superstring corresponds to the Hamiltonian path with minimum weight Proof Let P be a Hamiltonian path with minimum weight we need to show that s(p) is a shortest superstring Let s be a shortest superstring with s < s(p) Then there is a Hamiltonian path P such that s = s(p ) s(p ) = F + w(p ) < s(p) = F + w(p) Therefore, w(p ) < w(p), contradiction Hamiltonian path approach Finding a minimum weight Hamiltonian path is NP-hard (you can reduce HAMPATH to it) Unfortunately, there is no better approach to solve SCS, because SCS itself is NP-hard Let s consider a greedy algorithm for finding a Hamiltonian path

Greedy algorithm Greedy: start with an empty path repeatedly add the least weighted available edge until you get a Hamiltonian path Every time we add an edge (u,v), we need to check: (u,v) does not create a cycle with the previously added edges u has no previously added outgoing edge v has no previously added incoming edge Greedy algorithm sort edges by their weight: e 1, e 2, e E for all v V in(v) 0 out(v) 0 H φ i 1 while H < F 1 (u,v) e i if out(u) = 0 and in(v) = 0 then To build the graph: trivially O( F 2 ) (could be done optimally in O(n 2 + F ) using suffix trees) To run the algorithm: O(n 2 logn) if H e i does not contain a cycle [disjoint set data structure] H H e i out(u) 1 in(v) 1 i i + 1 ATGC -2 GCC 0-2 -3 TGCAT Greedy algorithm will choose: ATGC TGCAT GCC ATGCATGCC Optimal is: TGCAT ATGC GCC TGCATGCC

Sequncing By Hybridization SBH Use all possible probes of length l and obtain hybridization data with the DNA. If no errors, we have all substrings of length l. We would like to reconstruct the DNA from those substrings. We can formalize this as SCS and solve it as before. But we can simplify a little bit SBH and SCS SBH is a special case of the SCS problem where all fragments of F have the same length l. In the overlap graph, we will keep only the edges with weights equal to (l 1). By construction of these fragments, we know that there must be a Hamiltonian path in this modified overlap graph. All Hamiltonian paths now have the same weight = (n 1)(l 1) Thus we only need to find a Hamiltonian path (still NP-complete) l = 3 ATG TGG TGC GTG GGC GCA GCG CGT

Idea Instead of representing fragments as vertices, represent them as edges. Then, instead of looking for a Hamiltonian path (a path that goes through each vertex once), look for an Euler path (a path that goes through each edge once). Euler path can be found in linear time. Fragments as edges Construct a directed graph G = (V, E) V: (l 1) length fragments (these can be obtained from our set F by considering the first and last l 1 characters of each fragment) E: A directed edge (u,v) for each fragment in F that starts with u and ends with v l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG AT TG GC CA GG

Euler Cycle By construction of the fragments, we know that the graph will have all vertices balanced except possibly for two unbalanced vertices (each occurrence of an l fragment is shared by two l length fragments, except possibly for the first and last one) By adding an edge between two unbalanced vertices we can make the graph balanced Then we can find an Euler cycle in the graph (since it is balanced, there is one) l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG ATGCGTGGCA AT TG GC CA GG