Bio nformatics. Lecture 16. Saad Mneimneh

Similar documents
Lecture 15. Saad Mneimneh

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

Fragment Assembly of DNA

FRAGMENT ASSEMBLY OF DNA

Graph Algorithms in Bioinformatics

Bio nformatics. Lecture 3. Saad Mneimneh

Lecture 15: Realities of Genome Assembly Protein Sequencing

Problem: Shortest Common Superstring. The Greedy Algorithm for Shortest Common Superstrings. Overlap graphs. Substring-freeness

Practical Bioinformatics

Physical Mapping. Restriction Mapping. Lecture 12. A physical map of a DNA tells the location of precisely defined sequences along the molecule.

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Approximating Shortest Superstring Problem Using de Bruijn Graphs

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Pattern Matching (Exact Matching) Overview

10.4 The Kruskal Katona theorem

Data Structures in Java

SEQUENCE ALIGNMENT BACKGROUND: BIOINFORMATICS. Prokaryotes and Eukaryotes. DNA and RNA

SUPPORTING INFORMATION FOR. SEquence-Enabled Reassembly of β-lactamase (SEER-LAC): a Sensitive Method for the Detection of Double-Stranded DNA

Number-controlled spatial arrangement of gold nanoparticles with

Analysis and Design of Algorithms Dynamic Programming

Trees. A tree is a graph which is. (a) Connected and. (b) has no cycles (acyclic).

Supplemental data. Pommerrenig et al. (2011). Plant Cell /tpc

CMSC 451: Lecture 7 Greedy Algorithms for Scheduling Tuesday, Sep 19, 2017

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Crick s early Hypothesis Revisited

High throughput near infrared screening discovers DNA-templated silver clusters with peak fluorescence beyond 950 nm

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Advanced topics in bioinformatics

Regulatory Sequence Analysis. Sequence models (Bernoulli and Markov models)

1 Some loose ends from last time

NP-Complete Reductions 3

Towards More Effective Formulations of the Genome Assembly Problem

What we have done so far

1 More finite deterministic automata

4. How to prove a problem is NPC

Limitations of Algorithm Power

Structure-Based Comparison of Biomolecules

Algorithm Design and Analysis

Hierarchical Overlap Graph

Lecture 4: NP and computational intractability

More Approximation Algorithms

Characterization of Pathogenic Genes through Condensed Matrix Method, Case Study through Bacterial Zeta Toxin

Approximation Algorithms for Asymmetric TSP by Decomposing Directed Regular Multigraphs

UC Berkeley CS 170: Efficient Algorithms and Intractable Problems Handout 22 Lecturer: David Wagner April 24, Notes 22 for CS 170

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

Algorithms Exam TIN093 /DIT602

Lecture 1 : Data Compression and Entropy

8.3 Hamiltonian Paths and Circuits

Information Theory of DNA Shotgun Sequencing

Easy Problems vs. Hard Problems. CSE 421 Introduction to Algorithms Winter Is P a good definition of efficient? The class P

Dynamic Programming. Shuang Zhao. Microsoft Research Asia September 5, Dynamic Programming. Shuang Zhao. Outline. Introduction.

Three new strategies for exact string matching

Network Design and Game Theory Spring 2008 Lecture 2

Chapter 34: NP-Completeness

CSE 431/531: Analysis of Algorithms. Dynamic Programming. Lecturer: Shi Li. Department of Computer Science and Engineering University at Buffalo

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

NP-Complete Problems and Approximation Algorithms

Theoretical Computer Science

Topics in Approximation Algorithms Solution for Homework 3

CS 350 Algorithms and Complexity

Automata & languages. A primer on the Theory of Computation. Laurent Vanbever. ETH Zürich (D-ITET) September,

CS 350 Algorithms and Complexity

Nature Structural & Molecular Biology: doi: /nsmb Supplementary Figure 1

A Polynomial Time Algorithm for Parsing with the Bounded Order Lambek Calculus

NP-Complete problems

CMPSCI611: The Matroid Theorem Lecture 5

Supplementary Information

CSE 202 Dynamic Programming II

1. Introduction Recap

Did you know that Multiple Alignment is NP-hard? Isaac Elias Royal Institute of Technology Sweden

10.3 Matroids and approximation

Lecture 18: More NP-Complete Problems

Table S1. Primers and PCR conditions used in this paper Primers Sequence (5 3 ) Thermal conditions Reference Rhizobacteria 27F 1492R

Theoretical Computer Science

Preliminaries. Graphs. E : set of edges (arcs) (Undirected) Graph : (i, j) = (j, i) (edges) V = {1, 2, 3, 4, 5}, E = {(1, 3), (3, 2), (2, 4)}

Admin NP-COMPLETE PROBLEMS. Run-time analysis. Tractable vs. intractable problems 5/2/13. What is a tractable problem?

Electronic supplementary material

SSR ( ) Vol. 48 No ( Microsatellite marker) ( Simple sequence repeat,ssr),

Implementing Approximate Regularities

PCPs and Inapproximability Gap-producing and Gap-Preserving Reductions. My T. Thai

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

8 Knapsack Problem 8.1 (Knapsack)

CS 161: Design and Analysis of Algorithms

Bio nformatics. Lecture 23. Saad Mneimneh

Introduction to Complexity Theory

1 Alphabets and Languages

F. Blanchet-Sadri, "Codes, Orderings, and Partial Words." Theoretical Computer Science, Vol. 329, 2004, pp DOI: /j.tcs

Computational Models - Lecture 3 1

NATIONAL UNIVERSITY OF SINGAPORE CS3230 DESIGN AND ANALYSIS OF ALGORITHMS SEMESTER II: Time Allowed 2 Hours

SUPPLEMENTARY DATA - 1 -

Algorithm Design and Analysis

CS 580: Algorithm Design and Analysis

ON THE NP-COMPLETENESS OF SOME GRAPH CLUSTER MEASURES

Algorithms: COMP3121/3821/9101/9801

NP Completeness and Approximation Algorithms

Lecture 19: Finish NP-Completeness, conp and Friends

Cographs; chordal graphs and tree decompositions

Transcription:

Bio nformatics Lecture 16

DNA sequencing To sequence a DNA is to obtain the string of bases that it contains. It is impossible to sequence the whole DNA molecule directly. We may however obtain a piece of a certain length cut at random and sequence it. This is called a fragment. By using cloning and cutting techniques we can obtain a large number of sequenced fragments. The goal is to reconstruct the DNA molecule based on the fragments overlap.

Ideal case We know the length of the DNA (e.g. 10 bases) There are no errors in sequencing the fragments ACCGT CGTGC TTAC TACCGT --ACCGT-- ----CGTGC TTAC----- -TACCGT-- Align sequences ignoring end gaps Find consensus by majority voting TTACCGTGC

Insertion errors ACCGT CAGTGC TTAC TACCGT --ACC-GT-- ----CAGTGC TTAC------ -TACC-GT-- TTACC-GTGC Insertion of A in the second fragment Gap in consensus will be discarded In this example, it still works because of majority voting

Deletion error ACCGT CGTGC TTAC TACCGT --ACCGT-- ----CGTGC TTAC----- -TAC-GT-- TTACCGTGC The first C was deleted from 4 th fragment Consensus still works

Chimeric fragment Two disjoint fragments join to form one fragment that is not originally part of the DNA ACCGT CGTGC TTAC TACCGT TTATGC --ACCGT-- ----CGTGC TTAC----- -TACCGT-- TTACCGTGC TTA---TGC

Unknown orientation which strand a particular fragment belongs to? CACGT ACGT ACTACG GTACT ACTGA CTGA CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA reverse compliment We have 2 n possibilities CACGTAGTACTGA

Repeats A X B X C X D A X C X B X D Repeats of the form X X X

Repeats A X B Y C X D Y E A X D Y C X B Y E Repeats of the form X Y X Y

Inverted repeats CGA X TCG X reverse complement inverted X X Inverted repeat

Lack of coverage uncovered area contig contig We have more than one contig

Number of fragment It is important to know how many fragments we need to generate in order to achieve certain coverage. Let T denote the length of the DNA. Assume all fragments have length l and that we can detect overlaps of at least t bases. If we sample n fragments at random, what is the expected number of contigs? E[# contigs] ne n( l t)/ T

Alternative methods Shortest common superstring SCS An elegant theoretical abstraction, but fundamentally flawed R. Karp Generalized SCS Models errors and orientations Multicontig Models errors, orientations, and coverage

SCS Given a set of fragments F, Find the shortest string s that contains every f F as a substring This is NP-hard The SCS might not be what we really want

Bad example (repeats) X X Shortest common superstring will give: X X

Generalized SCS Given a set of fragments F, Find the shortest string s that contains either f or f as a substring, for every f F Now it models orientations

Generalized SCS (cont.) Given a set of fragments F, ε > 0, and a distance function d Find the shortest string s that contains a substring x for every f F such that min[d(f,x), d(f,x)] ε f Now it models both orientations and errors

Multicontig For a given set of fragments F, a contig is a multiple alignment containing either f or f for every f F. A contig has an ε-consensus iff each fragment f (or f) differs from its image in the consensus by at most ε f. A contig is a t-contig if the smallest overlap that is not contained in any fragment is at least t (t is a measure of coverage).

Multicontig Given a set of fragments F, ε > 0, and t > 0 Partition F into a minimum number of subsets such that each subset has a t-contig with ε-consensus This is NP-hard This models errors, orientations, and coverage

Solving SCS We are going to consider a Hamiltonian path approach to solving the SCS problem

Overlap graph Consider the complete directed weighted graph G = (V, E), called the overlap graph V = F (each fragment is a vertex) (u,v) E with weight -t iff t is the length of the maximal suffix of u that is a prefix of v We allow self loops and zero weight edges

Example c CTAAAG 0 weight edges not shown TACGA a -2-1 -1-1 d GACA -1 ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC

A path defines a superstring Every simple path P in the overlap graph involving a set of vertices (fragments) A defines a superstring s(p) for the set A. Therefore, a Hamiltonian path in the overlap graph defines a superstring for the set of fragments F. A Hamiltonian path must exist because the graph is complete (how many do we have?).

Example c CTAAAG TACGA a -2-1 0 weight edges not shown -1-1 d GACA -1 ACCC b a = TACGA b = ACCC c = CTAAAG d = GAGC s(p): P = adbc TACGA GACA ACCC CTAAAG ---------------- TACGACACCCTAAAG

Does a superstring define a path? We have seen that every Hamiltonian path corresponds to a superstring. Is the converse true? No: A superstring can contain arbitrary characters that are not present in any fragments Does a shortest superstring correspond to a Hamiltonian path? Yes: if F is substring-free, i.e. no fragment in F is contained in another

Example AGC The shortest superstring is G b 0 a 0 0-1 0 0 c CT AGCT There is no Hamiltonian path P, such that s(p) = AGCT

Subtring-free collection F Let F be a substring free set, then for every shortest superstring s, there is a Hamiltonian path P, such that s(p) = s. Proof: assume the fragments appear in s as follows (no gaps and no one can be contained in another) this must be the max overlap between a a and b b c d s -t 1 -t 2 0 t 1 t 2 Ham path: a b c d etc

Non substring-free F If F is not substring-free, then we can remove all fragments from F that are substrings of other fragments We end up with a set F But any superstring of F is a superstring of F Therefore, we can use F

Length of string v.s. weight of path Let P be a Hamiltonian path. Let w(p) be the weight of P. Let F = Σ a F a Then s(p) = F + w(p) [proof is simple] Therefore, the shortest common superstring corresponds to the Hamiltonian path with minimum weight

Proof Let P be a Hamiltonian path with minimum weight s(p) is a shortest superstring Let s be a shortest superstring with s < s(p) Then there is a Hamiltonian path P such that s = s(p ) s(p ) = F + w(p ) < s(p) = F + w(p) Therefore, w(p ) < w(p), contradiction

Hamiltonian path approach Finding a minimum weight Hamiltonian path is NP-hard (you can reduce HAMPATH to it) Unfortunately, there is no better approach to solve SCS, because SCS itself is NP-hard Let s consider a greedy algorithm for finding a Hamiltonian path

Greedy algorithm Greedy: start with an empty path repeatedly add the least weighted available edge until you get a Hamiltonian path Every time we add an edge (u,v), we need to check: (u,v) does not create a cycle with the previously added edges u has no previously added outgoing edge v has no previously added incoming edge

Greedy algorithm sort edges by their weight: e 1, e 2, e E for all v V in(v) 0 out(v) 0 H φ i 1 while H < F 1 (u,v) e i if out(u) = 0 and in(v) = 0 then To build the graph: O(n F ) (could be done optimally in O(n 2 + F ) using suffix trees) To run the algorithm: O(n 2 logn) if H e i does not contain a cycle [disjoint set data structure] H H e i out(u) 1 in(v) 1 i i + 1

Example ATGC -2-2 -3 0 GCC TGCAT Greedy algorithm will choose: ATGC TGCAT GCC ATGCATGCC Optimal is: TGCAT ATGC GCC TGCATGCC

Sequncing By Hybridization SBH Use all possible probes of length l and obtain hybridization data with the DNA. If no errors, we have all substrings of length l. We would like to reconstruct the DNA from those substrings. We can formalize this as SCS and solve it as before. But we can simplify a little bit

SBH and SCS SBH is a special case of the SCS problem where all fragments of F have the same length l. In the overlap graph, we will keep only the edges with weights equal to (l 1). By construction of these fragments, we know that there must be a Hamiltonian path in this modified overlap graph. All Hamiltonian paths now have the same weight = (n 1)(l 1) Thus we only need to find a Hamiltonian path (still NP-complete)

Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT

Idea Instead of representing fragments as vertices, represent them as edges. Then, instead of looking for a Hamiltonian path (a path that goes through each vertex once), look for an Euler path (a path that goes through each edge once). Euler path can be found in linear time.

Fragments as edges Construct a directed graph G = (V, E) V: (l 1) length fragments (these can be obtained from our set F by considering the first and last l 1 characters of each fragment) E: A directed edge (u,v) for each fragment in F that starts with u and ends with v

Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG AT TG GC CA GG

Euler Cycle By construction of the fragments, we know that the graph will have all vertices balanced except possibly for two unbalanced vertices (each occurrence of an l-1 fragment is shared by two l length fragments, except possibly for the first and last one) By adding an edge between two unbalanced vertices we can make the graph balanced Then we can find an Euler cycle in the graph (since it is balanced, there is one)

Example l = 3 ATG TGG TGC GTG GGC GCA GCG CGT GT CG ATGCGTGGCA AT TG GC CA GG