Fragment Assembly of DNA

Similar documents
Bio nformatics. Lecture 16. Saad Mneimneh

FRAGMENT ASSEMBLY OF DNA

Lecture 15. Saad Mneimneh

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

A Gentle Introduction to (or Review of ) Fundamentals of Chemistry and Organic Chemistry

The Structure and Functions of Proteins

Bio nformatics. Lecture 3. Saad Mneimneh

Problem: Shortest Common Superstring. The Greedy Algorithm for Shortest Common Superstrings. Overlap graphs. Substring-freeness

CMPSCI 311: Introduction to Algorithms Second Midterm Exam

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Gel Electrophoresis. 10/28/0310/21/2003 CAP/CGS 5991 Lecture 10Lecture 9 1

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Approximating Shortest Superstring Problem Using de Bruijn Graphs

Chapter 34: NP-Completeness

Algorithms Design & Analysis. Approximation Algorithm

Analysis and Design of Algorithms Dynamic Programming

8.3 Hamiltonian Paths and Circuits

1. Introduction Recap

Information Theory of DNA Shotgun Sequencing

Finding Consensus Strings With Small Length Difference Between Input and Solution Strings

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Graph Algorithms in Bioinformatics

Sequence analysis and Genomics

4/19/11. NP and NP completeness. Decision Problems. Definition of P. Certifiers and Certificates: COMPOSITES

Computer Science 385 Analysis of Algorithms Siena College Spring Topic Notes: Limitations of Algorithms

Approximation Algorithms for Re-optimization

Algorithm Design and Analysis

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

K-center Hardness and Max-Coverage (Greedy)

P and NP. Inge Li Gørtz. Thank you to Kevin Wayne, Philip Bille and Paul Fischer for inspiration to slides

8.1 Polynomial-Time Reductions. Chapter 8. NP and Computational Intractability. Classify Problems

A GREEDY APPROXIMATION ALGORITHM FOR CONSTRUCTING SHORTEST COMMON SUPERSTRINGS *

4. How to prove a problem is NPC

Algorithms Exam TIN093 /DIT602

Effects of Gap Open and Gap Extension Penalties

CS 350 Algorithms and Complexity

4/22/12. NP and NP completeness. Efficient Certification. Decision Problems. Definition of P

NP-Completeness. Andreas Klappenecker. [based on slides by Prof. Welch]

Towards More Effective Formulations of the Genome Assembly Problem

Algorithms for Three Versions of the Shortest Common Superstring Problem

CS 350 Algorithms and Complexity

Tandem Mass Spectrometry: Generating function, alignment and assembly

Samson Zhou. Pattern Matching over Noisy Data Streams

F. Blanchet-Sadri, "Codes, Orderings, and Partial Words." Theoretical Computer Science, Vol. 329, 2004, pp DOI: /j.tcs

NP-Completeness. Sections 28.5, 28.6

STATC141 Spring 2005 The materials are from Pairwise Sequence Alignment by Robert Giegerich and David Wheeler

Streaming Periodicity with Mismatches. Funda Ergun, Elena Grigorescu, Erfan Sadeqi Azer, Samson Zhou

CS 301: Complexity of Algorithms (Term I 2008) Alex Tiskin Harald Räcke. Hamiltonian Cycle. 8.5 Sequencing Problems. Directed Hamiltonian Cycle

NP-Complete Problems and Approximation Algorithms

Algorithms for Bioinformatics

CS 320, Fall Dr. Geri Georg, Instructor 320 NP 1

8. INTRACTABILITY I. Lecture slides by Kevin Wayne Copyright 2005 Pearson-Addison Wesley. Last updated on 2/6/18 2:16 AM

Chapter 6. Dynamic Programming. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Shortest DNA cyclic cover in compressed space

Limitations of Algorithm Power

Algorithm Design and Analysis

Tractable & Intractable Problems

Algorithms: COMP3121/3821/9101/9801

CSE 549: Computational Biology. Computer Science for Biologists Biology

Genomes Comparision via de Bruijn graphs

Lecturer: Shuchi Chawla Topic: Inapproximability Date: 4/27/2007

Computational Complexity

NP-Hardness reductions

Hierarchical Overlap Graph

Algorithm Design and Analysis

Admin NP-COMPLETE PROBLEMS. Run-time analysis. Tractable vs. intractable problems 5/2/13. What is a tractable problem?

CSE 202 Dynamic Programming II

Lecture 2: Pairwise Alignment. CG Ron Shamir

Lecture 13, Fall 04/05

Data Structures and Algorithms (CSCI 340)

Lecture 14 - P v.s. NP 1

ECS122A Handout on NP-Completeness March 12, 2018

FINISHING MY FOSMID XAAA113 Andrew Nett

Introduction to Complexity Theory

Topics in Complexity Theory

CS/COE

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

CSE 421 Introduction to Algorithms Final Exam Winter 2005

Complexity Theory VU , SS The Polynomial Hierarchy. Reinhard Pichler

Outline. Complexity Theory EXACT TSP. The Class DP. Definition. Problem EXACT TSP. Complexity of EXACT TSP. Proposition VU 181.

CS 580: Algorithm Design and Analysis

COP 4531 Complexity & Analysis of Data Structures & Algorithms

3/22/2018. CS 580: Algorithm Design and Analysis. Circuit Satisfiability. Recap. The "First" NP-Complete Problem. Example.

Designing and Testing a New DNA Fragment Assembler VEDA-2

10.3: Intractability. Overview. Exponential Growth. Properties of Algorithms. What is an algorithm? Turing machine.

A difficult problem. ! Given: A set of N cities and $M for gas. Problem: Does a traveling salesperson have enough $ for gas to visit all the cities?

Intractability. A difficult problem. Exponential Growth. A Reasonable Question about Algorithms !!!!!!!!!! Traveling salesperson problem (TSP)

Lecture 18: More NP-Complete Problems

Chapter 4. Greedy Algorithms. Slides by Kevin Wayne. Copyright 2005 Pearson-Addison Wesley. All rights reserved.

Design and Analysis of Algorithms

Polynomial-Time Reductions

4/12/2011. Chapter 8. NP and Computational Intractability. Directed Hamiltonian Cycle. Traveling Salesman Problem. Directed Hamiltonian Cycle

ECS 120 Lesson 24 The Class N P, N P-complete Problems

The Algorithmics Column

Intro to Theory of Computation

Aside: Golden Ratio. Golden Ratio: A universal law. Golden ratio φ = lim n = 1+ b n = a n 1. a n+1 = a n + b n, a n+b n a n

arxiv: v1 [cs.dm] 18 May 2016

Multivariate Algorithmics for NP-Hard String Problems

4/30/14. Chapter Sequencing Problems. NP and Computational Intractability. Hamiltonian Cycle

Transcription:

Wright State University CORE Scholar Computer Science and Engineering Faculty Publications Computer Science and Engineering 2003 Fragment Assembly of DNA Dan E. Krane Wright State University - Main Campus, dan.krane@wright.edu Michael L. Raymer Wright State University - Main Campus, michael.raymer@wright.edu Follow this and additional works at: http://corescholar.libraries.wright.edu/cse Part of the Computer Sciences Commons, and the Engineering Commons Repository Citation Krane, D. E., & Raymer, M. L. (2003). Fragment Assembly of DNA.. http://corescholar.libraries.wright.edu/cse/391 This Presentation is brought to you for free and open access by Wright State University s CORE Scholar. It has been accepted for inclusion in Computer Science and Engineering Faculty Publications by an authorized administrator of CORE Scholar. For more information, please contact corescholar@www.libraries.wright.edu.

BIO/CS 471 Algorithms for Bioinformatics Fragment Assembly of DNA

Limitations to sequencing You must have a primer of known sequence to initiate PCR Only about 1000nts can be sequenced in a single reaction The sequencing process is slow, so it is beneficial to do as much in parallel as possible Primer hopping Shotgun approach Fragment Assembly 2

Shotgun Sequencing Fragment Assembly 3

The Ideal Case Find maximal overlaps between fragments: ACCGT CGTGC TTAC TACCGT --ACCGT-- ----CGTGC TTAC----- -TACCGT TTACCGTGC Consensus sequence determined by vote Fragment Assembly 4

Quality Metrics The coverage at position i of the target or consensus sequence is the number of fragments that overlap that position Target: Two contigs No coverage Fragment Assembly 5

Quality Metrics Linkage the degree of overlap between fragments Target: Perfect coverage, poor average linkage poor minimum linkage Fragment Assembly 6

Base call errors Real World Complications Chimeric fragments, contamination (e.g. from the vector) --ACCGT-- ----CGTGC TTAC----- -TGCCGT TTACCGTGC Base Call Error --ACC-GT-- ----CAGTGC TTAC------ -TACC-GT TTACC-GTGC Insertion Error --ACCGT-- ----CGTGC TTAC----- -TAC-GT TTACCGTGC Deletion Error Fragment Assembly 7

Unknown Orientation A fragment can come from either strand CACGT ACGT ACTACG GTACT ACTGA CTGA CACGT -ACGT --CGTAGT -----AGTAC --------ACTGA ---------CTGA Fragment Assembly 8

Direct repeats Repeats A X B X C X D A X C X B X D Fragment Assembly 9

Direct repeats Repeats A X B Y C X D Y E A X D Y C X B Y E Fragment Assembly 10

Inverted repeats Repeats X X X X Fragment Assembly 11

Sequence Alignment Models Shortest common superstring Input: A collection, F, of strings (fragments) Output: A shortest possible string S such that for every f F, S is a superstring of f. Example: F = {ACT, CTA, AGT} S = ACTAGT Fragment Assembly 12

Problems with the SCS model x x x x Directionality of fragments must be known No consideration of coverage Some simple consideration of linkage No consideration of base call errors Fragment Assembly 13

Reconstruction Deals with errors and unknown orientation Definitions f is an approximate substring of S at error level ε when d s (f, S) ε f d s = substring edit distance: Reconstruction Input: A collection, F, of strings, and a tolerance level, ε Output: Shortest possible string, S, such that for every f F : min( d ( f, S ), d ( f, S ) ε f s Fragment Assembly 14 s Match = 0 Mismatch = 1 Gap = 1

Input: Output: Reconstruction Example F = {ATCAT, GTCG, CGAG, TACCA} ε = 0.25 ATGAT ------CGAC -CGAG ----TACCA ACGATACGAC ATCAT GTCG d s (CGAG, ACGATACGAC) = 1 = 0.25 4 So this output is OK for ε = 0.25 Fragment Assembly 15

Gaps in Reconstruction Reconstruction allows gaps in fragments: AT-GA----- ATCGATAGAC d s = 1 Fragment Assembly 16

Limitations of Reconstruction Models errors and unknown orientation Doesn t handle repeats Doesn t model coverage Only handles linkage in a very simple way Always produces a single contig Fragment Assembly 17

Contigs Sometimes you just can t put all of the fragments together into one contiguous sequence: No way to tell how much sequence is missing between them.? No way to tell the order of these two contigs. Fragment Assembly 18

Multicontig Definitions A layout, L, is a multiple alignment of the fragments Columns numbered from 1 to L Endpoints of a fragment: l(f) and r(f) An overlap is a link is no other fragment completely covers the overlap Link Not a link Fragment Assembly 19

More definitions Multicontig The size of a link is the number of overlapping positions ACGTATAGCATGA GTA CATGATCA ACGTATAG GATCA A link of size 5 The weakest link is the smallest link in the layout A t-contig has a weakest link of size t A collection, F, admits a t-contig if a t-contig can be constructed from the fragments in F Fragment Assembly 20

Input: F, and t Perfect Multicontig Output: a minimum number of collections, C i, such that every C i admits a t-contig Let F = {GTAC, TAATG, TGTAA} t = 3 --TAATG TGTAA-- GTAC t = 1 TGTAA----- --TAATG--- ------GTAC Fragment Assembly 21

Handling errors in Multicontig The image of a fragment is the portion of the consensus sequence, S, corresponding to the fragment in the layout S is an ε-consensus for a collection of fragments when the edit distance from each fragment, f, and its image is at most ε f TATAGCATCAT CGTC CATGATCA ACGGATAG GTCCA ACGTATAGCATGATCA An ε-consensus for ε = 0.4 Fragment Assembly 22

Definition of Multicontig Input: A collection, F, of strings, an integer t 0, and an error tolerance ε between 0 and 1 Output: A partition of F into the minimum number of collections C i such that every C i admits a t-contig with an ε-consensus Fragment Assembly 23

Let ε = 0.4, t = 3 Example of Multicontig TATAGCATCAT ACGTC CATGATCAG ACGGATAG GTCCAG ACGTATAGCATGATCAG Fragment Assembly 24

Algorithms Most of the algorithms to solve the fragment assembly problem are based on a graph model A graph, G, is a collection of edges, e, and vertices, v. Directed or undirected Weighted or unweighted We will discuss representations and other issues shortly A directed, unweighted graph Fragment Assembly 25

The Maximum Overlap Graph The text calls it an overlap multigraph Each directed edge, (u,v) is weighted with the length of the maximal overlap between a suffix of u and a prefix of v TACGA a 1 CTAAAG c 1 2 d 1 1 GACA b ACCC 0-weight edges omitted! Fragment Assembly 26

Paths and Layouts The path dbc leads to the alignment: GACA-------- ---ACCC----- ------CTAAAG CTAAAG c TACGA 1 a 2 1 d 1 1 GACA b ACCC Fragment Assembly 27

Superstrings Every path that covers every node is a superstring Zero weight edges result in alignments like: GACA-------- ----GCCC----- --------TTAAAG Higher weights produce more overlap, and thus shorter strings The shortest common superstring is the highest weight path that covers every node Fragment Assembly 28

Graph formulation of SCS Input: A weighted, directed graph Output: The highest-weight path that touches every node of the graph Does this problem sound familiar? Fragment Assembly 29

The Greedy Algorithm Algorithm greedy Sort edges in increasing weight order For each edge in this order If the edge does not form a cycle and the edge does not start or end at the same node as another edge in the set then add the edge to the current set End for End Algorithm Figure 4.16, page 125 Fragment Assembly 30

Greedy Example 7 2 2 4 5 3 2 6 1 Fragment Assembly 31

Greedy does not always find the best path GCC 2 0 ATGC 2 TGCAT 3 Fragment Assembly 32

Tools for Shotgun Sequencing Fragment Assembly 33

Common Difficulty Each of these problems is a method for modeling fragment assembly Each of these problems is provably intractable How? Fragment Assembly 34

Embedding problems Suppose I told you that I had found a clever way to model the TSP as a shortest common superstring problem Paths between cities are represented as fragments The shortest path is the shortest common superstring of the fragments If this is true, then there are only two possibilities: 1. This problem is just as intractable as TSP 2. TSP is actually a tractable problem! Fragment Assembly 35

NP-Complete Problems There is a collection of problems that computer scientists believe to be intractable TSP is one of them Each of them has been modeled as one or more of the other NP-complete problems If you solve one, you solve them all A problem, p, is NP-hard if you can model one of these NP-complete problems as an instance of p Fragment Assembly 36

NP-Completeness NP Subset sum 3-SAT TSP P Fragment Assembly 37

P = NP? NP Subset sum 3-SAT NP P Fragment Assembly 38