Towards More Effective Formulations of the Genome Assembly Problem

Similar documents
Multi-Assembly Problems for RNA Transcripts

BME 5742 Biosystems Modeling and Control

Genome Assembly. Sequencing Output. High Throughput Sequencing

Genomes Comparision via de Bruijn graphs

Complete all warm up questions Focus on operon functioning we will be creating operon models on Monday

Algorithms for Bioinformatics

GCD3033:Cell Biology. Transcription

Controlling Gene Expression

Chapters 12&13 Notes: DNA, RNA & Protein Synthesis

Introduction to de novo RNA-seq assembly

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Lecture 15. Saad Mneimneh

DNA sequencing. Bad example (repeats) Lecture 15. Shortest common superstring SCS. Given a set of fragments F,

Genome Assembly Results, Protocol & Demo

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs

Biology. Biology. Slide 1 of 26. End Show. Copyright Pearson Prentice Hall

Biology 644: Bioinformatics

Chapter 15 Active Reading Guide Regulation of Gene Expression

RNA Processing: Eukaryotic mrnas

Bio nformatics. Lecture 16. Saad Mneimneh

From Gene to Protein

Proteomics. 2 nd semester, Department of Biotechnology and Bioinformatics Laboratory of Nano-Biotechnology and Artificial Bioengineering

Lecture 15: Realities of Genome Assembly Protein Sequencing

Genome Sequencing & Assembly Michael Schatz. May 2, 2013 Human Microbiome Consortium

O 3 O 4 O 5. q 3. q 4. Transition

Simulation of Gene Regulatory Networks

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

UNIT 6 PART 3 *REGULATION USING OPERONS* Hillis Textbook, CH 11

Multiple Choice Review- Eukaryotic Gene Expression

De novo assembly and genotyping of variants using colored de Bruijn graphs

New Contig Creation Algorithm for the de novo DNA Assembly Problem

Lecture 7: Simple genetic circuits I

Prokaryotic Regulation

Markov Models & DNA Sequence Evolution

Exhaustive search. CS 466 Saurabh Sinha

A Simple Protein Synthesis Model

Name: SBI 4U. Gene Expression Quiz. Overall Expectation:

Mathangi Thiagarajan Rice Genome Annotation Workshop May 23rd, 2007

Bio nformatics. Lecture 3. Saad Mneimneh

Newly made RNA is called primary transcript and is modified in three ways before leaving the nucleus:

Reducing storage requirements for biological sequence comparison

Graph Algorithms in Bioinformatics

Markov Chains and Hidden Markov Models. = stochastic, generative models

Topic 4 - #14 The Lactose Operon

1. In most cases, genes code for and it is that

Videos. Bozeman, transcription and translation: Crashcourse: Transcription and Translation -

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

CSEP 590A Summer Tonight MLE. FYI, re HW #2: Hemoglobin History. Lecture 4 MLE, EM, RE, Expression. Maximum Likelihood Estimators

CSEP 590A Summer Lecture 4 MLE, EM, RE, Expression

Organization of Genes Differs in Prokaryotic and Eukaryotic DNA Chapter 10 p

Flow of Genetic Information

Do Read Errors Matter for Genome Assembly?

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Lecture 5: September Time Complexity Analysis of Local Alignment

L3.1: Circuits: Introduction to Transcription Networks. Cellular Design Principles Prof. Jenna Rickus

Introduction to Molecular and Cell Biology

Algorithms in Computational Biology (236522) spring 2008 Lecture #1

High-throughput sequencing: Alignment and related topic

Introduction to Bioinformatics

2012 Univ Aguilera Lecture. Introduction to Molecular and Cell Biology

Translation Part 2 of Protein Synthesis

Cellular Neuroanatomy I The Prototypical Neuron: Soma. Reading: BCP Chapter 2

Control of Gene Expression in Prokaryotes

Bio 119 Bacterial Genomics 6/26/10

Tandem Mass Spectrometry: Generating function, alignment and assembly

Protein Synthesis. Unit 6 Goal: Students will be able to describe the processes of transcription and translation.

Lecture 18 June 2 nd, Gene Expression Regulation Mutations

The Research Plan. Functional Genomics Research Stream. Transcription Factors. Tuning In Is A Good Idea

CHAPTER 3. Cell Structure and Genetic Control. Chapter 3 Outline

9/11/18. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

REVIEW SESSION. Wednesday, September 15 5:30 PM SHANTZ 242 E

Network motifs in the transcriptional regulation network (of Escherichia coli):

Computational Biology: Basics & Interesting Problems

UNIT 5. Protein Synthesis 11/22/16

Algorithms for Bioinformatics

Pattern Matching (Exact Matching) Overview

9/2/17. Molecular and Cellular Biology. 3. The Cell From Genes to Proteins. key processes

Today s Lecture: HMMs

Regulation of Gene Expression

Honors Biology Reading Guide Chapter 11

Предсказание и анализ промотерных последовательностей. Татьяна Татаринова

Pan-genomics: theory & practice

15.2 Prokaryotic Transcription *

Biology 112 Practice Midterm Questions

Basic Synthetic Biology circuits

Introduction to Hidden Markov Models for Gene Prediction ECE-S690

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

More Protein Synthesis and a Model for Protein Transcription Error Rates

GENE ACTIVITY Gene structure Transcription Transcript processing mrna transport mrna stability Translation Posttranslational modifications

Putting the Bayes update to sleep

Going Beyond SNPs with Next Genera5on Sequencing Technology Personalized Medicine: Understanding Your Own Genome Fall 2014

HMMs and biological sequence analysis

Approximating Shortest Superstring Problem Using de Bruijn Graphs

Evolutionary analysis of the well characterized endo16 promoter reveals substantial variation within functional sites

7.32/7.81J/8.591J: Systems Biology. Fall Exam #1

Regulation of Gene Expression

Searching Sear ( Sub- (Sub )Strings Ulf Leser

Unit Two: Molecular Genetics. 5.5 Control Mechanisms 5.7 Key Differences 5.8 Genes and Chromosomes

Transcription:

Towards More Effective Formulations of the Genome Assembly Problem Alexandru Tomescu Department of Computer Science University of Helsinki, Finland DACS June 26, 2015 1 / 25

2 / 25

CENTRAL DOGMA OF BIOLOGY protein-protein interaction DNA gene 1 intron exon TFBS gene 2 transcription pre-mrna alternative splicing enhancer silencer promoter... mature mrna transcripts translation proteins binding Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015 3 / 25

SEQUENCING ATLAS De novo sequencing / Whole genome resequencing Bisulfite sequencing ChIP sequencing Targeted resequencing DNA methylation gene TFBS primer primer mature mrna transcripts RNA sequencing proteins binding Image taken from Genome-Scale Algorithm Design, Cambridge University Press, 2015 4 / 25

HIGH-THROUGHPUT SEQUENCING DNA ) amplification ) break apart ) size selection paired-end reads sequencing ) length ~450 100 ~250 100 5 / 25

HIGH-THROUGHPUT SEQUENCING DNA ) amplification ) break apart ) size selection paired-end reads sequencing ) length ~450 100 ~250 100 5 / 25

HIGH-THROUGHPUT SEQUENCING DNA ) amplification ) break apart ) size selection paired-end reads sequencing ) length ~450 100 ~250 100 We assume here that DNA is a single stranded, single chromosome 5 / 25

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End I LLUMINA H I S EQ X 6 / 25

GENOME ASSEMBLY PROBLEM INPUT: A collection of paired-end reads OUTPUT: The genome E.coli 4.6 10 6 Human 3.2 10 9 Spurce 25 10 9 7 / 25

GENOME ASSEMBLY PROBLEM INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations: E.coli 4.6 10 6 Human 3.2 10 9 Spurce 25 10 9 Shortest superstring problem (NP-hard) Build a graph with reads as nodes, and significant overlaps between reads as directed edges: Find a walk that passes through every node exactly once (NP-complete) Find a walk that passes through every node at least once 7 / 25

GENOME ASSEMBLY PROBLEM INPUT: A collection of paired-end reads OUTPUT: The genome Initial formulations: E.coli 4.6 10 6 Human 3.2 10 9 Spurce 25 10 9 Shortest superstring problem (NP-hard) Build a graph with reads as nodes, and significant overlaps between reads as directed edges: Find a walk that passes through every node exactly once (NP-complete) Find a walk that passes through every node at least once Unrealistic: Longer repeated regions are collapsed Genome coverage is not uniform We cannot choose between multiple solutions 7 / 25

PRACTICAL FORMULATIONS / PIPELINE 1. Contig assembly: assemble the reads into strings (contigs) that are guaranteed to occur in the genome contig: ACGTACG GTACGATA CTAATTCGA GATATCTA CTAGTACCC ACGTACGATATCTA 8 / 25

PRACTICAL FORMULATIONS / PIPELINE 1. Contig assembly: assemble the reads into strings (contigs) that are guaranteed to occur in the genome contig: ACGTACG GTACGATA CTAATTCGA GATATCTA CTAGTACCC ACGTACGATATCTA 2. Scaffolding: using paired-end reads, chain the contigs into scaffolds that are guaranteed to occur in the genome 8 / 25

PRACTICAL FORMULATIONS / PIPELINE (2) 3. Gap filling: fill the gaps in the scaffolds Tens of genome assembly programs available: ABySS, Velvet, Allpaths-LG, Bambus2, MSR-CA, SGA, Cortex, SOAPdenovo, Opera-LG, SPADES,... 9 / 25

DE BRUIJN GRAPHS DEFINITION Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DB k (R) with node set: the set of k-mers of R edge set: the set of k + 1-mers of the strings of R Also edges occur in the strings of R! 10 / 25

DE BRUIJN GRAPHS DEFINITION Given a set R of strings, the de Bruijn graph of order k of R is the directed graph DB k (R) with node set: the set of k-mers of R edge set: the set of k + 1-mers of the strings of R Also edges occur in the strings of R! ATGCGTGGCA ATGCG CGTG AT GT TG CG GC TGGCA CA k = 2 GG 10 / 25

CONTIG ASSEMBLY joint work with Paul Medvedev 11 / 25

CONTIG ASSEMBLY No previous formal definition of contig Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k+1 v t vt+1 12 / 25

CONTIG ASSEMBLY No previous formal definition of contig Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k+1 v t vt+1 Given a dbg G: a genomic walk of G is a circular edge-covering walk of G a walk is safe if it is a sub-walk of all genomic walks of G 12 / 25

CONTIG ASSEMBLY No previous formal definition of contig Usually, contigs are maximal, unary paths (i.e., whose internal nodes have in-degree and out-degree 1, aka unitigs) v 0 v 1 v 2 v k v k+1 v t vt+1 Given a dbg G: a genomic walk of G is a circular edge-covering walk of G a walk is safe if it is a sub-walk of all genomic walks of G We now assume that the dbg admits a genomic walk (i.e., is strongly connected) and is not a single cycle. 12 / 25

CONTIG ASSEMBLY (2) We say that a contig assembly algorithm is sound: if every output walk is safe complete: if every safe walk is in the output 13 / 25

CONTIG ASSEMBLY (2) We say that a contig assembly algorithm is sound: if every output walk is safe complete: if every safe walk is in the output The unitig algorithm is: outputting all maximal unitigs sound not complete 13 / 25

CONTIG ASSEMBLY (2) We say that a contig assembly algorithm is sound: if every output walk is safe complete: if every safe walk is in the output The unitig algorithm is: outputting all maximal unitigs sound not complete Is there a sound and complete contig assembly algorithm? 13 / 25

NON-SWITCHING CONTIGS v 0 v 1 v 2 v k v k+1 v t vt+1 A path with all out-branching nodes before all in-branching nodes 14 / 25

NON-SWITCHING CONTIGS v 0 v 1 v 2 v k v k+1 v t vt+1 A path with all out-branching nodes before all in-branching nodes Related to transformation-based algorithms of Kingsford, Schatz, Pop 2010 Jackson 2009 Medvedev, Georgiou, Myers, Brudno 2007 14 / 25

NON-SWITCHING CONTIGS v 0 v 1 v 2 v k v k+1 v t vt+1 A path with all out-branching nodes before all in-branching nodes Related to transformation-based algorithms of Kingsford, Schatz, Pop 2010 Jackson 2009 Medvedev, Georgiou, Myers, Brudno 2007 THEOREM There is an O( G + output )-time algorithm to output all maximal non-switching contigs of G. THEOREM The non-switching contig assembly algorithm is sound but not complete. 14 / 25

OmniTIGS v 0 v i v j vt+1 e 0 e i 1 e i e j 1 e j et We say that a walk w = (v 0, e 0, v 1, e 1,..., v t, e t, v t+1 ) is an omnitig if for all 1 i j t, there is no proper v j -v i path with first edge different from e j, and last edge different from e i 1. 15 / 25

OmniTIGS v 0 v i v j vt+1 e 0 e i 1 e i e j 1 e j et We say that a walk w = (v 0, e 0, v 1, e 1,..., v t, e t, v t+1 ) is an omnitig if for all 1 i j t, there is no proper v j -v i path with first edge different from e j, and last edge different from e i 1. THEOREM A walk w is safe w is an omnitig. THEOREM There is a polynomial time algorithm for outputting all maximal omnitigs. COROLLARY The omnitig algorithm is sound and complete. 15 / 25

EXPERIMENTAL RESULTS Algorithm #strings E-size AVG length unitig 41,524 7,806 893 non-switching contigs 32,589 7,822 1,136 omnitigs 24,949 7,850 1,479 genome: circularized human chr21 (length 48 10 6 ) graph: dbg k (chr21) for k = 55 e-size: given a set of substrings of genome, their e-size is the average, over all genomic positions i, of the mean length of the strings spanning position i 16 / 25

High-throughput sequencing Genome assembly problem Contig assembly Gap filling End G AP FILLING joint work with Leena Salmela, Kristoffer Sahlin and Veli Ma kinen 17 / 25

PROBLEM FORMULATION Previous formulations (GapCloser 2012, GapFiller 2012): s t 18 / 25

PROBLEM FORMULATION Previous formulations (GapCloser 2012, GapFiller 2012): s t We formulate it as Exact Path Length problem. Given: G = dbg k (R), for some k s, t V(G), the two k-mers flanking the gap [d..d] an estimate on the gap length s For every x [d..d] find an s-t path spelling a string of length x (i.e. a path of length x k). t s t 18 / 25

DYNAMIC PROGRAMMING (DP) Usually, d < V(G). Can be solved by DP in time O(d E(G) ): for every node v store: { 1 if there exists an s-v path of length i, a(v, i) := 0 otherwise. initialize a(s, 0) = 1, and compute a(v, i) := s t a(u, i 1). u N (v) s t u a(u,i-1) v a(v,i) back-tracking in the DP matrix from a(t, x k) gives a path (if exists) 19 / 25

ENGINEERED IMPLEMENTATION k-mers flanking the gaps can have errors allow paths to start/end at up to t flanking k-mers we should not explore the entire graph! meet in the middle optimization DP matrix rows are sparse store only the non-zero entries in each row parallelization on scaffold level limit the memory usage of the DP matrix abandon the search on a gap if limit exceeded 20 / 25

EXPERIMENTAL RESULTS (Gap2Seq) Original 4 10 5 GapCloser 3 10 5 GapFiller-bowtie GapFiller-bwa 2 10 5 Gap2Seq 1 10 5 0 Remaining Gap Length Erroneous Length genome: Staphylococcus aureus (length 2.8 10 6 ) graph: dbg with k = 31, for R = a collection of real reads results are totals over all scaffolds from a benchmark dataset (ABySS, Allpaths-LG, Bambus2, CABOG, MSR-CA, SGA, SOAPdenovo, Velvet) 21 / 25

EXPERIMENTAL RESULTS (Gap2Seq) 120 100 80 60 40 20 0 Runtime (min) 0.8 0.6 0.4 0.2 0 Peak Memory (GB) GapCloser GapFiller-bowtie GapFiller-bwa Gap2Seq 22 / 25

CONCLUSIONS Potential uses for omnitigs: longer contigs = better starting point for scaffolding and gap filling more flanking information around loci of interest Directions for omnitigs: robustness to errors, coverage gaps, reverse complements? faster omnitig algorithm a sound and complete algorithm when the genomic walk is linear? Gap Filling: performance on human data is so and so (but all tools have problems) we need to improve the runtime and memory usage we need a way to choose between multiple solutions (contig assembly?) 23 / 25

MULŢUMESC 24 / 25