Genomes Comparision via de Bruijn graphs

Similar documents
Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

arxiv: v1 [q-bio.qm] 30 Jul 2013

De novo assembly and genotyping of variants using colored de Bruijn graphs

Algorithms for Bioinformatics

Towards More Effective Formulations of the Genome Assembly Problem

Biology 644: Bioinformatics

Analysis of Gene Order Evolution beyond Single-Copy Genes

CONTENTS. P A R T I Genomes 1. P A R T II Gene Transcription and Regulation 109

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Mapping-free and Assembly-free Discovery of Inversion Breakpoints from Raw NGS Reads

Martin Bader June 25, On Reversal and Transposition Medians

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Annotation of Plant Genomes using RNA-seq. Matteo Pellegrini (UCLA) In collaboration with Sabeeha Merchant (UCLA)

Handling Rearrangements in DNA Sequence Alignment

On the Sound Covering Cycle Problem in Paired de Bruijn Graphs

Department of Forensic Psychiatry, School of Medicine & Forensics, Xi'an Jiaotong University, Xi'an, China;

Unit 3 - Molecular Biology & Genetics - Review Packet

Applications of genome alignment

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

Nature Genetics: doi:0.1038/ng.2768

Efficient Haplotype Inference with Boolean Satisfiability

Genome Assembly. Sequencing Output. High Throughput Sequencing

Kentucky Academic Standards Addressed By Zoo Program

Paired-End Read Length Lower Bounds for Genome Re-sequencing

Synteny Portal Documentation

Breakpoint Graphs and Ancestral Genome Reconstructions

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Tandem Mass Spectrometry: Generating function, alignment and assembly

Introduction to de novo RNA-seq assembly

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

High-throughput sequence alignment. November 9, 2017

Designing and Testing a New DNA Fragment Assembler VEDA-2

Molecular evolution - Part 1. Pawan Dhar BII

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

Greedy Algorithms. CS 498 SS Saurabh Sinha

Stochastic processes and

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

The Fragile Breakage versus Random Breakage Models of Chromosome Evolution

Potato Genome Analysis

Reconstruction of Ancestral Genome subject to Whole Genome Duplication, Speciation, Rearrangement and Loss

Chromosomal Breakpoint Reuse in Genome Sequence Rearrangement ABSTRACT

Multiple Whole Genome Alignment

Whole Genome Alignments and Synteny Maps

1 Introduction. Abstract

The genomes of recombinant inbred lines

Big Questions. Is polyploidy an evolutionary dead-end? If so, why are all plants the products of multiple polyploidization events?

Processes of Evolution

Graph Algorithms in Bioinformatics


Discovery of Genomic Structural Variations with Next-Generation Sequencing Data

Theoretical Physics Methods for Computational Biology. Third lecture

New Contig Creation Algorithm for the de novo DNA Assembly Problem

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Pattern Matching (Exact Matching) Overview

Pan-genomics: theory & practice

Intro Gene regulation Synteny The End. Today. Gene regulation Synteny Good bye!

Modelling and Analysis in Bioinformatics. Lecture 1: Genomic k-mer Statistics

Graphs, permutations and sets in genome rearrangement

Comparing whole genomes

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Computational Biology: Basics & Interesting Problems

New successor rules for constructing de Bruijn sequences

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Sequence Assembly

DNA, Chromosomes, and Genes

Time to learn about NP-completeness!

The breakpoint distance for signed sequences

A New Fast Heuristic for Computing the Breakpoint Phylogeny and Experimental Phylogenetic Analyses of Real and Synthetic Data

GENOME DUPLICATION AND GENE ANNOTATION: AN EXAMPLE FOR A REFERENCE PLANT SPECIES.

SJÄLVSTÄNDIGA ARBETEN I MATEMATIK

The Saguaro Genome. Toward the Ecological Genomics of a Sonoran Desert Icon. Dr. Dario Copetti June 30, 2015 STEMAZing workshop TCSS

Lecture 5,6 Local sequence alignment

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Lecture 3: Markov chains.

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Model plants and their Role in genetic manipulation. Mitesh Shrestha

MATHEMATICAL MODELS - Vol. III - Mathematical Modeling and the Human Genome - Hilary S. Booth MATHEMATICAL MODELING AND THE HUMAN GENOME

On the complexity of unsigned translocation distance

Reconstructing contiguous regions of an ancestral genome

Patterns of Simple Gene Assembly in Ciliates

Predicting RNA Secondary Structure

GBS Bioinformatics Pipeline(s) Overview

Pyrobayes: an improved base caller for SNP discovery in pyrosequences

Discrete Probability Distributions: Poisson Applications to Genome Assembly

CGS 5991 (2 Credits) Bioinformatics Tools

Inferring positional homologs with common intervals of sequences

Pairwise sequence alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

CHAPTER 23 THE EVOLUTIONS OF POPULATIONS. Section C: Genetic Variation, the Substrate for Natural Selection

A Browser for Pig Genome Data

Supplementary Figure 1. Phenotype of the HI strain.

Practical considerations of working with sequencing data

The 3D Organization of Chromatin Explains Evolutionary Fragile Genomic Regions

Comparative Genomics II

Multi-Break Rearrangements and Breakpoint Re- Uses: From Circular to Linear Genomes

Transcription:

Genomes Comparision via de Bruijn graphs Student: Ilya Minkin Advisor: Son Pham St. Petersburg Academic University June 4, 2012 1 / 19

Synteny Blocks: Algorithmic challenge Suppose that we are given two genomes The question is: how are they evolutionary related to each other? In order to do rearrangements analysis we must decompose genomes into synteny blocks Synteny blocks are evolutionary conserved segments of the genome These blocks cover most of the genome Occur in both genomes with possible variations 2 / 19

Academic Project Project: Identify synteny blocks for duplicated genomes represented as sequences of nucleotides. None of the previous synteny blocks reconstruction software (DRIMM-Synteny (Pham And Pevzner 2010) included) can efficiently solve this problem. DRIMM-Synteny can find the synteny blocks for complicated genomes. But: 3 / 19

Academic Project Project: Identify synteny blocks for duplicated genomes represented as sequences of nucleotides. None of the previous synteny blocks reconstruction software (DRIMM-Synteny (Pham And Pevzner 2010) included) can efficiently solve this problem. DRIMM-Synteny can find the synteny blocks for complicated genomes. But: It requires the genome to be represented as sequence of genes. 3 / 19

General Idea: de Bruijn Graph We are given an alphabet Σ and a string S over it, Σ = m A substring T, T = k is called k-mer De Bruijn graph is a multigraph G k = (V, E), where V = Σ k 1 = {all possible (k 1)-mers} If k-mer T is presented in S, then we add an oriented edge (T [1, k 1], T [2, k]) to the graph Create de Bruijn graph from the nucleotide sequence Conserved regions will yield non-branching paths 4 / 19

Challenges Variations in synteny blocks generate cycles, so we need to simplify the graph Double strandness: conserved regions may occur on both strands. Example: 5 AACCGGTT 3 3 TTGGCCAA 5 Such blocks are reverse complementary to each other no non-branching paths Spurious similarity Memory efficiency 5 / 19

Colored graph We use colored de Bruijn graphs [Iqball et al., 2012] to handle double-strandness Suppose that S + and S are positive and negative strands of the chromosome Colored de Bruijn graph is a multigraph G k = (V, E) where V = Σ k 1 For each k-mer T + in S + add edge (T + [1, k 1], T + [2, k]) to G k and mark it blue For each k-mer T in S add edge (T [1, k 1], T [2, k]) to G k and mark it red 6 / 19

Edge labeling Note that our graph is built from a string, not set of reads Each walk in the graph represents a string We are interested only in walks that represent substrings of the source string Assign to each edge e label L(e) = position of the corresponding k-mer on the positive strand Walk W = (v 1 e 1 v 2 e 2...) is considered valid iff: 1. e i and e i+1 are of the same color 2. L(e i ) L(e i+1 ) = 1 7 / 19

Example 5' ACCTGTCAGT 3' 3' TGGACAGTCA 5' 3 ac 0 7 cc 5 ca 6 tc 4 ct 2 2 1 3 7 ag tg gt 6 1 5 0 4 gg ga Figure 1: Colored de Bruijn graph built from two strands 8 / 19

Graph simplification Bulges spoil long non-branching paths and indicate indels/mismatches A pair of walks (W 1, W 2 ) is a bulge iff: 1) Start and end vertices of W 1 and W 2 coincide 2) W 1 and W 2 have exactly 2 common vertices 3) There are no edges u W 1 and v W 2 such that L(u) = L(v) 4) W 1 δ and W 2 δ... U V... Figure 2: A bulge 9 / 19

General pipeline Build de Bruijn graph from the genome Remove bulges (BFS-like algorithm) Bulges are removed by replacing long branches with shorter ones Output non-branching paths B A C X Y A B C Figure 3: Bulge removal illustration 10 / 19

Parameters selection How should we choose K and δ? Duplicated genes can have no long (K > 50) shared K - mers Big K 50 we find only few synteny blocks Small K 10 and small δ 15 we find very short synteny blocks Small K 10 and big δ 200 the genome will be disrupted completely 11 / 19

Parameters selection How should we choose K and δ? Duplicated genes can have no long (K > 50) shared K - mers Big K 50 we find only few synteny blocks Small K 10 and small δ 15 we find very short synteny blocks Small K 10 and big δ 200 the genome will be disrupted completely Solution do simplification in multiple stages 11 / 19

New pipeline General idea align similar regions first, then glue them together into synteny blocks Start with small K and small δ to smooth duplicated regions and obtain long K-mers Rebuild and simplify the graph with higher K and δ Continue this process several times Final step can be done with K several hundreds 12 / 19

Experiment We have attempted to identify duplications in Arabidopsis thaliana Arabidopsis is known to be highly duplicated genome [Arabidopsis Genome Initiative] Size of the genome is 120 Mbp We used 4 stages and following parameters: Stage number K δ 1 15 150 2 50 500 3 100 1000 4 500 5000 13 / 19

Computation results We have found 4722 synteny blocks in Arabidopsis These blocks cover 28 % of the genome Minimum length of the block is 1000 bp Largest block found has length 95 000 bp We tried to verify blocks by aligning instances of the same block At least 87 % of blocks have 50 % of exact matches 14 / 19

Computation results Figure 4: Matches percent vs. number of blocks plot 15 / 19

Computation results Figure 5: Synteny blocks length distribution 16 / 19

Summary We have covered 28 % of Arabidopsis genome with synteny blocks But we have missed some duplicated regions, described in [Arabidopsis Genome Initiative] Most of the blocks are short (< 5000 bp) We must improve coverage and lengthen the blocks 17 / 19

Summary We have covered 28 % of Arabidopsis genome with synteny blocks But we have missed some duplicated regions, described in [Arabidopsis Genome Initiative] Most of the blocks are short (< 5000 bp) We must improve coverage and lengthen the blocks Near plans: Improve performance Examine other genomes Optimize algorithms to handle larger genomes 17 / 19

Applications A multiple sequence alignment program Finding the synteny blocks for complicated genomes (Mammalian), possible collaboration Jian Ma. Tool for genomes vs genomes and/or assemblies vs. assemblies and/or assemblies vs. genomes comparisions. 18 / 19

References 1. Pevzner P and Tesler G, (2003) Human and mouse genomic sequences reveal extensive breakpoint reuse in mammalian evolution. 2. Pham S and Pevzner P, (2010) DRIMM-Synteny: Decomposing Genomes into Evolutionary Conserved Segments 3. Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G, (2012) De novo assembly and genotyping of variants using colored de Bruijn graphs 4. Arabidopsis Genome Initiative, (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana 19 / 19

Thank you! 19 / 19