BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Similar documents
Multiple Whole Genome Alignment

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

7 Multiple Genome Alignment

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Whole-Genome Alignments and Polytopes for Comparative Genomics

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Multiple Alignment of Genomic Sequences

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Comparative Network Analysis

Inferring positional homologs with common intervals of sequences

Handling Rearrangements in DNA Sequence Alignment

arxiv: v1 [q-bio.gn] 30 Oct 2009

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Alignment Strategies for Large Scale Genome Alignments

Network alignment and querying

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

Basic Local Alignment Search Tool

Whole Genome Alignments and Synteny Maps

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Comparative Genomics II

Comparing Genomes! Homologies and Families! Sequence Alignments!

Multiple Genome Alignment by Clustering Pairwise Matches

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Motivating the need for optimal sequence alignments...

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering


BMI/CS 576 Fall 2016 Final Exam

Example of Function Prediction

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

BINF6201/8201. Molecular phylogenetic methods

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Bioinformatics: Network Analysis

Comparative Bioinformatics Midterm II Fall 2004

Lecture 8 Multiple Alignment and Phylogeny

Applications of genome alignment

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

Tools and Algorithms in Bioinformatics

BLAST. Varieties of BLAST

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Comparing whole genomes

Algorithms for Bioinformatics

Evolutionary Tree Analysis. Overview

Sequence Database Search Techniques I: Blast and PatternHunter tools

Multiple sequence alignment

Phylogenetic Tree Reconstruction

Homology Modeling. Roberto Lins EPFL - summer semester 2005

EVOLUTIONARY DISTANCES

Effects of Gap Open and Gap Extension Penalties

Copyright 2000 N. AYDIN. All rights reserved. 1

Evolution s cauldron: Duplication, deletion, and rearrangement in the mouse and human genomes

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Analysis of Gene Order Evolution beyond Single-Copy Genes

Multiple Sequence Alignment: HMMs and Other Approaches

Genomes Comparision via de Bruijn graphs

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Sequence analysis and comparison

An Integrative Method for Accurate Comparative Genome Mapping

Comparative genomics: Overview & Tools + MUMmer algorithm

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Multiple Sequence Alignment

Dr. Amira A. AL-Hosary

Genes order and phylogenetic reconstruction: application to γ-proteobacteria

Ensembl focuses on metazoan (animal) genomes. The genomes currently available at the Ensembl site are:

Pairwise & Multiple sequence alignments

Comparative genomics. Lucy Skrabanek ICB, WMC 6 May 2008

Sequence Alignment Techniques and Their Uses

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Introduction to Sequence Alignment. Manpreet S. Katari

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Sequence analysis and Genomics

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

Biol478/ August

Introduction to Bioinformatics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Supplementary Information for Hurst et al.: Causes of trends of amino acid gain and loss

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

1.5 Sequence alignment

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

Assembly improvement: based on Ragout approach. student: Anna Lioznova scientific advisor: Son Pham

Practical considerations of working with sequencing data

A Browser for Pig Genome Data

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Multiple Sequence Alignment

Molecular phylogeny How to infer phylogenetic trees using molecular sequences

Supplementary Information for: The genome of the extremophile crucifer Thellungiella parvula

1 Some loose ends from last time

Heuristic Alignment and Searching

Homology and Information Gathering and Domain Annotation for Proteins

CS281A/Stat241A Lecture 19

Transcription:

BMI/CS 776 Lecture #20 Alignment of whole genomes Colin Dewey (with slides adapted from those by Mark Craven) 2007.03.29 1

Multiple whole genome alignment Input set of whole genome sequences genomes diverged from common ancestor via substitutions, indels, and rearrangements Output set of multiple alignments one multiple alignment per colinear region (region in the genomes that has not had internal rearrangements) 2

Genome Rearrangement Ancestor Extant Species 1 Extant Species 2 Duplications 3

Genome Rearrangement Example: Mouse vs. Human X Chromsome Figure from: Pevzner and Tesler. PNAS, 2003 each colored block represents a colinear orthologous region of the two chromosomes the two panels show the two most parsimonious sets of rearrangements to map one chromosome to the other 4

Hierarchical alignment First determine homology map Determines all homologous colinear segments in the input genomes Perform nucleotide-level alignment Run alignment program that assumes colinearity on each set of homologous segments 5

Homology Map Species 1 Species 2 Homologous Group Genome 1 Segments Genome 2 Segments 1 chr1:3306600-3626073:+ chr4:7084404-7540496:+ 2 chr2:3626073-3645123:+ chr5:1727254-1819933:- 3 chr2:3645123-3675603:+ chr2:7045783-7084404:+ chr2:7084405-7103943:+ 6

Whole Genome Alignment Species 1 Species 2 A set of multiple alignments of homologous colinear segments 7

One-to-one alignments Most whole genome alignments are one-to-one Any segment from a given genome is aligned to at most one segment from each other genome Such alignments are indicative of monotopoorthologous segments No undirected duplications involving segments since ancestor Segments may be sources of directed duplications, but not targets 8

Two multiple whole genome alignment methods Mauve Simultaneous construction of orthology map and nucleotide-level alignment multi-mums as anchors Mercator/MAVID Mercator constructs orthology map using genomic landmarks MAVID aligns colinear segment sets determined by orthology map 9

The Mauve Method Given: k genomes X 1,..., X k 1. find multi-mums (MUMs present in 2 or more genomes) 2. calculate a guide tree based on multi-mums 3. find LCBs (sequences of multi-mums) to use as anchors 4. do recursive anchoring within and outside of LCBs 5. calculate a progressive alignment of each LCB using guide tree * note: no LIS step! 10

2. Calculating the Guide Tree in Mauve Mauve calculates the guide tree instead of taking it as an input 1. find multi-mums in sequences 2. calculate pairwise distances 3. run neighbor-joining to get guide tree; distance between two sequences is based on fraction of sequences shared in multi-mums 11

3. Selecting Anchors: Finding Local Colinear Blocks repeat partition set of multi-mums, M into colinear blocks find minimum-weight colinear block(s) remove minimum weight block(s) if they re sufficiently small until minimum-weight block is not small enough 12

4. and 5. Recursive Anchoring and Gapped Alignment recursive anchoring (finding finer multi-mums and LCBs) and standard alignment (CLUSTALW) are used to extend LCBs between LCBs within LCBs 13

Mauve Alignment of 9 Enterobacteria (Salmonella and E. coli) An alignment Figure courtesy of Aaron Darling 14

Mauve vs. MLAGAN: Accuracy on Simulated Genome Data Figure courtesy of Aaron Darling 15

Mauve vs. LAGAN: Accuracy on Simulated Genome Data with Inversions Mauve Shuffle-LAGAN Figure courtesy of Aaron Darling 16

Mercator: Multiple whole-genome orthology mapping Genome Sequences Assembled Genomes Anchor Intervals Segment Identification Rough Map Breakpoint Identification Map Pairwise Anchor Hits Mercator Orthologous segment identification: graph-based method Comparative scaffolding: draft genomes can be scaffolded by other genomes Breakpoint identification: refine segment endpoints with a graphical model 17 17

Establishing Orthologous Segments Using Anchors Use annotations (coding exons) as anchors: RefSeq, Genscan, etc. All-vs-all pairwise comparison of anchors (BLAT in protein space) Construct graph with anchors as vertices and BLAT hits as edges (weighted by alignment score) 10 22 anchor edge 40 chromosome 60 18

Rough Orthology Map k-partite graph with edge weights vertices = intervals, edges = sequence similarity 19

Greedy segment identification For i = k to 2 do Identify repetitive anchors (depends on number of high-scoring edges incident to each anchor) Find best-hit anchor cliques of size i Join colinear cliques into runs Filter edges not consistent with significant runs 20

Clique Joining Species 1 Species 2 Species 3 21

Mercator example A 22

Mercator in action 23

Draft Genomes (Harder) 24

Assembling Draft Sequence Ignore breakpoints caused by contig ends Assemble draft contigs based on anchor ordering in other genomes 25

Assemble Draft with Draft Drafts provide some anchor order information that can be used to assemble other drafts Takifugu and Tetraodon (~20,000 scaffolds each) can use each other to assemble into ~1,500 pieces 26

Refining the Map: Finding Breakpoints 27

Optimizing Breakpoints max b a A(b) score(a) 28

Breakpoint Graph 1 2 3 4 5 6 7 8 9 10 11 12 11 9 3 1 5 6 10 8 4 2 7 12 29

Breakpoint Undirected Graphical Model b: configuration of breakpoints ψ BC (b C ): probability of multiple alignment of clique B C p(b) = 1 ψ BC (b C ) Z C C 11 9 3 1 5 6 10 8 4 2 7 12 30

Breakpoint Graphical Model Random variable indicating position of breakpoint in breakpoint segment i Score of best multiple alignment of prefixes/suffixes of breakpoint segments in clique, with segments broken at given positions Objective function that we wish to maximize Objective function that we wish to maximize over non-maximal 2-cliques 31

Breakpoint Finding Algorithm Construct breakpoint segment graph Weight edges with phylogenetic distances Find minimum spanning tree/forest Perform pairwise alignment for each edge in MST Use alignments to estimate Perform MAP inference to find maximizing Running time: Length dominates Instead only consider P positions in each segment Iterate to narrow in on exact position Running time: 32

Heuristics for MAP Inference of Breakpoints Minimum spanning forest with edges weighted by phylogenetic distance Pairwise alignment heuristics for large sequences Performance Finds breakpoints exactly for small simulated sequences Improves map segment definition for orthology maps 11 9 3 1 5 6 10 8 4 2 7 12 33

Whole-Genome alignment summary Very difficult problem No good model for whole-genome evolution Thus, no good objective function Very large input size Components of methods Pattern matching to find seeds Identify colinear regions Speed up alignment Alignment of long sequences Use combination of sparse dynamic programming and standard Needleman-Wunsch Progressive multiple alignment 34