I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Similar documents
I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

7 Multiple Genome Alignment

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Multiple Whole Genome Alignment

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Greedy Algorithms. CS 498 SS Saurabh Sinha

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Read Mapping. Burrows Wheeler Transform and Reference Based Assembly. Genomics: Lecture #5 WS 2014/2015

A PARSIMONY APPROACH TO ANALYSIS OF HUMAN SEGMENTAL DUPLICATIONS

Phylogenetic Tree Reconstruction

Comparative genomics: Overview & Tools + MUMmer algorithm

Whole Genome Alignments and Synteny Maps

Theoretical aspects of ERa, the fastest practical suffix tree construction algorithm

Whole Genome Alignment. Adam Phillippy University of Maryland, Fall 2012

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Linear-Space Alignment

Algorithms for Bioinformatics

Multiple Alignment of Genomic Sequences

AN EXACT SOLVER FOR THE DCJ MEDIAN PROBLEM

A Phylogenetic Network Construction due to Constrained Recombination

Analysis of Gene Order Evolution beyond Single-Copy Genes

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Handling Rearrangements in DNA Sequence Alignment

Multiple Sequence Alignment. Sequences

Applications of genome alignment

Reversal Distance for Strings with Duplicates: Linear Time Approximation using Hitting Set


Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Network alignment and querying

Comparative Genomics Background and Strategies. Nitya Sharma, Emily Rogers, Kanika Arora, Zhiming Zhao, Yun Gyeong Lee

Graphs, permutations and sets in genome rearrangement

Fast String Kernels. Alexander J. Smola Machine Learning Group, RSISE The Australian National University Canberra, ACT 0200

Motif Extraction from Weighted Sequences

3/1/17. Content. TWINSCAN model. Example. TWINSCAN algorithm. HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Self-Indexed Grammar-Based Compression

Phylogenetics without multiple sequence alignment

HMM for modeling aligned multiple sequences: phylo-hmm & multivariate HMM

Self-Indexed Grammar-Based Compression

On the complexity of unsigned translocation distance

arxiv: v1 [q-bio.gn] 30 Oct 2009

The combinatorics and algorithmics of genomic rearrangements have been the subject of much

Pattern Matching (Exact Matching) Overview

Sequence comparison by compression

Evolution of Tandemly Arrayed Genes in Multiple Species

Whole-Genome Alignments and Polytopes for Comparative Genomics

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

arxiv: v1 [cs.ds] 15 Feb 2012

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Multiple Sequence Alignment

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Organizing Life s Diversity

Randomized Sorting Algorithms Quick sort can be converted to a randomized algorithm by picking the pivot element randomly. In this case we can show th

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Phylogenetic Networks, Trees, and Clusters

METHODS FOR DETERMINING PHYLOGENY. In Chapter 11, we discovered that classifying organisms into groups was, and still is, a difficult task.

Inferring positional homologs with common intervals of sequences

Chapter 18 Active Reading Guide Genomes and Their Evolution

Genomes and Their Evolution

Genomes Comparision via de Bruijn graphs

arxiv: v1 [cs.db] 29 Sep 2015

Lecture 8 Multiple Alignment and Phylogeny

Chapter 0 Introduction. Fourth Academic Year/ Elective Course Electrical Engineering Department College of Engineering University of Salahaddin

Molecular phylogeny - Using molecular sequences to infer evolutionary relationships. Tore Samuelsson Feb 2016

CSE182-L7. Protein Sequence Analysis Patterns (regular expressions) Profiles HMM Gene Finding CSE182

Algorithms Design & Analysis. String matching

Define M to be a binary n by m matrix such that:

Module 9: Tries and String Matching

Humans have two copies of each chromosome. Inherited from mother and father. Genotyping technologies do not maintain the phase

Alignment Strategies for Large Scale Genome Alignments

arxiv: v2 [cs.ds] 16 Mar 2015

CREATING PHYLOGENETIC TREES FROM DNA SEQUENCES

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Phylogenetic Networks with Recombination

1 Alphabets and Languages

Pattern Matching. a b a c a a b. a b a c a b. a b a c a b. Pattern Matching 1

EVOLUTIONARY DISTANCES

Evolutionary Tree Analysis. Overview

Bio 1B Lecture Outline (please print and bring along) Fall, 2007

Motivating the need for optimal sequence alignments...

O 3 O 4 O 5. q 3. q 4. Transition

Comparison of Cost Functions in Sequence Alignment. Ryan Healey

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

BLAST: Basic Local Alignment Search Tool

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

List of Code Challenges. About the Textbook Meet the Authors... xix Meet the Development Team... xx Acknowledgments... xxi

This is a survey designed for mathematical programming people who do not know molecular biology and

11/3/13. Indexing techniques. Short-read mapping software. Indexing a text (a genome, etc) Some terminologies. Hashing

SUFFIX TREE. SYNONYMS Compact suffix trie

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Perfect Sorting by Reversals and Deletions/Insertions

SUPPLEMENTARY INFORMATION

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Cladistics and Bioinformatics Questions 2013

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

Text Indexing: Lecture 6

Course: Visual Analytics of largescale biological data. Kay Nieselt Center for Bioinformatics Tübingen University of Tübingen

Reading for Lecture 13 Release v10

Transcription:

I519 Introduction to Bioinformatics, 2011 Genome Comparison Yuzhen Ye (yye@indiana.edu) School of Informatics & Computing, IUB

Whole genome comparison/alignment Build better phylogenies Identify polymorphism Detect gene-level events Compare different assemblies of a single genome

Whole genome comparison Aligning whole genomes is a fundamentally different problem than aligning short sequences. Need to consider the presence of large-scale evolutionary events Gene duplication & loss Horizontal gene transfer Repetitive sequences (repeats) Gene rearrangement and inversion Pairwise and multiple genome comparison Multiple genome alignment provides a basis for research into comparative genomics and the study of evolutionary dynamics.

Genome evolution Genome A Point Substitution Translocation Inversion Inversion and Translocation Insertion Repeat (Duplication)

Basic algorithms: use anchoring as a heuristic to speed alignment Assumption: highly similar subsequences can be found quickly and are likely to be part of the correct global alignment. These local alignments are used to anchor a global alignment (alignment anchor), reducing the number of possible global alignments considered during a subsequent O(n2) dynamic programming step. Select a single collinear set of alignment anchors Many tools have been developed

Rearrangement free or not Free of rearrangement Assume the input sequences are free from significant rearrangements of sequence elements, selecting a single collinear set of alignment anchors Pairwise: MUMmer, GLASS, AVID, and WABA align pairs of long sequences Multiple alignment: MAVID, MLAGAN, and MGA Consider rearrangement Shuffle-LAGAN (2003, first genome comparison method described that explicitly deals with genome rearrangements) MultiPipMaker (2003) Mauve (2004, multiple) Enredo and Pecan (2008) GR-Aligner (2009, pairwise)

MUMer method MUMer combines suffix trees, the longest increasing subsequence (LIS) and SW alignment Maximal Unique Match (MUM) Identification - Identify the longest strings in Genome 1 that have one identical match in Genome 2 Naïve method: O(N 2 ) Using suffix tree: O(N) Ordered MUM Selection - Identify the longest set of MUMs such that they occur in order in each of the genomes (using a variation of the well-known algorithm to find the LIS of a sequence of integers) Processing Non-matched Regions - Classify nonmatched regions as either insertions, SNPs or highly polymorphic regions

Suffix tree Suffix tree is data structure, which allows one to find, extremely efficiently, all distinct subsequences in a given sequence. There are efficient algorithms to construct suffix trees given by Weiner (1973) and McCreight (1976) (in linear time) For the task of comparing two DNA sequences, suffix trees allow one to quickly find all subsequences shared by the two inputs. The genome alignment is then built upon this information.

Suffix tree for finding MUMs Suffix Tree for sequence gaaccgacct An internal node is a repeated sequence in the original string Leaf is a unique suffix Every unique matching sequence is represented by an internal node with exactly two child nodes, such that the child nodes are leaf nodes from different genomes

ATCGTA# # 7 A# 6 TA# 5 GTA# 4 CGTA# 3 TCGTA# 2 ATCGTA# 1 ATCGAT$ $ 14 T$ 13 AT$ 12 GAT$ 11 CGAT$ 10 TCGAT$ 9 ATCGAT$ 8 ATCGTA# # 7 $ 14 A# 6 AT$ 12 ATCGAT$ 8 ATCGTA# 1 CGAT$ 10 CGTA# 3 GAT$ 11 GTA# 4 T$ 13 TA# 5 TCGAT$ 9 TCGTA# 2 A toy example T A 0 CG 1 1 2 1 $ # TA# AT$ AT$ TA# A# CG T$ T 13 5 6 3 12 2 10 3 11 4 AT$ 9 2 TA# AT$ 4 CG TA# 8 1 G

Suffix tree & suffix array for string matching Preprocess text T, not pattern P O(m) preprocess time (m: the length of the text) O(n+k) search time (n: the length of the pattern) k is number of occurrences of P in T Match pattern P against tree starting at root until Case 1, P is completely matched Every leaf below this match point is the starting location of P in T Case 2: No match is possible P does not occur in T

A toy example of string (pattern) matching T = xabxac suffixes ={xabxac, abxac, bxac, xac, ac, c} Pattern P 1 : xa Pattern P 2 : xb 1 b b b x x x x a a a a a c c c c c c 2 3 6 5 4

Suffix array Suffix array: a sorted list of the suffixes of a given string; the start positions are sorted in lexicographical (alphabetical) order Straightforward implementation: O(m 2 logm), reduced to O(mlogm) (utilizing partial sorts) m: the length of the text Suffix array enables binary search for any substring, e.g. CAD O(nlogm), reduced to O(n + logm) if use LCP (longest common prefix) n: the length of the pattern Suffix array is more compact than a suffix tree ABRACADABRA# 11 # 10 A# 7 ABRA# 0 ABRACADABRA# 3 ACADABRA# 5 ADABRA# 8 BRA# 1 BRACADABRA# 4 CADABRA# 6 DABRA# 9 RA# 2 RACADABRA# webglimpse.net/pubs/suffix.pdf

G1 Ordered MUM selection 1 2 3 4... G2 A B C D... MUMs: Possible Selections <1,A>, <2,C>, <3,B>, <4,D> <1,A>, <2,C>, <4,D> <1,A>, <3,B>, <4,D> Then process non-matched regions (by dynamic programming algorithm) See more at www.cs.rice.edu/~nakhleh/comp571/genomealignment.ppt

LIS algorithm B positions is given by the sequence 1, 3, 2, 4, 6, 7, 5 The LIS (longest increasing sequence) is: 1, 2, 4, 6, 7 LIS problem can be solved by a dynamic programming algorithm

Mauve Mauve is a system for efficiently constructing multiple genome alignments in the presence of large-scale evolutionary events Identifies conserved genomic regions, rearrangements and inversions in conserved regions, and the exact sequence breakpoints of such rearrangements across multiple genomes. Also performs traditional multiple alignment of conserved regions to identify nucleotide substitutions and indels, using the progressive dynamic programming approach of CLUSTAL W

Mauve's anchor selection algorithm Relax anchor selection method: do not assume that the genomes under study are collinear Identifie and align regions of local collinearity called locally collinear blocks (LCBs) Each LCB is a homologous region of sequence shared by two or more of the genomes under study Does not contain any rearrangements of homologous sequence (within LCB)

Mauve algorithm 1. Find local alignments (multi-mums), using seed-and-extend hashing method (time complexity O(G 2 n + Gn loggn), G is the number of genomes and n the average genome length) 2. Use the multi-mums to calculate a phylogenetic guide tree. 3. Select a subset of the multi-mums to use as anchors these anchors are partitioned into collinear groups called LCBs, using a greedy breakpoint elimination algorithm 4. Perform recursive anchoring to identify additional alignment anchors within and outside each LCB. 5. Perform a progressive alignment of each LCB using the guide tree.

Greedy breakpoint elimination in three genomes Darling A C et al. Genome Res. 2004;14:1394-1403 2004 by Cold Spring Harbor Laboratory Press

An example of LCB identified among nine enterobacterial genomes Darling A C et al. Genome Res. 2004;14:1394-1403

LCBs identified among concatenated chromosomes of the mouse, rat, and human genomes Darling A C et al. Genome Res. 2004;14:1394-1403

Turnip vs cabbage: almost identical mtdna gene sequences In 1980s Jeffrey Palmer studied evolution of plant organelles by comparing mitochondrial genomes of the cabbage and turnip (using physical mapping) 99%-99.9% similarity between genes These surprisingly identical gene sequences differed in gene order This study helped pave the way to analyzing genome rearrangements in molecular evolution

Why we care about genome rearrangement Evolutionary and functional analysis Examples: Dynamics of Genome Rearrangement in Bacterial Populations, using comparison of eight Yersinia (pathogenic bacteria) genomes. PLoS Genet 4(7): e1000128, 2008 Genome-wide DNA excision (Oxytricha trifallax destroys 95% of its germline genome during development, including the elimination of all transposon DNA, through an exaggerated process of genome rearrangement). Science, Vol. 324. no. 5929, pp. 935 938, 2009

Transforming cabbage into turnip

Reversals and breakpoints 1 2 3 1 2 3 7 8 6 5 4 9 10 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 9 10 8 4 1, 2, 3, -8, -7, -6, -5, -4, 9, 10 7 6 5 The reversion introduced two breakpoints (disruptions in order).

Unknown ancestor ~ 75 million years ago Genome rearrangements Mouse (X chrom.) Human (X chrom.) What are the similarity blocks and how to find them? What is the architecture of the ancestral genome? What is the evolutionary scenario for transforming one genome into the other?

Comparative genomic architectures: mouse vs human genome Humans and mice have similar genomes, but their genes are ordered differently ~245 rearrangements Reversals Fusions Fissions Translocation

History of Chromosome X Rat Consortium, Nature, 2004

GRIMM Real genome architectures are represented by signed permutations Efficient algorithms to sort signed permutations have been developed GRIMM web server computes the reversal distances between signed permutations: http://nbcr.sdsc.edu/grimm/mgr.cgi