Multiple Whole Genome Alignment

Similar documents
BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Comparative Network Analysis

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Identifying Signaling Pathways

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Inferring Models of cis-regulatory Modules using Information Theory

Algorithms for Bioinformatics

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Protein Threading. BMI/CS 776 Colin Dewey Spring 2015

Greedy Algorithms. CS 498 SS Saurabh Sinha

7 Multiple Genome Alignment

Genome Rearrangements In Man and Mouse. Abhinav Tiwari Department of Bioengineering

Comparative Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

BMI/CS 576 Fall 2016 Final Exam

Inferring positional homologs with common intervals of sequences

Analysis of Genome Evolution and Function, University of Toronto, Toronto, ON M5R 3G4 Canada

Comparative Genomics II

Whole Genome Alignments and Synteny Maps

Multiple Alignment of Genomic Sequences

Learning Sequence Motif Models Using Gibbs Sampling

Inferring Models of cis-regulatory Modules using Information Theory

Multiple Sequence Alignment: HMMs and Other Approaches

Whole-Genome Alignments and Polytopes for Comparative Genomics

Chromosomal rearrangements in mammalian genomes : characterising the breakpoints. Claire Lemaitre

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Max$Sum(( Exact(Inference(

Algorithm Design and Analysis

1 Some loose ends from last time

Evolutionary Models. Evolutionary Models

Network alignment and querying

Handling Rearrangements in DNA Sequence Alignment

Computational Genetics Winter 2013 Lecture 10. Eleazar Eskin University of California, Los Angeles

Alignment Strategies for Large Scale Genome Alignments

5. Sum-product algorithm

Graphical Models Another Approach to Generalize the Viterbi Algorithm

Lecture 6: Graphical Models

Interpolated Markov Models for Gene Finding. BMI/CS 776 Spring 2015 Colin Dewey

BLAST. Varieties of BLAST

Information Extraction from Biomedical Text. BMI/CS 776 Mark Craven

Comparing Genomes! Homologies and Families! Sequence Alignments!

Example of Function Prediction

Perfect Sorting by Reversals and Deletions/Insertions

Evolutionary Tree Analysis. Overview

CMPSCI611: The Matroid Theorem Lecture 5

General context Anchor-based method Evaluation Discussion. CoCoGen meeting. Accuracy of the anchor-based strategy for genome alignment.

Bioinformatics: Network Analysis

Learning in Bayesian Networks

BINF6201/8201. Molecular phylogenetic methods

CS781 Lecture 3 January 27, 2011

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Greedy Algorithms My T. UF

Alignment Algorithms. Alignment Algorithms

Analysis of Gene Order Evolution beyond Single-Copy Genes

CSC 412 (Lecture 4): Undirected Graphical Models

A class of heuristics for the constrained forest problem

Probabilistic Graphical Models

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Markov Networks.

Genômica comparativa. João Carlos Setubal IQ-USP outubro /5/2012 J. C. Setubal

Variable Elimination (VE) Barak Sternberg

Matroids and submodular optimization


CS281A/Stat241A Lecture 19

Bayesian & Markov Networks: A unified view

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Directed and Undirected Graphical Models

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

General Methods for Algorithm Design

arxiv: v1 [q-bio.gn] 30 Oct 2009

X X (2) X Pr(X = x θ) (3)

Graphical models. Sunita Sarawagi IIT Bombay

Multiple Sequence Alignment. Sequences

Probabilistic Graphical Models (I)

Junction Tree, BP and Variational Methods

Sequence analysis and comparison

Undirected Graphical Models: Markov Random Fields

Equivalence relations

Fine-Scale Phylogenetic Discordance across the House Mouse Genome

A Methodological Framework for the Reconstruction of Contiguous Regions of Ancestral Genomes and Its Application to Mammalian Genomes

Final Exam, Machine Learning, Spring 2009

Probabilistic Graphical Models

Sequence Database Search Techniques I: Blast and PatternHunter tools

A Browser for Pig Genome Data

Evolution at the nucleotide level: the problem of multiple whole-genome alignment

Lecture 8 Multiple Alignment and Phylogeny

Variable Elimination: Algorithm

Copyright 2000 N. AYDIN. All rights reserved. 1

MACHINE LEARNING 2 UGM,HMMS Lecture 7

3 : Representation of Undirected GM

Intermediate Tier - Algebra revision

Genes order and phylogenetic reconstruction: application to γ-proteobacteria

Ring Sums, Bridges and Fundamental Sets

Computational methods for predicting protein-protein interactions

Massachusetts Institute of Technology Department of Electrical Engineering and Computer Science Algorithms For Inference Fall 2014

Lab 12: Structured Prediction

USING BLAST TO IDENTIFY PROTEINS THAT ARE EVOLUTIONARILY RELATED ACROSS SPECIES

Multiple Sequence Alignment

Phylogenetic inference

Learning MN Parameters with Approximation. Sargur Srihari

Transcription:

Multiple Whole Genome Alignment BMI/CS 776 www.biostat.wisc.edu/bmi776/ Spring 206 Anthony Gitter gitter@biostat.wisc.edu These slides, excluding third-party material, are licensed under CC BY-NC 4.0 by Mark Craven, Colin Dewey, and Anthony Gitter

Goals for Lecture Key concepts the large-scale multiple-alignment task progressive alignment breakpoint identification undirected graphical models minimal spanning trees/forests 2

Multiple Whole Genome Alignment: Task Definition Given A set of n > 2 genomes (or other large-scale sequences) Do Identify all corresponding positions between all genomes, allowing for substitutions, insertions/deletions, and rearrangements 3

Progressive Alignment Given a guide tree relating n genomes Construct multiple alignment by performing n- pairwise alignments 4

Progressive Alignment: MLAGAN Example align pairs of sequences human chimpanzee mouse rat align multi-sequences (alignments) align multi-sequence with sequence chicken 5

Progressive Alignment: MLAGAN Example Suppose we re aligning the multi-sequence X/Y with Z. anchors from X-Z and Y-Z become anchors for X/Y-Z 2. overlapping anchors are reweighted 3. LIS algorithm is used to chain anchors Figure from: Brudno et al. Genome Research, 2003 6

Genome Rearrangements ancestor a b c d e ancestor x y a b c d e extant species inversion a d c b e extant species d e a b c x y translocation Can occur within a chromosome or across chromosomes Can have combinations of these events 7

Mercator Orthologous segment identification: graph-based method Breakpoint identification: refine segment endpoints with a graphical model 8

Establishing Anchors Representing Orthologous Segments Anchors can correspond to genes, exons or MUMS E.g., may do all-vs-all pairwise comparison of genes Construct graph with anchors as vertices and high-similarity hits as edges (weighted by alignment score) edge 22 40 anchor chromosome 0 60 9

Rough Orthology Map k-partite graph with edge weights vertices = anchors, edges = sequence similarity 0

Greedy Segment Identification for i = k to 2 do identify repetitive anchors (depends on number of high-scoring edges incident to each anchor) find best-hit anchor cliques of size i join colinear cliques into segments filter edges not consistent with significant segments

Mercator Example Repetitive elements (black anchors) are identified; 3-cliques (red and blue anchors) are found Segments are formed by red and blue anchors; inconsistent edges are filtered 2-cliques are found and incorporated into segments 2

Refining the Map: Finding Breakpoints Breakpoints: the positions at which genomic rearrangements disrupt colinearity of segments Mercator finds breakpoints by using inference in an undirected graphical model 3

Undirected Graphical Models An undirected graphical model represents a probability distribution over a set of variables using a factored representation B B 2 p ( b) C ( bc ) Z C cliques B 3 B 4 B 5 B 6 B 7 B i b b C Z C random variable assignment of values to all variables (breakpoint positions) assignment of values subset of variables in C function (called a potential) representing the compatibility of a given set of values normalization term 4

Undirected Graphical Models B B 2 p ( b) C ( bc ) Z C cliques B 3 B 4 B 5 B 6 B 7 for the given graph: p( b) ( b, b3, b5 ) 2( b, b6, b7 ) 3( b2, b4, b6 ) Z 5

The Breakpoint Graph 2 3 4 5 6 7 8 9 0 2 9 3 5 6 0 8 4 2 7 2 some prefix of region 2 and some prefix of region should be aligned 6

Breakpoint Undirected Graphical Model Mercator frames the task of finding breakpoints as an inference task in an undirected graphical model p ( b) C ( bc ) Z C cliques configuration of breakpoints potential function representing score of multiple alignment of sequences in clique C for breakpoints in b 9 3 5 6 0 8 4 2 7 2 7

Breakpoint Undirected Graphical Model 2 3 4 5 6 7 8 9 0 2 8 2 The possible values for a variable indicate the possible coordinates for a breakpoint The potential for a clique is a function of the alignment score for the breakpoint regions split at the breakpoints b C 8

Breakpoint Undirected Graphical Model 9 3 p ( b) C ( bc ) Z C cliques 5 6 0 4 2 7 2 8 Inference task: find most probable configuration b of breakpoints Not tractable in this case graph has a high degree of connectivity multiple alignment is difficult So Mercator uses several heuristics 9

Making Inference Tractable in Breakpoint Undirected Graphical Model p ( b) C ( bc ) Z C cliques 5 9 3 6 0 8 4 2 7 2 Assign potentials, based on pairwise alignments, to edges only p( b) Z ( i, j) edges ( b, b i, j i j ) Eliminate edges by finding a minimum spanning forest, where edges are weighted by phylogenetic distance 20

Minimal Spanning Forest Minimal spanning tree (MST): a minimal-weight tree that connects all vertices in a graph Minimal spanning forest: a set of MSTs, one for each connected component 5 9 3 6 0 8 4 2 7 2 2

Breakpoint Finding Algorithm. construct breakpoint segment graph 2. weight edges with phylogenetic distances 3. find minimum spanning forest (MSF) 4. perform pairwise alignment for each edge in MSF 5. use alignments to estimate 6. perform max-product inference (similar to Viterbi) to find maximizing b i ( b, b i, j i j ) 22

Comments on Whole-Genome Alignment Methods Employ common strategy find seed matches identify (sequences of) matches to anchor alignment fill in the rest with standard methods (e.g. DP) Vary in what they (implicitly) assume about the distance of sequences being compared the prevalence of rearrangements Involve a lot of heuristics for efficiency because we don t know enough to specify a precise objective function (e.g. how should costs should be assigned to various rearrangements) 23