A greedy, graph-based algorithm for the alignment of multiple homologous gene lists

Similar documents
Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Single alignment: Substitution Matrix. 16 march 2017

Phylogenies Scores for Exhaustive Maximum Likelihood and Parsimony Scores Searches

Tools and Algorithms in Bioinformatics

Effects of Gap Open and Gap Extension Penalties

Pairwise & Multiple sequence alignments

CISC 889 Bioinformatics (Spring 2004) Sequence pairwise alignment (I)

Study and Implementation of Various Techniques Involved in DNA and Protein Sequence Analysis

5. MULTIPLE SEQUENCE ALIGNMENT BIOINFORMATICS COURSE MTAT

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Introduction to Bioinformatics Online Course: IBT

Quantifying sequence similarity

Sequence analysis and Genomics

In-Depth Assessment of Local Sequence Alignment

SUPPLEMENTARY INFORMATION

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Statistical Machine Learning Methods for Bioinformatics II. Hidden Markov Model for Biological Sequences

Multiple sequence alignment

BINF6201/8201. Molecular phylogenetic methods

Multiple Sequence Alignment, Gunnar Klau, December 9, 2005, 17:

Pairwise sequence alignment

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Introduction to Bioinformatics

"Nothing in biology makes sense except in the light of evolution Theodosius Dobzhansky

Tools and Algorithms in Bioinformatics

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment


Overview Multiple Sequence Alignment

Algorithms in Bioinformatics

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Probalign: Multiple sequence alignment using partition function posterior probabilities

Pairwise Alignment. Guan-Shieng Huang. Dept. of CSIE, NCNU. Pairwise Alignment p.1/55

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

Sequence Alignment Techniques and Their Uses

Sequence Database Search Techniques I: Blast and PatternHunter tools

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Supplemental Data. Perea-Resa et al. Plant Cell. (2012) /tpc

Evaluation Measures of Multiple Sequence Alignments. Gaston H. Gonnet, *Chantal Korostensky and Steve Benner. Institute for Scientic Computing

Bioinformatics. Scoring Matrices. David Gilbert Bioinformatics Research Centre

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Bioinformatics tools for phylogeny and visualization. Yanbin Yin

A faster algorithm for the single source shortest path problem with few distinct positive lengths

Bioinformatics and BLAST

Graph Alignment and Biological Networks

Ch. 9 Multiple Sequence Alignment (MSA)

Dr. Amira A. AL-Hosary

Computational Biology

Multiple Whole Genome Alignment

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

Grundlagen der Bioinformatik, SS 08, D. Huson, May 2,

Early History up to Schedule. Proteins DNA & RNA Schwann and Schleiden Cell Theory Charles Darwin publishes Origin of Species

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Sequence Analysis 17: lecture 5. Substitution matrices Multiple sequence alignment

Inferring phylogeny. Constructing phylogenetic trees. Tõnu Margus. Bioinformatics MTAT

C E N T R. Introduction to bioinformatics 2007 E B I O I N F O R M A T I C S V U F O R I N T. Lecture 5 G R A T I V. Pair-wise Sequence Alignment

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Multiple Sequence Alignment: A Critical Comparison of Four Popular Programs

Multiple Genome Alignment by Clustering Pairwise Matches

Chapter 11 Multiple sequence alignment

08/21/2017 BLAST. Multiple Sequence Alignments: Clustal Omega

Introduction to Bioinformatics Introduction to Bioinformatics

Phylogenetic inference

Motivating the need for optimal sequence alignments...

Moreover, the circular logic

Tree Building Activity

A New Similarity Measure among Protein Sequences

Sequence Analysis, '18 -- lecture 9. Families and superfamilies. Sequence weights. Profiles. Logos. Building a representative model for a gene.

A (short) introduction to phylogenetics

Alignment principles and homology searching using (PSI-)BLAST. Jaap Heringa Centre for Integrative Bioinformatics VU (IBIVU)

Theoretical Computer Science

EECS730: Introduction to Bioinformatics

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

Network alignment and querying

Evolutionary Tree Analysis. Overview

EVOLUTIONARY DISTANCES

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Copyright 2000 N. AYDIN. All rights reserved. 1

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Genomics and bioinformatics summary. Finding genes -- computer searches

SUPPLEMENTARY INFORMATION

INFORMATION-THEORETIC BOUNDS OF EVOLUTIONARY PROCESSES MODELED AS A PROTEIN COMMUNICATION SYSTEM. Liuling Gong, Nidhal Bouaynaya and Dan Schonfeld

Phylogeny: building the tree of life

Outline. Evolution: Speciation and More Evidence. Key Concepts: Evolution is a FACT. 1. Key concepts 2. Speciation 3. More evidence 4.

First generation sequencing and pairwise alignment (High-tech, not high throughput) Analysis of Biological Sequences

Cladistics and Bioinformatics Questions 2013

Biology Tutorial. Aarti Balasubramani Anusha Bharadwaj Massa Shoura Stefan Giovan

Large-Scale Genomic Surveys

Lecture Notes: Markov chains

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Practical considerations of working with sequencing data

BLAST. Varieties of BLAST

Phylogenetic Analysis. Han Liang, Ph.D. Assistant Professor of Bioinformatics and Computational Biology UT MD Anderson Cancer Center

Transcription:

A greedy, graph-based algorithm for the alignment of multiple homologous gene lists Jan Fostier, Sebastian Proost, Bart Dhoedt, Yvan Saeys, Piet Demeester, Yves Van de Peer, and Klaas Vandepoele Bioinformatics 27 (2011) 749 756. Speaker: Joseph Chuang-Chieh Lin Department of Computer Science and Information Engineering National Chung Cheng University. Taiwan March 30, 2011 Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 1 / 43

Self introduction Basic information Name: Joseph, Chuang-Chieh Lin Birth: 5 Dec. 1979 at Tainan City Married since 2007 & one daughter Hobbies: running, reading, & eating Homepage: http://www.cs.ccu.edu.tw/ lincc Email addresses: lincc@cs.ccu.edu.tw josephcclin@gmail.com Education background Bachelor of Science: Mathematics, National Cheng Kung University, 1998 2002 Master of Science: CSIE, National Chi Nan University 2002 2004, Thesis supervisor: Prof. Richard Chia-Tung Lee Ph.D. (dissertation in progress): CSIE, National Chung Cheng University 2004 2011, Dissertation supervisors: Prof. Maw-Shang Chang & Prof. Peter Rossmanith Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 2 / 43

Self introduction (contd.) Research interests Fixed-parameter algorithms Randomized algorithms (property testing) Graph theory & algorithms Computational biology (phylogenetic tree reconstruction) Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 3 / 43

Outline 1 Introduction Multiple sequence alignments Contribution of this paper 2 The basic procedure Conflicts & cycles Conflicts detection & resolution Faster heuristics for conflict resolution 3 Experimental results Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 4 / 43

Introduction Comparative genomics Comparative genomics: The study of the relationship of genome structure and function across different species. It exploits similarities & differences in the proteins, RNA, and regulatory regions (a DNA segment). An attempt to understand the evolutionary processes acting on genomes. Identification of homologous genomic regions is heavily relied. Using alignment tools. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 5 / 43

Introduction Multiple sequence alignments Sequence alignment S 1 : A R G C R C S 2 : A G A G C S 1 : A R G C R C S 2 : A G A G C Using dynamic programming (table-filling method) M(i,j) = M(i 1,j 1) + σ(s 1 (i),s 2 (j)), min M(i 1,j) + σ(s 1 (i), ), M(i,j 1) + σ(,s 2 (j)). σ(, ): score function (e.g., PAM 250, BLOSUM 62) O(l 2 ) time. l: the maximum sequence length. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 6 / 43

Introduction Multiple sequence alignments Multiple sequence alignment S 1 : A R G C R C S 2 : A G A G C S 3 : R R C R G S 1 : A R G C R C S 2 : A G A G C S 3 : R R C R G M(i, j, k) = 8 M(i 1, j 1, k 1) + σ(s 1(i), S 2(j), S 3(k)), M(i 1, j 1, k) + σ(s 1(i), S 2(j), ), >< M(i 1, j, k 1) + σ(s 1(i),, S 3(k)), min M(i, j 1, k 1) + σ(, S 2(j), S 3(k)), M(i 1, j, k) + σ(s 1(i),, ), M(i, j 1, k) + σ(, S >: 2(j), ), M(i, j, k 1) + σ(,, S 3(k)). O((2 3 1)l 3 ) time. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 7 / 43

Introduction Multiple sequence alignments A heuristic: progressive MSA Main steps of progressive MSA: Compute pairwise sequence similarities; Create a phylogenetic tree (or use a user-defined tree); Use the phylogenetic tree to carry out a MSA. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 8 / 43

Introduction Multiple sequence alignments S 1 : A T G C T C S 2 : A G A G C S 2 : A G A G C S 3 : G G G G S 1 : A T G C T C S 3 : G G G G S 1 : A T G C T C S 2 : A G A G C Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 9 / 43

Introduction Multiple sequence alignments S 1 : A T G C T C S 2 : A G A G C S 2 : A G A G C S 3 : G G G G S 1 : A T G C T C S 3 : G G G G S 1 : A T G C T C S 2 : A G A G C S 3 : G G G G Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 10 / 43

Introduction Multiple sequence alignments Well known MSA tools CLUSTAL (W) (Higgins & Sharp 1988; Thompson et al, 1994) T-COFFEE (Notredame et al., 1998) MUSCLE (Edgar, 2004) MAFFT (Katoh et al., 2002) Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 11 / 43

Introduction Contribution of this paper Contribution of this paper Focus on alignment of multiple, mutually homologous genomic segments (gene lists). Homologous: derived from a common ancestor. The homologous relationships between individual genes established as a preprocessing step. A graph-based approach to create accurate alignments of homologous genomic segments. Find a maximal number of homologous genes. The detection of additional homologous genomic segments can be facilitated. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 12 / 43

Introduction Contribution of this paper Some remarks The MSA of gene lists is much different from that at the nucleotide or amino acid level. Nucleotides & amino acids alphabet size: nucleotides (4) amino acids (20). Through evolution: mainly undergo character substitutions. Genes (or chromosomes) alphabet size: combinations of nucleotides... Through evolution: mainly undergo gene loss/insertion, inversion, etc. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 13 / 43

Introduction Contribution of this paper Contribution of this paper (contd.) The alignment algorithm: a subroutine of the software i-adhore. i-adhore 3.0 is available at http://bioinformatics.psb.ugent.be/software original i-adhore Simillion et al. 2004. progressively applies the Needleman-Wunsch (pnw) aligner once a gap, always a gap. i-adhore 2.0 Simillion et al. 2008. Using a greedy, graph-based aligner (GG-aligner). Avoiding once a gap, always a gap. Correctly aligned genes are not more. i-adhore 3.0 Fostier et al. 2011. New greedy, graph-based aligner (GG2-aligner). Using heuristics to resolve conflicts. Outperform pnw and GG aligners. The main dataset of genome: Arabidopsis thaliana. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 14 / 43

Introduction Contribution of this paper Arabidopsis thaliana Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 15 / 43

Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 16 / 43

The graph structure (cf. Lenhof et al. 1999) Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 17 / 43

The graph structure (contd.) G(V,E, w): N genomic segments. vertices in V: genes. n i,j : the j-th gene on the i-th gene-list. n i,ai : the active node which is the a i -th gene on the i-th gene-list. Consecutive genes on a segment are connected through a (directed) edge. Homologous genes located on different lists are connected through a link (undirected edge). Weight w: only attributed to links. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 18 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 19 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 20 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 21 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 22 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 23 / 43

The basic procedure The basic procedure Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 24 / 43

Conflicts & cycles Conflicts alignment of a link: aligning two nodes connected by a link. a conflict: a set of links such that the alignment of some of them prohibits the alignment of others. It can be resolved by removing one or more links. Though certain homologous genes will not be placed in the same column finally. Goal: minimize the number of misaligned genes. It s imperative to carefully select the link(s) to be removed. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 25 / 43

Conflicts & cycles Blocking paths/cycles & direct paths P B = {L 2, L 3, L 4 }: an elementary blocking path from u to v. P D = {L 2, L 3 }: an elementary direct path between x and y. C C = {L 1, P B }: an elementary blocking cycle. conflicts blocking cycle. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 26 / 43

Conflicts & cycles Blocking paths/cycles & direct paths (contd.) P B = {L 2, L 3, L 4, L 5 }: a blocking path from u to v (NOT elementary). C C = {L 1, P B }: a blocking cycle (NOT elementary). Yet C C = {L 3, L 4 } is an elementary blocking cycle. Removing either L 3 or L 4 resolves the conflict! Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 27 / 43

Conflicts & cycles Minimal conflicts & elementary blocking cycles a conflict is minimal: the removal of ANY link from its corresponding blocking cycle resolves ALL conflicts among the remaining links. Proposition Minimal conflicts corresponds to elementary blocking cycles and vice versa. A non-elementary blocking cycle corresponds to one or more minimal conflicts. In what follows, we only consider elementary blocking paths/cycles and minimal conflicts. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 28 / 43

Conflicts detection & resolution Active conflicts Consider a situation that no alignable set S can be found among the N active nodes in G. Let G = (V,E ) be the subgraph that only contains the active links. Proposition If no alignable set can be found among the N active nodes in G during the basic alignment procedure, then one conflict among the active links. Furthermore, no alignable set can be found among the same active nodes in G. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 29 / 43

Conflicts detection & resolution Heuristic by maximum flow For a link L st (linking s and t), we want to assess which degree they are connected maximum flow f st. Restrictions: Only passing through element blocking/direct paths s t. Links: no direction & capacity equal to their weight w. Edges: keep their directions & unlimited capacities. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 30 / 43

Conflicts detection & resolution Heuristic by maximum flow f D st (resp., f C st ): the max flow of s t through directed elementary paths (resp., elementary blocking paths). f C st = f st f D st. Let S Lst = f D st f C st f C ts. The link L with the lowest score S L will be removed. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 31 / 43

Conflicts detection & resolution Heuristic by maximum flow f C st = 2 for link L 1. f C ts = 0. f D st = 1. S L1 = 1 2 = 1. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 32 / 43

Conflicts detection & resolution Heuristic by maximum flow f C st = 1 for link L 2. f C ts = 0. f D st = 1. S L2 = 1 1 = 0. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 33 / 43

Conflicts detection & resolution Heuristic by maximum flow f C st = 1 for link L 3. f C ts = 0. f D st = 1. S L3 = 1 1 = 0. Thus we remove the link L 1. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 34 / 43

Conflicts detection & resolution Heuristic by maximum flow Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 35 / 43

Faster heuristics for conflict resolution Faster heuristics The computation of the max flow (f ): O(Ef ) time (Ford & Fulkerson 1962). Seeking for upper bounds on f C st is easier. Find f C st,ub and f C ts,ub. Then we derive a lower bound on S Lst : S Lst,LB = f D st,lb max{f C st,ub, f C ts,ub}, where we set f D st,lb = w(l st ). Select the active link L with the lowest lower bound S L,LB. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 36 / 43

Faster heuristics for conflict resolution Faster heuristics (contd.) Select the longest link. Let L st be an active link with t = n j,k. Select the active link incident to n j,k for which k a j is maximal. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 37 / 43

Faster heuristics for conflict resolution Summary of the heuristics To resolve a conflict, select one of the following active link for removal: RA (RAndom) RC (Random Conflict) RAC (Random Active Conflict) LL (Longest Link) LLBS (Lowest Lower Bound Score) LS (Lowest Score) a random link a random link involved in at least one (active or nonactive) conflict a random link involved in at least one active conflict the link L, involved in at least one active conflict, incident to node n j,k for which (k a j ) is maximum. the link L, involved in at least one active conflict, with the lowest lower bound score S L,LB. the link L, involved in at least one active conflict, with the lowest score S L. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 38 / 43

Experimental results Datasets Dataset I: Arabidopsis thaliana genome (2000) It contains both strongly diverged and more closely related homologous chromosomal regions. Running i-adhore on the genome separately. 921 sets of homologous genome segments are produced. N: varies from 2 to 11. Dataset II: Genomes of Arabidopsis thaliana, Populus trichocarpa (Tuskan et al., 2006) & Vitis vinifera (by Jaillon et al., 2007). Running i-adhore on the genomes. 7821 sets of homologous genome segments are produced. N: varies from 2 to 15. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 39 / 43

Experimental results Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 40 / 43

Experimental results Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 41 / 43

Experimental results LLBS is used as the default heuristic for i-adhore. Experiments on an extremely large dataset: Dozens of eukaryotic speices (Hubbard et al., 2009). This method can handle N = 50 (enough for practical uses). Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 42 / 43

Thank you. Joseph C.-C. Lin (CSIE, CCU, Taiwan) 30 March 2011 43 / 43