Network alignment and querying

Similar documents
Bioinformatics: Network Analysis

Network Alignment 858L

Protein-protein Interaction: Network Alignment

Comparative Network Analysis

Phylogenetic Analysis of Molecular Interaction Networks 1

Algorithms and tools for the alignment of multiple protein networks

Graph Alignment and Biological Networks

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Protein Complex Identification by Supervised Graph Clustering

Comparative Genomics: Sequence, Structure, and Networks. Bonnie Berger MIT

BMI/CS 776 Lecture #20 Alignment of whole genomes. Colin Dewey (with slides adapted from those by Mark Craven)

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Comparative Analysis of Molecular Interaction Networks

Modular Algorithms for Biomolecular Network Alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Comparison of Protein-Protein Interaction Confidence Assignment Schemes

FROM UNCERTAIN PROTEIN INTERACTION NETWORKS TO SIGNALING PATHWAYS THROUGH INTENSIVE COLOR CODING

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Pathway Association Analysis Trey Ideker UCSD

Græmlin: General and robust alignment of multiple large interaction networks

Lecture 10: May 19, High-Throughput technologies for measuring proteinprotein

RODED SHARAN, 1 TREY IDEKER, 2 BRIAN KELLEY, 3 RON SHAMIR, 1 and RICHARD M. KARP 4 ABSTRACT

Algorithms to Detect Multiprotein Modularity Conserved During Evolution

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Multiple Whole Genome Alignment

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

Sequence Alignment (chapter 6)

BIOINFORMATICS. Integrative Network Alignment Reveals Large Regions of Global Network Similarity in Yeast and Human

Dr. Amira A. AL-Hosary

Tools and Algorithms in Bioinformatics

Understanding Science Through the Lens of Computation. Richard M. Karp Nov. 3, 2007

Sequence analysis and Genomics

Towards Detecting Protein Complexes from Protein Interaction Data

Querying Pathways in Protein Interaction Networks Based on Hidden Markov Models

Introduction to Bioinformatics

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Amira A. AL-Hosary PhD of infectious diseases Department of Animal Medicine (Infectious Diseases) Faculty of Veterinary Medicine Assiut

An Introduction to Sequence Similarity ( Homology ) Searching

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Computational Network Biology Biostatistics & Medical Informatics 826 Fall 2018

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Phylogenetic Tree Reconstruction

Pairwise & Multiple sequence alignments

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Dividing Protein Interaction Networks by Growing Orthologous Articulations

Algorithms in Bioinformatics

Bioinformatics (GLOBEX, Summer 2015) Pairwise sequence alignment

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Repeat resolution. This exposition is based on the following sources, which are all recommended reading:

Effective Identification of Conserved Pathways in Biological Networks Using Hidden Markov Models

Lecture 4: Evolutionary Models and Substitution Matrices (PAM and BLOSUM)

Frequent Pattern Finding in Integrated Biological Networks

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Mapping and Filling Metabolic Pathways

SUPPLEMENTARY INFORMATION

Sequence Bioinformatics. Multiple Sequence Alignment Waqas Nasir

Quantifying sequence similarity

COMPARATIVE ANALYSIS OF BIOLOGICAL NETWORKS. A Thesis. Submitted to the Faculty. Purdue University. Mehmet Koyutürk. In Partial Fulfillment of the

Phylogenetic inference

Chapter 3. Global Alignment of Protein Protein Interaction Networks. Misael Mongiovì and Roded Sharan. Abstract. 1. Introduction

I519 Introduction to Bioinformatics, Genome Comparison. Yuzhen Ye School of Informatics & Computing, IUB

Discovering modules in expression profiles using a network

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

BINF 730. DNA Sequence Alignment Why?

Comparative Analysis of Modularity in Biological Systems

EVOLUTIONARY DISTANCES

Comparing Protein Interaction Networks via a Graph Match-and-Split Algorithm

MULE: An Efficient Algorithm for Detecting Frequent Subgraphs in Biological Networks

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Procedure to Create NCBI KOGS

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

Protein function prediction via analysis of interactomes

EFFICIENT AND ROBUST PREDICTION ALGORITHMS FOR PROTEIN COMPLEXES USING GOMORY-HU TREES

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Functional Characterization of Molecular Interaction Networks

Learning in Bayesian Networks

Robust Community Detection Methods with Resolution Parameter for Complex Detection in Protein Protein Interaction Networks

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Active network alignment

Complex (Biological) Networks

Effects of Gap Open and Gap Extension Penalties

Mining Biological Networks

SLIDER: Mining correlated motifs in protein-protein interaction networks

Multiple Alignment. Slides revised and adapted to Bioinformática IST Ana Teresa Freitas

Single alignment: Substitution Matrix. 16 march 2017

Visualization of Aligned Biological Networks: A Survey

CSCI1950 Z Computa3onal Methods for Biology* (*Working Title) Lecture 1. Ben Raphael January 21, Course Par3culars

NJMerge: A generic technique for scaling phylogeny estimation methods and its application to species trees

Bioinformatics 1. Sepp Hochreiter. Biology, Sequences, Phylogenetics Part 4. Bioinformatics 1: Biology, Sequences, Phylogenetics

Communities Via Laplacian Matrices. Degree, Adjacency, and Laplacian Matrices Eigenvectors of Laplacian Matrices

Motivating the need for optimal sequence alignments...

Comparative and Evolutionary Analysis of Cellular Pathways

Copyright 2000 N. AYDIN. All rights reserved. 1

InDel 3-5. InDel 8-9. InDel 3-5. InDel 8-9. InDel InDel 8-9

Improved network-based identification of protein orthologs

GEOMETRIC EVOLUTIONARY DYNAMICS OF PROTEIN INTERACTION NETWORKS

Transcription:

Network biology minicourse (part 4) Algorithmic challenges in genomics Network alignment and querying Roded Sharan School of Computer Science, Tel Aviv University

Multiple Species PPI Data Rapid growth in number of species measured.

Distilling Modules Problem: Data is partial and noisy.

Being Comparative Paradigm: Evolutionary conservation implies functional significance. Conservation: similarity in sequence and interaction topology. Species A Species B

Main challenges Local network alignment: detect conserved subnetworks across two (or more) networks. Global network alignment: find 1-1 mapping between networks. Network querying: given a query subnetwork in species A, find similar instances in the network of species B.

Local Pairwise Alignment

Problem definition Given two networks (of two species), find pairs of subnetworks (one from each species) that are significantly similar. Similarity is measured both on vertices (sequence similarity) and edges (topological similarity). Under certain formulations reduces to subgraph isomorphism (NP hard).

Network Alignment Conserved interactions Protein pairs Alignment graph: Nodes: pairs of sequence-similar proteins, one per species. Edges: conserved interactions. Facilitates search for conserved subnetworks. First introduced by Ogata et al. 00 and Kelley et al. 03.

PathBLAST (Kelley et al. 03) H. pylori ~1500 PPIs Yeast ~15000 PPIs

Protein sequence similarity Bacteria Yeast Best match of bcp is Dot5, which does not interact with the pathway s proteins.

Identifying conserved Complexes Generalize single-species scoring Given two protein subsets, one in each species, with a many-to-many correspondence between them, wish: 1. Each subset induces a dense subgraph. 2. Matched protein pairs are sequence-similar. L( C, C') = L( C) L( C') Pr( S u, v Pr( S u, v matched u, v homologs) random) Recall: L( C) = p p( u, v) 1 p ( u, v) E ' ( u, v) E ' 1 p( u, v) S. et al. JCB 2005

Evolutionary-based Scoring

A Word on PPI Evolution PPI networks are shaped by duplication and indel events. Indel events arise due to mutations that change protein surface and are much more frequent. Koyuturk et al., JCB 2006

Scoring (MaWish) The score of two aligned protein subsets is based on the match, mismatch and duplication events they induce. Each event is associated with a parameter (heuristically set) which determines its relative weight. Match reward ms(u,v)s(u,v ) Indel penalty -ns(u,v)s(u,v ) Duplication reward/penalty d(s(u,u )-s)

Score improvement NetworkBLAST Probabilistic scoring model (S. et al. 05) MaWish (Koyuturk et al. 05) Evolution based scoring model Match: Network Interaction 1 in Network both species 2 A Mismatch: Interaction in one of the species and not the other The score that is assigned to every two pairs of homologous proteins is proportional to their sequence similarity level B C Network 1 Network 2 M C - Protein Complex Model: Likelihood ratio score: Dense subnetwork M N - Background A (Null) Model: Random B subnetwork C Score ( A) = P( A M P( A M Score( A, B) = Score( A) Score( B) C N ) ) Score(A U B) = Score(A U C) Score(A U B) = 1 = Score(A U C) = 1

Score improvement (cont.) P( A, B M C ) Link Loss Gain Closest Common Ancestor Hypothetical PPI Network A B C Network 1 Network 2 u u w v v x x Hirsh et al., ECCB 2006

Local multiple alignment

3-way comparison? S. cerevisiae 4389 proteins 14319 interactions C. elegans 2718 proteins 3926 interactions D. melanogaster 7038 proteins 20720 interactions S. et al. PNAS 2005

Generalizing Network Alignment Alignment graph is extensible to multiple species. Likelihood scoring is easily extensible, up to sequence similariry terms: require scoring a multiple sequence alignment. Ignored till now: need to balance edge and vertex terms. Practical solution: Sensible threshold for sequence similarity. Nodes in alignment graph are filtered accordingly. Node terms are removed from score.

71 conserved regions: 183 significant clusters and 240 significant paths.

Interaction Prediction A pair of proteins is predicted to interact if: 1. Sequence-similar proteins interact in the other two species. 2. The proteins co-occur in the same conserved complex. Species Sensitivity (%) Specificity (%) P-value Strategy Yeast 50 77 1-25 [ 1] Worm 43 82 1e-13 [ 1] Fly 23 84 5e-5 [ 1] Yeast 9 99 1e-6 [ 2]+[1] Worm 10 100 6e-4 [ 2]+[1] Fly 0.4 100 0.5 [ 2]+[1]

Experimental Validation 65 predictions for yeast using strategies [1]+[2] were tested in lab. Success rate: 40-52%. Outperforms the interolog approach (Matthews et al. 01, Yu et al. 04) at 16-31%.

The Scalability Problem Network alignment scales as n k (in time and space) for n proteins and k species, hence practical only for k=2,3 (takes several hours). Progressive alignment is fast (Graemlin by Flanick et al., GR 2006) but does not perform as well. Main idea: imitate the greedy search w/o explicitly constructing the alignment graph. Kalaev et al., RECOMB 2008

Scaling Up Network Alignment Maintain linear representation. Observe: network alignment node is a vertical path Given a current seed, use dynamic programming to identify the vertical path which contributes most to the score. Complexity reduces to O(m2 k )!

Network querying

Problem definition Given a query graph Q and a network G, find the subnetwork of G that is: Aligned with Q The alignment has maximal score Query Q Network G

Isomorphic Alignment Species A Q Species B isomorphic to Q match match match match match match Match of sequence-similar proteins

Homeomorphic Alignment Species A Q Species B homeomorphic to Q match match match match deletion insertion match match Match of sequence-similar proteins and deletion/insertion of degree-2 nodes

Score of Alignment Score Score = = Sequence similarity h q, v) score for matches Penalty for + deletions & + δ d (# Del) + δ insertions i Interaction reliability ) scores + (# Ins + ( w( vi, v j ) q h(q 1,v 1 ) 1 v 1 h(q 2,v 2 ) h(q 3,v 3 ) h(q 4,v 4 ) v 2 del pen h(q 5,v 5 ) ins pen h(q 6,v 6 )

Complexity Network querying problem is NPC by reduction from subgraph isomorphism (in contrast to sequence querying!!!) Naïve algorithm has O(n k ) complexity n = size of the PPI network, k = size of the query Intractable for realistic values of n and k n ~5000, k~10 Reduction in complexity can be achieved by: Constraining the network [Pinter et al., Bioinformatics 05] Allowing vertex repetitions Constraining the query (fixed parameter algs.)

PathBLAST Reduction to finding paths in an alignment graph. Repetitions are possible. No general handling of insertions/deletions Kelley et al., PNAS 03

DP-Based Approach Use dynamic programming (a la sequence alignment): W(i,j) is the maximal score of a partial alignment of query nodes {1 i} that ends at vertex j of the network. + + + + + = d i j i W E j m j m w m i W E j m j m w j i h m i W j i W δ δ ) 1, ( ),,( ), ( ), ( ), ),(, ( ), ( ) 1, ( max ), ( match insertion deletion Shlomi et al., BMC Bioinformatics 06; Yang & Sze, JCB 07

Cross-Species Comparison of Signaling Pathways But DP may introduce protein repetitions along the path. Yang & Sze, JCB 07

QPath randomly color Network Graph repeat N times query high scoring subnetwork Shlomi et al., BMC Bioinformatics 06 DP algorithm

Ideas can be generalized to tree queries and beyond (QNet) Query Network q 1 q 2 v 1 q3 q 4 v 2 v 3 v 4 q 5 v 6 q 6 v 7 q 7 Dost et al., RECOMB 07

Is topology needed?? Bruckner et al., RECOMB 2009

TORQUE: Topology-free querying Input: Graph G=(V,E) Color set {1,2,...,k} A coloring of network vertices Output: a connected subgraph that is colorful.

Algorithmic idea Every connected subgraph has a spanning tree Every colorful connected subgraph will have a colorful spanning tree Instead of looking for a colorful subgraph, look for a colorful tree Two implemented approaches: Dynamic programming (color coding) ILP

Comparison with QNet Human complexes in Yeast Fly complexes in Human number of matches 180 160 140 120 100 80 60 40 20 no match found not a quality match quality match number of matches 40 35 30 25 20 15 10 5 no match found not a quality match quality match 0 TORQUE QNet 0 TORQUE QNet Yeast complexes in Human Rat complexes in Fly 70 40 60 35 number of matches 50 40 30 20 10 no match found not a quality match quality match number of matches 30 25 20 15 10 5 no match found not a quality match quality match 0 TORQUE QNet 0 TORQUE

Summary & the road ahead