Network Alignment 858L

Similar documents
Network alignment and querying

Bioinformatics: Network Analysis

Protein-protein Interaction: Network Alignment

Comparative Genomics: Sequence, Structure, and Networks. Bonnie Berger MIT

Graph Alignment and Biological Networks

Comparative Network Analysis

Sequence Alignment (chapter 6)

Jeremy Chang Identifying protein protein interactions with statistical coupling analysis

Basic Local Alignment Search Tool

Background: comparative genomics. Sequence similarity. Homologs. Similarity vs homology (2) Similarity vs homology. Sequence Alignment (chapter 6)

BLAST. Varieties of BLAST

10-810: Advanced Algorithms and Models for Computational Biology. microrna and Whole Genome Comparison

Phylogenetic Analysis of Molecular Interaction Networks 1

Module: Sequence Alignment Theory and Applications Session: Introduction to Searching and Sequence Alignment

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Pairwise & Multiple sequence alignments

Protein Complex Identification by Supervised Graph Clustering

BLAST Database Searching. BME 110: CompBio Tools Todd Lowe April 8, 2010

Comparative Analysis of Molecular Interaction Networks

Frequent Pattern Finding in Integrated Biological Networks

Example of Function Prediction

Practical considerations of working with sequencing data

Computational approaches for functional genomics

networks in molecular biology Wolfgang Huber

Comparative Bioinformatics Midterm II Fall 2004

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Mapping and Filling Metabolic Pathways

Lecture 10: May 19, High-Throughput technologies for measuring proteinprotein

THEORY. Based on sequence Length According to the length of sequence being compared it is of following two types

Phylogenetic Tree Reconstruction

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

RELATIONSHIPS BETWEEN GENES/PROTEINS HOMOLOGUES

Biol478/ August

MULE: An Efficient Algorithm for Detecting Frequent Subgraphs in Biological Networks

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Sara C. Madeira. Universidade da Beira Interior. (Thanks to Ana Teresa Freitas, IST for useful resources on this subject)

Supplementary Information

Exploring Evolution & Bioinformatics

Computational methods for predicting protein-protein interactions

3. SEQUENCE ANALYSIS BIOINFORMATICS COURSE MTAT

Tools and Algorithms in Bioinformatics

Introduction to Bioinformatics Introduction to Bioinformatics

BIOINFORMATICS. Improved Network-based Identification of Protein Orthologs. Nir Yosef a,, Roded Sharan a and William Stafford Noble b

Introduction to protein alignments

Biochemistry 324 Bioinformatics. Pairwise sequence alignment

Orthology Part I concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

Bioinformatics Exercises

Biosequence Alignment 徐鹰佐治亚大学生化系 吉林大学计算机学院

Phylogenetic inference

17 Non-collinear alignment Motivation A B C A B C A B C A B C D A C. This exposition is based on:

Quantifying sequence similarity

Protein function prediction based on sequence analysis

Introduction to Bioinformatics

# shared OGs (spa, spb) Size of the smallest genome. dist (spa, spb) = 1. Neighbor joining. OG1 OG2 OG3 OG4 sp sp sp

Bioinformatics and BLAST

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Improved network-based identification of protein orthologs

Biological Networks Analysis

CONCEPT OF SEQUENCE COMPARISON. Natapol Pornputtapong 18 January 2018

9/30/11. Evolution theory. Phylogenetic Tree Reconstruction. Phylogenetic trees (binary trees) Phylogeny (phylogenetic tree)

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

Michael Yaffe Lecture #5 (((A,B)C)D) Database Searching & Molecular Phylogenetics A B C D B C D

Sequence analysis and Genomics

Genomics and bioinformatics summary. Finding genes -- computer searches

Types of biological networks. I. Intra-cellurar networks

IMPLEMENTING HIERARCHICAL CLUSTERING METHOD FOR MULTIPLE SEQUENCE ALIGNMENT AND PHYLOGENETIC TREE CONSTRUCTION

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling. Roberto Lins EPFL - summer semester 2005

comparative genomics of high throughput data between species and evolution of function

What is Phylogenetics

EVOLUTIONARY DISTANCES

RODED SHARAN, 1 TREY IDEKER, 2 BRIAN KELLEY, 3 RON SHAMIR, 1 and RICHARD M. KARP 4 ABSTRACT

Interaction Network Analysis

Università della Calabria

Introduction to Bioinformatics

Inferring Protein-Signaling Networks

Towards Detecting Protein Complexes from Protein Interaction Data

CHAPTERS 24-25: Evidence for Evolution and Phylogeny

Orthology Part I: concepts and implications Toni Gabaldón Centre for Genomic Regulation (CRG), Barcelona

Complex (Biological) Networks

Supplementary text for the section Interactions conserved across species: can one select the conserved interactions?

15.1 Matching, Components, and Edge cover (Collaborate with Xin Yu)

Chapter 26: Phylogeny and the Tree of Life Phylogenies Show Evolutionary Relationships

Dr. Amira A. AL-Hosary

BINF 730. DNA Sequence Alignment Why?

Comparative genomics: Overview & Tools + MUMmer algorithm

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Data Mining and Analysis: Fundamental Concepts and Algorithms

Effects of Gap Open and Gap Extension Penalties

An Introduction to Sequence Similarity ( Homology ) Searching

Comparing Protein Interaction Networks via a Graph Match-and-Split Algorithm

Cross-lingual and temporal Wikipedia analysis

Sequence alignment methods. Pairwise alignment. The universe of biological sequence analysis

Lecture 14: Multiple Sequence Alignment (Gene Finding, Conserved Elements) Scribe: John Ekins

Weighted Acyclic Di-Graph Partitioning by Balanced Disjoint Paths

Research Proposal. Title: Multiple Sequence Alignment used to investigate the co-evolving positions in OxyR Protein family.

A Novel Framework for the Comparative Analysis of Biological Networks

Session 5: Phylogenomics

BIOINFORMATICS: An Introduction

Transcription:

Network Alignment 858L

Terms & Questions A homologous h Interolog = B h Species 1 Species 2 Are there conserved pathways? What is the minimum set of pathways required for life? Can we compare networks to develop an evolutionary distance?

Aligning Networks Combining Sequence and Network Topology Let G 1 = (V 1, E 1 ), G 2,... G k be graphs, each giving noisy experimental estimations of interactions between proteins in organisms 1,.., k. If G i = (V i, E i ), we also have a function: sim(u,v) : V i V j that gives the sequence similarity between u and v. G 1 G 2

Conservation Functional Importance If a structure has withstood millions of years of the randomizing process of mutations, then it likely has an important function. Structure = DNA sequence, protein sequence, protein shape, network topology. So: appearance of similar topology in two widely separated organisms indicates a real, fundamental set of interactions. Also, by comparing graphs we can transfer knowledge about one organism to another.

Local alignment: 1. Which nodes are dissimilar [low sim(u,v)] but have similar neighbors / neighborhoods? (e.g. Bandyopadhyay et al.) functional orthologs: proteins that play the same role, but may look very different. 2. Which edges are real and important, e.g. form a conserved pathway in the cell? Global alignment: Singh et al., 2007 propose: Maximum common subgraph: Find the largest graph H that is isomorphic to subgraphs of two given graphs G 1 and G 2.

Graph Isomorphism: Given graphs G1 = (V1,E1), G2 = (V2,E2) each with n nodes, decide whether there is a one-to-one and onto function f : V1 V2 such that (u,v) E1 (f(u), f(v)) E2 Subgraph Isomorphism: Given graphs G1 = (V1,E1), G2 = (V2,E2), where G1 has k nodes and G2 has n > k nodes, decide whether there is a one-to-one function f : V1 V2 such that (u,v) E1 (f(u), f(v)) E2

Graph Isomorphism: Given graphs G1 = (V1,E1), G2 = (V2,E2) each with n nodes, decide whether there is a one-to-one and onto function f : V1 V2 such that (u,v) E1 (f(u), f(v)) E2 Not known to be NP-hard. Subgraph Isomorphism: Given graphs G1 = (V1,E1), G2 = (V2,E2), where G1 has k nodes and G2 has n > k nodes, decide whether there is a one-to-one function f : V1 V2 such that (u,v) E1 (f(u), f(v)) E2 NP-complete.

PathBLAST: (Kelley et al, 2003)

PathBLAST Alignment Graph Nodes correspond to homologous pairs (A, a) where A is from one species, and a is from the other. Edges come in 3 types: Direct. A-B and a-b interactions are present. Gap. Edge A-B is present, and a & b are separated by 2 hops. Mismatch. Both (A & B) and (a & b) are both separated by 2 hops. (Kelley et al, 2003)

PathBLAST Scoring Function Pr[interaction] p(e v H) estimated from E v distributions in COG: p(v) = probability that proteins in v are really homologs. p(v) = Pr[Homology E v ] 1 0.1 2 0.3 3 0.9 q(e) = i e Pr[i] sum over logs = product over scores q(e) = product of interaction edges contained within the alignment edge S(P ) := v P log p(v) p random + e P log q(e) q random P is a path in the alignment graph. p random and q random are the average values of p(v) and q(e) in the graph.

PathBLAST Search Procedure If G is directed, acyclic (DAG) then its easy to find a high-scoring path via dynamic programming. S(v,L ) = max-scoring path of length L that ends at v: S(v, L) = arg max u pred(v) S(u, L 1) + log p(v) p random + log q(u v) q random Because G is not directed, acyclic they randomly create a large number of DAGs by removing edges as follows: 1. Randomly rank vertices. 2. Direct edges from low to high rank. Run dynamic program on the random DAGs and take the highest scoring path. 2/L! chance that a path will be preserved. So repeat 5L! times.

H. pylori & S. cerevisiae Find several (50) high-scoring paths Then, remove those edges & vertices and repeat. Overlay the identified paths. Revealed 5 conserved pathways. Contains proteins from both: DNA polymerase and Proteosome => evidence that they interact (Kelley et al, 2003)

H. pylori & S. cerevisiae Find several (50) high-scoring paths Then, remove those edges & vertices and repeat. Overlay the identified paths. Revealed 5 conserved pathways. Contains proteins from both: DNA polymerase and Proteosome => evidence that they interact (Kelley et al, 2003)

H. pylori & S. cerevisiae Find several (50) high-scoring paths Then, remove those edges & vertices and repeat. Overlay the identified paths. Revealed 5 conserved pathways. Contains proteins from both: DNA polymerase and Proteosome => evidence that they interact (Kelley et al, 2003)

Some Notes Goal: use a well-studied organism (yeast) to learn about a lessstudied organism (H. pylori). There were only 7 directly shared edges between yeast & H. pylori. (you would expect 2.5 shared edges). - Gap & mismatch edges were essential! Within conserved pathways, proteins often were not paired with the protein with the most similar sequence. - 22% of the proteins in previous figure did not pair with their best sequence match Single pathways in bacteria often correspond to multiple pathways in yeast. (Yeast is suspected of having undergone multiple whole-genome duplications.)

Yeast Paralogous Pathways Proteins were not allowed to pair with themselves or their network neighbors. (Kelley et al, 2003)

Yeast Paralogous Pathways Proteins were not allowed to pair with themselves or their network neighbors. MAPK kinase signaling cascades were very common. All 120 yeast kinases share ~ 30% sequence similarity. Hence: p(v) similar for all pairs v of two kinases. Probably need to improve kinase matching procedure. (Kelley et al, 2003)

Searching Can use local alignment to search: align a small query network to the large network. (Kelly et al, 2003)

Searching Can use local alignment to search: align a small query network to the large network. (Kelly et al, 2003)

PathBLAST Summary Local graph alignment Takes into account sequence similarity & topological patterns Allows gaps and mismatches of length 1. Scoring function ~ probability of the path existing. Algorithm: fast, reasonable, but definitely a heuristic. Searching & local alignment are very related.