Integration of functional genomics data

Similar documents
Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

86 Part 4 SUMMARY INTRODUCTION

Networks & pathways. Hedi Peterson MTAT Bioinformatics

BMD645. Integration of Omics

Gene Ontology and overrepresentation analysis

Computational Genomics. Systems biology. Putting it together: Data integration using graphical models

Francisco M. Couto Mário J. Silva Pedro Coutinho

Bioinformatics. Transcriptome

Written Exam 15 December Course name: Introduction to Systems Biology Course no

Chemical Data Retrieval and Management

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Comparative Analysis of Nitrogen Assimilation Pathways in Pseudomonas using Hypergraphs

Computational methods for predicting protein-protein interactions

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Fuzzy Modeling of Metabolic Maps in MetNetDB *

Markov Random Field Models of Transient Interactions Between Protein Complexes in Yeast

Title 代謝経路からの酵素反応連続パターンの検出 京都大学化学研究所スーパーコンピュータシステム研究成果報告書 (2013), 2013:

ATLAS of Biochemistry

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

Metabolic networks: Activity detection and Inference

networks in molecular biology Wolfgang Huber

Systems II. Metabolic Cycles

Supplementary online material

Homology and Information Gathering and Domain Annotation for Proteins

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

A Protein Ontology from Large-scale Textmining?

Bioinformatics: Network Analysis

Context dependent visualization of protein function

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

DATA ACQUISITION FROM BIO-DATABASES AND BLAST. Natapol Pornputtapong 18 January 2018

A Database of human biological pathways

V19 Metabolic Networks - Overview

V14 extreme pathways

Cell and Molecular Biology

CS612 - Algorithms in Bioinformatics

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

Erzsébet Ravasz Advisor: Albert-László Barabási

Computational methods for the analysis of bacterial gene regulation Brouwer, Rutger Wubbe Willem

CSCE555 Bioinformatics. Protein Function Annotation

Genome-Scale Gene Function Prediction Using Multiple Sources of High-Throughput Data in Yeast Saccharomyces cerevisiae ABSTRACT

Predicting Protein Functions and Domain Interactions from Protein Interactions

Learning in Bayesian Networks

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

STRING: Protein association networks. Lars Juhl Jensen

Systems biology and biological networks

Bioinformatics: Network Analysis

PNmerger: a Cytoscape plugin to merge biological pathways and protein interaction networks

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Title: HyBrow - A Prototype System for Computer-Aided Hypothesis Evaluation

Staphylococcus aureus subsp. aureus MRSA252 1

IMMERSIVE GRAPH-BASED VISUALIZATION AND EXPLORATION OF BIOLOGICAL DATA RELATIONSHIPS

Discovering molecular pathways from protein interaction and ge

Homology. and. Information Gathering and Domain Annotation for Proteins

METABOLIC PATHWAY PREDICTION/ALIGNMENT

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

Androgen-independent prostate cancer

Predicting rice (Oryza sativa) metabolism

Lecture 2. The Blast2GO annotation framework

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

A Multiobjective GO based Approach to Protein Complex Detection

Research Article HomoKinase: A Curated Database of Human Protein Kinases

Merging and semantics of biochemical models

Improving Gene Functional Analysis in Ethylene-induced Leaf Abscission using GO and ProteInOn

Introduction to Bioinformatics

Chapter 7: Metabolic Networks

Comparative RNA-seq analysis of transcriptome dynamics during petal development in Rosa chinensis

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

SUPPLEMENTARY INFORMATION

Lecture 3: A basic statistical concept

An introduction to SYSTEMS BIOLOGY

Cellular Function Prediction for Hypothetical Proteins Using High-Throughput Data

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

Network Biology: Understanding the cell s functional organization. Albert-László Barabási Zoltán N. Oltvai

Evidence for dynamically organized modularity in the yeast protein-protein interaction network

New Computational Methods for Systems Biology

Biological Pathways Representation by Petri Nets and extension

Subsystem: TCA Cycle. List of Functional roles. Olga Vassieva 1 and Rick Stevens 2 1. FIG, 2 Argonne National Laboratory and University of Chicago

Supporting Information

Cluster Analysis of Gene Expression Microarray Data. BIOL 495S/ CS 490B/ MATH 490B/ STAT 490B Introduction to Bioinformatics April 8, 2002

Towards Detecting Protein Complexes from Protein Interaction Data

Hands-On Nine The PAX6 Gene and Protein

Computational Genomics. Reconstructing dynamic regulatory networks in multiple species

All Proteins Have a Basic Molecular Formula

Estimation of Identification Methods of Gene Clusters Using GO Term Annotations from a Hierarchical Cluster Tree

Phylogenetic Analysis of Molecular Interaction Networks 1

Self Similar (Scale Free, Power Law) Networks (I)

Genome-wide multilevel spatial interactome model of rice

BIOINFORMATICS. Mining literature for protein protein interactions. Edward M. Marcotte 1, 2, 3,, Ioannis Xenarios 1, and David Eisenberg 1

APPLICATION OF GRAPH BASED DATA MINING TO BIOLOGICAL NETWORKS CHANG HUN YOU. Presented to the Faculty of the Graduate School of

Biological networks CS449 BIOINFORMATICS

Analysis and visualization of protein-protein interactions. Olga Vitek Assistant Professor Statistics and Computer Science

GENE ONTOLOGY (GO) Wilver Martínez Martínez Giovanny Silva Rincón

Keywords: systems biology, microarrays, gene expression, clustering

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Functional Characterization and Topological Modularity of Molecular Interaction Networks

Introduction to Bioinformatics Online Course: IBT

METABOLISM. What is metabolism? Categories of metabolic reactions. Total of all chemical reactions occurring within the body

Types of biological networks. I. Intra-cellurar networks

Bioinformatics 2. Large scale organisation. Biological Networks. Non-biological networks

Transcription:

Integration of functional genomics data Laboratoire Bordelais de Recherche en Informatique (UMR) Centre de Bioinformatique de Bordeaux (Plateforme) Rennes Oct. 2006 1

Observations and motivations Genomics and functional genomics have expanded the focus of cellular biology from individual biomolecular entities towards relationships between those entities. Entities : genes, ORFs, proteins Relationships : interactions, complexes, pathways, networks This raises new types of questions and new requirements in terms of data integration. Rennes Oct. 2006 2

How to make sense out of new experimental data? 1. Purification of protein complexes. Is there a biological knowledge, or information, which significantly groups together the components of a complex? 2. Large scale expression profile analysis. Are there clusters of co-regulated genes that significantly correspond to known biological processes? Rennes Oct. 2006 3

Characteristics of the new questions The questions start with a set of biomolecular entities (query set). The answer should go further than collecting information attached to all members of the query set (what significantly groups members of the set?). Example : Analysis of a 13 proteins yeast complex 2 Valine, leucine and isoleucine biosynthesis (16) 3 Biodegradation of Xenobiotics (137) 1 Glycolysis / Gluconeogenesis (47) 2 Oxidative phosphorylation (70) 5 Non-enzymes (4312) KEGG Pathways Rennes Oct. 2006 4

Glutamate metabolism Rennes Oct. 2006 5

Difficulties related to biological information Heterogeneity : in terms of semantics (functional and structural information) and in terms of structures (numerical values, discrete attributes, natural language texts, ) Dissemination : annotated databases (UNIPROT, EMBL, KEGG, ), literature (MedLine, full text of articles), raw data sources (SMD, ArrayExpress, ). How to identify a biological criteria which significantly groups components of my query set? Rennes Oct. 2006 6

Proposed strategy Principles Use sets of genes, or gene s products, as a unified data structure Convert as much as possible of available biological knowledge into sets (known / target sets) Use a measure of similarity between sets in order to compare a query set with the target sets System Store all the target sets in a database Define a standard format to import new sets Develop a system that supports queries: comparison of one or several sets against the content of the database in order to fetch similarities Rennes Oct. 2006 7

Converting biological knowledge into sets Structural information (InterPro : 1 domain = 1 set) Functional classification (Kegg : 1 pathway = 1 set) Protein interactions (1 complex = 1 set) Cellular location (MIPS : 1 compartment = 1 set) Biliographical references (Pubmed : 1 article = 1 set) Expression data (GEO : 1 cluster = 1 set) Physico-chemical properties (a IP value range = 1 set) Genome structure (1 group of neighbors = 1 set) Rennes Oct. 2006 8

Principles of sets comparison Query set The probability of having at least the observed number of elements in common between sets provides with a similarity measure. Database of sets Target sets - Sets have to be taken from a define population (an organism). - Due to multiple comparisons, statistical correction is necessary (i.e. Bonferonni) in order to compute an Evalue. Rennes Oct. 2006 9

Organization of the sets Each set belongs to a criteria (i.e. physical proximity, a given expression data experiment, GO, etc ) For a given criteria, there are relationships between sets that can be described in a graph D A E B F Star graph (domains, biblio) Binary tree (Hierarchical clustering) Tree (Enzyme) G Directed acyclic Graph (Go, physical proximity) Rennes Oct. 2006 10

Single/Multiple query sets Query sets Database of sets collections Query set Rennes Oct. 2006 11

BlastSets system System up and running and publicly available at http://cbi.labri.fr/outils/blastsets/ Barriot, R., Poix, J., Groppi, A., Barre, A., Goffard, N., Sherman, D., Dutour, I. & de Daruvar, A. New strategy for the representation and the integration of biomolecular knowledge at a cellular scale. Nucleic Acids Res. Rennes Oct. 2006 12

Can BlastSets be usefull? ex: Large scale expression profile analysis. Are there clusters of co-regulated genes that significantly correspond to known biological processes? Rennes Oct. 2006 13

Interpretation of expression data Compute an automatic comparison of : sets obtained by hierarchical clustering of real expression data sets corresponding to metabolic pathways Compare BlastSets results (pathways that are found most significantly similar to a given node in the hierarchical tree) and published results (obtain by manual exploration of the hierarchical tree) Rennes Oct. 2006 14

Results Ferea et al. Glycolysis Tricarboxylic cycle Oxidative phosphorylation Metabolite transport BlastSets Ribosome Oxidative phosphorylation Citrate cycle (TCA cycle) ATP synthesis Galactose metabolism D-Arginine and D-ornithine metabolism Nucleotide sugars metabolism Decreasing similarity scores Rennes Oct. 2006 15

Results Ferea et al. Glycolysis Tricarboxylic cycle Oxidative phosphorylation Metabolite transport BlastSets Ribosome Oxidative phosphorylation Citrate cycle (TCA cycle) ATP synthesis Galactose metabolism D-Arginine and D-ornithine metabolism Nucleotide sugars metabolism Decreasing similarity scores Rennes Oct. 2006 16

BlastSets architecture Public databases download BlastSets server KEGG pathways SWISSPROT MIPS Stanford Microarray Database Gene Ontology flat files Internal XML format extract sets load in DB and index Apache Comparison module Sets DB PHP Web services index Internet protocol Web browser Java, Perl or Ruby program Rennes Oct. 2006 17

Knowledge representation : how to define sets? Simple for discrete attributes: Sub cellular compartments one compartment = one set Metabolic pathways one pathway = one set Multi-protein complexes one complex = one set Not simple otherwise how to choose the most appropriate clustering method? Rennes Oct. 2006 18

Comparing different representations Biological criterion 1 Biological criterion 2 Representation 1 Known correlation with biological criterion 1 Representation 2 Rennes Oct. 2006 19

Clustering expression profiles : Hierarchical clustering N-1 sets N genes Rennes Oct. 2006 20

Clustering expression profiles : Best neighbors N genes N x m sets m sets More groups => more information captured and more noise! Rennes Oct. 2006 21

Assessment procedure using protein complexes Hierarchical clustering expression profiles Neighbors clustering Complexes Hierarchical clustering Neighbors clustering Comp. 1 X X Comp. 2 Comp. 3 X X Comp. 4 X BlastSets... Total 2 3 Multiproteic complexes Rennes Oct. 2006 22

Results : nb. complexes similar to at least one expression cluster Hierarchic al clustering Spellman experiment 2 Gasch experiment 3 Neighborhood 60 Neighborhood 100 Hierarchic al clustering Neighborhood 60 Neighborhood 100 Number of sets 5629 56 300 78 820 5648 56 490 79 086 MIPS Complexes 1 (1059) Random complexes (1059) 48 51 14 56 89 20 0 0 0 0 0 0 Obtained using Bonferroni correction 1. MIPS Database Complex : http://mips.gsf.de/genre/proj/yeast/ 2. Spellman PT et al. 1998. "Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization". Mol Biol Cell 9(12) : 3273-97 3. Gasch AP et al. 2000. "Genomic expression programs in the response of yeast cells to environmental changes". Mol Biol Cell 11(12) : 4241-57 Rennes Oct. 2006 23

Project on pathways (collaboration with the KEGG) Assessment of various methods for representing metabolic pathways : One KEGG map = one set For each map : calculation of elementary modes each of which defines a set Rennes Oct. 2006 24

Conclusion / perspectives BlastSets implements the concept of neighborhoods (A. Danchin) in order to reveal potential relationships between heterogeneous information. The strategy requires optimization of knowledge representation. Some computational problems remain to be solved. Can the method be implemented as a service provides by the each data source? Rennes Oct. 2006 25

Partners ACI IMPBIO Centre de Bioinformatique Bordeaux (A. de Daruvar, A. Groppi, A. Barré) Laboratoire Bordelais de Recherche en Informatique (A. de Daruvar, I. Dutour, D. Sherman, R. Barriot, C. Gaugain) Laboratoire de Statistique Mathématique et Applications (J. Poix) Unité de Génétique des Génomes Bactériens, Institut Pasteur (A. Danchin) UMR INRA/UB2 Génomique Développement Pouvoir Pathogène (A. Blanchard) Other collaborations : INIST (A. Zasadzinski) KEGG (M. Kanehisa, J.M. Schwartz, J. Nacher) Rennes Oct. 2006 26