GROOLS: Reactive Graph Reasoning for Genome Annotation

Similar documents
CS 229 Project: A Machine Learning Framework for Biochemical Reaction Matching

ATLAS of Biochemistry

2 GENE FUNCTIONAL SIMILARITY. 2.1 Semantic values of GO terms

Gene Ontology and overrepresentation analysis

Integration of functional genomics data

Enumerating Precursor Sets of Target Metabolites in a Metabolic Network

Transitioning BioCyc to a Subscription Model

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Introduction to Bioinformatics

CISC 636 Computational Biology & Bioinformatics (Fall 2016)

Networks & pathways. Hedi Peterson MTAT Bioinformatics

Computational approaches for functional genomics

86 Part 4 SUMMARY INTRODUCTION

Computational Systems Biology

FUNCTION ANNOTATION PRELIMINARY RESULTS

Predicting rice (Oryza sativa) metabolism

Outline. Terminologies and Ontologies. Communication and Computation. Communication. Outline. Terminologies and Vocabularies.

Francisco M. Couto Mário J. Silva Pedro Coutinho

Biological Networks: Comparison, Conservation, and Evolution via Relative Description Length By: Tamir Tuller & Benny Chor

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

BMD645. Integration of Omics

AN EVIDENCE ONTOLOGY FOR USE IN PATHWAY/GENOME DATABASES

RGP finder: prediction of Genomic Islands

Knowledge representation DATA INFORMATION KNOWLEDGE WISDOM. Figure Relation ship between data, information knowledge and wisdom.

Gene Ontology. Shifra Ben-Dor. Weizmann Institute of Science

Types of biological networks. I. Intra-cellurar networks

Genome Annotation. Bioinformatics and Computational Biology. Genome sequencing Assembly. Gene prediction. Protein targeting.

RULE-BASED REASONING FOR SYSTEM DYNAMICS IN CELL SYSTEMS

Written Exam 15 December Course name: Introduction to Systems Biology Course no

FCModeler: Dynamic Graph Display and Fuzzy Modeling of Regulatory and Metabolic Maps

Biological Pathways Representation by Petri Nets and extension

Introduction to Bioinformatics

SPA for quantitative analysis: Lecture 6 Modelling Biological Processes

CSCE555 Bioinformatics. Protein Function Annotation

Goal 1: Develop knowledge and understanding of core content in biology

Grundlagen der Systembiologie und der Modellierung epigenetischer Prozesse

Chapter 3: Propositional Calculus: Deductive Systems. September 19, 2008

n logical not (negation) n logical or (disjunction) n logical and (conjunction) n logical exclusive or n logical implication (conditional)

METABOLIC PATHWAY PREDICTION/ALIGNMENT

Bioinformatics 2. Yeast two hybrid. Proteomics. Proteomics

Pathway Bioinformatics: Inference, Visualization, and Analysis. Peter D. Karp, Ph.D.

Proteomics. Yeast two hybrid. Proteomics - PAGE techniques. Data obtained. What is it?

International Journal of Scientific & Engineering Research, Volume 6, Issue 2, February ISSN

Using Atom Mapping Rules for an Improved Detection of Relevant Routes in Weighted Metabolic Networks. TORSTEN BLUM and OLIVER KOHLBACHER ABSTRACT

COMPARATIVE PATHWAY ANNOTATION WITH PROTEIN-DNA INTERACTION AND OPERON INFORMATION VIA GRAPH TREE DECOMPOSITION

An introduction to SYSTEMS BIOLOGY

The EcoCyc Database. January 25, de Nitrógeno, UNAM,Cuernavaca, A.P. 565-A, Morelos, 62100, Mexico;

Comparative Analysis of Nitrogen Assimilation Pathways in Pseudomonas using Hypergraphs

Bayesian Hierarchical Classification. Seminar on Predicting Structured Data Jukka Kohonen

Microbiome Metabolic Modeling with Pathway Tools

Discovering molecular pathways from protein interaction and ge

Gene Ontology and Functional Enrichment. Genome 559: Introduction to Statistical and Computational Genomics Elhanan Borenstein

GOSAP: Gene Ontology Based Semantic Alignment of Biological Pathways

Dynamic optimisation identifies optimal programs for pathway regulation in prokaryotes. - Supplementary Information -

CVO103: Programming Languages. Lecture 2 Inductive Definitions (2)

Learning in Bayesian Networks

Science Online Instructional Materials Correlation to the 2010 Biology Standards of Learning and Curriculum Framework

Metabolic modelling. Metabolic networks, reconstruction and analysis. Esa Pitkänen Computational Methods for Systems Biology 1 December 2009

Proteomics Systems Biology

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

Institute of Biotechnology, Shiraz University, Shiraz, Iran 2

UNIVERSITY OF YORK. BA, BSc, and MSc Degree Examinations Department : BIOLOGY. Title of Exam: Molecular microbiology

Homology Modeling. Roberto Lins EPFL - summer semester 2005

How much non-coding DNA do eukaryotes require?

Science Textbook and Instructional Materials Correlation to the 2010 Biology Standards of Learning and Curriculum Framework. Publisher Information

Heterogeneous Graph Mining for Biological Pattern Discovery in Metabolic Pathways

A Study of Correlations between the Definition and Application of the Gene Ontology

Lassen Community College Course Outline

Learning Goals of CS245 Logic and Computation

Unit # - Title Intro to Biology Unit 1 - Scientific Method Unit 2 - Chemistry

Computational methods for predicting protein-protein interactions

New Computational Methods for Systems Biology

APPENDIX II. Laboratory Techniques in AEM. Microbial Ecology & Metabolism Principles of Toxicology. Microbial Physiology and Genetics II

Boolean models of gene regulatory networks. Matthew Macauley Math 4500: Mathematical Modeling Clemson University Spring 2016

Title: HyBrow - A Prototype System for Computer-Aided Hypothesis Evaluation

MTopGO: a tool for module identification in PPI Networks

Introduction to Bioinformatics. Shifra Ben-Dor Irit Orr

Deduction by Daniel Bonevac. Chapter 3 Truth Trees

2012 Assessment Report. Biology Level 3

n Empty Set:, or { }, subset of all sets n Cardinality: V = {a, e, i, o, u}, so V = 5 n Subset: A B, all elements in A are in B

Metabolic pathway predictions for metabolomics: a molecular structure matching approach

Mapping and Filling Metabolic Pathway Holes

Comparative Network Analysis

PNmerger: a Cytoscape plugin to merge biological pathways and protein interaction networks

UniStra activities within the BigChem project:

C. Schedule Description: An introduction to biological principles, emphasizing molecular and cellular bases for the functions of the human body.

The bacterial interlocked process ONtology (BiPON): a systemic multi-scale unified representation of biological processes in prokaryotes

Supplementary methods

ANAXOMICS METHODOLOGIES - UNDERSTANDING

BIOINFORMATICS: An Introduction

Mapping and Filling Metabolic Pathways

13th International Conference on Relational and Algebraic Methods in Computer Science (RAMiCS 13)

The Changing Requirements for Informatics Systems During the Growth of a Collaborative Drug Discovery Service Company. Sally Rose BioFocus plc

objective functions...

Biological Networks. Gavin Conant 163B ASRC

SC55 Anatomy and Physiology Course #: SC-55 Grade Level: 10-12

STRING: Protein association networks. Lars Juhl Jensen

Gene Network Science Diagrammatic Cell Language and Visual Cell

Pairing Transitive Closure and Reduction to Efficiently Reason about Partially Ordered Events

Chemical Data Retrieval and Management

Transcription:

GROOLS: Reactive Graph Reasoning for Genome Annotation Jonathan Mercier 123 and David Vallenet 123 1 Direction des Sciences du Vivant, CEA, Institut de Génomique, Genoscope, LABGeM, Evry, France 2 CNRS-UMR8030, Evry, France 3 Université d Evry Val-d Essonne, Evry, France {jmercier,vallenet}@genoscope.cns.fr Abstract. GROOLS (Genomic Rule Oriented Object Logic System) is an expert system to help biologists in the evaluation of genome functional annotation through biological processes like metabolic pathways. Such reasoning is conducted using a Business Rule Management System (BRMS) working on a generic representation of biological knowledge that captures complex data and relationships. We use the Object-Oriented First Order Logic (OOFOL) extended to a four-valued logic of Belnap. Prior biological knowledge is organized in a directed acyclic graph and evaluated by applying reactive graph reasoning using observation facts. Two types of observations are considered: predictions from bioinformatics methods and assertions that correspond to experimental evidences in the studied organism. Once all facts are spread over the graph, a conclusion is made about the state of prior knowledge (e.g. confirmed presence, missing, unexpected absence). GROOLS implementation is based on the jboss DROOLS framework. Keywords: genome annotation, metabolic network reconstruction, business rules, object-oriented logic, four-valued logic 1 Introduction GROOLS (Genomic Rule Oriented Object Logic System) is an applied research project mixing biology, informatics and logic. It aims to standardize the use of biological results to annotate a genome. This expert system will help biologists (i.e. bio-annotators) to evaluate the quality of gene functional annotation through biological processes like metabolic pathways. Starting from a set of predicted genes, the bio-annotator tries to assign precise molecular functions to corresponding proteins by integrating various predictions from bioinformatics methods, which are mainly based on comparative sequence analysis with proteins having experimentally validated functions. This laborious task may lead to inconsistencies in the annotations due to the lack of experimental results and the difficulty to find a correct trade-off between sensitivity and specificity of the methods for functional inference.

2 Observations on the organism, like Biolog growth phenotype experiments [5], may help genome annotation notably for metabolic network reconstruction [3]. For example, if an organism is able to grow using a metabolite X then a metabolic pathway for compound X degradation is required in the organism as well as enzymes that catalyze the chemical reactions of the pathway. During the genome annotation process, the consistency and completeness of the predicted enzymatic functions could then be evaluated using this experimental results. Such reasoning can be conducted using a Business Rule Management System (BRMS) working on a generic representation of biological knowledge that captures complex data and relations like is-a and has-a. These prior knowledge objects are theories and are organized in a directed acyclic graph. In GROOLS, we use the Object-Oriented First Order Logic (OOFOL) [1] extended to a fourvalued logic of Belnap [4]. 2 Approach 2.1 Metabolic pathway representation Metabolic pathways are groups of chemical reactions taking part in a same biological process. As depicted in Figure 1, different pathway variants may occur in organisms to achieve a same global chemical transformation. These variants may have common or specific parts, called here reaction blocks. Fig. 1: A metabolic pathway with two variants to transform metabolite X to metabolite Y. These variants share two reaction blocks and are made of 4 reactions for VariantX1 and 3 reactions for VariantX2. In GROOLS, metabolic pathways with their variants made of reaction blocks will be represented as prior knowledge objects organized in a Directed Acyclic Graph (DAG) (Figure 2). Except for leaf nodes, nodes are flagged with (i) And

3 when a knowledge requires all its sub-knowledge to be present (ii) Or when a knowledge requires only one of its sub-knowledge to be present. Considering the example depicted in Figure 2, if the reactions 1, 3 and 4 are present (i.e. predicted by gene functional annotation) and the pathwayx is required (i.e. the organism degrades compound X) then VariantX1 can be considered as present with one missing reaction (reaction 2). Indeed, biological knowledge is often incomplete and such missing information is commonly called a pathway hole or an orphan enzyme : an enzymatic reaction that should occur in an organism but has no annotated gene to code the enzyme [8]. Fig. 2: Directed Acyclic Graph representation of a metabolic pathway. 2.2 Prior knowledge and observation model A bio-annotator uses at least three types of facts: (i) prediction facts are predictions from bioinformatics methods or human expertise made by integrating several method results (ii) assertion facts are experimental evidences in the studied organism (iii) prior knowledge gathers all metabolic pathways that were experimentally elucidated in at least one organism and represents the actual knowledge over years of cumulative empirical research in biology. Prediction and assertion facts will be considered as observations and will be used to make conclusions about the state of prior biological knowledge. As shown in Figure 3, we designed an object-oriented model with interfaces to ease the integration of heterogeneous objects from different external databases.

4 Fact getid() : String getname() : String getsource() : String getdate() : DateTime getpresence() : FourState Observation getpriorknowledgeid() : String getevidence() : Evidence PriorKnowledge getpartof() : PriorKnowledge[] getnodetype() : NodeType getconclusion() : Conclusion setconclusion( Conclusion conclusion ) : void setpresence( FourState presence ) : void 0..* 1 Assertion Prediction Fig. 3: Grools object-oriented model. Facts are Observations or Prior knowledge. The Prior knowledge is structured in a DAG. Two types of Observations are stored: bioinformatics Predictions or Assertions that are experimental observations. 2.3 Reasoning over structured data Prior knowledge will be evaluated through observations by applying reactive graph reasoning using the Object-Oriented First Order Logic (OOFOL) [1] extended to a four-valued logic of Belnap [4]. Predictions will directly impact leaf nodes of the DAG while assertions will generally impact root nodes. Prediction and assertions facts take four different values {present, absent, both, unknown} and {required, avoided, both, unknown} respectively, which correspond to {true, false, both, none} values of the four-valued logic. In a logic point of view, a class is a structure and an instance of a class (an object) is a theory. An object can contain other objects meaning that a theory is a graph of smaller theories. Each theory is defined by a coherent set of values. A class C is defined as a tuple L C, A C, I C. Where L is the vocabulary (class methods), A a set of axioms (attributes) and I a set of vocabulary to implement (I C L C ). The truth of a theory can be inferred using its sub-theories following a logic. This logic is designed to cope with various and contradictory information sources. A theory is true if all sub-theories are true. If all sub-theories are false then the theory is false. If some sub-theories are true and other false then the theory takes the value both to denote ambiguity. If any information matches a theory then the theory takes the state none (see Table 1). More exactly, a theory linked to sub-theories with a logical And (NodeType knowledge attribute) is evaluated using the following priority false > both > none > true. For a logical Or, the

5 priority is true > none > both > false. We are also evaluating an optimistic version of the logic where And priority is modified to true > false > both > none when sub-theories are linked to a single theory. This will help us to deal with incomplete predictions. Lastly, once all facts are spread over the DAG, a conclusion with 16 possible states is made on prior knowledge using assertion and prediction value combination (see Table 2). Table 1: A four-valued logic truth tables. T: true, F: false, B: both, N: none. F ( α) F ( α) T B N F F ( α) T B N F T F T T B N F T T T T T B B B B B F F B T B T B N N N N F N F N T T N N F T F F F F F F T B N F Assertion Table 2: Conclusion truth table. Prediction PRESENT ABSENT BOTH UNKNOWN REQUIRED Confirmed P. Unexpected A. Contradictory A. Missing AVOIDED Unexpected P. Confirmed A. Contradictory P. Absent BOTH Ambiguous P. Ambiguous A. Ambiguous C. Ambiguous UNKNOWN Unconfirmed P. Unconfirmed A. Unconfirmed C. Unknown Legend A. Absence P. Presence C. Contradiction 3 Conclusion and future work The GROOLS system is still under heavy development and evaluation. We hope that this tool will be useful for biologists to evaluate the overall coherence of individual predicted functions through the integration of additional information from biological processes like metabolic pathways. Through a collaborative project between INRIA and the Swiss Institute of Bioinformatics, a first prototype with similar deductive reasoning has been implemented in the HERBS system using Jess rule engine. GROOLS implementation is based on the jboss DROOLS framework which is an open rule-engine written in Java [7]. It natively supports an object-oriented language and uses the PHREAK algorithm to reason over structured data. It does a bridge between our Java application and the business logic. Genomic data with functional predictions and human expert annotations will be extracted from the Prokaryotic

6 Genome DataBase (PkGDB) of the MicroScope platform[9]. Metabolic pathways will be extracted from MetaCyc [2] and UniPathway [6] resources. References 1. Amir, E.: Object-oriented first-order logic. Linkiping Electronic Articles in Computer and Information Science (http://www.ida.liu.se/ext/etai) 4, 63 84 (1999), http://www.ep.liu.se/ej/etai/1999/008/ 2. Caspi, R., Altman, T., Billington, R., Dreher, K., Foerster, H., Fulcher, C.A., Holland, T.A., Keseler, I.M., Kothari, A., Kubo, A., et al.: The metacyc database of metabolic pathways and enzymes and the biocyc collection of pathway/genome databases. Nucleic Acids Res 42(D1), D459 D471 (2014) 3. Francke, C., Siezen, R.J., Teusink, B.: Reconstructing the metabolic network of a bacterium from its genome. Trends in microbiology 13(11), 550 558 (2005) 4. Jr, N.D.: A useful four-valued logic. In: Modern uses of multiple-valued logic, pp. 5 37. Springer (1977) 5. Mackie, A.M., Hassan, K.A., Paulsen, I.T., Tetu, S.G.: Biolog phenotype microarrays for phenotypic characterization of microbial cells. In: Environmental Microbiology, pp. 123 130. Springer (2014) 6. Morgat, A., Coissac, E., Coudert, E., Axelsen, K.B., Keller, G., Bairoch, A., Bridge, A., Bougueleret, L., Xenarios, I., Viari, A.: Unipathway: a resource for the exploration and annotation of metabolic pathways. Nucleic Acids Res. p. gkr1023 (2011) 7. Proctor, M., Neale, M., Lin, P., Frandsen, M.: Drools documentation. JBoss.org, Tech. Rep (2008) 8. Sorokina, M., Stam, M., Médigue, C., Lespinet, O., Vallenet, D.: Profiling the orphan enzymes. Biol. Direct 9(10) (2014) 9. Vallenet, D., Belda, E., Calteau, A., Cruveiller, S., Engelen, S., Lajus, A., Le Fevre, F., Longin, C., Mornico, D., Roche, D., Rouy, Z., Salvignol, G., Scarpelli, C., Thil Smith, A.A., Weiman, M., Medigue, C.: MicroScope an integrated microbial resource for the curation and comparative analysis of genomic and metabolic data. Nucleic Acids Res. 41(Database issue), D636 647 (2013)