Metrabase The Metabolism and Transport Database. user manual v

Similar documents
Open PHACTS Explorer: Compound by Name

ATLAS of Biochemistry

Chemical Data Retrieval and Management

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

SABIO-RK Integration and Curation of Reaction Kinetics Data Ulrike Wittig

A Journey from Data to Knowledge

OECD QSAR Toolbox v.3.3. Predicting skin sensitisation potential of a chemical using skin sensitization data extracted from ECHA CHEM database

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

OECD QSAR Toolbox v.4.1. Tutorial on how to predict Skin sensitization potential taking into account alert performance

In silico pharmacology for drug discovery

How to Create a Substance Answer Set

Research Article HomoKinase: A Curated Database of Human Protein Kinases

Networks & pathways. Hedi Peterson MTAT Bioinformatics

InChI keys as standard global identifiers in chemistry web services. Russ Hillard ACS, Salt Lake City March 2009

Reaxys Pipeline Pilot Components Installation and User Guide

Tautomerism in chemical information management systems

Organometallics & InChI. August 2017

A powerful site for all chemists CHOICE CRC Handbook of Chemistry and Physics

ISO INTERNATIONAL STANDARD. Geographic information Spatial referencing by coordinates

InChI/InChIKey vs. NCI/CADD Structure Identifiers: A comparison

Introduction to Chemoinformatics and Drug Discovery

In Silico Investigation of Off-Target Effects

NMR Predictor. Introduction

CHEMISTRY (CHE) CHE 104 General Descriptive Chemistry II 3

The Case for Use Cases

ISO/TR TECHNICAL REPORT. Nanotechnologies Methodology for the classification and categorization of nanomaterials

Drug Informatics for Chemical Genomics...

OECD QSAR Toolbox v.4.0. Tutorial on how to predict Skin sensitization potential taking into account alert performance

Dictionary of ligands

Dongyue Cao,, Junmei Wang,, Rui Zhou, Youyong Li, Huidong Yu, and Tingjun Hou*,, INTRODUCTION

Machine Learning Concepts in Chemoinformatics

Dock Ligands from a 2D Molecule Sketch

OECD QSAR Toolbox v.4.1

Chemical Space: Modeling Exploration & Understanding

Reaxys Medicinal Chemistry Fact Sheet

ISO INTERNATIONAL STANDARD

How Do Metabolites Differ from Their Parent Molecules and How Are They Excreted?

OECD QSAR Toolbox v.3.3

ISO INTERNATIONAL STANDARD. Geographic information Spatial referencing by coordinates Part 2: Extension for parametric values

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

OECD QSAR Toolbox v.3.4

Database Speaks. Ling-Kang Liu ( 劉陵崗 ) Institute of Chemistry, Academia Sinica Nangang, Taipei 115, Taiwan

Chapter 6- An Introduction to Metabolism*

CIM Report May 8, :01pm

On InChI and evaluating the quality of cross-reference links

CSD. Unlock value from crystal structure information in the CSD

KATE2017 on NET beta version Operating manual

Representation of molecular structures. Coutersy of Prof. João Aires-de-Sousa, University of Lisbon, Portugal

Marvin. Sketching, viewing and predicting properties with Marvin - features, tips and tricks. Gyorgy Pirok. Solutions for Cheminformatics

Imago: open-source toolkit for 2D chemical structure image recognition

FROM MOLECULAR FORMULAS TO MARKUSH STRUCTURES

October 6 University Faculty of pharmacy Computer Aided Drug Design Unit

Command-line tools of ChemAxon: tips and tricks

Integrated Cheminformatics to Guide Drug Discovery

Navigating between patents, papers, abstracts and databases using public sources and tools

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

JCICS Major Research Areas

ISO INTERNATIONAL STANDARD. Geographic information Metadata Part 2: Extensions for imagery and gridded data

RMassBank: Automatic Recalibration and Processing of Tandem HR-MS Spectra for MassBank

INTERNATIONAL STANDARD

Structural biology and drug design: An overview

Scientific Integrity: A crystallographic perspective

Development of QSAR Models for Identification of CYP3A4 Substrates and Inhibitors

MSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.

How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space

Canonical Line Notations

ISO 9277 INTERNATIONAL STANDARD. Determination of the specific surface area of solids by gas adsorption BET method

Regulatory use of (Q)SARs under REACH

Homology. and. Information Gathering and Domain Annotation for Proteins

OECD QSAR Toolbox v.4.1

Overview. Database Overview Chart Databases. And now, a Few Words About Searching. How Database Content is Delivered

Data Mining in the Chemical Industry. Overview of presentation

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build and evaluate a category based on mechanism of action with protein and DNA binding

Basic Techniques in Structure and Substructure

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

CHEMISTRY (CHEM) CHEM 5. Chemistry for Nurses. 5 Units. Prerequisite(s): One year high school algebra; high school chemistry

International Chemical Identifier for Reactions (RInChI)

The Chemistry department approved by the American Chemical Society offers a Chemistry degree in the following concentrations:

1. (18) Multiple choice questions. Please place your answer on the line preceding each question.

Gene Ontology and overrepresentation analysis

Peter L Warren, Pamela Y Shadforth ICI Technology, Wilton, Middlesbrough, U.K.

Internet Resource Guide. For Chemical Engineering Students at The Pennsylvania State University

Bioinformatics. Dept. of Computational Biology & Bioinformatics

ISO INTERNATIONAL STANDARD. Sample preparation Dispersing procedures for powders in liquids

A multi-label approach to target prediction taking ligand promiscuity into account

OECD QSAR Toolbox v.3.4. Example for predicting Repeated dose toxicity of 2,3-dimethylaniline

ISO 2575 INTERNATIONAL STANDARD. Road vehicles Symbols for controls, indicators and tell-tales

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Analyzing Small Molecule Data in R

Quality and Coverage of Data Sources

e-practicals: how to develop a virtual (chemistry) lab class

College of Science (CSCI) CSCI EETF Assessment Year End Report, June, 2017

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Synteny Portal Documentation

Kernels for small molecules

ChemAxon. Content. By György Pirok. D Standardization D Virtual Reactions. D Fragmentation. ChemAxon European UGM Visegrad 2008

Enzyme Enzymes are proteins that act as biological catalysts. Enzymes accelerate, or catalyze, chemical reactions. The molecules at the beginning of

Bioinformatics Workshop - NM-AIST

CHEMISTRY (CHEM) CHEM 5. Chemistry for Nurses. 5 Units. Prerequisite(s): One year high school algebra; high school chemistry

Searching Substances in Reaxys

Transcription:

Metrabase The Metabolism and Transport Database user manual v2015.02

Contents 1 INTRODUCTION 3 2 METRABASE CONTENT 4 2.1 ACTIVITIES (INTERACTIONS) 7 2.2 PROTEINS 9 2.3 COMPOUNDS 10 2.4 DATA SOURCES 13 3 THE TRANSPORTER SUBSTRATE DATASET 14 4 USAGE 16

1 Introduction Metrabase is a cheminformatics and bioinformatics resource that contains manually curated structural, physicochemical and biological data related to small molecule transport and metabolism. Metrabase offers structured and easily accessible data on interactions between proteins and chemical compounds, providing not only actions and measured activities, but also chemical structural information, tissue expression data and negative action types that are essential in modelling activity. Easily accessible refers to computational processing in the first place. Even when data is made available, an easy way to process it computationally is quite often missing in a range of freely available resources (e.g. an online search and browse facility offered, but not download and permission barriers imposed). We aim to construct a comprehensive, thoroughly annotated and easy to use resource of high quality small molecule metabolism and transport information. In particular by covering the areas of biochemistry, pharmacology and toxicology, we hope diverse research communities will find Metrabase useful and valuable. 3

2 Metrabase content Metrabase version 1.0 contains curated data related to human transport and metabolism of chemical compounds. Its primary content includes nearly 3500 small molecule substrates and modulators of transport proteins and, to a smaller extent, cytochrome P450 enzymes (CYPs). Proteins Compounds Interactions References 20 transporters and 13 CYPs 3438 11649 1211 20 transporters 3307 11143 1177 13 CYPs 212 506 36 The major focus of Metrabase v1.0 is on transport proteins: specifically, on their interactions with small molecules that were experimentally found to be (or not to be) substrates. 4

5

Metrabase 1.0 schema 6

2.1 Activities (interactions) The key information held in the activities table of the database covers the interactions between proteins and chemical compounds, indicating the compound action type as either substrate, non-substrate, inducer, non-inducer, repressor, inhibitor, non-inhibitor, stimulator or binder (the action_type field). Action types Protein activity substrate (transport or catalysis) non-substrate inhibitor/repressor (negative modulators) Compound activity stimulator/inducer (positive modulators) (affecting protein activity/expression) non-inhibitor/non-inducer (inactive compounds) Action type was set to binder where it did not fall into any of these categories, but the molecule was found to bind to the protein. key fields: cmpd_id protein_id ref_id action_type species (However, in version 1.0 species = human for all the records and so can be omitted.) 7

Compounds were categorised as substrates or non-substrates according to the results presented in the publication providing the data point and no further evaluation was carried out on our side. Care must be taken with respect to the current status of the inhibition records, since depending on the measurement threshold (e.g. percentage inhibition) some of the compounds annotated as inhibitors can be regarded as non-inhibitors and vice versa. A proper classification of compounds as either inhibitors or noninhibitors is planned for subsequent releases of the database. Other activities fields holding additional extracted data and annotations, such as assay descriptions, relevant experimental measurements, cell systems, compound concentrations and the substrates used in inhibition assays, may have only been partially completed in this release. This is partially due to assay information not being included in most of the reviews. The published_label field contains chemical names, abbreviations or designations employed in publications to label compounds. This field has been completed for all except records linked to the external datasets and can be used to easily identify compounds in their respective publications. Activity types were mostly accepted as found in the publications and therefore they may be overlapping. Consequently, selecting all the activity types relevant for one s search is recommended. 8

2.2 Proteins The proteins contained in Metrabase are categorised as either transporters or enzymes (the protein_type field) and are provided with the HUGO Gene Nomenclature Committee (HGNC) approved symbols and names (www.genenames.org) as well as UniProt IDs. Protein sequences for the indicated isoforms were included from UniProt (www.uniprot.org). Other fields include additional information, such as Gene, RefSeq and Ensembl IDs and TC (Transporter Classification) or EC (Enzyme Commission) numbers. Metrabase also contains information about protein expression levels across healthy human tissues. Part of this data is based on immunohistochemistry using tissue microarrays (gene, tissue, cell type, level, expression type and reliability) and comes from the normal_tissue.csv file of the Human Protein Atlas (HPA) v9.0 (www.proteinatlas.org). All other expression records contain data that was extracted from the literature. The levels of expression (mrna and/or protein levels) for non-hpa records (i.e. where ref_id is not null): expressed (if the level had not been specified), none, none-low, low, low-medium, medium, medium-high and high. 9

2.3 Compounds The total number of records in the compounds table is 3562, but the number of compounds with recorded interaction data for both transporters and enzymes is 3438. The remaining compounds are used in other tables, such as cmpd_variants, which holds stereoisomers, multi-component structures and different forms of a compound. Molecular structures are available in MDL molfile format and as absolute (unique and isomeric) SMILES strings (in Kekulé form). They were mostly verified using the Chemspider (www.chemspider.com) and/or SciFinder (www.cas.org/products/scifinder) databases. The standard InChI and InChI Key strings were generated using v1.04 of the InChI software (http://www.inchi-trust.org). 10

The great majority of the compounds are small organic molecules (containing just the following atoms: C, H, O, P, S, N, F, Cl, Br and I) and all the other types (coordination complexes, inorganic compounds, metalloid-containing compounds, selenium-containing compounds and polymers) are listed in the compound_types table. This table also contains the DrugBank types of drugs (approved, experimental, illicit, investigational, nutraceutical and withdrawn) taken from DrugBank v3.0 (www.drugbank.ca) and can easily be improved by annotating compounds further, for example, as natural products including their subtypes (e.g. natural product: terpene: sesquiterpene). The properties table contains selected molecular properties that were calculated/predicted for all (molecular mass) or just the small organic single-component structures (constitutional descriptors: atom and bond counts, hydrogen bond donor and acceptor counts, ring count and rotatable bond count; log P and log D) using ChemAxon s Calculator (cxcalc) v6.1.3 (www.chemaxon.com). Experimental properties are not currently provided, i.e. properties.type = c for all records (where c stands for calculated ). The multi-component structures can easily be identified using the compounds.fragment_count field and their single-component counterparts using the cmpd_variants table). The synonyms table contains chemical names of Metrabase compounds (systematic, semi-systematic, common, trade names, abbreviations, codes). One of the synonyms was selected as the main name (the compounds.cmpd_name field) for each compound. Chemical names were obtained mostly from DrugBank 11

(these might refer to compound variants as well) and SciFinder. The systematic (IUPAC) names were computer generated using the ChemAxon s IUPAC Naming Plugin v6.1.3 (the compounds.iupac_name field). The cmpd_ids table contains external compound IDs. Most of the compounds have ChemSpider IDs (CSIDs) and only if CSID had not been found, CAS Registry Number was provided (CASRN; CAS Registry Number is a Registered Trademark of the American Chemical Society). DrugBank IDs are also included were identified (especially for the approved drugs). MBCD number is the compound identifier in Metrabase, e.g. mbcd0027084 (MBID for compounds). cmpd_id: mbcd0027084 (CSID:14034) cmpd_name: Ethidium bromide smiles: [Br-].CC[N+]1=C(C2=CC=CC=C2)C2=CC(N)=CC=C2C2=CC=C(N)C=C12 std_inchi: 1S/C21H19N3.BrH/c1-2-24-20-13-16(23)9-11-18(20)17-10-8-15(22)12-19(17)21(24)14-6- 4-3-5-7-14;/h3-13,23H,2,22H2,1H3;1H std_inchikey: ZMMJGEGLRURXTF-UHFFFAOYSA-N iupac_name: 3,8-diamino-5-ethyl-6-phenylphenanthridin-5-ium bromide formula_dot: C21H20N3.Br fragment_count: 2 12

2.4 Data sources The datasources table contains the sources of data in the database, including information about software that was used to calculate molecular properties. The datasource_id and datasource_version fields indicate the source of all Metrabase records where applicable. The refs table contains the publications citation information (bibliographic fields) and links. Most of them (91%) are original peer-reviewed research articles and the aim remains to link all Metrabase records to primary literature sources (7% are reviews). PubMed IDs are provided where available, as well as DOIs (if DOI was not available, URL is given instead in the doi_url field). Attach http://dx.doi.org/ to DOI to resolve a DOI, e.g. http://dx.doi.org/10.1021/ac0354342. 13

3 The transporter substrate dataset We aim to provide a version of the transporter substrate dataset (MBTPsubDS) as a supplement to each Metrabase release. Each MBTPsubDS version contains interactions between small molecules and transporters, and includes all the unique substrate and non-substrate records obtained from Metrabase and processed to facilitate human transporter data analysis and predictive modelling (by 'unique' we mean the unique (cmpd_id, protein_id, action_type) tuples). MBTPsubDS1_0 MBTPsubDS1_0a based on Metrabase v1.0; all the interactions involving conflicting action types (where a compound was found to be both a substrate and a non-substrate of a single transporter) were excluded some of the conflicting action types were resolved upon our evaluation of such records and the corresponding compound-transporter pairs were added to MBTPsubDS1_0 where we thought we could consider the compound as either a substrate or a non-substrate 14

15

4 Usage Web interface Local MySQL database Search by protein Search by compound Expression data Protein list Download To load Metrabase from a dump file (metrabase1_0.sql), you should first create a database on your system and then load the dump file, for example like this: # tar -xzvf metrabase1_0.tar.gz # mysql -u username -p mysql> CREATE DATABASE metrabase; # mysql -u username -p metrabase < metrabase1_0.sql MySQL Workbench can be used as an interface for MySQL. 16

17

18

19

20

21

22

23

Credits Metrabase was developed by Lora Mak in collaboration with David Marcus, Andreas Bender and Robert C. Glen at the Centre for Molecular Informatics and Galina Yarova, Guus Duchateau and Werner Klaffke at Unilever, with the much appreciated help from the following (at the time) 2nd and 3rd year undergraduate students of the University of Cambridge: Claire Dickson, Joseph Dixon, Ivan Lam, Richard Lewis, Callum Picken, Claudia Pop, Heyao Shi, Emma Stirk, Yasmin Surani, Paddy Szeto, Nathaniel Wand, Julian Willis and Jing Xiangyi. Metrabase's web interface was developed by Andrew Howlett at the Centre for Molecular Informatics. Andrew also designed the Metrabase logo. Metrabase was realised and is being maintained in the Glen group. 24

Licensing Metrabase is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License (http://creativecommons.org/licenses/by-sa/4.0/) However, with respect to the integrated data, such as the TP-Search, ChEMBL and Human Protein Atlas records that are distributed as part of Metrabase, the user is referred to each external data source regarding their respective licensing. This means that the integrated data retains the licensing of the original data sources. The TP-Search and ChEMBL records may have been modified and augmented, while the Human Protein Atlas records were included unmodified. Attribution We hope you find our database and the associated datasets useful. If you use it, please cite us: Mak L, Marcus D, Howlett A, Yarova G, Duchateau G, Klaffke W, Bender A, Glen RC: Metrabase: a cheminformatics and bioinformatics database for small molecule transporter data analysis and (Q)SAR modeling. Journal of Cheminformatics 2015, 7:31. Metrabase v1.0, University of Cambridge, http://www-metrabase.ch.cam.ac.uk 25

Metrabase - http://www-metrabase.ch.cam.ac.uk Contact: metrabase@ch.cam.ac.uk The Centre for Molecular Informatics Department of Chemistry, University of Cambridge Lensfield Road, Cambridge, CB2 1EW, UK This document was prepared by Dr Lora Mak and reviewed by Prof Robert C. Glen. 2015 Metrabase Development Team, University of Cambridge. All rights reserved.