PubChem data extraction and integration using Instant JChem Oleg Ursu Cristian Bologa Tudor I. Oprea Division of Biocomputing
PubChem - why not? Custom SQL queries Pipelining with custom in house or commercial tools Structure search not complete Integration with in house databases Speed of access/queries
PubChem structure search July 24, 2008 (and earlier)
DrugBank structure search tool July 24, 2008 (and earlier)
Integration with other databases Quickly identify HTS hits activity on other target(s)/assays Is there any relationship between my target and other targets in a cell based assay? Profile compounds activity on other than PubChem assay data WOMBAT records overlap with MLSMR ~9000 WOMBAT unique compounds overlap with MLSMR ~ 1700
Small Molecules Repository Large subset of PubChem libraries tested in multiple assays The same supplier, multiple centers including NMMLSC Need to pipeline/automate post HTS analysis for multiple targets Integration with WOMBAT in house database
Building IJC database Download the MLSMR library from PubChem Download assays data and description from PubChem FTP Extract and prepare data from assays files Design and create database tables, relationships and forms
Structures import PubChem limit for download 250,000 structures 2 download batches Clean up using ChemAxon Standardizer with the following configuration
Database creation Import WOMBAT RDF file Assign PUBCHEM_SUBSTANCE_ID to structures in WOMBAT present MLSMR library Import MLSMR library checking for duplicate structures already present in SUBSTANCES table Import PubChem assay data Import PubChem assay description data
PubChem structures import
Assays import CSV file assay test data XML file assay description data Shell script to process and extract assays data XMLStarlet Command Line XML Toolkit to query and extract assays description data (http://xmlstar.sourceforge.net/) Examples: $ xml sel -t -m //PC-AssayDescription -v PC-AssayDescription_name 761.descr.xml > HTS to identify specific small molecule inhibitors of Ras and Ras-related GTPases specifically Cdc42 wildtype $ xml sel -t -m //PC-AssayTargetInfo -v PC-AssayTargetInfo_name -n 761.descr.xml > cell division cycle 42 (GTP binding protein, 25kDa) [Homo sapiens]
Entity relationships Link ACTIVITY table Data with SUBSTANCES table on PUBCHEM_SUBSTANCE_ID Link WOMBAT.ACT.LIST/WOMBAT.MO L.KW with WOMBAT SUBSTANCES table on SMDL.ID
PubChem (MLSMR) view
Wombat view
Analyzing HTS hits Cluster HTS hits, SAR relationships Profile HTS hits on PubChem assays and WOMBAT targets Virtual screening of commercial libraries using ROCS, FP, and docking
Integration with other tools GUI is nice, doesn t play well with other tools jcsearch, jcman ChemAxon command line tools, doesn t allow for join select SQL statements Designed custom search application based on JChemSearch object and JChem API
Using JChemSearch object
Pipelining db search with other tools Select active compounds in GTPases screening assays SQL filter: select distinct cd_id from substances,activity where activity.pubchem_substance_id=substances.pubchem_substance_i d and activity.aid in (757,758,759,760,761,764) and activity.activity_outcome=2 Pipelining search results to MCES based clustering tool $ db_search s get_actives.sql t substances pubchem_substance_id pubchem_ext_datasource_regid mcs -i - -mt 0.4 -o gtp.actives.meas -- out-type m -s 0.3 Apply MESA Analytics grouping module to the result measures matrix $ Clustering gtp.actives.meas -T 0.28 637 > cluster.28.out $ ClusterOutput cluster.28.out gtp.actives.smiles 2 T N > cluster.28.out.smiles Generate Omega conformations needed for ROCS screening $ db_search s get_actives.sql t substances pubchem_substance_id omega2 in - -out gtp.actives.confs.oeb.gz maxconfs 100
Selected Cluster
Similar compounds in MLSMR - Active in other assays
WOMBAT compounds
BIRT reporting framework Use the list of SIDs to create a report on PubChem assays profiling
Cluster compounds profile in PubChem and WOMBAT Active in GTPases assays Active in other PubChem assays # of compounds Tested Active Target(s) name 4 69 6 Rac1 protein GTP-binding protein (rab7) ras protein Ras-related protein Rab-2A. cell division cycle 42 (GTP binding protein, 25kDa) Rac1 protein 4 250 5 qhts Assay for Disrupters of an Hsp90 Co-Chaperone Interaction Catalytic epsilon subunit of the translation initiation factor eif2b, the guaninenucleotide exchange factor for eif2; activity& cytochrome P450, family 2, subfamily C, polypeptide 9 cytochrome P450, family 2, subfamily C, polypeptide 19 thyroid stimulating hormone receptor WOMBAT 5 6 6 DP; prostaglandin D2 receptor EP1; prostaglandin E2 receptor, EP1 subtype EP2; prostaglandin E2 receptor, EP2 subtype EP3; prostaglandin E2 receptor, EP3 subtype EP4; prostaglandin E2 receptor, EP4 subtype cpla2; cytosolic phospholipase A2; phospholipase A2 group IVA Total 13 317 17
ROCS screening 275 250 Plate1A02_000A-0214 Rac_act RawMCF 225 200 175 150-15 -9-8 -7-6 -5-4 -3 Log Compound Conc [M] BOTTOM TOP LOGEC50 HILLSLOPE EC50 Rac_act 240.1 191.3-6.990-0.9937 1.0237e-007 ROCS hit from ChemDiv library, dose response EC 50 =0.102 μm
Future plans Automatic synchronization with PubChem Integration with other databases: DrugBank, Protein Ligand Databases, EMBL-EBI, in house assay data, etc.
Acknowledgments ChemAxon OpenEye Eclipse project Division of Biocomputing at UNM
Division of Biocomputing at UNM Tudor Oprea Cristian Bologa Steve Mathias Jerome Abear Andrei Leitao Ramona Curpan Liliana Halip Jeremy Yang Niranjan Kumar Oleg Ursu