An Integrated Approach to in-silico

An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577

Goals of in-silico screening Lead discovery/design Lead optimization Maximize potency Minimize toxicity Optimize ADME characteristics Fail-early paradigm - try to predict clinical success/failure Maximize use of prior information

Challenges of in-silico screening Combinatorial explosion of chemical and biological data Must insure data currency, validity, cleanliness Must accommodate multiple heterogeneoous sources of data Wide expanse of chemical space dictates a wide variety of tools to analyze them Need an integrated chemical informatics framework to utilize the tools efficiently and conveniently

Chemical Informatics Framework Databases, data marts, and data warehouses of chemical structure information - and the software to build and manage them (including in-house, commercial, and internet data sources) Software tools to extract, normalize, enumerate, and compute properties from structures Analysis applications to visualize, model, and predict activity from structure (structural data mining)

Chemical Informatics: : Database Chemical Warehousing - e.g., Reagent Selector, Compound Warehouse Oracle 8i architecture Star Schema - One or more large fact table(s) linked to several smaller dimension tables Serviced by schedulers and database daemon processes Scalable from project data mart (100K structures) to corporate warehouse (10M structures) Multiple structure sources, including commercial databases (MDDR, ACD, Beilstein), in-house, and online data sources

Chemical Informatics: : Software Tools Cheshire - MDL chemical scripting language Object-oriented, Java-like syntax - associative arrays for structures, queries, data; collections Perceive and set atom, bond properties, rings, fragments, sgroup data Supports mapping and structure-walking applications Exposes MDL s SSKeys; allows user-defined similarity searches

Cheshire: An Example Purpose: normalize nitro group representation function no2s() { var q1=createmol( N+(=O)-O) ); var ic = 0; if ((c=map(q1)).count() > 0 ) { c.set(a_charge, 0); c.set(b_type,b_type_double); ic = ic + 1; } return (ic); }

Topological Distances

Gasteiger Charges

Common Substructures Algorithm O O C H C O C O O O C H C C O N O N O O Find the Stereocenters Collect the Stereocenters, atoms alpha to them, rings connected to them Compare the new structure with those already collected. Is it unique? (No: Add a hit to the list) Is it a substructure of an already existing structure (or vise versa)? (Yes: Add a hit to the substructure's list, remove the "superstructure"; No: Add it to the list)

Structural Diversity

SSKeys Structure-based keys 166 bits - subset of unique and identified features 960 bits - optimized for searching; contains many-to-one mappings 2275 bits - many-to-one mappings removed

Modal Fingerprints Capture information about key bits set by a collection of molecules threshold weighting Similarity searches N(q&t)/N(q) ( sub -similarity) N(q&t)/N(q t) (Tanimoto) see also: Shemetulskis, et al., J. Chem. Inf. Comput. Sci. 1996, 36, 862-871.

Chemical Informatics: : Analysis Application Visualization Unsupervised data mining - clustering Supervised data mining - classification, decision trees, recursive partitioning QSAR modeling - regression, neural networks, genetic algorithms Pharmacophore searching and docking

Chemical Informatics: : Clustering Goal: rapid, simple UI, select cluster count Possibilities: hierarchical, Jarvis-Patrick, monothetic, kmeans, binning Choice: Sample-based kmeans Fast, scalable, controllable, use 166 keys Published algorithm: L. Kauffman & P. J. Rousseeuw, Finding Groups in Data, Wiley, 1991 (win-www.uia.ac.be/u/statis/programs)

Chemical Informatics: : Classification Goal: rapid, simple UI, interpretable results Possibilities: C4.5, FIRM, CART, various Bayesian methods, etc. Choices: C4.5/C4.5rules, FIRM Available, fast, simple I/O Published algorithms: C4.5: J. R. Quinlan, C4.5, Morgan-Kaufmann, 1992. FIRM: D. Hawkins, UMN Statistics Dept. (ftp://ftp.stat.umn.edu/pub/firm/ibmfirm.exe) Others: WEKA - I. H. Witten et al (pure JAVA) (http://www.cs.waikato.ac.nz/ml/weka/)

Example: 5-HT Antagonists 3 Bind to ligand-gated ion-channel receptor for serotonin. Therapeutically useful in treatment of emesis, anxiety, schizophrenia, drug dependence, and age-related memory loss. Often show mixed agonist/antagonist properties, depending on site of action. Example refs: Cappelli, et.al., J. Med. Chem., 1998, 41, 728-741 Rival, et.al., J. Med. Chem., 1998, 41, 311-317

Approaches to Library Construction Reagent Based Product Based

Reagent Based Library Construction Build around a known scaffold N N N + Y. Rival, et al., J. Med. Chem. 1998, 41, 311-317.

Arylpiperazine Interactions Limited Space Attractive Short- Range interactions Aromatic Interactions Limited Space N H O N O N + O Limited Space

One-Step Chemistry + N Cl N N N N N

Diversity Around The Ring ACD Available Chemicals Directory R2 R1 70 Imidoyl chlorides 26 Mw < 250 Small molecular volume substitution at R2 & R3 Electron withdrawing & donating substitution at R1 R3 N Cl

Substructure Search Selecting Reagents Filter on M. Wt. Cluster Filtered Reagents

Product Based Library Construction Use modal fingerprints to characterize active compounds Choose a backbone (quinoline) for the compounds Search ACDSC for quinoline-containing molecules; refine the search with the modal fingerprint

Clustering Extracted Structures Cluster Analysis on substructure keys Drilldown into cluster

Decision Tree Classification Method: Select 200 5-HT3 antagonists from MDDR20001 Select an equal number of most similar structures lacking reported 5-HT3 activity Use 166 User keys as descriptors Run a variety of classification programs, using 10-fold cross validation Example results: Method % correct Number of Rules C45 91 14 C45Rules 68 2 Ripper 85 6 Naïve Bayes 76 -- Regression 77 --

Classification Results Training Set Classification Results: 166 Keys 960 Keys Method correct # rules correct # rules C4.5 81% 80 84% 67 C4.5Rules 68 2 69 12 FIRM 73 5 78 7 402 structures (204 active, 198 inactive ) Default program parameters 10-fold cross-validation results

Library Evaluation ACDSC - 1.4M structure screening database Pick structures similar to 960-key modal f.p. and those similar to 2275-key modal f.p. Test structures against decision tree models Number of structures predicted to be active: Similarity basis Model 960 key m.f.p. 2275 key m.f.p. C4.5 343/2908 379/3067 FIRM 193/2908 472/3067

Conclusions 166 and 960-key sets performed similarly; 960 keys were marginally better for training set predictions. Decision tree performance correlated with rule-size. FIRM gave the best performance/rule-size ratio. Very little overlap was observed between keys selected by the decision tree programs and those selected in the modal fingerprints - suggesting the approaches are complementary. The 960-key set can be expanded to 2275 unrolled keys; similarity to these keys and the 960-key set yields different selections of structures, which behave differently in decision tree predictions.

Future Work Run classifications using the 2275-key set. Investigate other classifiers (knn, Bayesian, CART, etc.) Investigate relationship between similarity to modal fingerprint and level of biological activity. Develop classification and fingerprint models for ADME properties. Utilize query features in modal fingerprints....

Acknowledgments Doug Henry, David Evans, and Maurizio Bronzetti Richard Briggs, Burt Leland, and the rest of the Cheshire Team Ali Ozkabak and the rest of the Reagent Selector team