Functional Group Fingerprints CS Chemistry Wilmington, USA James R. Arnold Charles L. Lerman William F. Michne James R. Damewood American Chemical Society ational Meeting August, 2004 Philadelphia, PA
Functional Group Fingerprint bjectives Develop a 2D fingerprint searching method that uses medicinally relevant, custom defined, functional groups. Validate the method on a large dataset of biological targets and chemical classes. Examine approaches for enhancing accuracy, and reducing false positive rates, of 2D searches. Deploy method company-wide. Rapidly expand SAR in it/lead Identification. Subset the corporate collection for screening. Factor of 2 improvement in every 2D similarity search.
Many Target Classes Approached by Ligand-Based Methods The Impetus for Developing Functional Group Fingerprints Biochemical Classes of Drug Targets of Current Therapies: (Perhaps 70% of Targets Approached by Ligand-Based Methods) Receptors, 45% Enzymes, 28% Unknown, 7% DA, 2% uclear Receptors, 2% ormones & Factors, 11% Ion Channels, 5% Drews, J. Science 2000, 287, 1960-1964
Functional Groups are ne Aspect of Medicinal Chemistry Reasoning in Ligand-Based Design 2 Express one aspect of the knowledge of our most experienced people for wide use. Design is often partially driven by functional group features. These can be warheads, linkers, substituents for interaction with receptors - and they influence many molecular properties. Functional Group Fingerprints capitalize on proven Medicinal Chemistry approaches, and two-dimensional searches are widely used
Functional Group Fingerprint Based Classification and Similarity Searching -Classification based on 400 medicinally relevant functional groups -Classification translated into bit strings Imigran (1): GSK, 1.07 billion dollar treatment for migraine in 2000.
Functional groups are recognized algorithmically using SMARTS The exclusions make the functional group definitions specific and make the entire set as orthogonal as possible.
Functional groups are Defined to Minimize verlap Between Definitions The above imide is defined as an imide.. not two carbonyls, two amides and an amine. rthogonal Functional group definitions allow specific functional groups to be related to activity.
Most common functional group classes in 2003 MDDR Functional Group Frequency Functional Group Frequency aromatic nitrogen 9,609 3 o amine, not a arom. 8,493 2 o alcohol 8,104 aryl halide 7,896 2 o amide 7,805 acyclic ether, a arom. 7,262 carboxylic acid 6,464 alkene 6,044 carboxylic ester aromatic alcohol 5,199 3,691 cyclic ether aromatic sulfur, thio. 4,778 3,485 2 o amine, not a arom. 3,281 aromatic sulfur 3,185 3 o amide 3,113 acyclic ether 3,018 3 o amine, a arom 2,938 imidazole, fused, no 2,737 1 o alcohol 2,662 1 o amine, a arom. 2,488 1 o amine, not a arom. 2,474 acetal ketal 2,177 beta lactam, fused 1,931 3 o alcohol 1,921 3 o lactam 1,814 aromatic -, no 1,709 a b unsaturated acid 1,666 2 o amine, a arom. 1,634 lactone 1,586 cyclic ether a 1 arom 1,552 cyclic thioether 1,552 imidazole, fused, w/ 1,533 ketone 1,513 ketone, a arom. 1,476 aromatic ketone 1,385 aromatic w/ 1,326 aromatic oxygen 1,283 acyclic thioether a arom 1,240 a b unsaturated ester 1,188 acyclic thioether 1,166 1 o amide 1,146 2 o lactam 1,136 oxime ether 1,135 trihalide 1,135 nitrile 1,122 sulfonamide, 1,075 urethane, 1,071 urea 864 General categories are shown, actual functional group classifications are more specific.
Classification Quality: Coverage and verlap of Functional Group Definitions Coverage: All heteroatoms in molecule are classified. verlap: A heteroatom in molecule classified in > 1 functional group. % Coverage and verlap 100 90 80 70 60 50 40 30 20 Ideal Coverage CMC = 8,545 MDDR = 135,342 MedCh = 145,158 10 0 CMC MDDR MedChem Testing in medicinally relevant databases. Roughly 90% coverage and 10% overlap. Ideal verlap
Biological Validation: 538 Target Classes in MDDR Active compounds randomly divided into test and training Each Target Class had > 10 actives, or not included n average: 473 actives in 94 clusters* (Daylight) for each class Compounds in MDDR: 4.5 functional groups is median % Compounds # Functional Groups MDDR (Cumulative) # Cpds and # Clust in Tgt. Classes 100 90 80 70 60 50 40 30 20 10 0 0 1 2 3 4 5 6 7 8 9 10 # clusters at Tanimoto 0.3 3000 2500 2000 1500 1000 500 0 0 2000 4000 6000 8000 10000 12000 14000 # Functional Groups # Compounds: 537 Target Classes * Clusters generated with Daylight fingerprints at Tanimoto = 0.3
Tanimoto Scores From Functional Groups Tanimoto based on presence of functional group (binary) or counts (count) Count Tanimoto (C. Lerman) S S 1 2 B1 = FG in mol 1 B2 = FG in mol 2 BC = FG common to mol 1 and mol2 T dist = (B1 + B2-2 * BC) / (B1 + B2 - BC) S 3 Distance Matrix 1 2 3 4 1 ----.25 0.60 0.67 S 4 2 ---- 0.67 0.50 3 ---- 0.20 4 ---- ne functional group difference = distance 0.2-0.25
Average Percentage Actives Recovered 538 Target Classes in MDDR 2003 Actives in each target class randomly divided test & training. Recovery of test set using training set is graphed. % Actives Retreived 100 90 80 70 60 50 40 30 20 Binary Counts Daylight Consensus Random Recovery Rates Top Top Top Top 100 500 1,000 5,000 Bin 25.7 49.6 59.4 75.8 Ct 31.4 54.3 63.1 78.1 Day 38.2 56.4 68.3 82.2 Cons 37.7 65.0 74.5 87.9 > 60% Actives in top 1% DBase 10 0 0 20000 40000 60000 80000 100000 120000 Ranked MDDR MDDR 2003 > 135,000 cpds.
Tanimoto Enrichment Rate Analysis 538 Target Classes in MDDR 2003 Actives in each target class randomly divided test & training Recovery of test set using training set is graphed Enrichment Rate Equation A = # actives at Tanimoto B = # cpds total at Tanimoto ADB = total actives in DBase DB = total cpds in Dbase E = (A / B) / (ADB / DB) Enrichments normalized for the number of actives in target class.
Example Biological Categories: MDDR 2003 umber Cpds With Biology In Test Set umber Cpds With Biology Retrieved Daylight 0.3 umber Cpds Total Retrieved Daylight 0.3 En h an ce Ratio 0.3 Daylight umber Cpds With Biology Retrieved FGroup 0.3 umber Cpds Total Retrieved FGroup 0.3 En h an ce Ratio 0.3 FGroup umber Cpds With Biology Retrieved Consens 0.3 umber Cpds Total Retrieved Consens 0.3 Enhance Ratio 0.3 Consens Biology eurokinin K2 Antagonist 147 132 3112 38.3 110 1488 66.7 106 469 203.9 eurokinin K3 Antagonist 25 23 220 566.0 17 517 178.0 16 81 1069.4 Protein Kinase C Inhibitor 225 199 1619 73.9 151 1056 86.0 145 503 173.4 IV-1 Protease Inhibitor 457 411 2547 47.8 336 1575 63.2 327 899 107.7 5T1B Agonist 24 18 322 315.2 15 202 418.8 12 120 563.9 mglur1 Antagonist 20 13 95 926.0 8 298 181.7 7 47 1007.9 Thrombin Inhibitor 555 493 3571 33.7 417 1698 59.9 399 1004 96.9 Factor Xa Inhibitor 379 293 2307 45.4 238 1326 64.1 215 546 140.6 GABA-B Receptor Antagonis 21 15 33 2929.5 8 33 1562.4 7 16 2819.6 Adrenergic_beta_Blocker 89 70 494 215.5 76 618 187.0 67 228 446.9 Potassium_Channel_Blocke 132 110 644 175.1 88 1512 59.7 86 345 255.6 Sodium_Channel_Blocker 97 64 419 213.1 52 1775 40.9 49 220 310.8 ACE_Inhibitor 266 232 3185 37.1 198 1169 86.2 182 642 144.2 Estrogen_Receptor_Modulat 59 46 466 226.4 41 953 98.7 37 227 373.9 Dopamine_D2_Agonist 80 71 434 276.8 53 1323 67.8 52 187 470.4 Dopamine_D2_Antagonist 244 187 1283 80.8 158 3147 27.9 146 546 148.3 Thymidylate_Synthetase_Inh 128 120 493 257.4 106 365 307.1 103 274 397.5 Dihydrofolate_Reductase_Inh 72 61 322 356.1 61 340 337.2 58 255 427.6 Renin_Inhibitor 599 575 3164 41.1 527 1766 67.4 516 1337 87.2 Trypsin_Inhibitor 51 40 485 218.9 32 133 638.5 29 93 827.5 Antiviral 1635 1377 11080 10.3 1167 7345 13.2 1116 4121 22.4 Antiinflammatory 2363 1915 13221 8.3 1495 14285 6.0 1376 5603 14.1
Consensus Approach: verlap of True Positives from FG Count and Daylight The circles are drawn to scale.
Performance of the FG Count, Daylight and Consensus Approaches in Terms of True and False Positives umber of Compounds 1500 1250 1000 750 500 FG = FG Count D = Daylight C = Consensus 250 0 FG, 0.0 D, 0.0 C, 0.0 FG, 0.1 D, 0.1 C, 0.1 FG, 0.2 D, 0.2 C, 0.2 FG, 0.3 D, 0.3 C, 0.3 Tanimoto Distance for Methods umber of true and false positives for the Functional Group Fingerprint counts, Daylight fingerprint and consensus (logical AD ) approaches for the five hundred and thirty eight biological target classes at Tanimoto distances of 0.0, 0.1, 0.2, and 0.3. The three methods are binned at the various Tanimoto distances and are reported in the order of counts, Daylight, consensus, and are listed as C, D and A, respectively.
Functional Group Fingerprint Conclusions Developed a 2D fingerprint searching method that uses medicinally relevant, custom defined, functional groups. Validated the method on a large dataset of biological targets and chemical classes (538 target classes, 473 cpds 90 clust per class). Factor of 2 gain in accuracy through reduction of false positives. Deploy method company-wide. Rapidly expand SAR in it/lead Identification. Subset the corporate collection for screening. Factor of 2 improvement in 2D searches. Acknowledgement: Dave Cosgrove AstraZeneca