Cheminformatics analysis and learning in a data pipelining environment

Molecular Diversity (2006) 10: 283 299 DOI: 10.1007/s11030-006-9041-5 c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1, Shikha Varma-O Brien 2 & David Rogers 1 1 SciTegic, Inc., 10188 Telesis Court, Suite 100, San Diego, CA 92121, USA; 2 Accelrys, Inc. 10188 Telesis Court, Suite 100, San Diego, CA 92121, USA ( Author for correspondence, E-mail: mhassan@scitegic.com, Tel: +(858)-279-8800, Fax: +(858)-279-8804) Received 12 December 2005; Accepted 23 February 2006 Key words: Bayesian models, bioactivity prediction, data mining, data pipelining, maximal common substructure search, molecular fingerprints, molecular similarity, virtual screening Summary Workflow technology is being increasingly applied in discovery information to organize and analyze data. SciTegic s Pipeline Pilot is a chemically intelligent implementation of a workflow technology known as data pipelining. It allows scientists to construct and execute workflows using components that encapsulate many cheminformatics based algorithms. In this paper we review SciTegic s methodology for molecular fingerprints, molecular similarity, molecular clustering, maximal common subgraph search and Bayesian learning. Case studies are described showing the application of these methods to the analysis of discovery data such as chemical series and high throughput screening results. The paper demonstrates that the methods are well suited to a wide variety of tasks such as building and applying predictive models of screening data, identifying molecules for lead optimization and the organization of molecules into families with structural commonality. Abbreviations: MCSS, maximal common substructure search; ECFP, extended connectivity fingerprints; FCFP, functional class fingerprints; MDDR, MDL drug data report; WDI, world drug index; CATS, chemically advanced template search; BKD, binary kernel discrimination; CDK2, cyclin-dependent kinase 2; DHFR, escherichia coli dihydrofolate reductase Introduction Over the last decade the volume and complexity of data generated in drug discovery has increased massively. The analysis of the data, to turn it into useful information on which to base project decisions, is complicated by the scatter of this data across multiple data source locations and by the proliferation of point solutions which perform parts of a complete data analysis. More recently workflow technologies have been introduced to streamline and automate the process of data retrieval, organization, analysis and reporting. Once such workflow technology is data pipelining in which individual components that perform specific data retrieval, calculation or analysis tasks are graphically wired together to form a protocol or workflow. In such protocols data automatically flows between tasks allowing a complete data analysis to be performed. A variety of workflow tools exist to handle text and numeric data. However, to be useful for discovery, a workflow system must also understand data types such as molecule and sequence and must provide chemically- or biologically-intelligent components that can act on this data. In this paper we review some of the chemical intelligence within Pipeline Pilot, the leading commercial implementation of data pipelining for discovery [1]. Specifically, we discuss capabilities in molecular fingerprinting, similarity searching, clustering, maximal common subgraph searching and Bayesian learning. We review both algorithmic details of these methods and case studies conducted both by SciTegic and others that demonstrate the application of these methods. Methodology Here we present details of the implementation of several commonly used cheminformatics tools in Pipeline Pilot, including molecular fingerprints, similarity calculations, clustering, maximal common subgraph search, and Bayesian model learning. Molecular fingerprints Molecular fingerprints [2] are representations of chemical structures originally designed to assist in chemical database

284 searching, but later used for analysis tasks such as similarity searching [3], clustering [4] or recursive partitioning [5]. An interesting subclass of molecular fingerprints is circular substructure fingerprints; in this case, each feature is derived from a substructure centered on some atom, and extending some number of bonds in all directions. A member of this class of fingerprints was first described for the DARC substructure search system [6]; many circular substructural variants have since been described [7 13]. Extended-connectivity fingerprints (ECFPs) are a new class of circular substructural fingerprint for molecular characterization. ECFPs are derived using a variant of the Morgan algorithm [14], which was originally proposed as a method for solving the molecular isomorphism problem (that is, identify when two molecules, with different atom numberings, are the same). The generation of extended-connectivity fingerprints for a molecule begins with the assignment of an initial atom identifier for each heavy (non-hydrogen) atom in the molecule. In theory, any atom-typing rule could be used. In practice, we have found two rules the most useful: the Daylight atomic invariants [15] (the number of connections; the number of bonds to non-hydrogen atoms; the atomic number; the atomic mass; the atomic charge; and the number of attached hydrogens) or a functional-class rule (whether the atom is a hydrogen-bond acceptor; a hydrogen-bond donor; is positively ionizable; is negatively ionizable; is aromatic; or is a halogen). If the former is chosen, ECFPs result; if the latter is chosen, FCFPs. A number of iterations of the Morgan algorithm are then performed. At each iteration, each atom collects the codes of its neighboring atoms, and hashes them with its own code, generating a new code. The collection of these codes is the fingerprint. (A detailed description of this process is published elsewhere [16]). ECFPs and FCFPs are powerful representations that will be used in learning, clustering, and similarity search, as described in later sections of this paper. Molecular similarity The basic principle of molecular similarity is based on the idea that molecules with similar structural and physicochemical properties are more likely to have similar biological activities. This principle underlies many drug-discovery applications such as database searches, virtual screening, library focusing, and prediction of ADME/Tox and other physicochemical properties [3, 17]. The Molecular Similarity component in Pipeline Pilot allows the calculation of the similarity between sets of reference and target molecules based on SciTegic s molecular fingerprints (ECFPs or FCFPs), other commonly used fingerprints such as the MDL public keys, or new, user-defined fingerprints. The computation has been optimized for speed and memory usage, allowing the efficient comparison of large reference and target sets. For example, for a given target molecule, the identification of the five most similar molecules from a set of 500,000 references can be done in under 4 min; and the calculation and retrieval of the five most similar reference molecules from a set of 100,000 for each one of the target molecules in a set of 10,000 can be done in about 10 min (times obtained in a 3.0 GHz Windows machine, including loading the molecules and calculating the fingerprints). Similarity values can be calculated using several known coefficients such as Tanimoto, Dice, or Cosine. The following contributions are common to the definitions of these coefficients: SA = i SB = i SC = i x Ti x Ri x 2 Ti i x 2 Ri i x Ti x Ri x Ti x Ri Here, x Ti and x Ri are the values of the ith descriptor in the target and reference, respectively. Note that these are generic definitions that work for value-based similarity (using Counts, the number of times that each fingerprint key is observed in the molecule) as well as binary bit-based fingerprints (presence or absence of each fingerprint key). When using bit-based calculations, only the values 1 and 0 are possible, and the descriptions of the contributions above can be simplified as: SA = Number of bits defined in both the target and the reference SB = Number of bits defined in the target but not the reference SC = Number of bits defined in the reference but not the target Using these definitions for the coefficient contributions, the Tanimoto, Dice, and Cosine similarity coefficients are defined as follows: SA Tanimoto = SA+ SB + SC 2SA Dice = 2SA+ SB + SC SA Cos = (SA+ SB)(SA+ SC) The component also allows the users to define their own similarity coefficient, specified as a function of the SA, SB, and SC contributions. Tasks that can be easily accomplished in Pipeline Pilot using the Molecular Similarity component include: For each target molecule, find the nearest reference molecule at any similarity. For each target molecule, provide a list of reference molecules that are within a given similarity value, say, 0.7

285 Figure 1. Pipeline Pilot protocol to find the most similar target molecules with respect to a set of reference molecules using group data fusion. or greater, or a list of some number of nearest reference molecules, say, the nearest 5. Provide the list of target molecules that are not within a given similarity value of any of the reference molecules. Rank the target molecules by similarity to a reference set using group data fusion metrics such as maximum or average similarity. Data fusion calculations Pipeline Pilot provides an ideal framework to implement and deploy complex protocols that require reading molecular data from a variety of sources, calculation of molecular properties, analysis of the results and visualization of the output. The implementation of a protocol to carry out similarity calculations based on group data fusion [18] (Figure 1) offers an excellent example. In this case, the target molecules are read from an SD file and passed to the Data Fusion Similarity subprotocol. This subprotocol, which can be constructed by the user with components available in Pipeline Pilot, has parameters that specify all the information needed to perform the calculation: Location of one or more SD files containing the reference molecules Number of random reference molecules to select that define the group Group metric to use: either Maximum Similarity or Average Similarity Number of most similar target molecules to return Fingerprint property to use in the similarity calculation (ECFPs, FCFPs,...) Similarity coefficient to use (Tanimoto, Cosine,...) Looking at the data flow in the subprotocol, we see that it starts by reading the reference molecules from the specified SD files and then selecting the required number of reference molecules (10 in this case) by assigning a random number to each one and keeping the 10 ones with the largest random numbers. These 10 reference molecules are tagged as references and passed to the Molecular Similarity component along with all the target molecules (51058 in this case). The Molecular Similarity component calculates the required fingerprint properties and the similarity values between each reference and target molecule and, for each target molecule, outputs the similarity values with respect to all the reference molecules. The next component, Calculate Group Similarity Metric, contains a script to calculate, for each target molecule, either the average or the maximum similarity with respect to the group of reference molecules. The last step in the subprotocol is to pass the target molecules with the calculated group metric to a filter component to keep only the top N molecules (5 in this case) with the maximum values for the group metric. The selected (most similar) molecules can then be visualized in a HTML table, or saved to a file or database. Clustering Pipeline Pilot s clustering method was developed to rapidly cluster large data sets, particularly large sets of molecular data. It is a partitioning method [19, 20] in which the original data set is partitioned into ever-smaller regions that define the clusters. In a partitioning method, a number of representative objects are chosen from the data set. The correspond-

286 Figure 2. Computation time vs. number of molecules for clustering with a fixed number of total clusters and with a fixed average number of molecules per cluster. ing clusters are found by assigning each remaining object to each representative object, selecting the object that is the closest. The representative objects are called the cluster centers, and the other objects are the cluster members. The distance function between the objects is a Euclidian distance (if numeric properties are used), Tanimoto distance (defined as 1 Tanimoto similarity, if fingerprints are used), or a combination of the two (if both types of properties are used). The method for selecting the cluster centers is a maximum dissimilarity method [21]. It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that. The process repeats itself until there are a sufficient number of cluster centers. The nonselected objects are then assigned to the nearest cluster center to determine the cluster membership. Optionally, the clusters can be refined by iteratively recomputing the cluster centers based on the cluster membership and then reassigning cluster members from one cluster to another if they turn out to be closer to the second cluster s center following the recalculation. The Cluster Molecules component in Pipeline Pilot allows the user to specify the number of clusters to use to partition the data set, or to specify an average number of molecules per cluster, in which case the number of clusters to use is automatically calculated. Figure 2 shows plots of computational time vs. number of molecules for clustering molecules from the Asinex data set [22]. The times increase linearly, O(n), with the number of molecules when the total number of clusters is set to a fixed value, 50 clusters in this case. Times increase more rapidly with the number of molecules when the average number of molecules per cluster is set to a fixed value (50 molecules/cluster), as the number of total clusters increase quickly with the number of molecules. Maximal common substructure search Maximal Common Substructure Search (MCSS) is the process of finding the largest structure that is a substructure of all the molecules in a given set. It is a well-known method, but is known to be computationally very intensive, having been shown to be in the set of NP complete problems. This intensiveness has led to alternate methods for determining structural commonalities without performing a full MCSS [23, 24]. An additional complexity can be introduced by allowing some number of the molecules to be excluded from the requirement of matching. In this case the problem becomes to find the largest substructure that is a substructure of some percentage p of the molecules in a given set. This Approximate Maximal Common Substructure Search (AMCSS) task is even more daunting, as we do not know a priori which subset of molecules will be the set containing the maximal substructure. However, this extension of MCSS can be very useful; real-life data sets often contain a variety of interesting structural motifs, none of which may be present in all the samples, but all of which may be present in many of the samples. A final requirement is that the method be applicable to very large data sets. If possible, the algorithm should scale linearly with the number of samples. Current MCSS methods are often limited to a few hundred compounds, and may take hours or days to run. Our goal was to develop a method that would work with thousands of samples, and process them in a few minutes. MCSS was developed within the Pipeline Pilot system as a generator. What that means is: it accepts a number of input molecules, processes the data, then outputs (generates) new molecular fragments that represent the discovered maximal substructure or substructures. In the most common use, a single molecule is output; this represents the largest substructure. The method is based on an extension of our extendedconnectivity fingerprints (ECFPs). In the case where the

287 Figure 3. Largest common substructure with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. This substructure is present in 31.6% of the molecules. maximal substructure was exactly represented by some ECFP bit, then we could rapidly identify this substructure by simply recording the number of times each ECFP bit was present in a library, removing those that fail to be present in enough samples, and keeping the bit whose substructure contains the largest number of bonds. The probability of the maximal substructure being exactly the bonds within some radius of some atom is small. However, we can extend the ECFP method to generate all possible connected substructures using bonds within a given radius of an atom. In this case, any maximal substructure would be represented by (at least) one of these new bits. Now we can find the maximal substructure by simply searching for the bit that occurs in sufficient samples and has the most bonds. By avoiding any comparison of one molecule to another, as is typical for MCSS methods, our method scales linearly with the number of molecules. This is critical if the method is to be used for thousands or millions of molecules. An example of the use of maximal common substructure search is in the analysis of set of hits from screening. MCS search can be used to identify common cores with the hits and thereby organize the hits into families where a family is a set of molecules with a common core. In this way project teams can view hits in an organized fashion rather than in the arbitrary order in which they were discovered. The MCSS component in Pipeline Pilot can be configured to carry out a variety of tasks. It can generate just the largest substructure (within a specified size range, defined by a minimum and a maximum number of bonds), or all the different substructures in the size range, or a diverse set of substructures present in a specified minimum percentage of the molecules. It can also be configured to take into account activity data, for example, to find maximal common substructures among the molecules within a specified activity range and report back the mean activity value and standard deviation of the molecules containing the substructures. Figure 3 shows the largest common substructure with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. The substructure exhibits a threering core with an ether linkage in one of the rings. Finding the substructure took 25 s in a Windows 3.0 GHz machine. When the MCSS search is expanded to include not only the largest subgraph but all the subgraphs of a minimum size present in at least 30% of the molecules, we obtain the structures shown in Figure 4, which displays nine such substructures. The time it took to obtain these subgraphs was essentially the same time it took to find only the largest subgraph, about 25 s. Calculating maximal common subgraphs taking into account activity data is easily accomplished within the Pipeline Pilot framework. Figure 5 shows a protocol that finds diverse maximal common substructures in active molecules from the NCI AIDS data set that exhibit a large activity range. First we read 32343 molecules from the NCI AIDS data set using a SD Reader component and add the EC50 activity data contained in a separate text file to each molecule, filtering out those molecules without EC50 data. Then we use a general manipulator component to calculate logec50 for each molecule (logec50 = log10(ec50)) and keep only those molecules with logec50 < 6.0. The 168 molecules that pass these tests are input to the Generate MCSS component, which is configured to generate all the diverse maximal common subgraphs present in at least 10% of the molecules, with a minimum size of 8 bonds, and with a minimum range of 2.0 for the logec50 property. A total of 28 such subgraphs were found. After cleaning up the structures by removing Hydrogen atoms and calculating 2D coordinates, the parent molecules with the highlighted maximal common substructure atoms are displayed in a HTML viewer. The first two largest subgraphs found, highlighted in a parent molecule, are shown in Figure 6, together with statistical data for the logec50 activity property, including mean, range, minimum and maximum values. The total time taken to run this protocol was 31 s in a Windows 3.0 GHz machine. Bayesian models The Bayesian analysis method available in Pipeline Pilot is a method for the binary categorization of molecular data. The scientist presents the data to the method, with some subset marked as good ; the system builds a model which returns a number that can be used in ranking compounds from most-toleast likely as members of this good subset. Complete details of the underlying method are described elsewhere [25], but a short excerpt is offered below. The learning process starts by generating a large set of Boolean (yes/no) features from the input descriptors, then

288 Figure 4. Nine largest common substructures with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. Figure 5. Pipeline Pilot protocol to find diverse maximal common substructures in active molecules from the NCI AIDS data set that exhibit a large activity range. collects the frequency of occurrence of each feature in the good subset and in all data samples. To apply the model to a particular sample, the features of the sample are generated, and a weight is calculated for each feature using a Laplacianadjusted probability estimate. The weights are summed to provide a probability estimate, which is a relative predictor of the likelihood of that sample being from the good subset. The Laplacian corrected estimator is used to adjust the uncorrected probability estimate of a feature to account for the different sampling frequencies of different features. The derivation is: assume that N samples are available for training, of which M are good (active). An estimate of the baseline probability of a randomly chosen sample being active, P(Active), is M/N. Next, assume we are given a feature F contained in B samples, and that A of those B samples are active. The uncorrected estimate of activity, P(Active F), is A/B. Unfortunately, as the number of samples, B, becomes small, this estimator is unreliable; for example, if A = 1 and B = 1, P(Active F) would be 1 (that is, certainly active), which seems overconfident for a feature we have only seen

289 Figure 6. First two largest maximal substructures, highlighted in a parent molecule, found in active molecules from the NCI AIDS data set that exhibit a large activity range. Also shown in the figure are the logec50 mean, range, minimum and maximum values for the parent molecules that exhibit each subgraph. once. Most likely, the estimator is poor because the feature is undersampled, and further sampling of that feature would improve the estimate. We can estimate the effect of further sampling if we assume the vast majority of features have no relationship with activity; that is, if, for most features, F i, we would expect P(Active F i ) to be equal to our baseline probability P(Active). If we sampled the feature K additional times, we would expect P(Active) K of those new samples to be active. This provides the information needed to estimate the corrective effect of K additional samples: P corr (Active F) = (A + P(Active) K)/(B + K). (For K = 1/P(Active), this is the Laplacian correction.) This correction stabilizes the estimator: as the number of samples, B, containing a feature approaches zero, the feature s probability contribution converges to P(Active), which would be the expected value for most features. The final step is to make the estimator a relative estimate by dividing through by P(Active) that is, P final (Active F) = P corr (Active}F)/P(Active). This means that for most features, log P final 0. For features more common in actives, log P final > 0. For features less common in actives, log P final < 0. The completed estimate for a particular sample is derived by adding together the log P final values for all the features present in that sample. A recent study by Hert et al. [26] provides a good overview of the different available Bayesian methods, which they call R1, R2, R3, R4 [27], and AVID [28]. They note that the method available in Pipeline Pilot is most closely related to the AVID method, but with an important difference: the log of the Laplacian-corrected probability score is taken before the scores of different features are combined. We evaluated the different methods using the same 11 classes in the MDL Drug Data Report (MDDR) that were described earlier by Hert et al. [29]. A leave-one-out cross-validation scheme was used to provide predictions for each molecule in each activity class. The result was that in all cases the SciTegic classifier outperformed that of Avidon et al. [28], confirming the importance of the log transformation before the combining of the weights. Applications Similarity-based virtual screening Hert et al. [29] compared the performance of several types of 2D fingerprints in virtual screening experiments in which molecules were ranked based on their calculated similarity

290 Figure 7. Average recall rates (percentage of active compounds retrieved) over the 11 activity classes obtained from the top 1% of the ranked molecules using the different types of fingerprints. [Figure 1 from Hert et al. [29] Reproduced by permission of The Royal Society of Chemistry]. Figure 8. Example of a molecule and its corresponding molecular framework. Only the ring systems and the chains that link them are preserved. All heavy atoms are converted to Carbon, and all bonds are converted to single bonds. to different sets of bioactive reference compounds. The 2D fingerprints included in the study were the 1052 bits Barnard Chemical Information (BCI) structural keys [30], the 2048 bit Daylight fingerprints [31], the 988 bit Unity fingerprints [32], the 2048 bits Avalon fingerprints (developed by Novartis, includes atoms, augmented atoms, atom triplets and connection paths), several variations of SciTegic s extended connectivity and functional class fingerprints (ECFP 2, ECFP 4, FCFP 2, FCFP 4), the Similog keys 2D pharmacophore fingerprints [33], and the Chemically Advanced Template Search (CATS) pharmacophore fingerprints [34]. The study used eleven different activity classes from the MDL Drug Data Report (MDDR) database [35]. Two different similarity searching procedures were utilized, data fusion using the maximum similarity scores and a form of binary kernel discrimination (BKD) machine learning technique. Data fusion involves using not a single structure but a group of bioactive compound from the same activity class as reference structures. The similarity of each molecule in the database to each of the reference compounds is calculated and the molecules are ranked based on the maximum similarity score to any of the reference structures. This group fusion technique has been shown to be more efficient in retrieving actives than searches using a single reference compound [18]. Random groups of ten active reference compounds were used in the study. Results of the virtual screening are illustrated in Figure 7 which shows the average recall rates (percentage of active compounds retrieved) over the 11 activity classes obtained from the top 1% of the ranked molecules using the different types of fingerprints. The figure shows that SciTegic s extended connectivity fingerprints, particularly the ECFP 4 fingerprints, perform significantly better than the other fingerprints. The ECFP 4 fingerprints result in average recall rates of about 43%, compared to only 34% for the average for all fingerprints combined. The authors also investigated the diversity of the sets of active molecules retrieved in the top 1% using the different types of fingerprints. They defined the diversity of the sets of retrieved actives in terms of the number of different ring systems present in the set. The ring systems are defined in terms of the molecular frameworks described by Bemis and Murcko [36] and illustrated in Figure 8. The molecular frameworks consist of the rings present in the molecule plus any chains connecting them, with all atoms converted to Carbon and all bonds converted to single bonds. Figure 9 shows the percentage of molecular frameworks in the sets of actives retrieved in the top 1% of the ranked molecules averaged over the 11 activity classes for the different types of fingerprints. Again,

291 Figure 9. Percentage of molecular frameworks in the sets of actives retrieved in the top 1% of the ranked molecules averaged over the 11 activity classes for the different types of fingerprints. [Figure 5 from Hert et al. [29] Reproduced by permission of The Royal Society of Chemistry]. the extended connectivity fingerprints tend to do better than the other fingerprint types in retrieving different molecular frameworks, which indicates that they are also suitable for scaffold-hopping applications. Bayesian learning based virtual screening One major application of Bayesian learning is in the analysis of high-throughput screening data. HTS data has specific characteristics 1. A large number of samples 2. A very low occurrence of hits hit rates below 1% are common 3. A large amount of noise both false positives and false negative 4. Multiple modes of action it is typical that hits in a screen can come from many different classes and may have different modes of action In this section we describe a number of case studies that demonstrate how Bayesian learning can be applied to HTS data analysis and how it deals with each of the issues described above. The NCI AIDS data set The first case study uses the AIDS data set from the NCI [37]. After curation this produced a starting set of 32,343 molecules of which 230 were marked as confirmed active (CA) and a further 450 confirmed moderate (CM); the remainder were confirmed inactive (CI). An analysis by the chemists at the NCI suggested that the 230 hits do not form a single congeneric series, but rather they identified at least Table 1. Results of the Bayesian model on the NCI AIDS HTS data set 80% of Actives a ROC Experiment 1 3.5% 0.90 (Good) Experiment 2 14% 0.89 (Good) Experiment 3 7% 0.87 (Good) Experiment 4 15% 0.88 (Good) a Percentage of the hit list, sorted by Bayesian score required to recover 80% of the actives. 7 chemical classes amongst those CA hits. A series of experiments were conducted to test the properties of the Bayesian learning method with FCFP fingerprints as applied to high throughput screening data. Experiment 1: Predictive ability. The data set was split randomly into two equal parts giving a training set and a test set, a Bayesian model build with the first subset and then applied to the prediction of the second subset. In this experiment the two classes were defined as CA being the good subset and CM + CI as the baseline. In all experiments in this section, the descriptors used were FCFP 6 fingerprints, AlogP, molecular weight, number of donors, acceptors and rotatable bonds. The entire experiment to prepare the test and training sets, build the model and predict the test set took less than 25 s on a 3.0 GHz Windows machine. This shows that Bayesian models are extremely fast to build. Unlike many other methods the algorithm scales linearly with training set size making it practical to process extremely large compound collections. The following measures are used to assess the quality of the predictions in this and the subsequent. experiments Enrichment the samples in the test set were sorted by their Bayesian score from highest to lowest. An enrichment plot

292 Figure 10. Enrichment plot obtained with the Bayesian model corresponding to experiment 1 for the NCI AIDS data set. From the plot it is seen that 80% of the actives would be received in the top 3.5% of the list (571 compounds). shows the rank order of the samples of the X-axis plotted against the fraction of the actives recovered on the Y-axis. From this the percent of the data set need to retrieve 80% of the actives is recorded. Figure 10 shows the enrichment plot for experiment 1. ROC score a measure of the area under the curve of false positive rate vs false positive rate. The ROC plot for experiment 1 is shown in Figure 11. The results for experiment 1 are show in Table 1. Examination of these results and the plots in Figures 10 and 11 show that a robust model was generated, that was able to predict the test set with a high degree of accuracy. When the test set is sorted by the model score, 80% of the actives would be present in the top 3.5% of the list, i.e in the top 571 of 16,000+ compounds. If the model were random, one would not expect to recover 80% of the actives without testing 80% of the samples. The ROC score of 0.9 shows that the model is of good quality. The results show that if the model had been available prior to screening the test set, only a small percentage of the test set would need to be screened to find most of the activity in the entire set. This, of course, translates to a cost saving in terms of time and material for the screening run. The results also show that the model is able to encompass the multiple activity types present within the training set. Experiment 2: Robustness to few good samples. Akey issue when analyzing HTS data is the low hit-rate that often results from such assays. In Experiment 1, a set of approximately 115 hits was used to build a model to predict the other 115 hits. When the hit rate in an assay is low, it is conceivable that there will not be this amount of positive data available for model building. In the second experiment the data was split so that only 5% was used as the training set for model building and the remaining 95% was placed into the test set. This time, the training set contained only 14 hits and these covered 6 of the 7 classes identified by the NCI chemists. The results in Table 1, show that the model was not as good as that from Experiment 1. However, it was still a very robust model that was able to identify 80% of the actives in the top 14% of the list (approx 2200 compounds of the 30,000 + test set). Experiment 3: The effects of noise. It is well known that primary HTS data is noisy and can contain many false positives and false negatives. However, the NCI data has been well curated and likely is cleaner than typical HTS sets. To model the effects of noise, the data set was again split 50/50

293 Figure 11. ROC plot obtained with the Bayesian model corresponding to experiment 1 for the NCI AIDS data set. The area under the curve gives the ROC score of 0.90. into training and test sets. Before model building 5% of the negatives (CIs) in the training set were reassigned to positive (CA). Thus, the model was built with 115 true positive, 800 false positives and approximately 15000 true negatives. The results in Table 1 show that even with a noise level of 7:1 (false positives: true positives) a good model was found. With noise, 80% of the actives were found in the top 7% percent of the list, compared to the top 3.5% percent of the list without noise. Experiment 4: Weakly active hits. In some screening experiments, the first round of screening may turn up only weakly active compounds. This experiment tests whether there is sufficient information in those compounds to lead to the identification of more strongly actives. The data was first split 50/50 but then all the active (CA) compounds in the training set were moved to the test set. In this case the model was learned using the weakly active compounds (CM) as the actives. When applied to the test set, the ability of the model to identify the CA compounds was investigated. Without a single CA in the training set, the model was able to rank 80% of the CAs in the test set in the top 15% of the list. Experiment 5: Recovery of false negatives. False positives in a screen are somewhat undesirable, but ultimately will be discovered and discounted on follow up screening. False negatives are more problematic in most screening protocols since they will never be identified and represent hits that are gone for good. Experiment 5 tested the ability of the method to recover false negatives. In this experiment half of the 230 CAs were marked as CI. A model was then learned on the entire data set, and it was then used to rank all the negatives, including the true CI molecules and the 115 CIs that were really CAs. To recover false negatives one would wish that the false negatives would have the highest Bayesian scores and so would appear at the top of a list ranked by that score. In practice 85% of the false negatives were contained within the top 5% of the list. This demonstrates that a screening protocol in which follow up screening is conducted on both the hits and also the top few percent of the negatives (as ranked by a Bayesian model of the hits) would allow many false negatives to be recovered. Experiment 6: Iterative screening. Historically highthroughput screening has been performed across an entire compound collection in a single run. More recently, iterative screening has been introduced in an effort to discover hits faster and save cost by screening only part of a collection. The procedure is to screen an initial small sample of a compound collection and then build a model of the results. This model is then used to rank the remainder of the collection and the next subset selected and screened from the top ranked compounds. The model is then regenerated to include the new results. The process is iterated until either 1. All compounds have been tested 2. The hit rate drops to a level at which no further screening is deemed necessary.

294 Figure 12. Iterative screening results for the NCI AIDS data set. Experiment 6 mimics the iterative screening process by first selecting a subset of 3072 compounds (8 plates of 384) from the NCI set at random this represents about 10% of the data set. Its hit rate is calculated by examination of the CA/CI assignments and equals the hit rate in the data set as a whole. A model is then built and then next subset of the same size is chosen from the top ranked compounds and the hit rate recalculated across all the selected compounds. The model is then rebuilt and the process iterated. Figure 12 shows the results. At the first iteration 23 hits are found equal to the hit rate of the whole data set (shown by the dashed line). However, after the second generation 174 of the 230 hits have already been recovered (solid line); 80% of the hits are recovered in 3 generations and 90% in 6 generations, at which point only just over half of the compound collection has been screened. Screening of kinase inhibitors Similar experiments were performed by Xia et al. [25] using a wide variety of screening data on kinase targets that had been collected at Amgen. They identified a set of 6236 compounds that had been found to inhibit one or more of 39 protein kinases from two kinase families (tyrosine kinases and serine/thereonine kinases) with an IC 50 < 10 μm. A set of 193,417 compounds from the corporate collection were identified as a baseline and these sets were used to build Bayesian models. Splitting the data 50:50 into training and test sets produced a model that ordered 85% of the hits in the test set in the top 10% of the list sorted by model score. A further experiment to build the model from only 10% of the data and predict the other 90% showed almost no reduction in model quality. One concern of the authors was that the model might only be predictive within the series of compounds contained with the historical corporate data and would not be able to identify kinase inhibitors that were markedly different from those used in learning. This is an important concern when one considers the ability of methods to lead-hop from an existing patent space into a new one. The authors identified 172 compounds from the recent literature that were newly emerging protein kinase inhibitor classes. A similarity analysis showed that these 172 compounds were all significantly different from the 6236 hits from their corporate collection. The 172 compounds were merged with 168,000 baseline molecules and the model learned earlier from 10% of the in house kinase hits was applied to predict the 168,172 member data set. 70% of the 172 new compounds were found in the top 10% of the list and 85% in the top 20%, a discovery rate only slightly worse than that of the in-house compounds. This shows the ability of the methodology to identify hits from novel structural and kinase classes. Analysis of CDK2 inhibitors Cyclin-dependent kinases are cellular kinases that play a crucial role in phases of cell cycle. Many groups are studying

295 Figure 13. Representative CDK2 inhibitors compounds in each scaffold class and number of active compounds containing each scaffold. Cyclin-dependent kinases 2 (CDK2) inhibitors for their potential as anticancer therapeutics [38 48]. The study reported here is a ligand-based retrospective analysis of the classification of activity of ligands using a Bayesian statistical approach. It demonstrates how CDK2 inhibitor-like libraries can be generated for smart screening using a Bayesian model. This model also allows the identification of structural features that are associated with CDK2 inhibition activity. Using data from a previously published work [49] we assembled a library containing a combination of sources of compounds: diverse compounds from general screening libraries (13,359 with 207 actives), commercial compounds selected by chemists (951 with 161 actives), and the rest synthesized and screened iteratively using 22 scaffold types (3,240 total with 14 actives). We combined the sources and divided the data equally and randomly for training and test sets. This comprised a total of 17,550 CDK2 inhibitors. Compounds were classified as actives if the reported IC50 was lower than 25 mm or if the percent inhibition was greater than 50% at 10 mm. Figure 13 represents the 20 scaffolds present in the HTS data. The figure shows representative compounds for each scaffold type (specifically the lowest molecular weight in each scaffold class). Every compound belongs to a scaffold type as designated by medicinal chemists as stated in the original reference [49]. We developed a predictive Bayesian model using FCFP 6, AlogP, molecular weight, number of hydrogen bond donors, number of hydrogen bond acceptors, and number of rotatable bonds. The Bayesian model was able to correctly classify over 80% of the 8,706 test set compounds (129 out of 173 actives and 6,898 out of 8,533 inactives). This is a good prediction but a major strength of Bayesian modeling is its ability to rank compounds according to their probability of being active. This ranking is important when prioritizing compounds for screening or for further development. The enrichment plot shown in Figure 14 is a graphical representation of the quality of this ranking. The enrichment curve plots the number of active compounds recovered versus the proportion of the database screened. The diagonal line shows how many active compounds would be recovered when the database is screened randomly. This graph shows that by screening just 1% of the database, 17% of the actives are retrieved (17-fold enrichment). This graph demonstrates the quality of the Bayesian model generated for the prediction of CDK2 inhibition, in particular the ability of the model to accurately rank compounds according to activity. It is also possible to use the Bayesian model to identify the top good and bad structural features present in the compounds in the training set. The top good features are those that appear mostly in active compounds, while bad features are those that appear mostly in inactive compounds. Figures 15 and 16 shows the top good and bad features identified by the Bayesian model based on the FPFC 6 fingerprints. The presence or absence of good and bad features among the compounds belonging to the different scaffolds help us to understand the model predictions. All 15 test set compounds belonging to scaffold 20 contain the top FCFP 6 features and are correctly predicted to be active. As previously reported [49], scaffold 05 was designated as an active scaffold. The

296 Figure 14. Enrichment plot from the classification of compounds in the CDK2 test data set obtained with the Bayesian model. Figure 16. Bad FCFP 6 features identified in the CDK2 data set. The numbers below the structures indicate the number of times the feature was observed and, in parenthesis, the number of times the feature was observed in active compounds. Figure 15. Good FCFP 6 features identified in the CDK2 data set. The numbers below the structures indicate the number of times the feature was observed and, in parenthesis, the number of times the feature was observed in active compounds. Bayesian model assigns high probability scores to 21 out of the 23 actives within the test set. This is because scaffold 05 contains one of the good features (the fourth substructure feature shown in Figure 24). This structure was seen 52 times in the training set, 13 times in active compounds giving it a normalized probability score of +1.81. However, the same feature also occurs in the inactive molecules belonging to scaffold 05. As this scaffold itself is deemed to confer activity, 112 scaffold 05 compounds are incorrectly predicted to be active (false positives). This study demonstrates that predictive model can be built extremely quickly with thousands of compounds with diverse structures and scaffolds, and likely more than one mode of action. The resulting model enables to correlate

297 structural features with biological activities and correctly classifies active and inactive compounds. Additionally, ranking of the active compounds shows significant enrichment with certain scaffolds conferring CDK2 inhibitory activity. The MCMASTER s DHFR screening set Many methods for virtual screening have been developed ranging from simple similarity searching through statistical methods such as Bayesian learning to complex docking. Direct comparison of the methods is always hard, however, since validation studies invariably use different data sets and analyze results in different ways. To address this issue, a group from McMaster University instigated a virtual screening competition that would allow different methods to be directly compared. The group provided a training set of 50,000 compounds that had been screened against Escherichia coli dihydrofolate reductase (DHFR), which contained 32 marked hits, including 12 competitive inhibitors. They also provided a 50,000 compound test set and the task was to return the list to the organizers in rank order of predicted activity. The details and results of the competition were described in a special issue of the Journal of Biomolecular Screening [50]. Rogers et al. [51] applied Bayesian learning to the problem. In an initial validation within the training set using 10-fold cross-validation, a ROC score of 0.86 was obtained, suggested that the method should be applicable to this data set. A model was then built of the entire training set and this was used to rank order the test set, and this ranked list was submitted. The results of the competition were judged by the number of actives from the training set within the top 2500 compounds in the list. The Bayesian learning method scored 7 compounds, which seems somewhat low. However, it was later revealed that although there were 96 hits within the test set, none of those was a competitive inhibitor. Furthermore, of the remaining 32 competition entries, many groups scored 0 or 1 hits only in the list. The top group who used a docking approach identified 13 hits, and another group using a similar statistical learning approach identified 4 hits in the top 2500. Further investigation by a number of groups revealed that the training set and test sets were drawn from very different populations of molecule types and had little overlap, thus violating the statistical requirement that the training and tests sets be drawn from the same population. While this means that the competition results must be treated somewhat circumspectly, it is interesting to note that the Bayesian statistical approaches, which took a matter of a few minutes to learn and score the entire set competed very favorably with complex docking methods that require significant manual work to set up the active site, and then had runtimes of the order of seconds or minutes per ligand and would not have been so affected by the statistical differences between the training and test sets. Application of Bayesian learning with docking Individual virtual screening methods need not be applied in isolation. For example, a fast Bayesian method can provide a useful filter to reduce a very large pool of potential ligands down to a smaller candidate subset for the application of a more rigorous and time-consuming methods such as docking. A research group at Novartis has also shown that Bayesian learning may be used a post processing stage to docking and scoring to improve the rankings obtained from docking alone. An ongoing challenge in docking is to calculate accurate energies and scores for docked conformations of ligands. Many different scoring methods have been proposed, with a wide range of theoretical backgrounds, but all suffer from biases that limited their accuracy. In initial work Klon et al. [52] proposed a method in which a set of ligands was first docked into a protein active site. The energies of the docked conformations were then used to rank order the molecules. A Bayesian model was then learned using the top ranked molecules as good and the remainder as bad. Once built the model was then used to re-rank the compounds. Using training sets with known activities against a particular protein, enrichment curves were compared based on the dock scores alone and those in which the rank orders had been modified according to the Bayesian model. In systems in which the dock scores produced some enrichment, the Bayesian modified ranking was found to significantly improve the enrichment. For example, for docking into PTP- 1B using Dock 45% of known actives were captured in the top 10% of the list based on the Dock score alone, while 68% were in the top 10% of the Bayesian resorted list. For FlexX the figures were 72% and 91% and for Glide, 84% and 89%. In these initial experiments the authors found cases in which the dock scores gave no enrichment and found that in those cases Bayesian learning was unable to improve the situation. However, in a later publication [53] they showed that by employing a consensus scoring method following docking and then learning on the consensus scores rather than the raw docking scores, Bayesian learning could improve the enrichment in the test cases for which learning on the original single docking score produced no enrichment. Summary and conclusions In this review we have discussed a wide ranging set of chemical algorithms implemented in Pipeline Pilot. Through a variety of case studies we have demonstrated their effectiveness in analyzing the types of data available in drug discovery projects. By deploying these algorithms as components in a data pipelining environment, we enable scientists to have great flexibility in defining their own chemically intelligent workflows. These workflows allow the rapid implementation of complex tasks and the automation of their execution. Pipeline Pilots chemistry and analysis components can also