Cheminformatics analysis and learning in a data pipelining environment

Size: px
Start display at page:

Download "Cheminformatics analysis and learning in a data pipelining environment"

Transcription

1 Molecular Diversity (2006) 10: DOI: /s c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1, Shikha Varma-O Brien 2 & David Rogers 1 1 SciTegic, Inc., Telesis Court, Suite 100, San Diego, CA 92121, USA; 2 Accelrys, Inc Telesis Court, Suite 100, San Diego, CA 92121, USA ( Author for correspondence, mhassan@scitegic.com, Tel: +(858) , Fax: +(858) ) Received 12 December 2005; Accepted 23 February 2006 Key words: Bayesian models, bioactivity prediction, data mining, data pipelining, maximal common substructure search, molecular fingerprints, molecular similarity, virtual screening Summary Workflow technology is being increasingly applied in discovery information to organize and analyze data. SciTegic s Pipeline Pilot is a chemically intelligent implementation of a workflow technology known as data pipelining. It allows scientists to construct and execute workflows using components that encapsulate many cheminformatics based algorithms. In this paper we review SciTegic s methodology for molecular fingerprints, molecular similarity, molecular clustering, maximal common subgraph search and Bayesian learning. Case studies are described showing the application of these methods to the analysis of discovery data such as chemical series and high throughput screening results. The paper demonstrates that the methods are well suited to a wide variety of tasks such as building and applying predictive models of screening data, identifying molecules for lead optimization and the organization of molecules into families with structural commonality. Abbreviations: MCSS, maximal common substructure search; ECFP, extended connectivity fingerprints; FCFP, functional class fingerprints; MDDR, MDL drug data report; WDI, world drug index; CATS, chemically advanced template search; BKD, binary kernel discrimination; CDK2, cyclin-dependent kinase 2; DHFR, escherichia coli dihydrofolate reductase Introduction Over the last decade the volume and complexity of data generated in drug discovery has increased massively. The analysis of the data, to turn it into useful information on which to base project decisions, is complicated by the scatter of this data across multiple data source locations and by the proliferation of point solutions which perform parts of a complete data analysis. More recently workflow technologies have been introduced to streamline and automate the process of data retrieval, organization, analysis and reporting. Once such workflow technology is data pipelining in which individual components that perform specific data retrieval, calculation or analysis tasks are graphically wired together to form a protocol or workflow. In such protocols data automatically flows between tasks allowing a complete data analysis to be performed. A variety of workflow tools exist to handle text and numeric data. However, to be useful for discovery, a workflow system must also understand data types such as molecule and sequence and must provide chemically- or biologically-intelligent components that can act on this data. In this paper we review some of the chemical intelligence within Pipeline Pilot, the leading commercial implementation of data pipelining for discovery [1]. Specifically, we discuss capabilities in molecular fingerprinting, similarity searching, clustering, maximal common subgraph searching and Bayesian learning. We review both algorithmic details of these methods and case studies conducted both by SciTegic and others that demonstrate the application of these methods. Methodology Here we present details of the implementation of several commonly used cheminformatics tools in Pipeline Pilot, including molecular fingerprints, similarity calculations, clustering, maximal common subgraph search, and Bayesian model learning. Molecular fingerprints Molecular fingerprints [2] are representations of chemical structures originally designed to assist in chemical database

2 284 searching, but later used for analysis tasks such as similarity searching [3], clustering [4] or recursive partitioning [5]. An interesting subclass of molecular fingerprints is circular substructure fingerprints; in this case, each feature is derived from a substructure centered on some atom, and extending some number of bonds in all directions. A member of this class of fingerprints was first described for the DARC substructure search system [6]; many circular substructural variants have since been described [7 13]. Extended-connectivity fingerprints (ECFPs) are a new class of circular substructural fingerprint for molecular characterization. ECFPs are derived using a variant of the Morgan algorithm [14], which was originally proposed as a method for solving the molecular isomorphism problem (that is, identify when two molecules, with different atom numberings, are the same). The generation of extended-connectivity fingerprints for a molecule begins with the assignment of an initial atom identifier for each heavy (non-hydrogen) atom in the molecule. In theory, any atom-typing rule could be used. In practice, we have found two rules the most useful: the Daylight atomic invariants [15] (the number of connections; the number of bonds to non-hydrogen atoms; the atomic number; the atomic mass; the atomic charge; and the number of attached hydrogens) or a functional-class rule (whether the atom is a hydrogen-bond acceptor; a hydrogen-bond donor; is positively ionizable; is negatively ionizable; is aromatic; or is a halogen). If the former is chosen, ECFPs result; if the latter is chosen, FCFPs. A number of iterations of the Morgan algorithm are then performed. At each iteration, each atom collects the codes of its neighboring atoms, and hashes them with its own code, generating a new code. The collection of these codes is the fingerprint. (A detailed description of this process is published elsewhere [16]). ECFPs and FCFPs are powerful representations that will be used in learning, clustering, and similarity search, as described in later sections of this paper. Molecular similarity The basic principle of molecular similarity is based on the idea that molecules with similar structural and physicochemical properties are more likely to have similar biological activities. This principle underlies many drug-discovery applications such as database searches, virtual screening, library focusing, and prediction of ADME/Tox and other physicochemical properties [3, 17]. The Molecular Similarity component in Pipeline Pilot allows the calculation of the similarity between sets of reference and target molecules based on SciTegic s molecular fingerprints (ECFPs or FCFPs), other commonly used fingerprints such as the MDL public keys, or new, user-defined fingerprints. The computation has been optimized for speed and memory usage, allowing the efficient comparison of large reference and target sets. For example, for a given target molecule, the identification of the five most similar molecules from a set of 500,000 references can be done in under 4 min; and the calculation and retrieval of the five most similar reference molecules from a set of 100,000 for each one of the target molecules in a set of 10,000 can be done in about 10 min (times obtained in a 3.0 GHz Windows machine, including loading the molecules and calculating the fingerprints). Similarity values can be calculated using several known coefficients such as Tanimoto, Dice, or Cosine. The following contributions are common to the definitions of these coefficients: SA = i SB = i SC = i x Ti x Ri x 2 Ti i x 2 Ri i x Ti x Ri x Ti x Ri Here, x Ti and x Ri are the values of the ith descriptor in the target and reference, respectively. Note that these are generic definitions that work for value-based similarity (using Counts, the number of times that each fingerprint key is observed in the molecule) as well as binary bit-based fingerprints (presence or absence of each fingerprint key). When using bit-based calculations, only the values 1 and 0 are possible, and the descriptions of the contributions above can be simplified as: SA = Number of bits defined in both the target and the reference SB = Number of bits defined in the target but not the reference SC = Number of bits defined in the reference but not the target Using these definitions for the coefficient contributions, the Tanimoto, Dice, and Cosine similarity coefficients are defined as follows: SA Tanimoto = SA+ SB + SC 2SA Dice = 2SA+ SB + SC SA Cos = (SA+ SB)(SA+ SC) The component also allows the users to define their own similarity coefficient, specified as a function of the SA, SB, and SC contributions. Tasks that can be easily accomplished in Pipeline Pilot using the Molecular Similarity component include: For each target molecule, find the nearest reference molecule at any similarity. For each target molecule, provide a list of reference molecules that are within a given similarity value, say, 0.7

3 285 Figure 1. Pipeline Pilot protocol to find the most similar target molecules with respect to a set of reference molecules using group data fusion. or greater, or a list of some number of nearest reference molecules, say, the nearest 5. Provide the list of target molecules that are not within a given similarity value of any of the reference molecules. Rank the target molecules by similarity to a reference set using group data fusion metrics such as maximum or average similarity. Data fusion calculations Pipeline Pilot provides an ideal framework to implement and deploy complex protocols that require reading molecular data from a variety of sources, calculation of molecular properties, analysis of the results and visualization of the output. The implementation of a protocol to carry out similarity calculations based on group data fusion [18] (Figure 1) offers an excellent example. In this case, the target molecules are read from an SD file and passed to the Data Fusion Similarity subprotocol. This subprotocol, which can be constructed by the user with components available in Pipeline Pilot, has parameters that specify all the information needed to perform the calculation: Location of one or more SD files containing the reference molecules Number of random reference molecules to select that define the group Group metric to use: either Maximum Similarity or Average Similarity Number of most similar target molecules to return Fingerprint property to use in the similarity calculation (ECFPs, FCFPs,...) Similarity coefficient to use (Tanimoto, Cosine,...) Looking at the data flow in the subprotocol, we see that it starts by reading the reference molecules from the specified SD files and then selecting the required number of reference molecules (10 in this case) by assigning a random number to each one and keeping the 10 ones with the largest random numbers. These 10 reference molecules are tagged as references and passed to the Molecular Similarity component along with all the target molecules (51058 in this case). The Molecular Similarity component calculates the required fingerprint properties and the similarity values between each reference and target molecule and, for each target molecule, outputs the similarity values with respect to all the reference molecules. The next component, Calculate Group Similarity Metric, contains a script to calculate, for each target molecule, either the average or the maximum similarity with respect to the group of reference molecules. The last step in the subprotocol is to pass the target molecules with the calculated group metric to a filter component to keep only the top N molecules (5 in this case) with the maximum values for the group metric. The selected (most similar) molecules can then be visualized in a HTML table, or saved to a file or database. Clustering Pipeline Pilot s clustering method was developed to rapidly cluster large data sets, particularly large sets of molecular data. It is a partitioning method [19, 20] in which the original data set is partitioned into ever-smaller regions that define the clusters. In a partitioning method, a number of representative objects are chosen from the data set. The correspond-

4 286 Figure 2. Computation time vs. number of molecules for clustering with a fixed number of total clusters and with a fixed average number of molecules per cluster. ing clusters are found by assigning each remaining object to each representative object, selecting the object that is the closest. The representative objects are called the cluster centers, and the other objects are the cluster members. The distance function between the objects is a Euclidian distance (if numeric properties are used), Tanimoto distance (defined as 1 Tanimoto similarity, if fingerprints are used), or a combination of the two (if both types of properties are used). The method for selecting the cluster centers is a maximum dissimilarity method [21]. It begins by randomly choosing a data record as the first cluster center. The record maximally distant from the first point is selected as the next cluster center. The record maximally distant from both current points is selected after that. The process repeats itself until there are a sufficient number of cluster centers. The nonselected objects are then assigned to the nearest cluster center to determine the cluster membership. Optionally, the clusters can be refined by iteratively recomputing the cluster centers based on the cluster membership and then reassigning cluster members from one cluster to another if they turn out to be closer to the second cluster s center following the recalculation. The Cluster Molecules component in Pipeline Pilot allows the user to specify the number of clusters to use to partition the data set, or to specify an average number of molecules per cluster, in which case the number of clusters to use is automatically calculated. Figure 2 shows plots of computational time vs. number of molecules for clustering molecules from the Asinex data set [22]. The times increase linearly, O(n), with the number of molecules when the total number of clusters is set to a fixed value, 50 clusters in this case. Times increase more rapidly with the number of molecules when the average number of molecules per cluster is set to a fixed value (50 molecules/cluster), as the number of total clusters increase quickly with the number of molecules. Maximal common substructure search Maximal Common Substructure Search (MCSS) is the process of finding the largest structure that is a substructure of all the molecules in a given set. It is a well-known method, but is known to be computationally very intensive, having been shown to be in the set of NP complete problems. This intensiveness has led to alternate methods for determining structural commonalities without performing a full MCSS [23, 24]. An additional complexity can be introduced by allowing some number of the molecules to be excluded from the requirement of matching. In this case the problem becomes to find the largest substructure that is a substructure of some percentage p of the molecules in a given set. This Approximate Maximal Common Substructure Search (AMCSS) task is even more daunting, as we do not know a priori which subset of molecules will be the set containing the maximal substructure. However, this extension of MCSS can be very useful; real-life data sets often contain a variety of interesting structural motifs, none of which may be present in all the samples, but all of which may be present in many of the samples. A final requirement is that the method be applicable to very large data sets. If possible, the algorithm should scale linearly with the number of samples. Current MCSS methods are often limited to a few hundred compounds, and may take hours or days to run. Our goal was to develop a method that would work with thousands of samples, and process them in a few minutes. MCSS was developed within the Pipeline Pilot system as a generator. What that means is: it accepts a number of input molecules, processes the data, then outputs (generates) new molecular fragments that represent the discovered maximal substructure or substructures. In the most common use, a single molecule is output; this represents the largest substructure. The method is based on an extension of our extendedconnectivity fingerprints (ECFPs). In the case where the

5 287 Figure 3. Largest common substructure with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. This substructure is present in 31.6% of the molecules. maximal substructure was exactly represented by some ECFP bit, then we could rapidly identify this substructure by simply recording the number of times each ECFP bit was present in a library, removing those that fail to be present in enough samples, and keeping the bit whose substructure contains the largest number of bonds. The probability of the maximal substructure being exactly the bonds within some radius of some atom is small. However, we can extend the ECFP method to generate all possible connected substructures using bonds within a given radius of an atom. In this case, any maximal substructure would be represented by (at least) one of these new bits. Now we can find the maximal substructure by simply searching for the bit that occurs in sufficient samples and has the most bonds. By avoiding any comparison of one molecule to another, as is typical for MCSS methods, our method scales linearly with the number of molecules. This is critical if the method is to be used for thousands or millions of molecules. An example of the use of maximal common substructure search is in the analysis of set of hits from screening. MCS search can be used to identify common cores with the hits and thereby organize the hits into families where a family is a set of molecules with a common core. In this way project teams can view hits in an organized fashion rather than in the arbitrary order in which they were discovered. The MCSS component in Pipeline Pilot can be configured to carry out a variety of tasks. It can generate just the largest substructure (within a specified size range, defined by a minimum and a maximum number of bonds), or all the different substructures in the size range, or a diverse set of substructures present in a specified minimum percentage of the molecules. It can also be configured to take into account activity data, for example, to find maximal common substructures among the molecules within a specified activity range and report back the mean activity value and standard deviation of the molecules containing the substructures. Figure 3 shows the largest common substructure with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. The substructure exhibits a threering core with an ether linkage in one of the rings. Finding the substructure took 25 s in a Windows 3.0 GHz machine. When the MCSS search is expanded to include not only the largest subgraph but all the subgraphs of a minimum size present in at least 30% of the molecules, we obtain the structures shown in Figure 4, which displays nine such substructures. The time it took to obtain these subgraphs was essentially the same time it took to find only the largest subgraph, about 25 s. Calculating maximal common subgraphs taking into account activity data is easily accomplished within the Pipeline Pilot framework. Figure 5 shows a protocol that finds diverse maximal common substructures in active molecules from the NCI AIDS data set that exhibit a large activity range. First we read molecules from the NCI AIDS data set using a SD Reader component and add the EC50 activity data contained in a separate text file to each molecule, filtering out those molecules without EC50 data. Then we use a general manipulator component to calculate logec50 for each molecule (logec50 = log10(ec50)) and keep only those molecules with logec50 < 6.0. The 168 molecules that pass these tests are input to the Generate MCSS component, which is configured to generate all the diverse maximal common subgraphs present in at least 10% of the molecules, with a minimum size of 8 bonds, and with a minimum range of 2.0 for the logec50 property. A total of 28 such subgraphs were found. After cleaning up the structures by removing Hydrogen atoms and calculating 2D coordinates, the parent molecules with the highlighted maximal common substructure atoms are displayed in a HTML viewer. The first two largest subgraphs found, highlighted in a parent molecule, are shown in Figure 6, together with statistical data for the logec50 activity property, including mean, range, minimum and maximum values. The total time taken to run this protocol was 31 s in a Windows 3.0 GHz machine. Bayesian models The Bayesian analysis method available in Pipeline Pilot is a method for the binary categorization of molecular data. The scientist presents the data to the method, with some subset marked as good ; the system builds a model which returns a number that can be used in ranking compounds from most-toleast likely as members of this good subset. Complete details of the underlying method are described elsewhere [25], but a short excerpt is offered below. The learning process starts by generating a large set of Boolean (yes/no) features from the input descriptors, then

6 288 Figure 4. Nine largest common substructures with a minimum of 12 bonds found in at least 30% of the molecules in a set of 152 estrogen antagonists from the WDI database. Figure 5. Pipeline Pilot protocol to find diverse maximal common substructures in active molecules from the NCI AIDS data set that exhibit a large activity range. collects the frequency of occurrence of each feature in the good subset and in all data samples. To apply the model to a particular sample, the features of the sample are generated, and a weight is calculated for each feature using a Laplacianadjusted probability estimate. The weights are summed to provide a probability estimate, which is a relative predictor of the likelihood of that sample being from the good subset. The Laplacian corrected estimator is used to adjust the uncorrected probability estimate of a feature to account for the different sampling frequencies of different features. The derivation is: assume that N samples are available for training, of which M are good (active). An estimate of the baseline probability of a randomly chosen sample being active, P(Active), is M/N. Next, assume we are given a feature F contained in B samples, and that A of those B samples are active. The uncorrected estimate of activity, P(Active F), is A/B. Unfortunately, as the number of samples, B, becomes small, this estimator is unreliable; for example, if A = 1 and B = 1, P(Active F) would be 1 (that is, certainly active), which seems overconfident for a feature we have only seen

7 289 Figure 6. First two largest maximal substructures, highlighted in a parent molecule, found in active molecules from the NCI AIDS data set that exhibit a large activity range. Also shown in the figure are the logec50 mean, range, minimum and maximum values for the parent molecules that exhibit each subgraph. once. Most likely, the estimator is poor because the feature is undersampled, and further sampling of that feature would improve the estimate. We can estimate the effect of further sampling if we assume the vast majority of features have no relationship with activity; that is, if, for most features, F i, we would expect P(Active F i ) to be equal to our baseline probability P(Active). If we sampled the feature K additional times, we would expect P(Active) K of those new samples to be active. This provides the information needed to estimate the corrective effect of K additional samples: P corr (Active F) = (A + P(Active) K)/(B + K). (For K = 1/P(Active), this is the Laplacian correction.) This correction stabilizes the estimator: as the number of samples, B, containing a feature approaches zero, the feature s probability contribution converges to P(Active), which would be the expected value for most features. The final step is to make the estimator a relative estimate by dividing through by P(Active) that is, P final (Active F) = P corr (Active}F)/P(Active). This means that for most features, log P final 0. For features more common in actives, log P final > 0. For features less common in actives, log P final < 0. The completed estimate for a particular sample is derived by adding together the log P final values for all the features present in that sample. A recent study by Hert et al. [26] provides a good overview of the different available Bayesian methods, which they call R1, R2, R3, R4 [27], and AVID [28]. They note that the method available in Pipeline Pilot is most closely related to the AVID method, but with an important difference: the log of the Laplacian-corrected probability score is taken before the scores of different features are combined. We evaluated the different methods using the same 11 classes in the MDL Drug Data Report (MDDR) that were described earlier by Hert et al. [29]. A leave-one-out cross-validation scheme was used to provide predictions for each molecule in each activity class. The result was that in all cases the SciTegic classifier outperformed that of Avidon et al. [28], confirming the importance of the log transformation before the combining of the weights. Applications Similarity-based virtual screening Hert et al. [29] compared the performance of several types of 2D fingerprints in virtual screening experiments in which molecules were ranked based on their calculated similarity

8 290 Figure 7. Average recall rates (percentage of active compounds retrieved) over the 11 activity classes obtained from the top 1% of the ranked molecules using the different types of fingerprints. [Figure 1 from Hert et al. [29] Reproduced by permission of The Royal Society of Chemistry]. Figure 8. Example of a molecule and its corresponding molecular framework. Only the ring systems and the chains that link them are preserved. All heavy atoms are converted to Carbon, and all bonds are converted to single bonds. to different sets of bioactive reference compounds. The 2D fingerprints included in the study were the 1052 bits Barnard Chemical Information (BCI) structural keys [30], the 2048 bit Daylight fingerprints [31], the 988 bit Unity fingerprints [32], the 2048 bits Avalon fingerprints (developed by Novartis, includes atoms, augmented atoms, atom triplets and connection paths), several variations of SciTegic s extended connectivity and functional class fingerprints (ECFP 2, ECFP 4, FCFP 2, FCFP 4), the Similog keys 2D pharmacophore fingerprints [33], and the Chemically Advanced Template Search (CATS) pharmacophore fingerprints [34]. The study used eleven different activity classes from the MDL Drug Data Report (MDDR) database [35]. Two different similarity searching procedures were utilized, data fusion using the maximum similarity scores and a form of binary kernel discrimination (BKD) machine learning technique. Data fusion involves using not a single structure but a group of bioactive compound from the same activity class as reference structures. The similarity of each molecule in the database to each of the reference compounds is calculated and the molecules are ranked based on the maximum similarity score to any of the reference structures. This group fusion technique has been shown to be more efficient in retrieving actives than searches using a single reference compound [18]. Random groups of ten active reference compounds were used in the study. Results of the virtual screening are illustrated in Figure 7 which shows the average recall rates (percentage of active compounds retrieved) over the 11 activity classes obtained from the top 1% of the ranked molecules using the different types of fingerprints. The figure shows that SciTegic s extended connectivity fingerprints, particularly the ECFP 4 fingerprints, perform significantly better than the other fingerprints. The ECFP 4 fingerprints result in average recall rates of about 43%, compared to only 34% for the average for all fingerprints combined. The authors also investigated the diversity of the sets of active molecules retrieved in the top 1% using the different types of fingerprints. They defined the diversity of the sets of retrieved actives in terms of the number of different ring systems present in the set. The ring systems are defined in terms of the molecular frameworks described by Bemis and Murcko [36] and illustrated in Figure 8. The molecular frameworks consist of the rings present in the molecule plus any chains connecting them, with all atoms converted to Carbon and all bonds converted to single bonds. Figure 9 shows the percentage of molecular frameworks in the sets of actives retrieved in the top 1% of the ranked molecules averaged over the 11 activity classes for the different types of fingerprints. Again,

9 291 Figure 9. Percentage of molecular frameworks in the sets of actives retrieved in the top 1% of the ranked molecules averaged over the 11 activity classes for the different types of fingerprints. [Figure 5 from Hert et al. [29] Reproduced by permission of The Royal Society of Chemistry]. the extended connectivity fingerprints tend to do better than the other fingerprint types in retrieving different molecular frameworks, which indicates that they are also suitable for scaffold-hopping applications. Bayesian learning based virtual screening One major application of Bayesian learning is in the analysis of high-throughput screening data. HTS data has specific characteristics 1. A large number of samples 2. A very low occurrence of hits hit rates below 1% are common 3. A large amount of noise both false positives and false negative 4. Multiple modes of action it is typical that hits in a screen can come from many different classes and may have different modes of action In this section we describe a number of case studies that demonstrate how Bayesian learning can be applied to HTS data analysis and how it deals with each of the issues described above. The NCI AIDS data set The first case study uses the AIDS data set from the NCI [37]. After curation this produced a starting set of 32,343 molecules of which 230 were marked as confirmed active (CA) and a further 450 confirmed moderate (CM); the remainder were confirmed inactive (CI). An analysis by the chemists at the NCI suggested that the 230 hits do not form a single congeneric series, but rather they identified at least Table 1. Results of the Bayesian model on the NCI AIDS HTS data set 80% of Actives a ROC Experiment 1 3.5% 0.90 (Good) Experiment 2 14% 0.89 (Good) Experiment 3 7% 0.87 (Good) Experiment 4 15% 0.88 (Good) a Percentage of the hit list, sorted by Bayesian score required to recover 80% of the actives. 7 chemical classes amongst those CA hits. A series of experiments were conducted to test the properties of the Bayesian learning method with FCFP fingerprints as applied to high throughput screening data. Experiment 1: Predictive ability. The data set was split randomly into two equal parts giving a training set and a test set, a Bayesian model build with the first subset and then applied to the prediction of the second subset. In this experiment the two classes were defined as CA being the good subset and CM + CI as the baseline. In all experiments in this section, the descriptors used were FCFP 6 fingerprints, AlogP, molecular weight, number of donors, acceptors and rotatable bonds. The entire experiment to prepare the test and training sets, build the model and predict the test set took less than 25 s on a 3.0 GHz Windows machine. This shows that Bayesian models are extremely fast to build. Unlike many other methods the algorithm scales linearly with training set size making it practical to process extremely large compound collections. The following measures are used to assess the quality of the predictions in this and the subsequent. experiments Enrichment the samples in the test set were sorted by their Bayesian score from highest to lowest. An enrichment plot

10 292 Figure 10. Enrichment plot obtained with the Bayesian model corresponding to experiment 1 for the NCI AIDS data set. From the plot it is seen that 80% of the actives would be received in the top 3.5% of the list (571 compounds). shows the rank order of the samples of the X-axis plotted against the fraction of the actives recovered on the Y-axis. From this the percent of the data set need to retrieve 80% of the actives is recorded. Figure 10 shows the enrichment plot for experiment 1. ROC score a measure of the area under the curve of false positive rate vs false positive rate. The ROC plot for experiment 1 is shown in Figure 11. The results for experiment 1 are show in Table 1. Examination of these results and the plots in Figures 10 and 11 show that a robust model was generated, that was able to predict the test set with a high degree of accuracy. When the test set is sorted by the model score, 80% of the actives would be present in the top 3.5% of the list, i.e in the top 571 of 16,000+ compounds. If the model were random, one would not expect to recover 80% of the actives without testing 80% of the samples. The ROC score of 0.9 shows that the model is of good quality. The results show that if the model had been available prior to screening the test set, only a small percentage of the test set would need to be screened to find most of the activity in the entire set. This, of course, translates to a cost saving in terms of time and material for the screening run. The results also show that the model is able to encompass the multiple activity types present within the training set. Experiment 2: Robustness to few good samples. Akey issue when analyzing HTS data is the low hit-rate that often results from such assays. In Experiment 1, a set of approximately 115 hits was used to build a model to predict the other 115 hits. When the hit rate in an assay is low, it is conceivable that there will not be this amount of positive data available for model building. In the second experiment the data was split so that only 5% was used as the training set for model building and the remaining 95% was placed into the test set. This time, the training set contained only 14 hits and these covered 6 of the 7 classes identified by the NCI chemists. The results in Table 1, show that the model was not as good as that from Experiment 1. However, it was still a very robust model that was able to identify 80% of the actives in the top 14% of the list (approx 2200 compounds of the 30,000 + test set). Experiment 3: The effects of noise. It is well known that primary HTS data is noisy and can contain many false positives and false negatives. However, the NCI data has been well curated and likely is cleaner than typical HTS sets. To model the effects of noise, the data set was again split 50/50

11 293 Figure 11. ROC plot obtained with the Bayesian model corresponding to experiment 1 for the NCI AIDS data set. The area under the curve gives the ROC score of into training and test sets. Before model building 5% of the negatives (CIs) in the training set were reassigned to positive (CA). Thus, the model was built with 115 true positive, 800 false positives and approximately true negatives. The results in Table 1 show that even with a noise level of 7:1 (false positives: true positives) a good model was found. With noise, 80% of the actives were found in the top 7% percent of the list, compared to the top 3.5% percent of the list without noise. Experiment 4: Weakly active hits. In some screening experiments, the first round of screening may turn up only weakly active compounds. This experiment tests whether there is sufficient information in those compounds to lead to the identification of more strongly actives. The data was first split 50/50 but then all the active (CA) compounds in the training set were moved to the test set. In this case the model was learned using the weakly active compounds (CM) as the actives. When applied to the test set, the ability of the model to identify the CA compounds was investigated. Without a single CA in the training set, the model was able to rank 80% of the CAs in the test set in the top 15% of the list. Experiment 5: Recovery of false negatives. False positives in a screen are somewhat undesirable, but ultimately will be discovered and discounted on follow up screening. False negatives are more problematic in most screening protocols since they will never be identified and represent hits that are gone for good. Experiment 5 tested the ability of the method to recover false negatives. In this experiment half of the 230 CAs were marked as CI. A model was then learned on the entire data set, and it was then used to rank all the negatives, including the true CI molecules and the 115 CIs that were really CAs. To recover false negatives one would wish that the false negatives would have the highest Bayesian scores and so would appear at the top of a list ranked by that score. In practice 85% of the false negatives were contained within the top 5% of the list. This demonstrates that a screening protocol in which follow up screening is conducted on both the hits and also the top few percent of the negatives (as ranked by a Bayesian model of the hits) would allow many false negatives to be recovered. Experiment 6: Iterative screening. Historically highthroughput screening has been performed across an entire compound collection in a single run. More recently, iterative screening has been introduced in an effort to discover hits faster and save cost by screening only part of a collection. The procedure is to screen an initial small sample of a compound collection and then build a model of the results. This model is then used to rank the remainder of the collection and the next subset selected and screened from the top ranked compounds. The model is then regenerated to include the new results. The process is iterated until either 1. All compounds have been tested 2. The hit rate drops to a level at which no further screening is deemed necessary.

12 294 Figure 12. Iterative screening results for the NCI AIDS data set. Experiment 6 mimics the iterative screening process by first selecting a subset of 3072 compounds (8 plates of 384) from the NCI set at random this represents about 10% of the data set. Its hit rate is calculated by examination of the CA/CI assignments and equals the hit rate in the data set as a whole. A model is then built and then next subset of the same size is chosen from the top ranked compounds and the hit rate recalculated across all the selected compounds. The model is then rebuilt and the process iterated. Figure 12 shows the results. At the first iteration 23 hits are found equal to the hit rate of the whole data set (shown by the dashed line). However, after the second generation 174 of the 230 hits have already been recovered (solid line); 80% of the hits are recovered in 3 generations and 90% in 6 generations, at which point only just over half of the compound collection has been screened. Screening of kinase inhibitors Similar experiments were performed by Xia et al. [25] using a wide variety of screening data on kinase targets that had been collected at Amgen. They identified a set of 6236 compounds that had been found to inhibit one or more of 39 protein kinases from two kinase families (tyrosine kinases and serine/thereonine kinases) with an IC 50 < 10 μm. A set of 193,417 compounds from the corporate collection were identified as a baseline and these sets were used to build Bayesian models. Splitting the data 50:50 into training and test sets produced a model that ordered 85% of the hits in the test set in the top 10% of the list sorted by model score. A further experiment to build the model from only 10% of the data and predict the other 90% showed almost no reduction in model quality. One concern of the authors was that the model might only be predictive within the series of compounds contained with the historical corporate data and would not be able to identify kinase inhibitors that were markedly different from those used in learning. This is an important concern when one considers the ability of methods to lead-hop from an existing patent space into a new one. The authors identified 172 compounds from the recent literature that were newly emerging protein kinase inhibitor classes. A similarity analysis showed that these 172 compounds were all significantly different from the 6236 hits from their corporate collection. The 172 compounds were merged with 168,000 baseline molecules and the model learned earlier from 10% of the in house kinase hits was applied to predict the 168,172 member data set. 70% of the 172 new compounds were found in the top 10% of the list and 85% in the top 20%, a discovery rate only slightly worse than that of the in-house compounds. This shows the ability of the methodology to identify hits from novel structural and kinase classes. Analysis of CDK2 inhibitors Cyclin-dependent kinases are cellular kinases that play a crucial role in phases of cell cycle. Many groups are studying

13 295 Figure 13. Representative CDK2 inhibitors compounds in each scaffold class and number of active compounds containing each scaffold. Cyclin-dependent kinases 2 (CDK2) inhibitors for their potential as anticancer therapeutics [38 48]. The study reported here is a ligand-based retrospective analysis of the classification of activity of ligands using a Bayesian statistical approach. It demonstrates how CDK2 inhibitor-like libraries can be generated for smart screening using a Bayesian model. This model also allows the identification of structural features that are associated with CDK2 inhibition activity. Using data from a previously published work [49] we assembled a library containing a combination of sources of compounds: diverse compounds from general screening libraries (13,359 with 207 actives), commercial compounds selected by chemists (951 with 161 actives), and the rest synthesized and screened iteratively using 22 scaffold types (3,240 total with 14 actives). We combined the sources and divided the data equally and randomly for training and test sets. This comprised a total of 17,550 CDK2 inhibitors. Compounds were classified as actives if the reported IC50 was lower than 25 mm or if the percent inhibition was greater than 50% at 10 mm. Figure 13 represents the 20 scaffolds present in the HTS data. The figure shows representative compounds for each scaffold type (specifically the lowest molecular weight in each scaffold class). Every compound belongs to a scaffold type as designated by medicinal chemists as stated in the original reference [49]. We developed a predictive Bayesian model using FCFP 6, AlogP, molecular weight, number of hydrogen bond donors, number of hydrogen bond acceptors, and number of rotatable bonds. The Bayesian model was able to correctly classify over 80% of the 8,706 test set compounds (129 out of 173 actives and 6,898 out of 8,533 inactives). This is a good prediction but a major strength of Bayesian modeling is its ability to rank compounds according to their probability of being active. This ranking is important when prioritizing compounds for screening or for further development. The enrichment plot shown in Figure 14 is a graphical representation of the quality of this ranking. The enrichment curve plots the number of active compounds recovered versus the proportion of the database screened. The diagonal line shows how many active compounds would be recovered when the database is screened randomly. This graph shows that by screening just 1% of the database, 17% of the actives are retrieved (17-fold enrichment). This graph demonstrates the quality of the Bayesian model generated for the prediction of CDK2 inhibition, in particular the ability of the model to accurately rank compounds according to activity. It is also possible to use the Bayesian model to identify the top good and bad structural features present in the compounds in the training set. The top good features are those that appear mostly in active compounds, while bad features are those that appear mostly in inactive compounds. Figures 15 and 16 shows the top good and bad features identified by the Bayesian model based on the FPFC 6 fingerprints. The presence or absence of good and bad features among the compounds belonging to the different scaffolds help us to understand the model predictions. All 15 test set compounds belonging to scaffold 20 contain the top FCFP 6 features and are correctly predicted to be active. As previously reported [49], scaffold 05 was designated as an active scaffold. The

14 296 Figure 14. Enrichment plot from the classification of compounds in the CDK2 test data set obtained with the Bayesian model. Figure 16. Bad FCFP 6 features identified in the CDK2 data set. The numbers below the structures indicate the number of times the feature was observed and, in parenthesis, the number of times the feature was observed in active compounds. Figure 15. Good FCFP 6 features identified in the CDK2 data set. The numbers below the structures indicate the number of times the feature was observed and, in parenthesis, the number of times the feature was observed in active compounds. Bayesian model assigns high probability scores to 21 out of the 23 actives within the test set. This is because scaffold 05 contains one of the good features (the fourth substructure feature shown in Figure 24). This structure was seen 52 times in the training set, 13 times in active compounds giving it a normalized probability score of However, the same feature also occurs in the inactive molecules belonging to scaffold 05. As this scaffold itself is deemed to confer activity, 112 scaffold 05 compounds are incorrectly predicted to be active (false positives). This study demonstrates that predictive model can be built extremely quickly with thousands of compounds with diverse structures and scaffolds, and likely more than one mode of action. The resulting model enables to correlate

15 297 structural features with biological activities and correctly classifies active and inactive compounds. Additionally, ranking of the active compounds shows significant enrichment with certain scaffolds conferring CDK2 inhibitory activity. The MCMASTER s DHFR screening set Many methods for virtual screening have been developed ranging from simple similarity searching through statistical methods such as Bayesian learning to complex docking. Direct comparison of the methods is always hard, however, since validation studies invariably use different data sets and analyze results in different ways. To address this issue, a group from McMaster University instigated a virtual screening competition that would allow different methods to be directly compared. The group provided a training set of 50,000 compounds that had been screened against Escherichia coli dihydrofolate reductase (DHFR), which contained 32 marked hits, including 12 competitive inhibitors. They also provided a 50,000 compound test set and the task was to return the list to the organizers in rank order of predicted activity. The details and results of the competition were described in a special issue of the Journal of Biomolecular Screening [50]. Rogers et al. [51] applied Bayesian learning to the problem. In an initial validation within the training set using 10-fold cross-validation, a ROC score of 0.86 was obtained, suggested that the method should be applicable to this data set. A model was then built of the entire training set and this was used to rank order the test set, and this ranked list was submitted. The results of the competition were judged by the number of actives from the training set within the top 2500 compounds in the list. The Bayesian learning method scored 7 compounds, which seems somewhat low. However, it was later revealed that although there were 96 hits within the test set, none of those was a competitive inhibitor. Furthermore, of the remaining 32 competition entries, many groups scored 0 or 1 hits only in the list. The top group who used a docking approach identified 13 hits, and another group using a similar statistical learning approach identified 4 hits in the top Further investigation by a number of groups revealed that the training set and test sets were drawn from very different populations of molecule types and had little overlap, thus violating the statistical requirement that the training and tests sets be drawn from the same population. While this means that the competition results must be treated somewhat circumspectly, it is interesting to note that the Bayesian statistical approaches, which took a matter of a few minutes to learn and score the entire set competed very favorably with complex docking methods that require significant manual work to set up the active site, and then had runtimes of the order of seconds or minutes per ligand and would not have been so affected by the statistical differences between the training and test sets. Application of Bayesian learning with docking Individual virtual screening methods need not be applied in isolation. For example, a fast Bayesian method can provide a useful filter to reduce a very large pool of potential ligands down to a smaller candidate subset for the application of a more rigorous and time-consuming methods such as docking. A research group at Novartis has also shown that Bayesian learning may be used a post processing stage to docking and scoring to improve the rankings obtained from docking alone. An ongoing challenge in docking is to calculate accurate energies and scores for docked conformations of ligands. Many different scoring methods have been proposed, with a wide range of theoretical backgrounds, but all suffer from biases that limited their accuracy. In initial work Klon et al. [52] proposed a method in which a set of ligands was first docked into a protein active site. The energies of the docked conformations were then used to rank order the molecules. A Bayesian model was then learned using the top ranked molecules as good and the remainder as bad. Once built the model was then used to re-rank the compounds. Using training sets with known activities against a particular protein, enrichment curves were compared based on the dock scores alone and those in which the rank orders had been modified according to the Bayesian model. In systems in which the dock scores produced some enrichment, the Bayesian modified ranking was found to significantly improve the enrichment. For example, for docking into PTP- 1B using Dock 45% of known actives were captured in the top 10% of the list based on the Dock score alone, while 68% were in the top 10% of the Bayesian resorted list. For FlexX the figures were 72% and 91% and for Glide, 84% and 89%. In these initial experiments the authors found cases in which the dock scores gave no enrichment and found that in those cases Bayesian learning was unable to improve the situation. However, in a later publication [53] they showed that by employing a consensus scoring method following docking and then learning on the consensus scores rather than the raw docking scores, Bayesian learning could improve the enrichment in the test cases for which learning on the original single docking score produced no enrichment. Summary and conclusions In this review we have discussed a wide ranging set of chemical algorithms implemented in Pipeline Pilot. Through a variety of case studies we have demonstrated their effectiveness in analyzing the types of data available in drug discovery projects. By deploying these algorithms as components in a data pipelining environment, we enable scientists to have great flexibility in defining their own chemically intelligent workflows. These workflows allow the rapid implementation of complex tasks and the automation of their execution. Pipeline Pilots chemistry and analysis components can also

Studying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity

Studying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity Studying the effect of noise on Laplacian-modified Bayesian nalysis and Tanimoto Similarity David Rogers, Ph.D. SciTegic, Inc. (Division of ccelrys, Inc.) drogers@scitegic.com Description of: nalysis methods

More information

Practical QSAR and Library Design: Advanced tools for research teams

Practical QSAR and Library Design: Advanced tools for research teams DS QSAR and Library Design Webinar Practical QSAR and Library Design: Advanced tools for research teams Reservationless-Plus Dial-In Number (US): (866) 519-8942 Reservationless-Plus International Dial-In

More information

has its own advantages and drawbacks, depending on the questions facing the drug discovery.

has its own advantages and drawbacks, depending on the questions facing the drug discovery. 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Organic & Biomolecular

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic

Cross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic Cross Discipline Analysis made possible with Data Pipelining J.R. Tozer SciTegic System Genesis Pipelining tool created to automate data processing in cheminformatics Modular system built with generic

More information

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME

Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells

More information

Design and Synthesis of the Comprehensive Fragment Library

Design and Synthesis of the Comprehensive Fragment Library YOUR INNOVATIVE CHEMISTRY PARTNER IN DRUG DISCOVERY Design and Synthesis of the Comprehensive Fragment Library A 3D Enabled Library for Medicinal Chemistry Discovery Warren S Wade 1, Kuei-Lin Chang 1,

More information

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics

More information

Computational chemical biology to address non-traditional drug targets. John Karanicolas

Computational chemical biology to address non-traditional drug targets. John Karanicolas Computational chemical biology to address non-traditional drug targets John Karanicolas Our computational toolbox Structure-based approaches Ligand-based approaches Detailed MD simulations 2D fingerprints

More information

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters Drug Design 2 Oliver Kohlbacher Winter 2009/2010 11. QSAR Part 4: Selected Chapters Abt. Simulation biologischer Systeme WSI/ZBIT, Eberhard-Karls-Universität Tübingen Overview GRIND GRid-INDependent Descriptors

More information

The Schrödinger KNIME extensions

The Schrödinger KNIME extensions The Schrödinger KNIME extensions Computational Chemistry and Cheminformatics in a workflow environment Jean-Christophe Mozziconacci Volker Eyrich Topics What are the Schrödinger extensions? Workflow application

More information

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor

More information

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies

Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-aturwissenschaftlichen Fakultät der Rheinischen

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Patent Searching using Bayesian Statistics

Patent Searching using Bayesian Statistics Patent Searching using Bayesian Statistics Willem van Hoorn, Exscientia Ltd Biovia European Forum, London, June 2017 Contents Who are we? Searching molecules in patents What can Pipeline Pilot do for you?

More information

In Silico Investigation of Off-Target Effects

In Silico Investigation of Off-Target Effects PHARMA & LIFE SCIENCES WHITEPAPER In Silico Investigation of Off-Target Effects STREAMLINING IN SILICO PROFILING In silico techniques require exhaustive data and sophisticated, well-structured informatics

More information

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017

Using Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017 Using Phase for Pharmacophore Modelling 5th European Life Science Bootcamp March, 2017 Phase: Our Pharmacohore generation tool Significant improvements to Phase methods in 2016 New highly interactive interface

More information

Performing a Pharmacophore Search using CSD-CrossMiner

Performing a Pharmacophore Search using CSD-CrossMiner Table of Contents Introduction... 2 CSD-CrossMiner Terminology... 2 Overview of CSD-CrossMiner... 3 Searching with a Pharmacophore... 4 Performing a Pharmacophore Search using CSD-CrossMiner Version 2.0

More information

Introduction. OntoChem

Introduction. OntoChem Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads

More information

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland

Navigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Navigation in Chemical Space Towards Biological Activity Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Data Explosion in Chemistry CAS 65 million molecules CCDC 600 000 structures

More information

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

Molecular Similarity Searching Using Inference Network

Molecular Similarity Searching Using Inference Network Molecular Similarity Searching Using Inference Network Ammar Abdo, Naomie Salim* Faculty of Computer Science & Information Systems Universiti Teknologi Malaysia Molecular Similarity Searching Search for

More information

An Integrated Approach to in-silico

An Integrated Approach to in-silico An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals

More information

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a Retrieving hits through in silico screening and expert assessment M.. Drwal a,b and R. Griffith a a: School of Medical Sciences/Pharmacology, USW, Sydney, Australia b: Charité Berlin, Germany Abstract:

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

Data Quality Issues That Can Impact Drug Discovery

Data Quality Issues That Can Impact Drug Discovery Data Quality Issues That Can Impact Drug Discovery Sean Ekins 1, Joe Olechno 2 Antony J. Williams 3 1 Collaborations in Chemistry, Fuquay Varina, NC. 2 Labcyte Inc, Sunnyvale, CA. 3 Royal Society of Chemistry,

More information

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös

KNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös KIME-based scoring functions in Muse 3.0 KIME User Group Meeting 2013 Fabian Bös Certara Mission: End-to-End Model-Based Drug Development Certara was formed by acquiring and integrating Tripos, Pharsight,

More information

Virtual screening in drug discovery

Virtual screening in drug discovery Virtual screening in drug discovery Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery

More information

The PhilOEsophy. There are only two fundamental molecular descriptors

The PhilOEsophy. There are only two fundamental molecular descriptors The PhilOEsophy There are only two fundamental molecular descriptors Where can we use shape? Virtual screening More effective than 2D Lead-hopping Shape analogues are not graph analogues Molecular alignment

More information

Hit Finding and Optimization Using BLAZE & FORGE

Hit Finding and Optimization Using BLAZE & FORGE Hit Finding and Optimization Using BLAZE & FORGE Kevin Cusack,* Maria Argiriadi, Eric Breinlinger, Jeremy Edmunds, Michael Hoemann, Michael Friedman, Sami Osman, Raymond Huntley, Thomas Vargo AbbVie, Immunology

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

Introducing a Bioinformatics Similarity Search Solution

Introducing a Bioinformatics Similarity Search Solution Introducing a Bioinformatics Similarity Search Solution 1 Page About the APU 3 The APU as a Driver of Similarity Search 3 Similarity Search in Bioinformatics 3 POC: GSI Joins Forces with the Weizmann Institute

More information

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and

More information

Similarity methods for ligandbased virtual screening

Similarity methods for ligandbased virtual screening Similarity methods for ligandbased virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery 5, 22 nd July 2010 Overview Molecular similarity and its use in virtual screening

More information

Ligand Scout Tutorials

Ligand Scout Tutorials Ligand Scout Tutorials Step : Creating a pharmacophore from a protein-ligand complex. Type ke6 in the upper right area of the screen and press the button Download *+. The protein will be downloaded and

More information

Statistical concepts in QSAR.

Statistical concepts in QSAR. Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs

More information

Kinome-wide Activity Models from Diverse High-Quality Datasets

Kinome-wide Activity Models from Diverse High-Quality Datasets Kinome-wide Activity Models from Diverse High-Quality Datasets Stephan C. Schürer*,1 and Steven M. Muskal 2 1 Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

Xia Ning,*, Huzefa Rangwala, and George Karypis

Xia Ning,*, Huzefa Rangwala, and George Karypis J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets

More information

1. Some examples of coping with Molecular informatics data legacy data (accuracy)

1. Some examples of coping with Molecular informatics data legacy data (accuracy) Molecular Informatics Tools for Data Analysis and Discovery 1. Some examples of coping with Molecular informatics data legacy data (accuracy) 2. Database searching using a similarity approach fingerprints

More information

Research Article. Chemical compound classification based on improved Max-Min kernel

Research Article. Chemical compound classification based on improved Max-Min kernel Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved

More information

De Novo molecular design with Deep Reinforcement Learning

De Novo molecular design with Deep Reinforcement Learning De Novo molecular design with Deep Reinforcement Learning @olexandr Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill olexandr@unc.edu http://olexandrisayev.com About me Ph.D. in Chemistry

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE

PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE PIOTR GOLKIEWICZ LIFE SCIENCES SOLUTIONS CONSULTANT CENTRAL-EASTERN EUROPE 1 SERVING THE LIFE SCIENCES SPACE ADDRESSING KEY CHALLENGES ACROSS THE R&D VALUE CHAIN Characterize targets & analyze disease

More information

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner

Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner Table of Contents Creating a Pharmacophore Query from a Reference Molecule & Scaffold Hopping in CSD-CrossMiner Introduction... 2 CSD-CrossMiner Terminology... 2 Overview of CSD-CrossMiner... 3 Features

More information

Clustering Ambiguity: An Overview

Clustering Ambiguity: An Overview Clustering Ambiguity: An Overview John D. MacCuish Norah E. MacCuish 3 rd Joint Sheffield Conference on Chemoinformatics April 23, 2004 Outline The Problem: Clustering Ambiguity and Chemoinformatics Preliminaries:

More information

DISCOVERING new drugs is an expensive and challenging

DISCOVERING new drugs is an expensive and challenging 1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 8, AUGUST 2005 Frequent Substructure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, Nikil

More information

CSD. CSD-Enterprise. Access the CSD and ALL CCDC application software

CSD. CSD-Enterprise. Access the CSD and ALL CCDC application software CSD CSD-Enterprise Access the CSD and ALL CCDC application software CSD-Enterprise brings it all: access to the Cambridge Structural Database (CSD), the world s comprehensive and up-to-date database of

More information

CSD. Unlock value from crystal structure information in the CSD

CSD. Unlock value from crystal structure information in the CSD CSD CSD-System Unlock value from crystal structure information in the CSD The Cambridge Structural Database (CSD) is the world s most comprehensive and up-todate knowledge base of crystal structure data,

More information

Functional Group Fingerprints CNS Chemistry Wilmington, USA

Functional Group Fingerprints CNS Chemistry Wilmington, USA Functional Group Fingerprints CS Chemistry Wilmington, USA James R. Arnold Charles L. Lerman William F. Michne James R. Damewood American Chemical Society ational Meeting August, 2004 Philadelphia, PA

More information

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics

Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics... 1 1.1 Chemoinformatics... 2 1.1.1 Open-Source Tools... 2 1.1.2 Introduction to Programming Languages... 3 1.2 Chemical Structure

More information

Universities of Leeds, Sheffield and York

Universities of Leeds, Sheffield and York promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Statistical Analysis

More information

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.

More information

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA http://www.ftsm.ukm.my/apjitm Asia-Pacific Journal of Information Technology and Multimedia Jurnal Teknologi Maklumat dan Multimedia Asia-Pasifik Vol. 7 No. 1, June 2018: 91-98 e-issn: 2289-2192 COMPARISON

More information

Supplementary information

Supplementary information Electronic Supplementary Material (ESI) for MedChemComm. This journal is The Royal Society of Chemistry 2017 Supplementary information Identification of steroid-like natural products as potent antiplasmodial

More information

Using AutoDock for Virtual Screening

Using AutoDock for Virtual Screening Using AutoDock for Virtual Screening CUHK Croucher ASI Workshop 2011 Stefano Forli, PhD Prof. Arthur J. Olson, Ph.D Molecular Graphics Lab Screening and Virtual Screening The ultimate tool for identifying

More information

Introduction to Chemoinformatics and Drug Discovery

Introduction to Chemoinformatics and Drug Discovery Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.

More information

The Case for Use Cases

The Case for Use Cases The Case for Use Cases The integration of internal and external chemical information is a vital and complex activity for the pharmaceutical industry. David Walsh, Grail Entropix Ltd Costs of Integrating

More information

Benchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints in Phenotypic Drug Discovery

Benchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints in Phenotypic Drug Discovery 501390JBXXXX10.1177/1087057113501390Journal of Biomolecular ScreeningReisen et al. research-article2013 Original Research Benchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints

More information

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE

COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE NUE FEATURE T R A N S F O R M I N G C H A L L E N G E S I N T O M E D I C I N E Nuevolution Feature no. 1 October 2015 Technical Information COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE A PROMISING

More information

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007 Computational Chemistry in Drug Design Xavier Fradera Barcelona, 17/4/2007 verview Introduction and background Drug Design Cycle Computational methods Chemoinformatics Ligand Based Methods Structure Based

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Ultra High Throughput Screening using THINK on the Internet

Ultra High Throughput Screening using THINK on the Internet Ultra High Throughput Screening using THINK on the Internet Keith Davies Central Chemistry Laboratory, Oxford University Cathy Davies Treweren Consultants, UK Blue Sky Objectives Reduce Development Failures

More information

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE AMRI COMPOUD LIBRARY COSORTIUM: A OVEL WAY TO FILL YOUR DRUG PIPELIE Muralikrishna Valluri, PhD & Douglas B. Kitchen, PhD Summary The creation of high-quality, innovative small molecule leads is a continual

More information

Virtual Screening: How Are We Doing?

Virtual Screening: How Are We Doing? Virtual Screening: How Are We Doing? Mark E. Snow, James Dunbar, Lakshmi Narasimhan, Jack A. Bikker, Dan Ortwine, Christopher Whitehead, Yiannis Kaznessis, Dave Moreland, Christine Humblet Pfizer Global

More information

Model Accuracy Measures

Model Accuracy Measures Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses

More information

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size

Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Feature selection and classifier performance in computer-aided diagnosis: The effect of finite sample size Berkman Sahiner, a) Heang-Ping Chan, Nicholas Petrick, Robert F. Wagner, b) and Lubomir Hadjiiski

More information

An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection

An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection American Journal of Applied Sciences 8 (4): 368-373, 2011 ISSN 1546-9239 2010 Science Publications An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection

More information

LigandScout. Automated Structure-Based Pharmacophore Model Generation. Gerhard Wolber* and Thierry Langer

LigandScout. Automated Structure-Based Pharmacophore Model Generation. Gerhard Wolber* and Thierry Langer LigandScout Automated Structure-Based Pharmacophore Model Generation Gerhard Wolber* and Thierry Langer * E-Mail: wolber@inteligand.com Pharmacophores from LigandScout Pharmacophores & the Protein Data

More information

Reaxys Medicinal Chemistry Fact Sheet

Reaxys Medicinal Chemistry Fact Sheet R&D SOLUTIONS FOR PHARMA & LIFE SCIENCES Reaxys Medicinal Chemistry Fact Sheet Essential data for lead identification and optimization Reaxys Medicinal Chemistry empowers early discovery in drug development

More information

Kernel-based Machine Learning for Virtual Screening

Kernel-based Machine Learning for Virtual Screening Kernel-based Machine Learning for Virtual Screening Dipl.-Inf. Matthias Rupp Beilstein Endowed Chair for Chemoinformatics Johann Wolfgang Goethe-University Frankfurt am Main, Germany 2008-04-11, Helmholtz

More information

Using Self-Organizing maps to accelerate similarity search

Using Self-Organizing maps to accelerate similarity search YOU LOGO Using Self-Organizing maps to accelerate similarity search Fanny Bonachera, Gilles Marcou, Natalia Kireeva, Alexandre Varnek, Dragos Horvath Laboratoire d Infochimie, UM 7177. 1, rue Blaise Pascal,

More information

Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification

Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Knowledge and Information Systems (20XX) Vol. X: 1 29 c 20XX Springer-Verlag London Ltd. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Nikil Wale Department of Computer

More information

Development of Pharmacophore Model for Indeno[1,2-b]indoles as Human Protein Kinase CK2 Inhibitors and Database Mining

Development of Pharmacophore Model for Indeno[1,2-b]indoles as Human Protein Kinase CK2 Inhibitors and Database Mining Development of Pharmacophore Model for Indeno[1,2-b]indoles as Human Protein Kinase CK2 Inhibitors and Database Mining Samer Haidar 1, Zouhair Bouaziz 2, Christelle Marminon 2, Tiomo Laitinen 3, Anti Poso

More information

Interactive Feature Selection with

Interactive Feature Selection with Chapter 6 Interactive Feature Selection with TotalBoost g ν We saw in the experimental section that the generalization performance of the corrective and totally corrective boosting algorithms is comparable.

More information

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology

Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Farewell, PipelinePilot Migrating the Exquiron cheminformatics platform to KNIME and the ChemAxon technology Serge P. Parel, PhD ChemAxon User Group Meeting, Budapest 21 st May, 2014 Outline Exquiron Who

More information

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput

More information

DivCalc: A Utility for Diversity Analysis and Compound Sampling

DivCalc: A Utility for Diversity Analysis and Compound Sampling Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,

More information

Conformational Searching using MacroModel and ConfGen. John Shelley Schrödinger Fellow

Conformational Searching using MacroModel and ConfGen. John Shelley Schrödinger Fellow Conformational Searching using MacroModel and ConfGen John Shelley Schrödinger Fellow Overview Types of conformational searching applications MacroModel s conformation generation procedure General features

More information

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015

Ignasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015 Ignasi Belda, PhD CEO HPC Advisory Council Spain Conference 2015 Business lines Molecular Modeling Services We carry out computational chemistry projects using our selfdeveloped and third party technologies

More information

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS

PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS 179 Molecular Informatics: Confronting Complexity, May 13 th - 16 th 2002, Bozen, Italy PROVIDING CHEMINFORMATICS SOLUTIONS TO SUPPORT DRUG DISCOVERY DECISIONS CARLETON R. SAGE, KEVIN R. HOLME, NIANISH

More information

A reliable computational workflow for the selection of optimal screening libraries

A reliable computational workflow for the selection of optimal screening libraries DOI 10.1186/s13321-015-0108-0 RESEARCH ARTICLE Open Access A reliable computational workflow for the selection of optimal screening libraries Yocheved Gilad 1, Katalin Nadassy 2 and Hanoch Senderowitz

More information

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value

Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Rapid Application Development using InforSense Open Workflow and Daylight Technologies Deliver Discovery Value Anthony Arvanites Daylight User Group Meeting March 10, 2005 Outline 1. Company Introduction

More information

The shortest path to chemistry data and literature

The shortest path to chemistry data and literature R&D SOLUTIONS Reaxys Fact Sheet The shortest path to chemistry data and literature Designed to support the full range of chemistry research, including pharmaceutical development, environmental health &

More information

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012

bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012 bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012 Outline Machine Learning Cheminformatics Framework QSPR logp QSAR mglur 5 CYP

More information

MM-GBSA for Calculating Binding Affinity A rank-ordering study for the lead optimization of Fxa and COX-2 inhibitors

MM-GBSA for Calculating Binding Affinity A rank-ordering study for the lead optimization of Fxa and COX-2 inhibitors MM-GBSA for Calculating Binding Affinity A rank-ordering study for the lead optimization of Fxa and COX-2 inhibitors Thomas Steinbrecher Senior Application Scientist Typical Docking Workflow Databases

More information

Schrodinger ebootcamp #3, Summer EXPLORING METHODS FOR CONFORMER SEARCHING Jas Bhachoo, Senior Applications Scientist

Schrodinger ebootcamp #3, Summer EXPLORING METHODS FOR CONFORMER SEARCHING Jas Bhachoo, Senior Applications Scientist Schrodinger ebootcamp #3, Summer 2016 EXPLORING METHODS FOR CONFORMER SEARCHING Jas Bhachoo, Senior Applications Scientist Numerous applications Generating conformations MM Agenda http://www.schrodinger.com/macromodel

More information

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo

Automatic Star-tracker Optimization Framework. Andrew Tennenbaum The State University of New York at Buffalo SSC17-VIII-6 Automatic Star-tracker Optimization Framework Andrew Tennenbaum The State University of New York at Buffalo aztennen@buffalo.edu Faculty Advisor: John Crassidis The State University of New

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr.

Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, Dr. Large Scale Evaluation of Chemical Structure Recognition 4 th Text Mining Symposium in Life Sciences October 10, 2006 Dr. Overview Brief introduction Chemical Structure Recognition (chemocr) Manual conversion

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

The Conformation Search Problem

The Conformation Search Problem Jon Sutter Senior Manager Life Sciences R&D jms@accelrys.com Jiabo Li Senior Scientist Life Sciences R&D jli@accelrys.com CAESAR: Conformer Algorithm based on Energy Screening and Recursive Buildup The

More information

W vs. QCD Jet Tagging at the Large Hadron Collider

W vs. QCD Jet Tagging at the Large Hadron Collider W vs. QCD Jet Tagging at the Large Hadron Collider Bryan Anenberg: anenberg@stanford.edu; CS229 December 13, 2013 Problem Statement High energy collisions of protons at the Large Hadron Collider (LHC)

More information

Supporting Information

Supporting Information Supporting Information COMPUTATIONAL DISCOVERY AND EXPERIMENTAL VALIDATION OF INHIBITORS OF THE HUMAN INTESTINAL TRANSPORTER, OATP2B1 Natalia Khuri 1,2,#, Arik A. Zur 2,#, Matthias B. Wittwer 2, Lawrence

More information

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods

Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods J. Chem. Inf. Model. 2010, 50, 979 991 979 Assessing Synthetic Accessibility of Chemical Compounds Using Machine Learning Methods Yevgeniy Podolyan, Michael A. Walters, and George Karypis*, Department

More information

Supporting Information

Supporting Information Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen

More information

Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites. J. Andrew Surface

Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites. J. Andrew Surface Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites Introduction J. Andrew Surface Hampden-Sydney College / Virginia Commonwealth University In the past several decades

More information

Drug Informatics for Chemical Genomics...

Drug Informatics for Chemical Genomics... Drug Informatics for Chemical Genomics... An Overview First Annual ChemGen IGERT Retreat Sept 2005 Drug Informatics for Chemical Genomics... p. Topics ChemGen Informatics The ChemMine Project Library Comparison

More information