TARGET-ORIENTED GENERIC FINGERPRINT-BASED MOLECULAR REPRESENTATION
|
|
- Kristina May
- 5 years ago
- Views:
Transcription
1 TARGET-ORIENTED GENERIC FINGERPRINT-BASED MOLECULAR REPRESENTATION Petr Skoda and David Hoksza Faculty of Mathematics and Physics, Charles University in Prague, Prague, Czech Republic ABSTRACT The screening of chemical libraries is an important step in the drug discovery process. The existing chemical libraries contain up to millions of compounds. As the screening at such scale is expensive, the virtual screening is often utilized. There exist several variants of virtual screening and ligand-based virtual screening is one of them. It utilizes the similarity of screened chemical compounds to known compounds. Besides the employed similarity measure, another aspect greatly influencing the performance of ligand-based virtual screening is the chosen chemical compound representation. In this paper, we introduce a fragment-based representation of chemical compounds. Our representation utilizes fragments to represent a compound where each fragment is represented by its physico-chemical descriptors. The representation is highly parametrizable, especially in the area of physico-chemical descriptors selection and application. In order to test the performance of our method, we utilized an existing framework for virtual screening benchmarking. The results show that our method is comparable to the best existing approaches and on some data sets it outperforms them. KEYWORDS Virtual screening, Molecular representation, Molecular fingerprints 1. INTRODUCTION The main method to identify new leads in the drug discovery process has traditionally been medium or high-throughput screening (HTS). In this experimental process, a large number of chemical compounds can be screened against a specific target to identify compounds which trigger a response in this target. Some of the HTS approaches can guarantee throughput up to about compounds per second [1] by using the combinatorial libraries. Obviously, the throughput in such cases is not an issue anymore. However, management of such large libraries can be difficult and economically unfeasible since every new compound brought into the screening process increases its price. The in-silico answer to the growing size of chemical databases is the so-called high-throughput virtual screening (HTVS). It allows fast screening of large libraries, which may contain up to tens of millions chemical compounds, without the need of physically own the compounds. An additional bonus which relates to the HTVS is the ability to screen even virtual libraries. I.e., one Natarajan Meghanathan et al. (Eds) : WiMONe, NCS, SPM, CSEIT pp , CS & IT-CSCP 2014 DOI : /csit
2 232 Computer Science & Information Technology (CS & IT) can easily predict bioactivity of compounds residing in not yet well explored parts of the chemical space [2]. While the limits of HTS are given by the technology, the HTVS, since it is a simulation of a real world approach, is limited by the available information about ligands [3]. The available knowledge then dictates how HTVS is utilized. Since unlike HTS, HTVS can suffer by both false positives and false negatives it is commonly used as a pre-step to the standard HTS in the earlystages in the drug discovery pipeline. The HTVS is used to prioritize large chemical libraries which narrows down the set of compounds to be forwarded to HTS. Usefulness of complementing HTS with HTVS has been supported by several studies [4], [5]. The virtual screening approaches can be classified as ligand-based virtual screening (LBVS) and structurebased virtual screening (SBVS) [6], [7]. The choice of which approach to utilize depends on information about the task at hand. If we know the three-dimensional structure of the biological target we can use SBVS methods [8], [9]. The SBVS is based on docking and includes two steps: positioning the ligand into the target active site (docking) and scoring the pose. However, this information is often not available in sufficient quality or it is not available at all. In such a case the ligand-based virtual screening method is the method of choice. 1.1 Ligand-based virtual screening In LBVS, only the information about known bioactive ligands (triggering response in the given biological target) is required. The LBVS is built around the concept that similar structures carry out similar functions more often than dissimilar ones. This assumption is based on the shape and physicochemical complementary of the ligand and target commonly called key-and-lock principle [10] or similar property principle [11]. Thus, given the known active (and possibly also inactive) compounds LBVS methods prioritize compounds that are more likely to have desired functionality/features, based on the similarity to the known active molecules. In the first step of LBVS, a computer-based representation is calculated for the known bioactive ligands as well as for all the molecules in a library to be searched for new bioactive compounds. In the second step, the representation of ligands can be aggregated and used as a query or individual representations are used directly for searching the library. As the last step, the library is sorted with respect to the similarity to the query ligand(s). It is assumed that the high-scoring compounds bind to the target with high probability due to the similarity principle. One can come up with various classification of LBVS approaches. For example, Taboureau at. al. [7] divide LBVS into five classes based on the utilized molecular features: alignment-based, descriptor-based, graph-based, shape-based and pharmacophore-based. While the methods might differ in the specifics of how to approach the identification of bioactive compounds, most of them employ a feature extraction step where the molecular descriptors are identified and encoded into some kind of representation. This is then used as a representation of the molecule in the virtual screening. Among the commonly used features are those which reflect structure or capture computed or experimentally measured physico-chemical properties. Currently, there exists a plethora of descriptors to be utilized in virtual screening [12], [13]. They differ not only in their semantics but also in the computational complexity. The excess of descriptors is the consequence of the fact that none of the descriptors can be generally declared as superior to the rest. The features discriminating active and inactive compounds simply depend on the specific target which varies in every screening campaign. It follows that it is vital to the success of a virtual screening campaign to capture such features which represent the molecules well in terms of their discriminative capability. This is the main motivation for our work. It is out of question that the correct choice of features greatly influences the outcome of a virtual
3 Computer Science & Information Technology (CS & IT) 233 screening campaign. However, we moreover believe that the choice of descriptors should be context-aware that is it should be dependent on the investigated target. Therefore, in this paper we propose a general framework which allows the user to parameterize the molecular representations. 1.2 Fingerprints A common type of descriptors are the 2D fingerprints (fingerprints) capturing the structure of given chemical compound in the form of a bitstring. Every structural feature is mapped to a position in the string. Such representation is suitable for large-scale virtual screening campaigns since it allows fast comparison of two molecules (bitstrings). Thus, the main idea behind the fingerprints is to encode the existence of a given (structural, pharmacophore, ) feature to a position in a bitstring. The features to be encoded commonly include molecular fragments which are continuous substructures of a given molecule. There are two main approaches to fragment extraction: path-based (Topological Torsions fingerprints), neighbourhood-based (Circular Fingerprints or Extended connectivity fingerprints). The Topological torsions fingerprints [14] (TT) use paths of length four (quaternions). The information about types, nonhydrogen connections and number of pi-electrons is used to calculate the index of given path. In Extended Connectivity Fingerprints (ECFPs) and Functional Connectivity Fingerprints (FCFPs) an atom is described in terms of its neighbouring atoms up to a certain radius. Hert [15] has shown that such descriptors can be effective in similarity searching applications. The extended connectivity of an atom is calculated using a modified version of the Morgan algorithm [16] where the atom code is combined with the codes of its neighbours to establish the final atom description. To map a fragment into a position in the bitstring representation, a mapping function needs to be utilized. The simplest solution is the dictionary-based approach where a predefined dictionary of fragments and their mapping into the bitstring is utilized. However, this allows to represent only a limited set of fragments in the bitstring. Another solution is to map every possible fragment into a constant-sized bitstring. However, since the size of a bitstring representation uses to be an order of magnitude smaller in comparison to the number of all possible fragments, typically a modulo function is applied. This allows to obtain a bitstring position for every possible fragment. On the other hand, two different fragments with different indexes can be mapped to the same position in a bitstring. This situation is called the collision. The fingerprints that utilize this approach form the family of hashed fingerprints. 2. METHOD OUTLINE In this work we introduce vector fingerprints (VectorFp), a new approach to the representation of chemical compounds and their comparison. As mentioned above, our goal is to provide a modular molecular representation for LBVS allowing to be parametrized based on the task at hand. The basis of VectorFp molecular representation form structural fragments. But unlike other descriptors, VectorFp allows the fragments to be labeled by user-defined physico-chemical properties. Moreover, the representation was designed with the emphasis on the ability to use it with existing similarity measures for bitstrings. Thus, VectorFp is designed as a generic representation that needs proper parametrization before it can be used.
4 234 Computer Science & Information Technology (CS & IT) 2.1. VectorFp structure In order to maintain compatibility with existing fingerprint methods we decided to choose the bitstring as the representation for VectorFp. The advantage is that we can utilize existing wellestablished similarity measures, LBVS processes, and benchmarking platforms in order to get comparison of our method to the other fingerprints. VectorFp (Figure 1) is basically an array (outer array), where each cell represents one (or more in case of a collision) fragment(s). Each cell of the main outer array contains another array (inner array). The purpose of the inner array is to store the selected descriptors of a respective fragment(s). As mentioned, these descriptors are physico-chemical properties of fragments that are converted into bitstring representation. Figure 1. Structure of VectorFp. 1 - outer array, 2 - cell with inner array, 3 - bits representing single descriptor Generally, physico-chemical descriptors can take various ranges of values being typically integer or float data types. The process of conversion of descriptors into a bitstring is secured by so called conversion methods. In our current implementation we use the same conversion method for all descriptors. It gets minimum and maximum value for a given descriptor and then uses binning which results in an integer value to be used as the descriptor value to be stored. The integer value is then encoded into a bit array using unary encoding. The binning is the formation of a set of disjoint intervals (bins) that represent the possible values. The bin index is finally encoded into a binary representation. In VectorFp we decided to use unary coding. The choice of unary coding instead of, e.g. classical binary coding, stems from the typical choice of similarity functions used when comparing bitstring molecular representations. The most commonly used similarity functions basically assess similarity to a pair of bit strings based on the number of common and differing bit positions. These measures assume that the bits are independent which holds when every bit corresponds to the existence or nonexistence of a molecular substructure. However, when the binary image of a substructure spans multiple bit positions (inner array) the positions are dependent. Using the binary coding with such similarity measure is then not valid. Let us consider a situation when the binary representation takes 4 bits. Then if the distance/similarity is based on the number of common bits bin 4 (0100) is from bin 1 (0001) in the same distance as, e.g., bin 2 (0010). Which should not hold since the bin indexes approximate quantitative characteristics. However, when using unary coding bin 1 gets the code 1000, bin 2 gets the code 1100 and bin 4 gets the code Then, using the same similarity measure, bin 2 is more similar to bin 1 than bin 4 as one would expected VectorFp generation 1. The VectorFp representation computation for a given molecule consists of five main steps: 2. Extract fragments from the molecule and compute indexes (positions in the bitstring representation) for those fragments. 3. For each fragment compute its physico-chemical descriptors. 4. Convert all fragments descriptors into bitstrings. 5. Create the fragment bitstring representation from its respective descriptor representations.
5 Computer Science & Information Technology (CS & IT) Combine the individual fragments representations (bitstrings) together and assemble the representation of the molecule. The representations of fragments are stored into cells determined by the index computed in step 1. As the VectorFp size is limited, the index computed in step 1 must be modified by application of the modulo operation (hashing). As a consequence a collision may occur. In order to solve collisions VectorFp utilizes the bitwise logical or to merge representations of multiple fragments together. The advantage of this method is simplicity and the fact that the results (fragment bitstring) are the same for different permutations of the same fragments. The drawback of selected approach is, that created fragment representation does not have to represent existing fragment. This can be problem during similarity comparison of two VectorFps. If both molecules (their VectorFps representations) have the same fragments in single cell, then everything is in order, but if one molecule has difference number of fragments in given cell then the other molecule, for example one and two, the comparison still compare fragment representation to fragment representation. In this case we compare existing fragment to some imaginary aggregated fragment. 3. PARAMETERIZATION From the description of VectorFp one can notice that there are places where the approach is not fully specified: fragment extraction, descriptor selection and conversion. The named areas and some more create space for parameterization of VectorFp. VectorFp can be seen as a generic representation or frame. The parameterization determines the efficiency of the final VectorFpbased molecular representation. 3.1 VectorFp size One of the parameters is the size of VectorFp. The size is determined by two variables: size of the inner array and the number of cells in the outer array. The final size of vectorfp representation is therefore size of inner array * size of outer array. So for example if we use 1024 cells for the outer array, then a 4 bit increase of the inner array size will result in 4096 bit increase of the resulting representation size. 3.2 Fragment extraction In the current implementation we utilize RDKit s [17] algorithm to extract the fragments from a molecule get their positions in the bitstring. The algorithm uses RDKit s Morgan Fingerprint which is based on the Morgan algorithm. Morgan Fingerprints use the following features to calculate a fragment s position in the bitstring: donor, acceptor, aromatic, halogen, basic, acidic. The RDKit provides the possibility to modify this feature list and thus change the fragment indexes. This can be also viewed as a possible parameterization of VectorFp. Another possibility is to use paths (like TT fingerprints) instead of neighbourhoods. 3.3 Fragment representation Each fragment is represented by an inner array (bitstring). The size of this array determines how many information can be stored about each fragment. By setting the size of the inner array to one we get the classical fingerprints. The selection of used descriptors, conversion method and number of bits in inner array is also part of the parameterization. The descriptor selection and conversions are in our opinion the most important parts of the parameterization and have a great influence on the performance of the method. The selected descriptors include, for example, the number of heavy atoms, logp, the presence of a fragment or, in an extreme case, other fingerprint
6 236 Computer Science & Information Technology (CS & IT) can be used as a fragment s descriptor and inserted into VectorFp. There is also the possibility to stress certain descriptor by multiplying its value. For example, let us have two different descriptors, we use them both in our parameterization but we replicate one of them. In this case, the replicated descriptor has more weight and can be seen as the main one. The second one (nonreplicated) descriptor can serve as a fine tuning mechanism. 4. EXPERIMENTS For experimental evaluation we used the recently published framework for benchmarking LBVS approaches by Riniker et al. [18]. The framework is written in Python [19] and uses RDKit [17] as the underlying chemical framework. It comes with a predefined set of fingerprints, similarity methods (Dice, Tanimoto, Cosine, Russel, Kulczynski, McConnaughey, Manhattan, RogotGoldberg) and quality measurement methods (Area Under Curve (AUC) of Receiver Operating Characteristic curve (ROC), Enrichment Factor (EF), Robust Initial Enhancement (RIE) [20], Boltzmann-Enhanced Discrimination of ROC (BEDROC) [21]). The framework simulates LBVS on pooled targets from three data sets representing 88 targets in total. The three data sets include Database of Useful Decoys (DUD) [22], ChEMBL [23] and Maximum Unbiased Validation (MUV) [24]. For each target a set of known actives and inactives (decoys) is available. As the framework aims to high reproducibility of experiments it also contains a predefined random selection of actives and decoys. Thanks to that, the simulation of LBVS is deterministic and can be easily reproduced by any researcher. However, one of the drawbacks of the framework is that it is designed to use the same method with the same parameterization for all the data sets. There is no learning phase per dataset. Such phase could be useful for benchmarking of methods including a learning phase [25]. The absence of learning phase influences performance of our method in a negative way as our method needs a proper parameterization that differs based on the task (dataset) at hand. Still, we decided to not modify the benchmarking platform and to use a single parameterization over all data sets as the determination of the right parameterization is not the goal of this article. The problem of correct parameterization and feature selection is a separate topic. Riniker et al. [18] recommend to use at least two different benchmarking methods, for example AUC and BEDROC as the AUC alone is considered to be insufficiency sensitive. On the other hand, the advantage of the AUC in comparison to some other methods is that it is non-parametric. Thus, it can be easily used to give a basic idea about the performance of tested method especially in a large scale evaluation. From this reason, we decided to show only AUC values in the following experimental evaluation Comparison to existing methods In this section, we presents the comparison of VectorFp with other fingerprints from selected benchmarking framework. We used VectorFP with the best found parameterization (aggregated over all targets). However, we emphasize that the VectorFp performance strongly depends on the selected parameterization (see section IV-B) and since the parameterization optimization is a hard (and separate) problem, there is still room for improvement. Moreover, in this comparison we use a single parameterization for all targets which is not the optimal and intended use of VectorFp, but we find it useful in order to get a rough comparison with the other existing methods. To denote the other fingerprints we use abbreviations from the original article [18] containing also the details about the remaining fingerprints. The best parameterization we obtained in our experiments in terms of average auc (average of auc over all data sets) was nhbdon_lipinski,nn. This parameterization utilizes two descriptors
7 Computer Science & Information Technology (CS & IT) 237 nhbdon_lipinski and nn, where each descriptor occupies 16 bits in the final representation. We denote this parameterization further in the text as vectorfp. As already stated, the VectorFp is designed as a generic representation that should be rather used with parameterization based on the given task. In order to demonstrate the potential of VectorFp, we defined virtual VectorFp (vvectorfp). To get the results for vvectorfp, we select the best tested parameterization for every dataset. Thus, vvectorfp can be understood as VectorFp with an oraculum that gives us the best encountered parameterization for given target. As for the source of descriptors for labelling the extracted fragments, we used the PaDEL [26] tool. PaDEL is capable of generating about 770 2D descriptors that can be easily utilized in VectorFp. To convert the descriptor values into the bitstring in the inner arrays of VectorFp we use the binning and unary coding. As the results show (Table 1.), vectorfp (with the nhbdon-lipinski,nn parametrization) is, in terms of auc, the best fingerprint for 8 out of the 88 data sets and it ends up on position on average. The best obtained average position is reached by the TT fingerprint.thus, although the single parameterization is used for multiple data sets, it is clearly comparable with the best existing approaches. On some data sets, our method is superior to all the other methods. The performance differs throughout all the data sets (Table 2.). Table 1. Aggregated performance statistics of vectorfp and vvectorfp with respect to other fingerprints name average AUC number of best results average position tt hashap rdk vectorfp laval avalon rdk ap hashtt rdk lfcfp fcfp ecfc fcfp fcfc lfcfp lecfp lecfp fcfp ecfp fcfc maccs ecfc ecfc fcfc ecfp ecfp
8 238 Computer Science & Information Technology (CS & IT) ecfc ecfp vvectorfp As can be seen most of the tested fingerprints perform reasonably well, in comparison to the others, on at least one dataset. Out of the three data sets, the MUV dataset shows up to be the hardest for vvectorfp as in 3 cases it performs strongly under average. However, the MUV dataset is the most difficult for every tested fingerprint. The goal of the MUV design is to generate sets with a spatially well distributed active and decoy molecules in a simple descriptor space. Moreover, another goal is to evenly distribute actives among the decoys which makes the MUV dataset difficult for virtual screening. The best tested parameterization for MUV shows up to be naaromatom16,eta_betap_s16,minhsnh2 with the average auc on MUV being compared to vectorfp having the average auc of As the VectorFp in fact utilizes one of the extended connectivity fingerprints (ecfp) as the underlying fingerprint, we were interested how it compares to the performance of other fingerprints from the same family. From this perspective our method performs well and outperforms most fingerprints from this family. Our method was in term of average auc over all the data sets outperformed by tt, hashtt and ap fingerprints. All those fingerprints are based on different fragments than used in current version of VectorFp. tt and hashtt use paths of length four while ap use atom pairs. This suggest that the change of fragment extraction process (underlying fingerprint) may improve the performance of VectorFp. Notice, that vvectorfp is also included in the comparison. As it is not based on a single parameterization, the values presented in Table 1. were computed without the vvectorfp, and at the end the vvectorfp was added. Thus vvectorfp results did not influence the positions of other approaches. The vvectorfp outperforms all other methods in all the presented evaluation criteria (average auc, number of best results). We believe that this demonstrates the potential of VectorFp if parameterized properly. We emphasize again that there may be a better parameterization as we tested only a very limited subset of all possible parameterizations Parameterization As a part of our experiments we systematically tested hundreds of different parameterizations focusing on various descriptors provided by PaDEL (see above). Although there are more ways of how to parameterize VectorFp, here we focused on descriptor selection only being the most result influencing part of the parameterization. To test how the amount of used descriptors per fragment influences the discriminative power of the molecular representation, we started with just one descriptor per fragment and then added more. In the preparation phase, we extracted all fragments for all chemical compounds in every data sets of the benchmarking platform. For each fragment we computed all descriptors available in PaDEL. These descriptors were the subject of a basic descriptor analysis before running the experiments themselves. The goal of the analysis was to remove descriptors which clearly did not have enough discriminative power to be used for screening.
9 Computer Science & Information Technology (CS & IT) 239 Table 2. Comparison of vvectorfp and vectorfp with other fingerprints. The colours show the relative performance of given fingerprint to others on given target (dataset). The grey cell represents the best result on given target while white represents the worst result. As the first step of the analysis we dropped all the descriptors that were constant which resulted in the elimination of 258 descriptors. In the next step we utilized variance to decide which descriptors have the potential being a useful discriminator. A descriptor taking only two values has a low chance to well discriminate thousands of compounds. As a prestep to variance analysis we had performed normalization on every descriptor. First, we had removed outliers from every descriptor (values outside the second and third quantile), then we normalized the data into the [0, 1] interval using the min-max normalization. After the normalization we computed variance (var norm ) for each descriptor. Many descriptors ended up with var norm = 0. For example in case of nacid descriptor, about 96.7% of fragments have zero value. This does not leave much space for other values, and basically divides all fragments into just few categories (3 in case of nacid). If we consider the second and third quantile only we get zero variance. This step eliminated 326 descriptors. Since we used these descriptors later in the experiments, we formed group from them
10 240 Computer Science & Information Technology (CS & IT) called constvarq (constant variance on quantiles). The remaining 185 descriptors were split into 4 groups of almost the same size based on the value of variance. The groups were called var_00_25, var_25_50, var_50_75 and var_75_ Single descriptor In the first step we evaluated the performance for selected descriptors from groups var_00_25, var_25_50, var_50_75 and var_75_100 and constvarq. The descriptors in constvarq group performed worst of all, as expected. This was caused by the fact that in many data sets the descriptors were constant and so had no discriminative power. However, despite the overall bad performance few exceptions emerged. For example, using the descriptor nacid (number of acidic groups) for target resulted in auc The tt, hashtt and ap scored 0.841, and respectively. This demonstrates that good performance can be reached even with a single simple descriptor. As for the target 20174, the best performance (0.9560) was obtained by vectorfp. Figure 2. auc performance for single descriptor parameterization among the variability groups. The horizontal lines represent the average auc for given group. The performance of all the descriptors shows Figure 2. where descriptors data points in the same group share same shape. The X-axis corresponds to the individual descriptors while the Y-axis shows the average AUC over all the targets for each of the descriptors. The horizontal line then represents the average of the descriptors performance for each of the group. We can clearly see that the descriptors in the constvarq show worse performance then descriptors in the other
11 Computer Science & Information Technology (CS & IT) 241 groups. In all the var groups we can identify several well performing descriptors. However, the group var_00_25 contains many descriptors showing very poor performance. It follows that a descriptor with low variability is likely to perform poorly. On the other hand, the descriptors with high variability (group var_75_100) also lead to worse performance than the descriptors with moderate variability (groups var_25_50 and var_50_75) Multiple descriptors In the next step, we first created pairs and then triplets of descriptors and used them to label the fragments. Thus, in the previous step each fragment was labelled by exactly one descriptor but in the second step pairs and triplets of descriptors were utilized. Our hope was that the performance would increase when using tuples in contrast to using single descriptors alone. Since we did not have sufficient computational resources to test every possible pair and triplet of descriptor we implemented a filter. The purpose of the filter is to filter out such tuples which are unlikely to lead to best results. The filter utilizes AUC (auc) of single descriptors and correlation (cor) between pairs of descriptors. Let n denote the number of descriptors that should be used in the parameterization (in our case n is 2 or 3). Let auc i denote the average AUC for i-th descriptor (out of n) and cor i,j the correlation between AUC values of i-th and j-th descriptor over data sets. Thus if two descriptors show similar AUCs over all data sets they have high correlation. In order for the tuple of descriptors to pass the filter, the following two conditions need to be satisfied: n i =0 auc > auc i Level cor < cor LevelMin < max 0 i, j n, i jcori, j LevelMax The filter is parameterized by the values auc Level, cor LevelMin and cor LevelMax. Using auc Level simply prefers tuples consisting of descriptors which behave well when used alone. The idea behind restricting the correlation is that bringing together correlated descriptors would not result in new information and thus probably would not increase the discriminative power of the resulting molecular representation. We tried several parameterizations of the filter (see Table 3.) to get a reasonable number of pairs/triplets for our experiments. The descriptors in 2B pairs are required to have cor between 0.47 and 0.6. The lower bound for cor secures that the pairs in 2A and 2B are different. As the trade of, the required auc needs to be slightly higher. As the 2B group was selected with less stress on cor it was expected that the paired descriptors would have more similar results over the data sets. The goal of different parameterizations of the filter was to test which combination of correlation and quality parameters leads to better results. The same holds for the groups of triplets. Table 3. Specification of tested filters group name auc Level cor LevelMin cor LevelMax group size 2A B A B
12 242 Computer Science & Information Technology (CS & IT) How different number of descriptors used to label the fragments influence the auc show Table. 3. and Figure 4. The Table 3. average auc for parameterizations using single descriptors (the var groups), pairs of descriptors (the 2A and 2B group) or triplets of descriptors (the 3A and 3B group) to label the fragments. As the results show the 2A-filtered pairs of descriptors perform on average significantly better those based on the 2B filter. The difference between 2A and 2B is much higher than in case of 3A and 3B. As Table 4. shows, the performance of triplets of descriptors is somewhere between the 2A and 2B based pairs. From the Figure 3 it seems that the performance of triplets of descriptors is more variable than in case of pairs. We believe that it is the consequence of the fact that there are more triplets of descriptors than there are pairs. Therefore, it is more difficult to identify the correct triplets. Thus the variance in the results of triplets of descriptor is simply due to the imperfection of the descriptor selection procedure. In case of both pairs and triplets of descriptors, the group with more restricted cor seems to provide better results, especially in terms of worst case performance. Table 4. Average reached auc for different numbers of descriptors per fragment group name average auc var_00_ var_25_ var_50_ var_75_ A B A B Figure 3. auc performance for two and three descriptors parameterizations among groups. The horizontal lines represent average of auc for given group.
13 Computer Science & Information Technology (CS & IT) CONCLUSION In this work we presented a generic molecular representation called VectorFp. The representation was tested using the recently published benchmarking platform for LBVS. Therefore, the results should be easily reproducible and results were easily comparable with other existing commonly used molecular representations. The main motivation for our work was to provide a molecular representation which could be parameterizable with specific descriptors suitable for given biological target. Even though we operated within the boundaries of the benchmarking framework by forcing us to fix the parameterization our method it still outperformed most of the existing methods. We also showed the potential of our method by creating a virtual representation vvectorfp where the best encountered parameterization for given target was used. This representation clearly outperformed all the existing approaches showing that potential strength of the method with correct parameterization. As a virtual representation demonstrates the potential of VectorFp if the right parameterization is used, it follows that the research on the parameterization will be the main direction of our future work on VectorFp. Moreover, we tested only up to three descriptors per parameterization while there are, beside the computer memory, virtually no restrictions of how many descriptors can be used. Finally, we also plan to investigate the possibility of stressing the importance of a single descriptor by its multiple application. ACKNOWLEDGEMENTS This work was supported by Grant Agency of of Charles University [project Nr ], by SVV and by the Czech Science Foundation grant P. REFERENCES [1] B. Battersby and M. Trau, (2002) Novel miniaturized systems in high throughput screening, Trends in Biotechnologi, vol. 20, pp [2] J. Besnard, G. F. Ruda, V. Setola, K. Abecassis, R. M. Rodriguiz, X. P. Huang, S. Norval, M. F. Sassano, A. I. Shin, L. A. Webster, F. R. Simeons, L. Stojanovski, A. Prat, N. G. Seidah, D. B. Constam, G. R. Bickerton, K. D. Read, W. C. Wetsel, I. H. Gilbert, B. L. Roth, and A. L. Hopkins, (Dec 2012) Automated design of ligands to polypharmacological profiles, Nature, vol. 492, no. 7428, pp [3] S. Ekins, J. Mestres, and B. Testa, (2007) In silico pharmacology for drug discovery: methods for virtual ligand screening and profiling, British Journal of Pharmacology, vol. 152, pp. 9 [4] R. S. Ferreira, A. Simeonov, A. Jadhav, O. Eidam, B. T. Mott, M. J. Keiser, J. H. McKerrow, D. J. Maloney, J. J. Irwin, and B. K. Shoichet, (Jul 2010) Complementarity between a docking and a highthroughput screen in discovering new cruzain inhibitors, J. Med. Chem., vol. 53, no. 13, pp [5] L. R. Vidler, P. Filippakopoulos, O. Fedorov, S. Picaud, S. Martin, M. Tomsett, H. Woodward, N. Brown, S. Knapp, and S. Hoelder, (Oct 2013 ) Discovery of novel small-molecule inhibitors of BRD4 using structurebased virtual screening, J. Med. Chem., vol. 56, no. 20, pp [6] K. Heikamp and J. Bajorath, (Jan 2013) The future of virtual compound screening, Chem Biol Drug Des, vol. 81, no. 1, pp [7] O. Taboureau, J. B. Baell, J. Fernandez-Recio, and B. O. Villoutreix, (Jan 2012) Established and emerging trends in computational drug discovery in the structural genomics era, Chem. Biol., vol. 19, no. 1, pp , [8] P. Ripphausen, B. Nisius, L. Peltason, and J. Bajorath, (Dec 2013) Quo vadis, virtual screening? a comprehensive survey of prospective applications, Journal of Medicinal Chemistry, vol. 53, no. 24, pp
14 244 Computer Science & Information Technology (CS & IT) [9] P. Ripphausen, D. Stumpfe, and J. Bajorath, (Apr 2012) Analysis of structure based virtual screening studies and characterization of identified active compounds, Future Med Chem, vol. 4, no. 5, pp [10] W. L. Jorgensen, (Nov 1991) Rusting of the lock and key model for protein-ligand binding, Science, vol. 254, no. 5034, pp [11] M. A. Johnson and G. M. Maggiora, (1990) Concepts and Applications of Molecular Similarity. Wiley-Interscience [12] L. Xue and J. Bajorath, (Oct 2000) Molecular descriptors in chemoinformatics, computational combinatorial chemistry, and virtual screening, Comb. Chem. High Throughput Screen., vol. 3, no. 5, pp [13] J.-L. Faulon and A. Bender, (Apr 2010) Handbook of Chemoinformatics Algorithms (Chapman & Hall/CRC Mathematical & Computational Biology), 1st ed. Chapman and Hall/CRC [14] R. Nilakantan, N. Bauman, J. S. Dixon, and R. Venkataraghavan, (1987) Topological torsion: a new molecular descriptor for sar applications. comparison with other descriptors, Journal of Chemical Information and Computer Sciences, vol. 27, no. 2, pp , [15] J. Hert, P. Willett, D. J. Wilton, P. Acklin, K. Azzaoui, E. Jacoby, and A. Schuffenhauer, (2004) Comparison of topological descriptors for similarity-based virtual screening using multiple bioactive reference structures, Organic and Biomolecular Chemistry, vol. 2, pp [16] H. L. Morgan, The generation of a unique machine description for chemical structures-a technique developed at chemical abstracts service. Journal of Chemical Documentation, vol. 5, no. 2, pp [17] (2014) Rdkit: Cheminformatics and machine learning softwares, [Online]. Available: [18] S. Riniker and G. Landrum, (2013) Open-source platform to benchmark fingerprints for ligand-based virtual screening, Journal of Cheminformatics, vol. 5, no. 1, p. 26, [19] (2014) Python, [Online]. Available: [20] R. P. Sheridan, S. B. Singh, E. M. Fluder, and S. K. Kearsley, (2001) Protocols for bridging the peptide to nonpeptide gap in topological similarity searches, Journal of Chemical Information and Computer Sciences, vol. 41, no. 5, pp [21] J.-F. Truchon and C. I. Bayly, (2007) Evaluating virtual screening methods: Good and bad metrics for the early recognition problem, Journal of Chemical Information and Modeling, vol. 47, no. 2, pp [22] N. Huang, B. K. Shoichet, and J. J. Irwin, (2006) Benchmarking sets for molecular docking, Journal of Medicinal Chemistry, vol. 49, no. 23, pp , [23] K. Heikamp and J. Bajorath, (2003) Large-scale similarity search profiling of chembl compound data sets, Journal of Chemical Information and Modeling, vol. 51, no. 8, pp [24] S. G. Rohrer and K. Baumann, (2009) Maximum unbiased validation (muv) data sets for virtual screening based on pubchem bioactivity data, Journal of Chemical Information and Modeling, vol. 49, no. 2, pp , [25] D. Hoksza and P. Škoda, (2014) 2d pharmacophore query generation, in Bioinformatics Research and Applications, ser. Lecture Notes in Computer Science, M. Basu, Y. Pan, and J. Wang, Eds. Springer International Publishing, vol. 8492, pp [26] C. W. Yap (2011) Padel-descriptor: An open source software to calculate molecular descriptors and fingerprints, Journal of Computational Chemistry, vol. 32, p AUTHORS Petr Skoda received his master degree in 2014 from Faculty of Mathematics and Physics at the Charles University in Prague. Since 2014 he is Ph.D. studentunder the supervision of David Hoksza. His main research interest is chemoinformatics. David Hoksza received his Ph.D. in 2010 from the Dept. of Software Engineering, Charles University inprague. Since 2011 he is an associated professor of software engineering in the department of Software Engineering at the CharlesUniversity in Prague. His research interests include structural bioinformatics, chemoinformatics, data engineering and similarity searching.
Introduction. OntoChem
Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads
More informationSimilarity Search. Uwe Koch
Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance
More informationhas its own advantages and drawbacks, depending on the questions facing the drug discovery.
2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Organic & Biomolecular
More informationDrug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters
Drug Design 2 Oliver Kohlbacher Winter 2009/2010 11. QSAR Part 4: Selected Chapters Abt. Simulation biologischer Systeme WSI/ZBIT, Eberhard-Karls-Universität Tübingen Overview GRIND GRid-INDependent Descriptors
More informationCOMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA
http://www.ftsm.ukm.my/apjitm Asia-Pacific Journal of Information Technology and Multimedia Jurnal Teknologi Maklumat dan Multimedia Asia-Pasifik Vol. 7 No. 1, June 2018: 91-98 e-issn: 2289-2192 COMPARISON
More informationVirtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME
Virtual Libraries and Virtual Screening in Drug Discovery Processes using KNIME Iván Solt Solutions for Cheminformatics Drug Discovery Strategies for known targets High-Throughput Screening (HTS) Cells
More informationMolecular Complexity Effects and Fingerprint-Based Similarity Search Strategies
Molecular Complexity Effects and Fingerprint-Based Similarity Search Strategies Dissertation zur Erlangung des Doktorgrades (Dr. rer. nat.) der Mathematisch-aturwissenschaftlichen Fakultät der Rheinischen
More informationAdam Filandr. Utilizing simulated annealing for molecular fingerprints optimization for virtual screening
BACHELOR THESIS Adam Filandr Utilizing simulated annealing for molecular fingerprints optimization for virtual screening Department of Software Engineering Supervisor of the bachelor thesis: Study programme:
More informationDrug Informatics for Chemical Genomics...
Drug Informatics for Chemical Genomics... An Overview First Annual ChemGen IGERT Retreat Sept 2005 Drug Informatics for Chemical Genomics... p. Topics ChemGen Informatics The ChemMine Project Library Comparison
More informationChemoinformatics and information management. Peter Willett, University of Sheffield, UK
Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics
More informationA Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors
A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.
More informationA Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery
AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and
More informationNavigation in Chemical Space Towards Biological Activity. Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland
Navigation in Chemical Space Towards Biological Activity Peter Ertl Novartis Institutes for BioMedical Research Basel, Switzerland Data Explosion in Chemistry CAS 65 million molecules CCDC 600 000 structures
More informationComputational chemical biology to address non-traditional drug targets. John Karanicolas
Computational chemical biology to address non-traditional drug targets John Karanicolas Our computational toolbox Structure-based approaches Ligand-based approaches Detailed MD simulations 2D fingerprints
More informationBenchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints in Phenotypic Drug Discovery
501390JBXXXX10.1177/1087057113501390Journal of Biomolecular ScreeningReisen et al. research-article2013 Original Research Benchmarking of Multivariate Similarity Measures for High-Content Screening Fingerprints
More informationChemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller
Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics
More informationSimilarity methods for ligandbased virtual screening
Similarity methods for ligandbased virtual screening Peter Willett, University of Sheffield Computers in Scientific Discovery 5, 22 nd July 2010 Overview Molecular similarity and its use in virtual screening
More informationCheminformatics analysis and learning in a data pipelining environment
Molecular Diversity (2006) 10: 283 299 DOI: 10.1007/s11030-006-9041-5 c Springer 2006 Review Cheminformatics analysis and learning in a data pipelining environment Moises Hassan 1,, Robert D. Brown 1,
More informationDe Novo molecular design with Deep Reinforcement Learning
De Novo molecular design with Deep Reinforcement Learning @olexandr Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill olexandr@unc.edu http://olexandrisayev.com About me Ph.D. in Chemistry
More informationVirtual affinity fingerprints in drug discovery: The Drug Profile Matching method
Ágnes Peragovics Virtual affinity fingerprints in drug discovery: The Drug Profile Matching method PhD Theses Supervisor: András Málnási-Csizmadia DSc. Associate Professor Structural Biochemistry Doctoral
More informationInteractive Feature Selection with
Chapter 6 Interactive Feature Selection with TotalBoost g ν We saw in the experimental section that the generalization performance of the corrective and totally corrective boosting algorithms is comparable.
More informationESPRESSO (Extremely Speedy PRE-Screening method with Segmented compounds) 1
Vol.2016-MPS-108 o.18 Vol.2016-BI-46 o.18 ESPRESS 1,4,a) 2,4 2,4 1,3 1,3,4 1,3,4 - ESPRESS (Extremely Speedy PRE-Screening method with Segmented cmpounds) 1 Glide HTVS ESPRESS 2,900 200 ESPRESS: An ultrafast
More informationUsing Phase for Pharmacophore Modelling. 5th European Life Science Bootcamp March, 2017
Using Phase for Pharmacophore Modelling 5th European Life Science Bootcamp March, 2017 Phase: Our Pharmacohore generation tool Significant improvements to Phase methods in 2016 New highly interactive interface
More informationCross Discipline Analysis made possible with Data Pipelining. J.R. Tozer SciTegic
Cross Discipline Analysis made possible with Data Pipelining J.R. Tozer SciTegic System Genesis Pipelining tool created to automate data processing in cheminformatics Modular system built with generic
More informationIn silico pharmacology for drug discovery
In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of
More informationEvaluation of Molecular Similarity and Molecular Diversity Methods Using Biological Activity Data
2 Evaluation of Molecular Similarity and Molecular Diversity Methods Using Biological Activity Data Peter Willett Abstract This chapter reviews the techniques available for quantifying the effectiveness
More informationDr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre
Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput
More informationMachine Learning Concepts in Chemoinformatics
Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics
More informationStudying the effect of noise on Laplacian-modified Bayesian Analysis and Tanimoto Similarity
Studying the effect of noise on Laplacian-modified Bayesian nalysis and Tanimoto Similarity David Rogers, Ph.D. SciTegic, Inc. (Division of ccelrys, Inc.) drogers@scitegic.com Description of: nalysis methods
More informationMachine learning for ligand-based virtual screening and chemogenomics!
Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Statistical Analysis
More informationUsing AutoDock for Virtual Screening
Using AutoDock for Virtual Screening CUHK Croucher ASI Workshop 2011 Stefano Forli, PhD Prof. Arthur J. Olson, Ph.D Molecular Graphics Lab Screening and Virtual Screening The ultimate tool for identifying
More informationDATA FUSION APPROACHES IN LIGAND-BASED VIRTUAL SCREENING: RECENT DEVELOPMENTS OVERVIEW
DATA FUSION APPROACHES IN LIGAND-BASED VIRTUAL SCREENING: RECENT DEVELOPMENTS OVERVIEW Mubarak Himmat 1, Naomie Salim 1, Ali Ahmed 1, 2, and Mohammed Mumtaz Al-Dabbagh 1 1 Faculty of Computing, University
More informationThe Schrödinger KNIME extensions
The Schrödinger KNIME extensions Computational Chemistry and Cheminformatics in a workflow environment Jean-Christophe Mozziconacci Volker Eyrich Topics What are the Schrödinger extensions? Workflow application
More informationAn Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection
American Journal of Applied Sciences 8 (4): 368-373, 2011 ISSN 1546-9239 2010 Science Publications An Enhancement of Bayesian Inference Network for Ligand-Based Virtual Screening using Features Selection
More informationIntroducing a Bioinformatics Similarity Search Solution
Introducing a Bioinformatics Similarity Search Solution 1 Page About the APU 3 The APU as a Driver of Similarity Search 3 Similarity Search in Bioinformatics 3 POC: GSI Joins Forces with the Weizmann Institute
More informationXia Ning,*, Huzefa Rangwala, and George Karypis
J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets
More informationUniversities of Leeds, Sheffield and York
promoting access to White Rose research papers Universities of Leeds, Sheffield and York http://eprints.whiterose.ac.uk/ This is an author produced version of a paper published in Quantitative structure
More informationIntroduction to Chemoinformatics and Drug Discovery
Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.
More informationMSc Drug Design. Module Structure: (15 credits each) Lectures and Tutorials Assessment: 50% coursework, 50% unseen examination.
Module Structure: (15 credits each) Lectures and Assessment: 50% coursework, 50% unseen examination. Module Title Module 1: Bioinformatics and structural biology as applied to drug design MEDC0075 In the
More informationAnalysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing
Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver
More informationVirtual screening in drug discovery
Virtual screening in drug discovery Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery
More informationHow Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space
pubs.acs.org/jcim How Diverse Are Diversity Assessment Methods? A Comparative Analysis and Benchmarking of Molecular Descriptor Space Alexios Koutsoukas,, Shardul Paricharak,,, Warren R. J. D. Galloway,
More informationResearch Article. Chemical compound classification based on improved Max-Min kernel
Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2014, 6(2):368-372 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 Chemical compound classification based on improved
More informationDOCKING TUTORIAL. A. The docking Workflow
2 nd Strasbourg Summer School on Chemoinformatics VVF Obernai, France, 20-24 June 2010 E. Kellenberger DOCKING TUTORIAL A. The docking Workflow 1. Ligand preparation It consists in the standardization
More informationHit Finding and Optimization Using BLAZE & FORGE
Hit Finding and Optimization Using BLAZE & FORGE Kevin Cusack,* Maria Argiriadi, Eric Breinlinger, Jeremy Edmunds, Michael Hoemann, Michael Friedman, Sami Osman, Raymond Huntley, Thomas Vargo AbbVie, Immunology
More informationThe PhilOEsophy. There are only two fundamental molecular descriptors
The PhilOEsophy There are only two fundamental molecular descriptors Where can we use shape? Virtual screening More effective than 2D Lead-hopping Shape analogues are not graph analogues Molecular alignment
More informationLarge scale classification of chemical reactions from patent data
Large scale classification of chemical reactions from patent data Gregory Landrum NIBR Informatics, Basel Novartis Institutes for BioMedical Research 10th International Conference on Chemical Structures/
More informationAn Integrated Approach to in-silico
An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals
More informationCOMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE
NUE FEATURE T R A N S F O R M I N G C H A L L E N G E S I N T O M E D I C I N E Nuevolution Feature no. 1 October 2015 Technical Information COMBINATORIAL CHEMISTRY IN A HISTORICAL PERSPECTIVE A PROMISING
More informationPlan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.
Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of
More informationPractical QSAR and Library Design: Advanced tools for research teams
DS QSAR and Library Design Webinar Practical QSAR and Library Design: Advanced tools for research teams Reservationless-Plus Dial-In Number (US): (866) 519-8942 Reservationless-Plus International Dial-In
More informationIn silico generation of novel, drug-like chemical matter using the LSTM deep neural network
In silico generation of novel, drug-like chemical matter using the LSTM deep neural network Peter Ertl Novartis Institutes for BioMedical Research, Basel, CH September 2018 Neural networks in cheminformatics
More informationA Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval
A Framework For Genetic-Based Fusion Of Similarity Measures In Chemical Compound Retrieval Naomie Salim Faculty of Computer Science and Information Systems Universiti Teknologi Malaysia naomie@fsksm.utm.my
More informationKNIME-based scoring functions in Muse 3.0. KNIME User Group Meeting 2013 Fabian Bös
KIME-based scoring functions in Muse 3.0 KIME User Group Meeting 2013 Fabian Bös Certara Mission: End-to-End Model-Based Drug Development Certara was formed by acquiring and integrating Tripos, Pharsight,
More informationAn Unbiased Method To Build Benchmarking Sets for Ligand-Based Virtual Screening and its Application To GPCRs
This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. pubs.acs.org/jcim Downloaded
More informationStructural biology and drug design: An overview
Structural biology and drug design: An overview livier Taboureau Assitant professor Chemoinformatics group-cbs-dtu otab@cbs.dtu.dk Drug discovery Drug and drug design A drug is a key molecule involved
More informationMolecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.
Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems. Roberto Todeschini Milano Chemometrics and QSAR Research Group - Dept. of
More informationarxiv: v1 [cs.ds] 3 Feb 2018
A Model for Learned Bloom Filters and Related Structures Michael Mitzenmacher 1 arxiv:1802.00884v1 [cs.ds] 3 Feb 2018 Abstract Recent work has suggested enhancing Bloom filters by using a pre-filter, based
More informationest Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today
est Drive K20 GPUs! Experience The Acceleration Run Computational Chemistry Codes on Tesla K20 GPU today Sign up for FREE GPU Test Drive on remotely hosted clusters www.nvidia.com/gputestd rive Shape Searching
More informationIgnasi Belda, PhD CEO. HPC Advisory Council Spain Conference 2015
Ignasi Belda, PhD CEO HPC Advisory Council Spain Conference 2015 Business lines Molecular Modeling Services We carry out computational chemistry projects using our selfdeveloped and third party technologies
More informationbcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012
bcl::cheminfo Suite Enables Machine Learning-Based Drug Discovery Using GPUs Edward W. Lowe, Jr. Nils Woetzel May 17, 2012 Outline Machine Learning Cheminformatics Framework QSPR logp QSAR mglur 5 CYP
More informationarxiv: v1 [cs.ds] 25 Jan 2016
A Novel Graph-based Approach for Determining Molecular Similarity Maritza Hernandez 1, Arman Zaribafiyan 1,2, Maliheh Aramon 1, and Mohammad Naghibi 3 1 1QB Information Technologies (1QBit), Vancouver,
More informationStructure-Activity Modeling - QSAR. Uwe Koch
Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity
More informationNext Generation Computational Chemistry Tools to Predict Toxicity of CWAs
Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National
More informationDevelopment of a Structure Generator to Explore Target Areas on Chemical Space
Development of a Structure Generator to Explore Target Areas on Chemical Space Kimito Funatsu Department of Chemical System Engineering, This materials will be published on Molecular Informatics Drug Development
More informationQSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression
APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor
More informationLigandScout. Automated Structure-Based Pharmacophore Model Generation. Gerhard Wolber* and Thierry Langer
LigandScout Automated Structure-Based Pharmacophore Model Generation Gerhard Wolber* and Thierry Langer * E-Mail: wolber@inteligand.com Pharmacophores from LigandScout Pharmacophores & the Protein Data
More informationCOMBINATORIAL CHEMISTRY: CURRENT APPROACH
COMBINATORIAL CHEMISTRY: CURRENT APPROACH Dwivedi A. 1, Sitoke A. 2, Joshi V. 3, Akhtar A.K. 4* and Chaturvedi M. 1, NRI Institute of Pharmaceutical Sciences, Bhopal, M.P.-India 2, SRM College of Pharmacy,
More informationChemical library design
Chemical library design Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery Today,
More informationSolved and Unsolved Problems in Chemoinformatics
Solved and Unsolved Problems in Chemoinformatics Johann Gasteiger Computer-Chemie-Centrum University of Erlangen-Nürnberg D-91052 Erlangen, Germany Johann.Gasteiger@fau.de Overview objectives of lecture
More informationAMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE
AMRI COMPOUD LIBRARY COSORTIUM: A OVEL WAY TO FILL YOUR DRUG PIPELIE Muralikrishna Valluri, PhD & Douglas B. Kitchen, PhD Summary The creation of high-quality, innovative small molecule leads is a continual
More informationFast similarity searching making the virtual real. Stephen Pickett, GSK
Fast similarity searching making the virtual real Stephen Pickett, GSK Introduction Introduction to similarity searching Use cases Why is speed so crucial? Why MadFast? Some performance stats Implementation
More informationBenchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid descriptor sets
van Westen et al. Journal of Cheminformatics 2013, 5:42 RESEARCH ARTICLE Open Access Benchmarking of protein descriptor sets in proteochemometric modeling (part 2): modeling performance of 13 amino acid
More informationComparison of Descriptor Spaces for Chemical Compound Retrieval and Classification
Knowledge and Information Systems (20XX) Vol. X: 1 29 c 20XX Springer-Verlag London Ltd. Comparison of Descriptor Spaces for Chemical Compound Retrieval and Classification Nikil Wale Department of Computer
More informationhsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference
CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science
More informationDesign and characterization of chemical space networks
Design and characterization of chemical space networks Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-University Bonn 16 August 2015 Network representations of chemical spaces
More informationElectrical and Computer Engineering Department University of Waterloo Canada
Predicting a Biological Response of Molecules from Their Chemical Properties Using Diverse and Optimized Ensembles of Stochastic Gradient Boosting Machine By Tarek Abdunabi and Otman Basir Electrical and
More informationContents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics
Contents 1 Open-Source Tools, Techniques, and Data in Chemoinformatics... 1 1.1 Chemoinformatics... 2 1.1.1 Open-Source Tools... 2 1.1.2 Introduction to Programming Languages... 3 1.2 Chemical Structure
More informationThis is a repository copy of Multiple search methods for similarity-based virtual screening: analysis of search overlap and precision.
This is a repository copy of Multiple search methods for similarity-based virtual screening: analysis of search overlap and precision. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/74399/
More informationReceptor Based Drug Design (1)
Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals
More informationProgress of Compound Library Design Using In-silico Approach for Collaborative Drug Discovery
21 th /June/2018@CUGM Progress of Compound Library Design Using In-silico Approach for Collaborative Drug Discovery Kaz Ikeda, Ph.D. Keio University Self Introduction Keio University, Tokyo, Japan (Established
More informationUsing Self-Organizing maps to accelerate similarity search
YOU LOGO Using Self-Organizing maps to accelerate similarity search Fanny Bonachera, Gilles Marcou, Natalia Kireeva, Alexandre Varnek, Dragos Horvath Laboratoire d Infochimie, UM 7177. 1, rue Blaise Pascal,
More informationDivCalc: A Utility for Diversity Analysis and Compound Sampling
Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,
More informationStructural Bioinformatics (C3210) Molecular Docking
Structural Bioinformatics (C3210) Molecular Docking Molecular Recognition, Molecular Docking Molecular recognition is the ability of biomolecules to recognize other biomolecules and selectively interact
More informationPlan. Day 2: Exercise on MHC molecules.
Plan Day 1: What is Chemoinformatics and Drug Design? Methods and Algorithms used in Chemoinformatics including SVM. Cross validation and sequence encoding Example and exercise with herg potassium channel:
More informationKinome-wide Activity Models from Diverse High-Quality Datasets
Kinome-wide Activity Models from Diverse High-Quality Datasets Stephan C. Schürer*,1 and Steven M. Muskal 2 1 Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for
More informationRelative Drug Likelihood: Going beyond Drug-Likeness
Relative Drug Likelihood: Going beyond Drug-Likeness ACS Fall National Meeting, August 23rd 2012 Matthew Segall, Iskander Yusof Optibrium, StarDrop, Auto-Modeller and Glowing Molecule are trademarks of
More informationIntegrated Cheminformatics to Guide Drug Discovery
Integrated Cheminformatics to Guide Drug Discovery Matthew Segall, Ed Champness, Peter Hunt, Tamsin Mansley CINF Drug Discovery Cheminformatics Approaches August 23 rd 2017 Optibrium, StarDrop, Auto-Modeller,
More informationTargeting protein-protein interactions: A hot topic in drug discovery
Michal Kamenicky; Maria Bräuer; Katrin Volk; Kamil Ödner; Christian Klein; Norbert Müller Targeting protein-protein interactions: A hot topic in drug discovery 104 Biomedizin Innovativ patientinnenfokussierte,
More informationModel Accuracy Measures
Model Accuracy Measures Master in Bioinformatics UPF 2017-2018 Eduardo Eyras Computational Genomics Pompeu Fabra University - ICREA Barcelona, Spain Variables What we can measure (attributes) Hypotheses
More informationJCICS Major Research Areas
JCICS Major Research Areas Chemical Information Text Searching Structure and Substructure Searching Databases Patents George W.A. Milne C571 Lecture Fall 2002 1 JCICS Major Research Areas Chemical Computation
More informationChemical Similarity Searching
J. Chem. Inf. Comput. Sci. 1998, 38, 983-996 983 Chemical Similarity Searching Peter Willett* Krebs Institute for Biomolecular Research and Department of Information Studies, University of Sheffield, Sheffield
More informationCondensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule
Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,
More informationA short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie
A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie Computational Biology Program Memorial Sloan-Kettering Cancer Center http://cbio.mskcc.org/leslielab
More informationThis is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules.
This is a repository copy of Chemoinformatics techniques for data mining in files of two-dimensional and three-dimensional chemical molecules. White Rose Research Online URL for this paper: http://eprints.whiterose.ac.uk/8425/
More informationData Mining in the Chemical Industry. Overview of presentation
Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry
More information6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationAlgorithms for Data Science
Algorithms for Data Science CSOR W4246 Eleni Drinea Computer Science Department Columbia University Tuesday, December 1, 2015 Outline 1 Recap Balls and bins 2 On randomized algorithms 3 Saving space: hashing-based
More information