has its own advantages and drawbacks, depending on the questions facing the drug discovery.

Size: px

Start display at page:

Download "has its own advantages and drawbacks, depending on the questions facing the drug discovery."

Eugene Lee
5 years ago
Views:

1 2013 First International Conference on Artificial Intelligence, Modelling & Simulation Comparison of Similarity Coefficients for Chemical Database Retrieval Mukhsin Syuib School of Information Technology Faculty of Information Science and Technology Universiti Kebangsaan Malaysia Bangi, Selangor, Malaysia Shereena M. Arif School of Information Technology Faculty of Information Science and Technology Universiti Kebangsaan Malaysia Bangi, Selangor, Malaysia Nurul Malim Pusat Pengajian Sains Komputer Universiti Sains Malaysia Pulau Pinang, Malaysia Abstract-Similarity-based virtual screening is used in drug discovery by using computational model for rapid evaluation of large number of chemical molecules. Similarity searches use 2D or 3D fingerprints and similarity coefficient to calculate the structural resemblance between each molecule in a chemical database and a target structure. The objective of this work is to determine the best coefficient to be used in similarity searching to get the optimal results. This paper will describe the experiment to perform the molecular similarity searching using different similarity coefficients, which focus on 2D UNITY or ECFP4 fingerprint on 5 activity classes. We will also highlight the different similarity values and the optimal results of similarity measures. All this could depend on what type of fingerprint. As a conclusion, we found that every combination measure has its own advantage. But to look for the best possible results, the nature of molecular activity class could also play an important role. Keywords Chemoinformatics, virtual screening, 2D fingerprints, similarity measure. I. INTRODUCTION Chemoinformatics is a new discipline which emerged from several older disciplines such as computational chemistry, computer chemistry and chemical information (Xu et al, 2002). It involves the use of computer technology to process chemical data. What differentiates chemical data processing from other data processing is that chemical data involves the requirement to work with chemical structures. This requirement necessitated the introduction of special approaches to represent, store and retrieve structures in a computer system. According to Xu et al (2002), chemoinformatics have two aspects in drug discovery. First, it should be able to extract knowledge from large-scale raw high throughput screening databases in less time, and second, it should be able to provide efficient computational tools to predict ADMET properties (ie. a set of tests in drug discovery to determine if a lead can be a potential drug for human consumption). The searching of chemical library in silico is called virtual screening. A virtual screening is the method to boost the efficiency of lead-discovery programs in the pharmaceutical and agrochemicals industries (Werner et al, 2003). Similarity searching is one of the virtual screening methods used to find chemical structures from a known bioactive molecule, such as a hit from highthroughput screening experiment. This molecule hereafter referred to as the target structure, is then compared with each of the molecules in database 2D or 3D chemical structures by calculating a measure of the degree of structural resemblance between the target structure and the database structure. Virtual screening is always used to remove or separate the molecules which are not expected or desired from the library. By doing this, cost and time for drug discovery can be managed efficiently. Performing virtual screening at this early stage will reduce the number of compounds that will be investigated further in drug discovery. II. SIMILARITY COEFFICIENTS In search of the molecular similarities we have to specify attributes such as the following: a = number of bits where X is on and Y is off b = number of bits where X is off and Y is on c = number of bits where X and Y is on d = number of bits where X and Y is off n = total number of bits for a molecule. Example: X: Y: From above, we can get the attribute such as: c = 3 d = 0 a = 2 n = 9 b = 4 After we determined each attribute, we can use the similarity coefficient to find the pairwise similarity values. Next, we filter out the inactives compounds in the top-r ranked that we specify. Every similarity coefficient /13 $ IEEE DOI /AIMS

2 has its own advantages and drawbacks, depending on the questions facing the drug discovery. III. BINARY SIMILARITY COEFFICIENTS Descriptor of molecule can be binary (1,0) numeric or categorical. In chemoinformatics world, we call these descriptor strings fingerprint. Binary descriptors are especially useful, as there are highly efficient computer algorithms that work with binary strings. Figure 1 shows one example of hashing in binary descriptor, where one element can be represented by many bits and vice versa. algorithm to code path lengths of four bonds (ECFP4) or six bonds (ECFP6) or higher in length. We are using the commercial MDDR 2007 subscribed from Accelrys Inc (available from ECFP4 fingerprints are generated from the Pipeline Pilot software, which is the authoring tool for the Accelrys Enterprise Platform, while UNITY fingerprint has been generated using description from Tripos Inc (available from These molecular information databases provide all kinds of molecular structure, molecular weight, and other physical and chemical data (Zhang, 2007). MDDR 2007 contained 102,514 different molecules. TABLE I. LIST SIMILARITY COEFFICIENT IN 2D FINGERPRINT No Coefficient Formula Other Name 1 Tanimoto For Binary Known as Jaccard coefficient. 2 Cosine For Binary Known as Ochiai coefficient 3 Forbes For Binary None Figure 1. Binary Fingerprint (Source : Similarity coefficient is used to calculate the similarities between the reference and target fingerprint. There are many similarity coefficient derived from text retrieval field, also used in chemoinformatics. In this paper, we only describe seven coefficients that has been widely used in this field. Only three of these will be used here, which are Tanimoto, Russell-Rao and Euclidean Distance. According to Werner et al (2003), the Russell- Rao, Kulcynski and Forbes coefficients have been found to be effective for similarity searching in their laboratory, and they would appear to have a straightforward extension to continuous form. Thus, we use Russell-Rao in binary form (dichotomous), to test the hypothesis. UNITY fingerprints provide richer description than the classic fingerprint known as MACCS keys, which simply represent the absence or presence of a small library of functional groups. UNITY fingerprints incorporate a much broader range of features, which includes connected bond path fragments up to seven bonds long. The ECFP series of fingerprint used in Pipeline Pilot, on the other hand, use a different 4 Euclidean Distance For Binary None 5 Dice For Binary Known as Czekanowski coefficient or Sorenson coefficient 6 Russell- Rao For Binary None 7 Soergel Distance For Binary IV. EXPERIMENT None This work involves the use of the similarity measures described above on five activity classes. Table 2 shows the class of molecules used in this virtual screening experiment

3 TABLE II..LIST OF MDDR ACTIVITY CLASSES USED No Activity class name No of active molecules 1 5HT3 Antagonists Angiotensin II AT1 antagonists Thrombin inhibitors Substance P antagonist HT reuptake inhibitors 359 STEP 2 : TABLE IV. THE BASIC PROCEDURE FOR SIMILARITY SEARCHING Where n is total molecule in a particular activity class and N is total molecule in whole database. For i := 1 to n For j := 1 to N Calculate the similarity coefficient, by using i as the query and j as the reference database fingerprints. End For Read the next query fingerprint End For Sort the results in descending similarity values order. Take 1% of top rank Figure 2. Steps of experiment to get similarity activity class again whole database Figure 2 above illustrates the experiment flow on how 5HT activity class been used as the target structures against the whole database. The algorithm will compare the similarities from bit binary of the activity class with the database structures by applying the similarity coefficient. STEP 1: TABLE III. THE PSEUDOCODE TO RUN THE MAIN PROGRAM For i := 1 to X Copy element from database and paste to a temporary query array. Run C program. X is a member of a particular active class. Algorithm in Table III above shows the procedure to run C code with shell script. Firstly, the algorithm will loop following the number of members of a particular activity class. The code will then search the same compound id from this activity class in the database, copy and paste the information (compound id and binary descriptor) into the temporary query file. Finally it will apply the main algorithm for similarity search. Table IV shows the algorithm to calculate the similarity values for every molecule in an activity class against the whole database. After we get all the values, this algorithm will sort the results by decreasing order. We then take 1% of top ranking from the result and extract into a flat file. The next step is to compare the flat file with the activity class file. STEP 3: TABLE V. THE BASIC PROCEDURE TO DETERMINE TRUE POSITIVES For i:= 1 to P Open the file name in directory. Check compound_id from file target using activity class file. Calculate the total of the same compound id. P is list of file in directory The algorithm in Table 5 shows how to match the compound from file in folder directory with the activity class file. From that comparison, we can classify the true and false positives. From here, we can calculate the mean number of true positives retrieval for every activity class. V. RESULTS AND DISCUSSION To carry out similarity searching, we calculate similarity values for each actives in a particular activity class with each molecule in the MDDR database. Next, we rank the results in decreasing order of the values. For each actives, we take the top 1% compound id and generate the mean number of true positives to be used as experimental material. Below we discuss our findings based on the results sought

4 B. ECFP4 Fingerprint ECFP4 fingerprint use 1024 bit binary. Using this fingerprint, Tanimoto similarity coefficient remains the best. However, for Angiotensin activity, Russell Rao shows the most promising results. It has also been found by Todeschini et al (2012) who has conducted evaluations of similarity coefficient using simulated and real data. In that study, he found the considerable merits of the well-established Jaccard- Tanimoto coefficient. VI. CONCLUSIONS Figure 3. Chart about mean molecule is true positive every class with different similarity coefficients in UNITY A. UNITY Fingerprint UNITY fingerprint has 993 bit binary. From the result shown in Figure 3 below, we can see that Tanimoto similarity coefficient is the best coefficient across the activity classes used as sample. Exceptional is in the class of Angiotensin, where Tanimoto shows the least effcetiveness with mean value For Angiotensin activity class, Euclidean Distance shows the highest value of mean similarity Inspection of Table 6 show the results of the mean value for each class of molecule, where Tanimoto gave the highest mean value compared with the others. Exceptional can be seen for Angiotensin activity class, where Euclidean (Eu) gives the highest mean value, and Tanimoto was rated last. TABLE VI. MEAN FOR EVERY CLASS (UNITY) Activity Fingerprint Class UNITY Tan RR Eu 5HT Angiotensin Thrombin HT Substance P Figure 4. Chart about mean molecule is true positive every class with different similarity coefficients in ECFP4 From Table 7, we get a rather different result, where Tanimoto is still the best coefficient for 5HT, Thrombin, 5HT3, and Substance activity classes. In Angiotensin, the best coefficient is Russell-Rao (RR), and Euclidean was rated last. In conclusion, type of molecular fingerprint and activity class plays some part of the molecular similarity calculation. Nevertheless, in this experiment, we showed that Tanimoto similarity coefficient should be used in virtual screening to get the optimal result. For future work, we plan to experiment with different similarity coefficient and different chemical database to investigate further the parameters that can affect similarity searching performance

5 Activity Class TABLE VII. MEAN FOR EVERY CLASS (ECFP4) Fingerprint ECFP4 Tan RR Eu 5HT Angiotensin Thrombin HT Substance P ACKNOWLEDGEMENTS We would like to thank Dr Nurul Malim for comments and feedbacks on the manuscript. This work is jointly supported by the UKM Grant GGPM and USM Short Term Grant 304/PKOMP/ REFERENCES [1] Arif, S. M., Holliday, J. D., & Willett, P. (2013). Comparison of chemical similarity measures using different numbers of query structures. Journal of Information Science, 39(1), [2] Faver, J. C., Ucisik, M. N., Yang, W., & Merz, K. M. (2013). Computer-aided drug design: Using numbers to your advantage. ACS Medicinal Chemistry Letters, 4(9), [3] Green, D. V. S. (2008). Virtual screening of chemical libraries for drug discovery. Expert Opinion on Drug Discovery, 3(9), [4] Lavecchia, A., & Giovanni, C. D. (2013). Virtual screening strategies in drug discovery: A critical review. Current Medicinal Chemistry, 20(23), [5] Todeschini, R., Consonni, V., Xiang, H., Holliday, J., Buscema, M., & Willett, P. (2012). Similarity coefficients for binary chemoinformatics data: Overview and extended comparison using simulated and real data sets. Journal of Chemical Information and Modeling, 52(11), [6] Willett, P. (2011). Similarity-based data mining in files of twodimensional chemical structures using fingerprint measures of molecular resemblance. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(3), [7] Willett, P., Barnard, J. M., & Downs, G. M. (1998). Chemical similarity searching. Journal of Chemical Information and Computer Sciences, 38(6), [8] Xu, J., & Hagler, A. (2002). Chemoinformatics and drug discovery. Molecules, 7(8), [9] Figure binary fingerprint is available from : [last [10] ECFP4 fingerprint is available from: [last [11] UNITY fingerprint is available from: [last

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA

COMPARISON OF SIMILARITY METHOD TO IMPROVE RETRIEVAL PERFORMANCE FOR CHEMICAL DATA http://www.ftsm.ukm.my/apjitm Asia-Pacific Journal of Information Technology and Multimedia Jurnal Teknologi Maklumat dan Multimedia Asia-Pasifik Vol. 7 No. 1, June 2018: 91-98 e-issn: 2289-2192 COMPARISON