In Search of Desirable Compounds

In Search of Desirable Compounds Adrijo Chakraborty University of Georgia Email: adrijoc@uga.edu Abhyuday Mandal University of Georgia Email: amandal@stat.uga.edu Kjell Johnson Arbor Analytics, LLC Email: Kjell@arboranalytics.com Abstract Drug discovery scientific teams evaluate a compound s performance across a number of endpoints, often including its ability to bind to target(s), as well as its absorption, distribution, metabolism, excretion, and safety characteristics. A popular way to prioritize compounds is through desirability scoring across the selected endpoints. In addition to desirability scores, drug discovery teams can measure or compute many other compound descriptors that are thought to be related to the endpoints of interest. Scientists would like to identify a small number of such descriptors which can be tweaked to improve the performance of a compound. To that end, one primary objective of these teams is to identify the most crucial descriptors and relationships among the descriptors that impact compound desirability. Finding these relationships can help teams understand relationships between predictors and compounds desirability, and can possibly suggest molecular alterations that could improve compound desirability. In this work, we propose a novel method based on the desirability scoring framework for identifying important descriptors. We also explore ways to visualize high dimensional predictor spaces that can effectively inform teams decisions in synthesizing new compounds. Keywords: Importance frequency, Desirability Scores, Drug Discovery, Pharmaceutical Chemistry, Variable Prioritization. 1 Introduction Due to a number of technological advances, drug discovery research has rapidly changed over the past few decades. Traditionally, new molecular entities were painstakingly 1

synthesized one-at-a-time, and smaller collections of molecules were analyzed across a handful of in-vivo and in-vitro screens. Using this relatively small amount of information, teams would identify a few compounds with optimal characteristics and move towards development. The advent of combinatorial chemistry and automated methods for synthesizing compounds enabled companies to efficiently create hundreds to thousands of molecules, thus greatly expanding companies compound libraries. Robotic technology and biological nanotechnology advancements also enabled companies to conduct in-vitro screening of large libraries across a host of biological targets for efficacy, pharmacokinetic and pharmacodynamic properties, and safety characteristics. Furthermore, the field of computational chemistry now has the ability to numerically describe compounds molecular structure and physical characteristics. Hence, the number of compounds to evaluate, as well as the number of descriptors about those compounds has greatly expanded, leaving discovery teams with the difficult chore of sifting through this information to identify only the very most promising compounds to push forward. Humans are not well equipped to compare and contrast objects across an array of information. Instead, it is much easier to prioritize objects across just one, preferably, or perhaps two variables. Because of this inherent human nature, discovery teams have used desirability functions (Harrington, 1965) to generate one-number summaries when trying to optimize across multiple responses (Mandal et. al., 2007). There are various choices of desirability functions, one of the most common being the weighted geometric mean. In a global sense, these scores rank compounds from best to worse, where the higher the score, the better the compound. Often prior knowledge can help to determine the desirability threshold above which teams seek compounds. For some projects, a candidate set of desirable molecules emerge. For other projects, however, the set of compounds that currently exist may not achieve high enough desirability scores to merit moving any forward. In cases like these, teams would like to better understand characteristics of the compounds that are related to the desirability score and synthesize new compounds that improve upon those characteristics. Hence the compound characteristics that are thought to be related to the desirability score are also evaluated. This could be looked upon as a general regression set up where response Y is the compound desirability score and predictor variables X 1,..., X p represent the compound characteristics. Predictive performance and parsimony are the two main concerns when a regression model is constructed. Removing irrelevant variables not only makes the model simpler and interpretable but usually improves the prediction accuracy. Variable selection methods such as stepwise and best subsets are not as popular as they were before since they are time consuming (especially as the number of predictors become large) and sometimes might eliminate some of the useful predictors. The recently introduced Least Angle Regression (Efron et. al., 2004) algorithm, motivated by forward stagewise method is computationally very efficient. In recent days, penalized 2

least square based variable selection methods are more popular. LASSO (Tibshirani, 1996) shrinks all the least square regression coefficients towards zero. In some cases application of LASSO makes some of the coefficients absolutely zero, which leads to a simpler model. Under some modifications, LAR algorithm computes LASSO solutions. LASSO does not perform well when the variables are highly correlated. Elastic net (Zou and Hastie, 2005) is another method which outperforms LASSO in the presence of highly correlated predictors. LASSO is based on L 1 penalty and Elastic net is based on a linear combination of L 1 and L 2 penalties. Fan and Li (2001) introduced another penalty function namely, SCAD. SCAD estimators satisfy oracle properties. Fan and Lv (2004) introduced the sure independence screen (SIS) method that puts importance on the variables with high marginal correlation with the response. Although powerful, these methods do not solve the problem of efficiently identifying a handful of variables that can be tweaked in order to improve the efficacy of a compound. In this sense our problem is different from the standard variable selection problem, which we will discuss later. In this paper we propose an elegant method that can be easily used in order to identify important descriptors. In this problem, along with the values of the response for each compound we have some additional information, i.e. the scientists also know whether the compound is desirable, moderately desirable or undesirable. Naturally, here we use both the descriptive information as well as the desirability information to improve/enhance the performance of a compound. As mentioned before, the primary focus of the paper is to develop a simple and elegant method to prioritize some variables because the drug discovery team s objective is to identify a few characteristics of a compound to be modified in order to improve its desirability. To this end, the paper has been organized as follows. In Section 2 we discuss the data set provided by a pharmaceutical company. Given this context, we propose our method in Section 3. In Section 4 we illustrate this method on the pharmaceutical data, and in Section 5 we perform a cross validation study in order to investigate the performance of the method. We conclude our paper with some discussions in Section 6. 2 Pharmaceutical Chemistry Example. Early in the adjudication process, a pharmaceutical discovery team had obtained a desirability score for 10, 640 compounds. In addition, the team possessed values of 28 molecular descriptors for all of the compounds. Each of these descriptors is chemically interpretable, and changes in each of them can be translated into practical alterations of the molecule. For proprietary reasons, specific details about the data used in this work cannot be disclosed. However, we will provide general descriptions of the predictors and the desirability scores. Based on the desirability scores, the compounds are classified into three groups: desirable, moderately desirable, and undesirable. For this particular 3

problem, a compound is desirable if its score is greater than 5 and is undesirable if its score is less than 3. A distribution of the scores for this set of compounds is presented in Figure 1. Figure 1: Distribution of Desirability Scores. Undesirable Desirable Frequency 0 500 1000 1500 2000 2500 2 0 2 4 6 8 10 12 Desirability Scores From Figure 1 we see that the distribution of the scores is more or less symmetric. In the data set, we have 2289 desirable compounds and 3323 undesirable compounds. One of the team s primary objectives was to understand predictors in such a way that they could make alterations to undesirable compounds to make them moderately desirable or better. 3 Proposed Variable Prioritization Methodology Let Y be the desirability score of a compound. We assume that a compound will be considered desirable if Y > C 2, undesirable if Y < C 1 and moderately desirable otherwise. Here the cut-offs C 1 and C 2 would depend on the nature of compound library at hand and are chosen by the scientists. In a data set of N compounds, suppose 4

D are desirable, M D are moderately desirable and U D are undesirable compounds. Also assume that the data set has p continuous descriptors denoted by X 1, X 2,..., X p. Let, Ŷ be the predicted desirability score of a compound based on these predictors, that is Ŷ = ˆf(X 1,..., X p ); for some unknown function f. Let D be the number of compounds which are predicted to be desirable, that is, D = N i=1 I{Ŷi > C 2 } and similarly define MD and ÛD. Suppose we remove a predictor from the set of predictors, say X j, and predict Y based on the rest of the predictors. Let Ŷ j i be the predicted desirability score of the i th compound after removal of j th (j) predictor from the dataset. Now, let D new denote the number of compounds which were originally either undesirable or moderately desirable but with the exclusion of the jth predictor, its predicted value falls in the desirable range. In other words, D(j) new = N i=1 I{Ŷ j i > C 2 and Y i < C 2 }. Similarly ÛD(j) new denotes the number of compounds that move from desirable group to moderately desirable or undesirable group. Finally let Ûd(j) new be the number of compounds which move from moderately desirable group to undesirable group, with the exclusion of X j. Let P j be the proportion of compounds which are predicted to be desirable based on the set of predictors excluding X j, among the compounds which are observed as moderately or undesirable and Q j be the proportion of compounds which are predicted to be undesirable based on the set of predictors excluding X j, among the compounds which are observed as desirable. Similarly, R j be the proportion of compounds which are predicted to be undesirable based on the set of predictors excluding X j, among the compounds which are observed as moderately desirable. That is, P j = D (j) new MD + UD, Q j = ÛD(j) new D, R j = Ûd(j) new MD. We can compute P, Q and R for each variable in the set of predictors and judge whether or not to use this variable to modify the molecule. Removal of the j th variable could also be seen as setting X j = 0, which is equivalent to tweaking a molecule such that the value of the j th predictor becomes 0. Suppose that a variable X j has a high value of P. This would indicate that corresponding alteration of X j would shift considerably large proportion of compounds into the desirable range. Hence, this result would suggest changing the molecule based on X j. On the other hand, if Q j and/or R j is high, altering X j in this way would cause a large proportion of compounds to move from a more desirable to a less desirable category, 5

and hence we should not consider alterations of the predictors that would raise these Q and R scores. While making decisions on which predictors should be considered to alter the molecule, we need to consider P, Q, and R together. Given these considerations, we propose the following variable prioritization method: For each predictor X j, do: (1) Remove X j. (2) Predict Y based on X 1, X 2,..., X j 1, X j+1,...x p. (3) Compute P j, Q j, R j. end. Select important variables based on P ; Q; R plots. Note that a variable with a higher P score may not be included in the set of important predictors, on the other hand, a variable having higher Q or R score should not be excluded. We recommend plotting P, Q and R scores for each variable to understand the relative scoring within the problem. This proposed visualization method can help to pinpoint the important variables from a vast set of variables which might have either a good or an adverse effect to the desirability score of a compound. Having identified a variable as important, we can then use that information as feedback into the scientific process. In this case we could use the information for future molecular synthesis. Now, let X Important be the set of important predictors. Since we suggest of choosing the important variables from the P, Q and R plots, the choice of the members and the cardinality of X Important is subjective. A practical approach would be to rank the predictors according to their Q and R scores and choose a few of them based on their ranks. We illustrate the method with analyzing a data and also perform a cross validation in Section 5. There we rank the variables based on their P, Q and R values (in the ascending order) and choose first three variables based on the Q and R scores and construct X QR High. We also construct XP High which consists of the variables which rank among top three variables based on their P scores. Finally we construct X Important based on the variables which are in X QR High but not in XP High. Scientists might also be interested in finding the relationship between the desirability score and the predictors, that is, estimating the unknown function η in Y = η(x 1,..., X p ) + ε, with ε N(0, σ 2 ), where Y is the desirability score and (X 1,..., X p ) are the p predictors. We consider a second order polynomial for η. This is a common choice of response function in response surface methodology (Wu and Hamada, 2009). In order to reduce the number of parameters and avoid over-fitting, we consider the second order polynomial only for the important variables. Hence the model becomes, Y = g(x Important ) + h(x Remaining ) + ε (1) 6

where X Important is the set of important variables chosen using our proposed variable prioritization methodology and X Remaining denotes the others. Here we approximate g( ) by a second order polynomial which contains all the linear, quadratic and interaction terms of the important variables, and approximate h( ) by a linear combination of other variables. The choice of the model could be subjective in this scenario and there are scopes for model selection for this kind of data. Figure 2: PQR plots used to identify the important predictors P plot Q plot P 0.042 0.044 0.046 0.048 0.050 0.052 X10 X15 X16 X21 X22 X23 Q 0e+00 1e 04 2e 04 3e 04 4e 04 5e 04 X1 X10 X11 X15 X19 X23 X26 0 5 10 15 20 25 Variables 0 5 10 15 20 25 Variables R plot R 0.090 0.095 0.100 0.105 0.110 X5 X11 X8 X17 X16 X25 0 5 10 15 20 25 Variables 7

4 Pharmaceutical Chemistry Example Revisited In this Section we applied the proposed methodology to the data set at hand. As mentioned in Section 2, a compound is considered as desirable if it s desirability score (Y ) is greater than 5, moderately desirable if Y is between 3 and 5 and undesirable if Y is less than 3. In this data set we have 28 predictors and we follow the steps described in Section 3. The P, Q and R values are illustrated in Figure 2. In this example the P scores are comparatively higher for X 15, X 10, X 22, X 21 and X 23, and is significantly lower for X 16. This means if a compound from the undesirable or moderately desirable group is chosen and reconstructed with X j =0, j {10, 15, 21, 22, 23}, then the chance of the new compound to fall in the desirable group is higher. We may consider X 10, X 15, X 21, X 22 and X 23 to be removed from the set of predictors, but before we move on, we need to check the Q and R plots as well. From the Q plot for all the variables, we see the variables X 1, X 10, X 11, X 15, X 19, X 23 and X 26 have a higher Q score. This implies, the chances of a desirable compound to become an undesirable compound is high if we tweak the compound such that X j =0, j {1, 10, 11, 15, 19, 23, 26}. Hence we should not set them to be 0 and they should be considered as the members of X Important. Similarly, the R plot indicates that there is a little chance that a moderately desirable compound will become undesirable if the scientists tweak the compound such that X 16 =0, where as chances are higher if they set X 11, X 5, X 8, X 25 and X 17 to be 0. While building a model as suggested in Equation (1), we need to choose variables to keep in X Important. As we mentioned in Section 3, choice of the members of X Important is subjective. From Figure 2, we see that, X 1, X 5, X 8, X 10, X 11, X 15, X 19, X 23, X 25 and X 26 have either higher Q or higher R. Among them, we exclude X 10 and X 15 since both of them have higher P scores. Hence X Important consists of X 1, X 5, X 8, X 11, X 19, X 23, X 25 and X 26. Once the X Important set is built, We fit the model (1) based on these variables and the adjusted R 2 of the fitted model is 0.8565. Fitted Model (1) is a second order model and consists of 65 parameters. From a parsimonious perspective, we could fit a LASSO (Tibshirani, 1996) in order to get rid of irrelevant predictors. If there are correlated variables in the predictors set, Elastic Net (Zou and Hastie, 2005) could also be used. 8

Figure 3: Importance Frequency and Second Order Frequency plot for all the variables P Q Importance Frequency P 0 20 40 60 80 100 X15 X22 Importance Frequency Q 0 20 40 60 80 100 X11 0 5 10 15 20 25 Variables (a) R 0 5 10 15 20 25 Variables (b) Second Order Frequency Importance Frequency R 0 20 40 60 80 100 X5 X11 X25 Frequencies 0 20 40 60 80 100 X2 X1 X3 X9 X15 X16 X18 X21 X22 X27 0 5 10 15 20 25 Variables (c) 0 5 10 15 20 25 Variables (d) 9

5 A Cross Validation Study and Model Comparison In this section we evaluate the performance of our method through a cross validation study. The cross validation scheme is as follows. 1. We construct a training data set by randomly selecting 1000 observations from the entire data set, and we also randomly select a test data of size 500 from the remaining observations. 2. We apply the variable prioritization method proposed in Section 3 to the training data. At first we rank the variables according to their P, Q and R scores. We choose the top three variables based on Q and R scores, and construct X QR High. Again we construct another set which consists of the top three variables based on the P score, namely, XHigh P. Now, X Important consists of the variables which belong to X QR High but do not belong XP High, that is, X Important = X QR High \ XP High. 3. We fit the model (1) described in Section 3 using this X Important. 4. We predict the desirability scores of the observations in the test data using the fitted model (Step 3) and we calculate R 2, the correlation between yi T and y ˆ i T, the observed and predicted desirability score of the i th compound in the test data, respectively. 5. We repeat the steps above (1 through 4) 100 times. We define the importance frequency of a variable as the number of times a variable appears among the top three variables in terms of a performance measure. We mentioned in the simulation scheme that while performing the cross validation study, for each training data set we calculate the P, Q and R score of a variable. Now once we have P, Q and R scores for all the variables, for all the training data sets, we compute the importance frequency of each variable in terms of the P, Q, and R scores. In panels (a), (b) and (c) of Figure 3, we plot the P, Q and R importance frequencies. We point out the variables which have considerably higher or lower importance frequencies. In Table 1 we show the importance frequency distribution of the variables with respect to P and R scores. From the panel (b) of Figure 3 we see that X 11 has slightly higher frequency than the rest of the variables, which have almost the same frequencies. Hence we do not draw importance frequency table with respect to Q scores. Now, without prioritizing the variables, if we consider a second order response surface model for all the variables, we have, Y = η(x 1,..., X p ) + ε = Zγ + ε, where ε N(0, σ 2 ) (2) 10

Table 1: Importance frequencies and Second order frequencies of the descriptors Importance frequencies Second order frequencies Frequency (P ) (R) Model(2)+LASSO < 20 other variables 20 30 None X 25 X 23, X 27 30 50 X 22 X 5 X 3 50 80 X 15 X 11 X 1, X 18, X 21 > 80 None None X 2, X 9, X 15, X 16, X 22 where Y n 1 is vector of desirability scores, γ v 1 is the coefficient vector and, Z=(X 1, X 2..., X p, X1 2,..., X2 p, X 1 X 2,..., X p 1 X p ) n v, where v is the total number of linear and second order terms in the model. We estimate the parameters of Model (2) by ordinary least square method. In this kind of problems, typically p is large which enforces a large value of v. One might be tempted to use some well known variable selection techniques in order to fit a parsimonious model. But we need to keep in mind that it will defeat our purpose of identifying a handful of predictors to be tweaked in order to make the compounds more desirable. However, for the sake of comparison, we choose LASSO (Tibshirani, 1996) as our variable selection technique. Hence γ is estimated in the model as, ˆγ=argmin γ ȳ Zγ 2, subject to, p i=1 γ j t, where t is the tuning parameter. Tuning parameter is chosen by optimizing the generalized cross validation style statistic (Tibshirani, 1996). In order to fit Model (2) with LASSO penalty, we use LAR algorithm (Efron et. al, 2004) with LASSO modification. We apply this model to the same training data that we obtain in step(1) of the previously described simulation scheme and check the predictive performance by applying the fitted model to the test data. We repeat this 100 times for the same training and test data sets we used before. Finally we compute R 2 for each of the 100 simulations via calculating the correlation coefficient between predicted and observed desirability scores of the test data. We also count the number of times a variable selected as a second order term in the fitted LASSO model to the training data. By second order term we mean either quadratic main effect (x 2 i ) or interaction effect (x i x j ). We call this count as second order frequency. Table 1 and panel (d) of Figure 3 show the second order frequencies of each variable. One may argue that the importance frequencies and second order frequencies are not directly comparable, but it can be used to illustrate our point that the proposed PQR method identifies a handful of predictors that can be tweaked to improve the performance of the compound more efficiently. This happens because our Model (1) follows the effect heredity principle (Chipman 1996) (also called the marginality principle by McCullagh and Nelder, 1989). 11

In Figure 4 we plot the R 2 for Models (1) and (2), with and without LASSO penalty. Model (1) outperforms Model (2) by a large extent. We can also see that the predictive performance of Model(2)+LASSO (i.e. full second order model with LASSO penalty) is only marginally better than the proposed model. This establishes the usefulness of the proposed method which is very simple for the scientists to use. Figure 4: We compare the predictive performances of Model (1), Model (2) and Model (2)+LASSO. In the right panel we take a closer look of the boxplots of the left panel. Training data size=1000, Test data size=500 Training data size=1000, Test data size=500 R 2 0.0 0.2 0.4 0.6 0.8 R 2 0.80 0.82 0.84 0.86 0.88 0.90 Model(1) Model(2) Model(2)+LASSO Model(1) Model(2) Model(2)+LASSO 6 Summary and Concluding Remarks The proposed variable prioritization method uses qualitative information (i.e. whether the response falls in the desirable, undesirable or moderately desirable group) and the quantitative values of the response. Once the method is applied, it will be easier for the scientists to visualize the importance of the predictors and come up with a fewer set of predictors in order to conduct further research. We proposed a model considering the second order terms of the prioritized variables and studied its performance through a cross validation study. In order to compute P, Q and R scores, we need to choose 12

a prediction function f as mentioned in Section 3. In this paper we chose multiple linear regression to predict Y for this purpose. One could also use regression trees, for instance Random forest (Breiman, 2001) regression, as well. However, application of Random Forest could increase the computation time. As illustrated by the Figures 3 and 4, our proposed method performs almost as good as sophisticated variable selection techniques like LASSO, but is much simpler for the scientists to use, and also identifies a smaller number of descriptors that the scientists would be interested in. The focus of the paper is variable prioritization to identify a small number of predictors that can be modified to improve the performance of the compound. One could plot the P, Q and R scores and identify or rank the predictors based on their respective scores. Once the set of important predictors is built then one might think of fitting a model in order to explore the relationship among the predictors and the response variable. The proposed model motivated by the second order response surface modeling fits reasonably well to the data. There are scopes for further research in model selection for this kind of data. Here we compute marginal P, Q and R scores i.e. we removed a single variable from the predictors set at a time. Group P, Q and R could also be computed by removing a group of variables from a predictors set. References 1. Breiman, L. (2001). Random forests. Machine Learning, 45, 5-32. 2. Chen, H.-W., Wong, W. K. and Xu, H (2012). An Augmented Approach to the Desirability Function. Journal of Applied Statistics, in press. 3. Chipman, H. (1996). Bayesian variable selection with related predictors. Canadian Journal of Statistics 24, 17-36. 4. Efron, B., Hastie, T., Johnstone, I. and Tibshirani, R. (2004). Least angle regression (with discussion) Ann. Statist 32, 407 499. 5. Fan, J. and Li, R. (2001). Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96, 1348 1360. 6. Fan, J. and Lv, J. (2008). Sure independence screening for ultrahigh dimensional feature space (with discussion). J. Roy. Statist. Soc. Ser. B 70, 849 911. 7. Fan, J. and Lv, J. (2010). A selective overview of variable selection in high dimensional feature space. Statistical Sinica 20,, 101-148. 8. Harrington, E. C. (1965). The Desirability Function. Industrial Quality Control 4, 494 498. 9. Hastie, T., Tibshirani, R. and Friedman, J. (2005). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd edition). Springer-Verlag, New York. 13

10. McCullagh, P., Nelder, J. A. (1989). Generalized Linear Models (2nd Edition). Chapman and Hall, London. 11. Mandal, A., Johnson, K., Wu, C. F. J. (2007). Identifying Promising Compounds in Drug Discovery: Genetic Algorithms and Some New Statistical Technique. Journal of Chemical Information and Modelling 47, 981 988. 12. Tibshirani, R. (1996). Regression shrinkage and selection via the LASSO. J. Roy. Statist. Soc. Ser. B 58, 267 288. 13. Wu, C. F. J., and Hamada, M. (2000). Experiments: Planning, Analysis, and Parameter Design Optimization New York: Wiley. 14. Zou, H. and Hastie, T. (2005). Regularization and variable selection via the elastic net. J. Roy. Statist. Soc. Ser. B 67, 301 320. 14