A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR)

Size: px

Start display at page:

Download "A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR)"

Gervais Spencer
5 years ago
Views:

1 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 815 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR) Navdeep Singh Sethi Department of Pharmacy (Pharmaceutical Chemistry), DoabaGroup of Colleges, Kharar (Mohali) , Punjab, India. ABSTRACT: Virtual filtering and screening of combinatorial libraries have recently gained attention as methods complementing the high-throughput screening and combinatorial chemistry. These chemoinformatic techniques rely heavily on quantitative structure-activity relationship (QSAR) analysis, a field with established methodology and successful history. In this review, we discuss the computational methods for building QSAR models. The review starts with general introduction and theories of QSAR and identifying the general scheme of a QSAR model. Following, the review focus on the methodologies in constructing three main components of QSAR model, namely the methods for describing the molecular structure of compounds, for selection of informative descriptors and for activity prediction. The review present both the well established methods as well as techniques introduced into the QSAR domain. KEYWORDS: QSAR, FreeWilson analysis; Hansch analysis; molecular descriptors (2D descriptors and 3D descriptors); feature selection; machine learning. Introducton If we can understand how a molecular structure brings about a particular effect in a biological system, we have a key to unlocking the relationship and using that information to our advantage. Formal development of these relationships on this premise has proved to be the foundation for the development of predictive models. If we take a series of chemicals and attempt to form a quantitative relationship between the biological effect (i.e. the activity) and the chemistry (i.e. the structure) of each of the chemicals, then we are able to form a quantitative structure-activity relationship or QSAR. Less complex, or quantitative, understanding of the role of structure to govern effects, i.e. that a fragment or substructure could result in a certain activity, is often simply termed a structure-activity relationship or SAR. Together SARs and QSARs can be referred to as (Q)SARs and fall within a range of techniques known as in silico approaches. A (Q)SAR comprises three parts: the (activity) data to be modeled and hence predicted, data with which to model and a method to formulate the model. The purpose of in silico studies includes the following: (a) To predict biological activity and physicochemical properties by rational means. (b) To comprehend and rationalize the mechanism of action within a series of chemicals. * For correspondence: Navdeep Singh Sethi, Tel: , Fax: navdeep827@yahoo.com 815 Underlying these aims, the reasons for wishing to develop these models include: (a) Savings in the cost of product development (e.g. in pharmaceutical, pesticide, personal products, etc. areas). (b) Predictions could reduce the requirement for lengthy and expensive animal tests. (c) Reduction and even in some cases replacement of animal tests, thus reducing animal use and obviously pain and discomfort to animals. (d) Other areas of promoting green and greener chemistry to increase efficiency and eliminate waste by not following leads unlikely to be successful 1-3. Quantitative structure-activity relationships (QSARs) are based on the assumption that the structure of a molecule (i.e. geometric, steric and electronic properties) must contain the features responsible for its physical, chemical and biological properties and on the ability to represent the chemical by one or more numerical descriptors. Quantitative structure-activity relationships (QSARs) correlate within congeneric series of compounds, affinities of ligands to their binding sites, inhibition constants, rate constants and other biological activities either with certain structural features (Free Wilson analysis) or with atomic, group or molecular properties such as lipophilicity, polarizability, electronic and steric properties (Hansch analysis) 4,5. Since then, QSAR equations have been used to describe thousands of biological activities within different series of drugs and drug candidates. Especially enzyme inhibitions data have been successfully correlated with physico-chemical properties of the ligands. In certain cases, where X-ray structure of proteins became available, the

2 816 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 results of QSAR regression models could be interpreted with the additional information from the three-dimensional (3D) structures 6,7. QSAR studies can reduce the costly failures of drug candidates in clinical trials by filtering the combinatorial libraries. Virtual filtering can eliminate compounds with predicted toxic of poor pharmacokinetic properties early in the pipeline 8,9. It also allows for narrowing the library to drug-like or lead-like compounds and eliminating the frequent hitters i.e. compounds that show unspecific activity in several assays and rarely results in leads Including such considerations at an early stage results in multidimensional optimization, with high activity as an essential but not only goal. Considering activity optimization, building target-specific structure-activity models based on identical hits can guide high throughput screening (HTS) by rapidly screening the library for most promising candidates. Such focused screening can reduce the number of experiments and allow for use of more complexes, low throughput assay 12. Feedback loops of high-throughput and virtual screening, resulting in sequential screening approach allow therefore for more rational progress towards high quality lead compounds 13. Later in the drug discovery pipeline, accurate QSAR models constructed on the basis of the lead series can assist in optimizing the lead 14. The importance and difficulty of above described tasks facing QSAR models has inspired many chemo informatics researchers to borrow from recent development in various fields including pattern recognition, molecular modeling, machine learning and artificial intelligence. This results in large family of conceptually different methods being used for creating QSARs. The purpose of this review is to guide the reader through the diversity of the techniques and algorithms for developing successful QSAR model. Quantitative Structure-Activity Relationship (QSAR) Theories All QSAR analyses are based on the assumption of linear additive contribution of the different structural properties or features of a compound to its biological activity, provided that there are no nonlinear dependences of transport or binding on certain physicochemical properties. This simple assumption is proven by some dedicated investigation, for example the scoring function of the de novo drug design program LUDI (Eqn 1.), in addition the result of many Free Wilson and Hansch analyses support this concept 15,16. G binding = G 0 + G hb + G ionic + G lipo + G ro...(1) Overall loss of translational and rotational entropy, G 0 = +5.4 kj mol -1 Ideal neutral hydrogen bond, G hb = -4.7 kj mol -1 Ideal ionic contraction, G ionic = -8.3 kj mol -1 Lipophilic contact, G lipo = J mol -1 A -2 Entropy loss per rotatable bond of the ligand, G rot = kj mol -1 Equation 1 correlates the free energy of binding, G binding with a constant term G 0 that describes the loss of overall translational and rotational degrees of freedom and G hb, G ionic and G lipo which are structure-derived energy terms for neutral and charged hydrogen bond interactions and hydrophobic interactions between the ligand and the protein; G rot describes the loss of internal rotational degree of freedom of the ligand. Because of the extra thermodynamic relationship between free energy G and equilibrium constant K (Eqn 2.) or rate constant k (k on = association constant, k off = dissociation constant of ligandreceptor complex formation), the logarithms of such values can be correlated with binding affinities. G = RT log K = RT log k on / k off..(2) Logarithms of molar concentration C that produce a certain biological effect can be correlated with molecular features or with physicochemical properties that are also free energy related equilibrium constant; normally the logarithms of inverse concentrations (log 1/C) are used to obtain larger values for the more active analogs. Free Wilson Analysis In 1964, Free and Wilson derived a mathematical model that describes the presence and absence of certain structural features i.e. those groups that are chemical modified, by values of 1 and 0 and correlates the resulting structural matrix with biological activity values (Eqn 3.) Log1/C = a i + µ.. (3) The values of a i in equation 3 are the biological activity groups contributing of the substituents X 1, X 2, X i in the different positions p of compound 1 (Figure 1) and µ is the biological activity values of the reference compound, most often the unsubstituted parent structure of a series 4,6. A common skeleton bears substituents X i in different position p; the presence or absence of these substituents is coded by the values 1 and 0 respectively. Fig. 1. Schematic presentation of a molecule for Free Wilson analysis.

3 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 817 Equation 4 describes the antiadrenergic activities for 22 different m-, p- and m,p-disubstituted analogs of the N,Ndimethyl-α-bromophenylamine 2 (Figure 2) where C is the concentration that causes a 50% reduction of the adrenergic effect of a certain epinephrine dose 4. log 1/C = (±0.50) [m-f] (±0.29) [m-cl] (±0.27) [m-br] (±0.50) [m-i] (±0.27) [m-me] (±0.30) [p-f] (±0.30) [p-cl] (±0.30) [p-br] (±0.50) [p-i] (±0.33) [p-me] (±0.27) (n = 22; r = 0.969; s = 0.194; F = 16.99)..(4) Fig. 2. N,N-dimethyl-α-bromophenylamines (X, Y = H, F, Cl, Br, I, Me). Where n = number of compounds; r = correlation coefficient, measure for the relative quality of a model; s = standard deviation, measure for the absolute quality of a model; F = fisher value, measure for statistical significance; C = molar concentration that causes a certain biological effect. Equation 4 illustrates the main advantage of Free Wilson analysis: only the biological activity values and the chemical structure of the compounds need to be known to derive a QSAR model. On the other hand Free Wilson analysis has several shortcomings: (a) At least two different position of substituents must be chemically modified; (b) Predictions can only be made for new combination of substituents already included in the analysis; (c) Single point determination i.e. the single occurrence of a certain structural feature in the whole data set obscure the statistical results; (d) Many degrees of freedom are wasted to describe every substituent. Nevertheless, Free Wilson analysis is often used to see at a glance which physicochemical properties might be important for the biological activity. In this data set, it can be easily concluded from equation 4 that: Biological activities increase with increasing lipophilicity (F to Cl, Br and I); Biological activities increase with electron donor properties (methyl has larger group contributions than the equi-lipophilic Cl); meta-substituents have lower group contributions than para-substituents. Hansch Analysis Also in 1964, the linear free-energy related Hansch model (sometimes called the extra thermodynamic approach) was published 4,7. Log 1/C = a (log P) 2 + b log P + c σ k..(5) P = n-octanol/water partition coefficient; σ = Hammett electronic parameter; a, b, c = regression coefficient; k = constant term. Equation 5 was developed from the concept that the transport of a drug from the site of application to its site of action depends in a nonlinear manner on the lipophilicity of the drug and that the binding affinity to its biological counter-part, such as an enzyme or a receptor depends on the lipophilicity, the electronic properties and the other free-energy related properties. Equation 5 combines the description of both processes in one mathematical model. In addition to the introduction of a parabolic term for the nonlinear lipophilicity dependence and the combination of different physicochemical properties in one equation, Hansch and Fujita defined lipophilicity parameters π of substituent X (Eqn 6.), in the same matter as Hammett had defined the electronic parameter σ (Eqn 7.) about 30 years earlier. The partition coefficient P in equation 6 is an equilibrium constant, similar to the dissociation or reaction constant K in equation 7. The absence of a reaction term π in equation 6 is explained by the fact that all π values refers to the n-octanol/water system. π X = log P RX - log P RH..(6) ρσ = log K RX - log K RH..(7) With the help of these definitions it was possible to use tabulated values instead of measured values.for the data set described by equation 4, 8 (Figure 3) and 9 (E meta s = steric parameter for meta-substituents) could be derived. All parameters that are relevant in a QSAR study are presented and discussed in Figure 3. log 1/C = (±0.19)π (±0.34)σ (±0.17) E s meta (±0.24) (n = 22; r = 0.959; s = 0.173; F = 69.24; Q 2 = 0.869; S PRESS = 0.222)..(9)

4 818 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Fig. 3. Equation 8 describes a Quantitative relationship between the antiadrenergic activities of Compound 2. Equations 8 and 9 demonstrate the superiority of Hansch analysis as compared with Free Wilson analysis. Only a few properties are needed to correlate the biological activities; the model can directly be interpreted in physicochemical terms. The results of the Free Wilson analysis are confirmed in all details but prediction for compounds with other substituents can be made, for example for X = ethyl or CF 3. On the other hand, predictions those are too far outside the range of investigated parameters, such as for tert-bu or -OH or - SO 2 NH 2 will most probably fail because of narrow chemical relationship among the investigated substituents and the very different nature of these chemical groups, in size or in their hydrogen bond donor and acceptor properties. For such predictions much more heterogeneous substituents have to be included in the derivation of QSAR model. The fact that different model can be derived for the same data set frequently offers a dilemma in Hansch analysis. One can never be sure that a certain QSAR model is the correct one for the data set. On the other hand, different models correspond to different working hypothesis. Proposals for the synthesis of new analogs can be made in the following steps, which allow discrimination between these models. Free Wilson group contributions for every substituent can be derived from equations 8 and 9, which clearly indicate the close theoretical relationship between Free Wilson analysis and linear Hansch analysis. Correspondingly both approaches can be used in one model, the so-called mixed approach (Eqn 10). log 1/C = a (log P) 2 + b log P + cσ +. + a i + k..(10) Equation 10 combines the advantage of Hansch and Free Wilson analysis and widens the applicability of both methods. Physicochemical parameters describe parts of the molecules with broad structural variation, whereas indicator variablesa i (Free Wilson type variables) encodes the effect of structural variations that cannot be described otherwise 4. General Scheme of a QSAR Study The chemoinformatics methods used in building QSAR models can be divided into three groups i.e. extracting descriptors from molecular structure, choosing those informative in the context of analyzed activity and finally using the values of the descriptors as independent variables to define a mapping that correlates them with the activity in question. The typical QSAR system realizes these phases as depicted in Figure 4.

5 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 819 The molecular structure is encoded using numerical descriptors. The set of descriptors is pruned to select the most informative ones. The activity is derived as a function of the selected descriptors. Fig. 4. Main stages of a QSAR study. Generation of Molecular Descriptors from Structure The small-molecule compounds are defined by their structure, encoded as a set of atoms and covalent bonds between them. However, the structure cannot be directly used for creating structure-activity mapping for reasons stemming from chemistry and computer science. First, the chemical structure do not usually contain in an explicit from the information that relates to activity. This information has to be extracted from the structure. Various rationally designed molecular descriptors accentuate different chemical properties implicit in the structure of the molecule. Only those properties may correlate more directly with the activity. Such properties range from physicochemical and quantum-chemical to geometrical and topological features. The second, more technical reason which guides the use and development of molecular descriptors stems from the paradigm of feature space prevailing in statistical data analysis. Most methods employed to predict the activity require as input numerical vectors of features of uniform length for all molecules. Chemical structures of compounds are diverse in size and nature and as such do not fit into this model directly. To circumvent this obstacle, molecular descriptors convert the structure to the form of well-defined sets of numerical values. Selection of Relevant Molecular Descriptors Many applications are capable of generating hundreds or thousands of different molecular descriptors. Typically, only some of them are significantly correlated with the activity. Furthermore, many of the descriptors are intercorrelated. This has negative effects on several aspects of QSAR analysis. Some statistical methods require that the number of compounds is significantly greater than the number of descriptors. Using large descriptor sets would require large datasets. Other methods, while capable of handling datasets with large descriptors to compounds ratios, nonetheless suffer from loss of accuracy, especially for compounds unseen during the preparation of the model. Large number of descriptors also affects interpretability of the final model. To tackle these problems, a wide range of methods for automated narrowing of the set of descriptors to the most informative ones is used in QSAR analysis. Mapping the Descriptors to Activity Once the relevant molecular descriptors are computed and selected, the final task of creating a function between their values and the analyzed activity can be carried out. The value quantifying the activity is expressed as a function of the values of the descriptors. The most accurate mapping function from some wide family of functions is usually fitted based on the information available in the training set i.e. compounds for which the activity is known. A wide range of mapping function families can be used including linear or non-linear ones and many methods for carrying out the training to obtain the optimal function can be employed. Molecular Descriptors Molecular descriptors map the structure of the compound into a set of numerical or binary values representing

6 820 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 various molecular properties that are deemed to be important for explaining activity. Two broad families of descriptors can be distinguished based on the dependence on the information about 3D orientation and conformation of the molecule. 2D QSAR Descriptors The broad family of descriptors used in the 2D QSAR approach shares a common property of being independent from the 3D orientation of the compound. These descriptors range from simple measures of entities constituting the molecule, through its topological and geometrical properties to computed electrostatic and quantum-chemical descriptors or advanced fragmentcounting methods. Constitutional Descriptors Constitutional descriptors capture properties of the molecules that are related to elements constituting its structure. These descriptors are fast and easy to compute. Examples of constitutional descriptors include molecular weight, total number of atoms in the molecule and number of atoms of different identity. Also, a number of properties relating to bonds are used including total numbers of single, double, triple or aromatic type bonds as well as number of aromatic rings. Electrostatic and Quantum-Chemical Descriptors Electrostatic descriptors capture information on electronic nature of the molecule. These include descriptors containing information on atomic net and partial charges 17. Descriptors for highest negative and positive charges are also informative, as well as molecular polarizability 18. Partial negatively or positively charged solvent-accessible atomic surface areas have also been used as informative electrostatic descriptors for modeling intermolecular hydrogen bonding 19. Energies of highest occupied and lowest unoccupied molecular orbital from useful quantumchemical descriptors as do the derivative quantities such as absolute hardness 20,21. Topological Descriptors The topological descriptors treat the structure of the compound as a graph, with atoms as vertices and covalent bonds as edges. Based on this approach, many indices quantifying molecular connectivity were defined starting with Wiener index 22, which counts the total number of bonds in shortest paths between all pairs of non-hydrogen atoms. Other topological descriptors include Randic indices x 23, defined as sum of geometric averages of edge degrees of atoms within paths of given lengths, Balaban s J index 24 and Shultz index 25. Information about valence electrons can be included in topological descriptors e.g. Kier and Hall indices x v - 26 or Galvez topological charge indices 27. The first ones use geometric averages of valence connectivities along paths. The latter measure topological valences of atoms and net charges transfer between pair of atoms separated by a given number of bonds. Descriptors combining connectivity information with other properties are also available, e.g. BCUT descriptors which take form of eigenvalues of atom connectivity matrix with atom charge, polarizability or H-bond potential values on diagonal and additional terms of diagonal. Similarly, the topological sub-structural molecular design (TOSS-MODE/TOPS-MODE) 31,32 rely on spectral moments of bond adjacency matrix amended with information on for e.g. bond polarizability. The atom type electrotopological (E-state) indices 33,34 use electronic and topological organization to define the intrinsic atom state and the perturbations of this state induced by other atoms. This information is gathered individually for a wide range of atom types to form a set of indices. Geometrical Descriptors Geometrical descriptors rely on spatial arrangement of atoms constituting the molecule. These descriptors include information on molecular surface obtained from atomic van der Waals area and their overlap 35. Molecular volume may be obtained from atomic van der Waals volumes 36. Principal moments of inertia and gravitational indices 37 also capture information on spatial arrangement of the atoms in molecule. Shadow areas, obtain by projection of the molecule to its two principal axes are also used 38. Another geometrical descriptor is the total solvent-accessible surface area 39. Fragment-Based Descriptors and Molecular Fingerprints The family of descriptors relying on substructural motifs is often used, especially for rapid screening of very large databases. The BCI fingerprints 40 are derived as bits describing the presence or absence in the molecule of certain fragments, including atoms with their nearest neighborhoods, atom pairs and sequences, or ring-based fragments. A similar approach is present in the basic set of 166 MDL keys. However, other variants of the MDL keys are also available, including extending sets of keys or compact sets. The later are results of dedicated pruning strategies 41 or elimination methods, e.g. the fast random elimination of descriptors/substructure keys (FRED/ SKEYS) 42. Recently introduced Hologram QSAR (HQSAR) approach is based on counting the number of occurrences of certain sub-structural paths of functional groups. For each group, cyclic redundancy code is calculated which serves as a hashing function for partitioning the sub-structural motifs into bins of hash table. The numbers of elements in the bins form a hologram 43,44.

7 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 821 The daylight fingerprints are a natural extension of the fragment-based descriptors by eliminating the reliance on pre-defined list of sub-structure motifs. The fingerprint for each molecule is a string of bits. However, a structural motif in the molecule does not correspond to a single bit but leads through a hashing function to a pattern of bits that are added to the fingerprint with a logical or operation. The bits in the different patterns may overlap, due to large number of possible patterns and a finite length of a bit string. Thus, the fact that a bit or several bits are set in the fingerprint cannot be interpreted as a proof of pattern s presence. However, if one of the bits corresponding to a given pattern is not set, this guarantees that the pattern is not present in the molecule. This allows for rapid filtering of the molecules that do not possess certain structural motifs. The patterns are generated individually for each molecule and describe atoms with their neighborhoods and paths of up to 7 bonds. Other approaches than hashed fingerprints are also proposed to circumvent the problem of a pre-defined sub-structure library, e.g. algorithm for optimal discovery of frequent structural fragments relevant to given activity 45. 3D QSAR Descriptors The 3D QSAR methodology is much more computationally complex than 2D QSAR approach. In general, it involves several steps to obtain numerical descriptors of the compound structure. First, the conformation of the compound has to be determined either from experimental data or molecular mechanics and then refined by minimizing the energy 46,47. Next, the conformers in dataset have to be uniformly aligned in space. Finally, the space with immersed conformer is probed computationally for various descriptors. Some methods independent of the compound alignment have also been developed. Alignment-Dependent 3D QSAR Descriptors The group of methods that require molecule alignment prior to the calculation of descriptors is strongly dependent on the information on the receptor for the modeled ligand. In case where such data is available, the alignment can be guided by studying the receptor-ligand complexes. Otherwise, purely computational methods for superimposing the structures in space have to be used 48,49. These methods relies e.g. on atom-atom or substructuresubstructure mapping. Comparative Molecular Field Analysis The Comparative Molecular Field Analysis (CoMFA) 50 uses electrostatic (Coulombic) and steric (van der Waals) energy fields defined by the inspected compound. The aligned molecule is placed in a 3D grid. In each point of the grid lattice a probe atom with unit charge is placed and the potentials (Coulomb and Lennard-Jones) of the energy fields are computed. Then, they serve as descriptors in further analysis, typical using partial least square regression. This analysis allows for identifying structure regions positively and negatively related to the activity in question. Comparative Molecular Similarity Indices Analysis The Comparative Molecular Similarity Indices Analysis (CoMSIA) 51 is similar to CoMFA in the aspect of atom probing throughout the regular grid lattice in which the molecules are immersed. The similarity between probe atom and the analyzed molecule are calculated. Compared to CoMFA, CoMSIA uses a differential potential function, namely the Gaussian-type function. Steric, electrostatic and hydrophobic properties are then calculated; hence the probe atom has unit hydrophobicity as additional property. The use of Gaussian-type potential function instead of Lennard- Jones and Coulombic functions allows for accurate information in grid points located within the molecule. In CoMFA unacceptably large values are obtained in these points due to the nature of the potential functions and arbitrary cut-offs that have to be applied. Alignment-Independent 3D QSAR Descriptors Another group of 3D descriptors are those invariant to molecule rotation and translation in space. Thus, no superposition of compounds is required. Comparative Molecular Moment Analysis The Comparative Molecular Moment Analysis (CoMMA) 52 uses second-order moments of the mass distribution and charge distribution. The moments relate to center of the mass and center of the dipole. The CoMMA descriptors include principal moments of inertia, magnitudes of dipole moment and principal quadrupole moment. Furthermore, descriptors relating charges to mass distributions are defined, i.e. magnitudes of projections of dipole upon principal moments of inertia and displacement between center of mass and center of dipole. Weighted Holistic Invariant Molecular Descriptors The Weighted Holistic Invariant Molecular (WHIM) 53,54 and Molecular Surface WHIM 55 descriptors provide the invariant information by employing the principal component analysis (PCA) on the centered co-ordinates of the atoms constituting the molecule. This transforms the molecule into the space that captures the most variance. In this space, several statistics are calculated and serve as directional descriptors including variance, proportions, symmetry and kurtosis. By combining the directional descriptors, non-directional descriptors are also defined. The contribution of each atom can be weighted by a chemical property leading to different principal components capturing variance within the given property. The atoms can be weighted by mass, van der Waals volume, atomic electronegativity, atomic polarizability,

8 822 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 electrotopological index of Kier and Hall and molecular electrostatic potential. VolSurf The VolSurf 56,57 approach is based on probing the grid around the molecule with specific probes, for e.g. hydrophobic interaction or hydrogen bond acceptor or donor groups. The resulting lattice boxes are used to compute the descriptors relying on volumes or surfaces of 3D contours, defined by the same value of the probe molecule interaction energy. By using various probes and cut-off values for the energy, different molecular properties can be quantified. These include e.g. molecular volume and surface, hydrophobic and hydrophilic regions. Derivative quantities, e.g. molecular globularity or factors relating the surface of hydrophobic and hydrophilic regions to surface of the whole molecule can also be computed. In addition various geometry-based descriptors are also available including energy minima distances or amphiphilic moments. Grid-Independent Descriptors The Grid-Independent Descriptors (GRIND) 58 have been devized to overcome the problems with interpretability common in alignment-independent descriptors. Similarly to VolSurf, it utilizes probing of the grid with specific probes. The regions showing the most favorable energies of interaction are selected provided that the distances between the regions are large. Next, the probe-based energies are encoded in a way independent of the molecule s arrangement. To this end the distances between the nodes in the grid are discretized into a set of bins. For each distance bin, the nodes with the highest product of energies are stored and the value of the product serves as the numerical descriptors. In addition, the stored information on the position of the nodes can be used to track down the exact regions of the molecule relating to the given property. To extend the molecular information captured by the descriptors, the product of node energies may include not only energies relating to the same probe but also from two different probe types. 2D Versus 3D QSAR Approach It is generally assumed that 3D approaches are superior to 2D in drug design. Yet, studies show such an assumption may not always hold. For example, the results of conventional CoMFA may often be non-reproducible due to dependence of the outputs quality on the orientation of the rigidly aligned molecules on user s terminal 59,60. Such alignment problems are typical in 3D approaches and even though some solutions have been proposed the unambiguous 3D alignment of structurally diverse molecules still remains a difficult task. Moreover, the distinction between 2D and 3D QSAR approaches is not a crisp one, especially when alignment-independent descriptors are considered. This can be observed when comparing the BCUT with the WHIM descriptors. Both employ a similar algebraic method, i.e. solving an eigen problem for a matrix describing the compound - the connectivity matrix in case of BCUT descriptors and covariance matrix of 3D co-ordinates in case of WHIM. There is also a deeper connection between 3D QSAR and one of 2D methods, the topological approach. It stems from the fact that the geometry of a compound in many cases depends on its topology. An elegant example was provided by Estrada et al., who demonstrated that the dihedral angles of biphenyl as a function of the substituents attached to it can be predicted by topological indices 61. Along the same line, a supposedly typically 3D property, chirality has been predicted using chiral topological indices 62, constructed by introducing an adequate weight into the topological matrix for the chiral carbons. Automatic Selection of Relevant Molecular Descriptors Automatic methods for selecting the best descriptors or features to be used in construction of the QSAR model fall into two categories 63. In the wrapper approach, the quality of descriptor subsets is obtained from constructing and evaluating a series of QSAR models. In filtering, no model is build and features are evaluated using some other criteria. Filtering Methods These techniques are applied independently of the mapping method used. They are executed prior to the mapping to reduce the number of descriptors following some objective criteria, e.g. inter-descriptor correlation. Correlation-Based Methods Pearson s correlation coefficients may serve as a preliminary filter for discarding intercorrelated descriptors. This can be done by e.g. creating clusters of descriptors having correlation coefficients higher than certain threshold and retaining only one, randomly chosen member of each cluster 64. Another procedure involves estimating correlation between pair of descriptors and if it exceed a threshold, randomly discarding one of the descriptors 65. The choice of the ordering in which pairs are evaluated may lead to significantly different results. One popular method is to first rank the descriptors by using some criterion and then iteratively browse the set starting from pairs containing the highest-ranking features. One such ranking may be the correlation ranking, based on correlation coefficient between activity and descriptors. However, correlation ranking is usually used in conjunction

9 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 823 with principal component analysis 66,67. Methods using measures of correlation activity and descriptors other than Pearson s have been used, notably the pair-correlation method Methods Based on Information Theory Information content of the descriptor is defined in terms of entropy of descriptor treated as a random variable. Based on this notion, various measures relating the information shared between two descriptors or between descriptor and the activity can be defined. An example of such measure used in descriptor selection for QSAR is the mutual information. The mutual information sometimes referred to as information gain, quantifies the reduction of uncertainty or information content of activity variable by knowing the descriptor values. It is used in QSAR to rank the descriptors 71,72. The application of information-theoretic criteria is straight forward when both the descriptors and activity values are categorical. In case of continuous numerical variables, some discretization schemes have to be applied to approximate the variables. Thus, such criteria are usually used with binary descriptors. Statistical Criteria The fisher s ratio, i.e. ratio of the between class variance to the within class variance can be used to rank the descriptors 73. Next, the correlation between pairs of features is used to reduce the set of descriptors. Another method used in assessing the quality of a descriptor is based on the Kolmogorov-Smirnov statistics 74. As applied to descriptor selection in QSAR 75, it is a fast method not relying on the knowledge of the underlying distribution and not requiring the conversion of variables descriptors into categorical values. For two classes of activity to be predicted, the method measures the maximal absolute distance between cumulative distribution functions of the descriptor for individual activity classes. Wrapper Methods These techniques operate in a conjunction with a mapping algorithm 76. The choice of best subset of descriptors is guided by the error of the mapping algorithm for a given subset measured e.g. with cross-validation. The schematic illustration of wrapper methods is given in Figure 5. Iteratively, various configurations of selected and discarded descriptors are evaluated by creating a descriptors-to-activity mapping and assessing its prediction accuracy. The final descriptors are those yielding the highest accuracy for a given family of mapping functions. Fig. 5. Generic scheme for wrapper descriptor selection methods.

10 824 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Genetic Algorithm The Genetic Algorithms (GA) is efficient method for function minimization. In descriptor selection context, the prediction error of the model built upon a set of features is optimized 77,78. The genetic algorithm mimics the natural evaluation by modeling a dynamic population of solutions. The members of the population referred to as chromosomes, encode the selected features. The encoding usually takes form of bit strings with bits corresponding to selected features set others cleared. Each chromosomes leads to a model built using the encoded features. By using the training data, the error of the model is quantified and serves as a fitness function. During the course of evolution, the chromosomes are subjected to crossover and mutation. By allowing survival and reproduction of the fittest chromosomes, the algorithm effectively minimizes the error function in subsequent generations. The success of GA depends on several factors. The parameters steering the crossover, mutation and survival of chromosomes should be carefully chosen to allow the population to explore the solution space and to prevent early convergence to homogeneous population occupying a local minimum. The choice of initial population is also important in genetic feature selection. To address this issue, e.g. a method based on Shannon s entropy combined with graph analysis can be used 79. Genetic Algorithm have been used in feature selection for QSAR with a range of mapping methods, e.g. Artificial Neural Networks 80,81, k- Nearest Neighbor method 82 and Random Forest 65. Simulated Annealing Simulated Annealing (SA) is another stochastic method for function optimization employed in QSAR 65,83,84. As in the evolutionary approach, the function minimized represents the error of the model built using the subset of descriptors. The SA algorithm operates iteratively by binding a new subset of descriptors by altering the current-best one, e.g. by exchanging some percentage of the features. Next, SA evaluates prediction error of the new subset and makes the choice whether to adopt the new solution as the current optimal solution. This decision depends on whether the new solution leads to lower error then the current one. If so, the new solution is used. However, in other case the solution is not automatically discarded. With a given probability based on the Boltzmann distribution the worse solution can replace the current, better one. Replacing the solution with a worse one allows the SA method to escape from local minima of the error function, i.e. solution that cannot be made better without traversing through less-fitted feature subsets. The power of SA method stems from altering the temperature term in the Boltzmann distribution. At an early stage when the solution is not yet highly optimized and mostly prone to encounter local minima, the temperature is high. During the course of algorithm, the temperature is lowered and acceptance of worse solution is less likely. Thus, even if the obtained minimum is not global it is nonetheless usually of high quality. Sequential Feature Forward Selection While genetic algorithm and simulated annealing rely on guided random process of exploring the space of feature subsets, Forward Feature Selection 85 operates in a deterministic manner. It implements a greedy search throughout the feature subsets. As a first step, a single feature that leads to best prediction is selected. Next, sequentially each feature is individually added to the current subset and the errors of resulting models are quantified. The feature that is the best in reducing the error is incorporated into the subset. Thus, in each step a single best feature is added resulting in a sequence of nested subsets of features. The procedure stops when a specified number of features are selected. More elaborate stopping conditions are also proposed, e.g. based on incorporating an artificial random feature 86. When this feature is to be selected as the one that improves the best quality of the model, the procedure is stopped. The drawback of forward selection is that if several feature collectively are good predictors but alone each is a poor prediction, none of the features may be chosen. The recursive feature forward selection has been used in several QSAR studies 64, 79, 87, 88. Sequential Backward Feature Elimination The Backward Feature Elimination 85 is another example of a greedy sequential method that yields nested subsets of features. In contrast to forward selection, the full set of features is used as a starting point, next, in each step all subsets of features resulting from removal of a single feature are analyzed for the prediction error. The feature that leads to a model with highest error is removed from the current subsets. The procedure stops when the given numbers of features are dropped. Backward elimination is slower than forward selection, yet often leads to better results. Recently a significantly faster variant of backward elimination, the Recursive Feature Elimination 89 method has been proposed for Support Vector Machines (SVM). In this method, the feature to be removed is chosen based on a single execution of the learning method using all features remaining in the given iteration. The SVM allows for ranking the features according to their contribution to the result. Thus, the least contributing feature can be dropped to form a new narrowed subset of features. There is no need to train SVMs for each subset as in original feature elimination

11 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 825 method. Variants of backward feature elimination method have been used in numerous QSAR studies Hybrid Methods In addition to the purely filter or wrapper-based descriptor selection procedures, QSAR studies utilize the fusion of the two approaches. A rapid objective method is used as a preliminary filter to narrow the feature set, next, one of the more accurate but slower subjective method is employed. As an example of such a combination of techniques, the correlation-based test significantly reducing the number of features followed by genetic algorithm or simulated annealing can be used 65. A similar procedure which uses a greedy sequential feature forward selection is also in use 64. The feature selection can also be implicit in some mapping methods. For example, the Decision Tree (see section 6.2.4) utilizes only a subset of features in the decision process if a single or only a few descriptors are tested at each node and the overall number of features exceeds the number of those used in the nodes. Similarly, ensembles of Decision Stumps (see section 6.2.6) also operate on reduced number of descriptors if the number of members is ensemble is smaller than the number of features. Mapping the Molecular Structure to Activity Given the selected descriptors, the final step in building the QSAR model is to drive the mapping between the activity and the values of the features. Simple, yet useful methods model the activity as a linear function of the descriptors. Other non-linear methods extend this approach to more complex relations. Another important division of the mapping methods is based on the nature of the activity variable. In case of predicting a continuous value a regression problem is encountered. When only some categories of classes of the activity need to be predicted, e.g. portioning compounds into active and inactive, the classification problem occurs. In regression the dependent variable is modeled as a function of the descriptors. In classification framework, the resulting model is defined by decision boundary separating the classes in the descriptor space. The approaches to QSAR mapping is depicted in Figure 6. Linear Models Linear models have been the bases of QSAR analysis since it s beginning. They predict the activity as a linear function of molecular descriptors. In general, linear models are easily interpretable and sufficiently accurate for small datasets of similar compounds especially when the descriptors are carefully selected for a given activity. Multiple Linear Regression Multiple Linear Regression (MLR) models the activity to be predicted as a linear function of all descriptors. Based on the examples from the training set, the coefficients of the function are estimated. These free parameters are chosen to minimize the squares of the errors between the predicted and the actual activity. The main restriction of MLR analysis is the case of large descriptors-tocompounds ratio or multicollinear descriptors in general. This makes the problem ill-conditioned and makes the results unstable. Multiple linear regression is among the most widely used mapping method in QSAR in last decades. It has been employed in conjunction with genetic description selection for modeling GABA A receptor binding, antimalaric activity, HIV-1 protease inhibition and glycogen phosphorylase inhibition 95 exhibiting lower crossvalidation error than partial least square, both using 4D QSAR fingerprints. MLR has been applied to models in predictive toxicology 96,97, Caco-2-permeabbility 98 and aqueous solubility 99. In prediction of physicochemical properties 88,93 and of COX-2 inhibition 100, MLR proved significantly worse than non-linear support vector machine, yet comparable or only slightly inferior to neural network. However, in studies of logp 101 it proved worse than other models including multi-layer perceptron and Decision Tree. (a) Linear regression with activity as a function of two descriptors d 1 and d 2 ; (b) Binary classification with linear decision boundary between classes of active (+) and inactive (-) compounds; (c) Non-linear regression; (d) Non-linear binary classification. Fig. 6. Approaches to QSAR mapping.

12 826 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Partial Least Squares Partial Least Squares (PLS) linear regression is a method suitable for overcoming the problem in MLR related to multicollinear or over-abundant descriptors. The technique assumes that despite the large number of descriptors the modeled process is governed by a relatively small number of latent independent variables. The PLS tries to indirectly obtain knowledge on the latent variables by decomposing the input matrix of descriptors into two components, the scores and the loadings. The scores are orthogonal and while being able to capture the descriptor information allow also for good prediction of the activity. The estimation of score vectors is done iteratively. The first one is derived using the first eigenvector of the activitydescriptor combined variance-covariance matrix. Next, the descriptor matrix is deflated by subtracting the information explained by the first score vector. The resulting matrix is used in the derivation of the second score vector which followed by consecutive deflation closes the iteration loop. In each iteration step, the coefficient relating the score vector to the activity is also determined. PLS has been used successfully with 3D QSAR and HQSAR, e.g. in a study of nicotinic acetylocholine receptors binding modeling 105 and estrogen receptor binding 106. It has also been used in a study involving several dataset, including blood-brain barrier permeability, toxicity, P-glycoprotein transport, multidrug resistance reversal and log D showing results better than decision trees in most ensembles and SVM. PLS regression has been tested in prediction of COX-2 inhibition 82, but lower accuracy than neural network and decision tree. However, in a study of solubility prediction 107 PLS outperformed a neural network. Studies report PLS-based models in melting point and log P prediction 108. Finally, PLS models for BBB permeability 109,110, mutagenicity, toxicity, tumor growth inhibition and anti-hiv activity 111 and aqueous solubility 108,109 have been created. Linear Discriminant Analysis Linear Discriminant Analysis (LDA) 112 is a classification method that creates a linear transformation of the original feature space into a space, which maximizes the interclass separability and minimizes the within-class variance. The procedure operates by solving a generalized eigenvalue problem based on the between-class and within-class covariance matrices. Thus, the number of feature has to be significantly smaller than the number of observations to avoid ill-conditioning of the eigenvalue problem. Executing principal component analysis to reduce the dimension of the input data may be employed prior to applying LDA to overcome this problem. LDA has been used to create QSAR models e.g. for prediction of model validity for new compounds 113 where it fared better than PLS but worse than non-linear neural network. However, in BBB permeability prediction, LDA exhibited lower accuracy than PLS-based method. In predicting antibacterial activity 114,115 it performed worse than neural network. LDA was also used to predict drug likeness 116 showing results slightly better than linear programming machine, a method similar to linear SVM. However, it yielded results worse than non-linear SVM and bagging ensembles. In ecotoxicity prediction 117, LDA performed better than other linear methods and k-nn but inferior to decision tree. Non-Linear Models Non-linear models extend the structure-activity relationships to non-linear function of input descriptors. Such models may become more accurate, especially for large and diverse datasets. However, usually they are harder to interpret. Complex non-linear may also fall prey to over fitting 118, i.e. low generalization to compounds unseen during training. Bayes Classifier The Bayes Classifier stems from the Bayes rule relating the posterior probability of a class to its overall probability of the observations and the likelihood of a class with respect to observed variables. In Bayes rule, the class minimizing the posterior probability is chosen as the prediction result. However, in real problems the likelihoods are not known and have to be estimated. Yet, given a finite number of training examples such estimation is not trivial. One method to approach this problem is to make an assumption of independence of likelihoods of class with respect to different descriptors. This leads to the Naïve Bayes Classifier (NBC). For typical datasets the estimation of likelihoods with respect to single variables is feasible. The drawback of this method is that independence assumption usually does not hold. An extensive study using Naïve Bayes Classifier in comparison with other methods was conducted 119 using numerous endpoints including COX-2, CDK-2, BBB, dopamine, logd, P-glycoprotein, toxicity and multidrug resistance reversal. In most cases NBC was inferior to other methods, however it outperformed PLS for BBB and CDK- 2, k-nn for P-glycoprotein and COX-2 and decision trees for BBB and P-glycoprotein. However, NBC has been shown useful in modeling the inhibition of HIV-1 protease 120. The k-nearest Neighbor Method The k-nearest Neighbor (k-nn) 121 is a simple decision scheme that requires practically no training and is asymptotically optimal, i.e. with increase in training data

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity