A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR)

Size: px
Start display at page:

Download "A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR)"

Transcription

1 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 815 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September A Review on Computational Methods in Developing Quantitative Structure-Activity Relationship (QSAR) Navdeep Singh Sethi Department of Pharmacy (Pharmaceutical Chemistry), DoabaGroup of Colleges, Kharar (Mohali) , Punjab, India. ABSTRACT: Virtual filtering and screening of combinatorial libraries have recently gained attention as methods complementing the high-throughput screening and combinatorial chemistry. These chemoinformatic techniques rely heavily on quantitative structure-activity relationship (QSAR) analysis, a field with established methodology and successful history. In this review, we discuss the computational methods for building QSAR models. The review starts with general introduction and theories of QSAR and identifying the general scheme of a QSAR model. Following, the review focus on the methodologies in constructing three main components of QSAR model, namely the methods for describing the molecular structure of compounds, for selection of informative descriptors and for activity prediction. The review present both the well established methods as well as techniques introduced into the QSAR domain. KEYWORDS: QSAR, FreeWilson analysis; Hansch analysis; molecular descriptors (2D descriptors and 3D descriptors); feature selection; machine learning. Introducton If we can understand how a molecular structure brings about a particular effect in a biological system, we have a key to unlocking the relationship and using that information to our advantage. Formal development of these relationships on this premise has proved to be the foundation for the development of predictive models. If we take a series of chemicals and attempt to form a quantitative relationship between the biological effect (i.e. the activity) and the chemistry (i.e. the structure) of each of the chemicals, then we are able to form a quantitative structure-activity relationship or QSAR. Less complex, or quantitative, understanding of the role of structure to govern effects, i.e. that a fragment or substructure could result in a certain activity, is often simply termed a structure-activity relationship or SAR. Together SARs and QSARs can be referred to as (Q)SARs and fall within a range of techniques known as in silico approaches. A (Q)SAR comprises three parts: the (activity) data to be modeled and hence predicted, data with which to model and a method to formulate the model. The purpose of in silico studies includes the following: (a) To predict biological activity and physicochemical properties by rational means. (b) To comprehend and rationalize the mechanism of action within a series of chemicals. * For correspondence: Navdeep Singh Sethi, Tel: , Fax: navdeep827@yahoo.com 815 Underlying these aims, the reasons for wishing to develop these models include: (a) Savings in the cost of product development (e.g. in pharmaceutical, pesticide, personal products, etc. areas). (b) Predictions could reduce the requirement for lengthy and expensive animal tests. (c) Reduction and even in some cases replacement of animal tests, thus reducing animal use and obviously pain and discomfort to animals. (d) Other areas of promoting green and greener chemistry to increase efficiency and eliminate waste by not following leads unlikely to be successful 1-3. Quantitative structure-activity relationships (QSARs) are based on the assumption that the structure of a molecule (i.e. geometric, steric and electronic properties) must contain the features responsible for its physical, chemical and biological properties and on the ability to represent the chemical by one or more numerical descriptors. Quantitative structure-activity relationships (QSARs) correlate within congeneric series of compounds, affinities of ligands to their binding sites, inhibition constants, rate constants and other biological activities either with certain structural features (Free Wilson analysis) or with atomic, group or molecular properties such as lipophilicity, polarizability, electronic and steric properties (Hansch analysis) 4,5. Since then, QSAR equations have been used to describe thousands of biological activities within different series of drugs and drug candidates. Especially enzyme inhibitions data have been successfully correlated with physico-chemical properties of the ligands. In certain cases, where X-ray structure of proteins became available, the

2 816 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 results of QSAR regression models could be interpreted with the additional information from the three-dimensional (3D) structures 6,7. QSAR studies can reduce the costly failures of drug candidates in clinical trials by filtering the combinatorial libraries. Virtual filtering can eliminate compounds with predicted toxic of poor pharmacokinetic properties early in the pipeline 8,9. It also allows for narrowing the library to drug-like or lead-like compounds and eliminating the frequent hitters i.e. compounds that show unspecific activity in several assays and rarely results in leads Including such considerations at an early stage results in multidimensional optimization, with high activity as an essential but not only goal. Considering activity optimization, building target-specific structure-activity models based on identical hits can guide high throughput screening (HTS) by rapidly screening the library for most promising candidates. Such focused screening can reduce the number of experiments and allow for use of more complexes, low throughput assay 12. Feedback loops of high-throughput and virtual screening, resulting in sequential screening approach allow therefore for more rational progress towards high quality lead compounds 13. Later in the drug discovery pipeline, accurate QSAR models constructed on the basis of the lead series can assist in optimizing the lead 14. The importance and difficulty of above described tasks facing QSAR models has inspired many chemo informatics researchers to borrow from recent development in various fields including pattern recognition, molecular modeling, machine learning and artificial intelligence. This results in large family of conceptually different methods being used for creating QSARs. The purpose of this review is to guide the reader through the diversity of the techniques and algorithms for developing successful QSAR model. Quantitative Structure-Activity Relationship (QSAR) Theories All QSAR analyses are based on the assumption of linear additive contribution of the different structural properties or features of a compound to its biological activity, provided that there are no nonlinear dependences of transport or binding on certain physicochemical properties. This simple assumption is proven by some dedicated investigation, for example the scoring function of the de novo drug design program LUDI (Eqn 1.), in addition the result of many Free Wilson and Hansch analyses support this concept 15,16. G binding = G 0 + G hb + G ionic + G lipo + G ro...(1) Overall loss of translational and rotational entropy, G 0 = +5.4 kj mol -1 Ideal neutral hydrogen bond, G hb = -4.7 kj mol -1 Ideal ionic contraction, G ionic = -8.3 kj mol -1 Lipophilic contact, G lipo = J mol -1 A -2 Entropy loss per rotatable bond of the ligand, G rot = kj mol -1 Equation 1 correlates the free energy of binding, G binding with a constant term G 0 that describes the loss of overall translational and rotational degrees of freedom and G hb, G ionic and G lipo which are structure-derived energy terms for neutral and charged hydrogen bond interactions and hydrophobic interactions between the ligand and the protein; G rot describes the loss of internal rotational degree of freedom of the ligand. Because of the extra thermodynamic relationship between free energy G and equilibrium constant K (Eqn 2.) or rate constant k (k on = association constant, k off = dissociation constant of ligandreceptor complex formation), the logarithms of such values can be correlated with binding affinities. G = RT log K = RT log k on / k off..(2) Logarithms of molar concentration C that produce a certain biological effect can be correlated with molecular features or with physicochemical properties that are also free energy related equilibrium constant; normally the logarithms of inverse concentrations (log 1/C) are used to obtain larger values for the more active analogs. Free Wilson Analysis In 1964, Free and Wilson derived a mathematical model that describes the presence and absence of certain structural features i.e. those groups that are chemical modified, by values of 1 and 0 and correlates the resulting structural matrix with biological activity values (Eqn 3.) Log1/C = a i + µ.. (3) The values of a i in equation 3 are the biological activity groups contributing of the substituents X 1, X 2, X i in the different positions p of compound 1 (Figure 1) and µ is the biological activity values of the reference compound, most often the unsubstituted parent structure of a series 4,6. A common skeleton bears substituents X i in different position p; the presence or absence of these substituents is coded by the values 1 and 0 respectively. Fig. 1. Schematic presentation of a molecule for Free Wilson analysis.

3 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 817 Equation 4 describes the antiadrenergic activities for 22 different m-, p- and m,p-disubstituted analogs of the N,Ndimethyl-α-bromophenylamine 2 (Figure 2) where C is the concentration that causes a 50% reduction of the adrenergic effect of a certain epinephrine dose 4. log 1/C = (±0.50) [m-f] (±0.29) [m-cl] (±0.27) [m-br] (±0.50) [m-i] (±0.27) [m-me] (±0.30) [p-f] (±0.30) [p-cl] (±0.30) [p-br] (±0.50) [p-i] (±0.33) [p-me] (±0.27) (n = 22; r = 0.969; s = 0.194; F = 16.99)..(4) Fig. 2. N,N-dimethyl-α-bromophenylamines (X, Y = H, F, Cl, Br, I, Me). Where n = number of compounds; r = correlation coefficient, measure for the relative quality of a model; s = standard deviation, measure for the absolute quality of a model; F = fisher value, measure for statistical significance; C = molar concentration that causes a certain biological effect. Equation 4 illustrates the main advantage of Free Wilson analysis: only the biological activity values and the chemical structure of the compounds need to be known to derive a QSAR model. On the other hand Free Wilson analysis has several shortcomings: (a) At least two different position of substituents must be chemically modified; (b) Predictions can only be made for new combination of substituents already included in the analysis; (c) Single point determination i.e. the single occurrence of a certain structural feature in the whole data set obscure the statistical results; (d) Many degrees of freedom are wasted to describe every substituent. Nevertheless, Free Wilson analysis is often used to see at a glance which physicochemical properties might be important for the biological activity. In this data set, it can be easily concluded from equation 4 that: Biological activities increase with increasing lipophilicity (F to Cl, Br and I); Biological activities increase with electron donor properties (methyl has larger group contributions than the equi-lipophilic Cl); meta-substituents have lower group contributions than para-substituents. Hansch Analysis Also in 1964, the linear free-energy related Hansch model (sometimes called the extra thermodynamic approach) was published 4,7. Log 1/C = a (log P) 2 + b log P + c σ k..(5) P = n-octanol/water partition coefficient; σ = Hammett electronic parameter; a, b, c = regression coefficient; k = constant term. Equation 5 was developed from the concept that the transport of a drug from the site of application to its site of action depends in a nonlinear manner on the lipophilicity of the drug and that the binding affinity to its biological counter-part, such as an enzyme or a receptor depends on the lipophilicity, the electronic properties and the other free-energy related properties. Equation 5 combines the description of both processes in one mathematical model. In addition to the introduction of a parabolic term for the nonlinear lipophilicity dependence and the combination of different physicochemical properties in one equation, Hansch and Fujita defined lipophilicity parameters π of substituent X (Eqn 6.), in the same matter as Hammett had defined the electronic parameter σ (Eqn 7.) about 30 years earlier. The partition coefficient P in equation 6 is an equilibrium constant, similar to the dissociation or reaction constant K in equation 7. The absence of a reaction term π in equation 6 is explained by the fact that all π values refers to the n-octanol/water system. π X = log P RX - log P RH..(6) ρσ = log K RX - log K RH..(7) With the help of these definitions it was possible to use tabulated values instead of measured values.for the data set described by equation 4, 8 (Figure 3) and 9 (E meta s = steric parameter for meta-substituents) could be derived. All parameters that are relevant in a QSAR study are presented and discussed in Figure 3. log 1/C = (±0.19)π (±0.34)σ (±0.17) E s meta (±0.24) (n = 22; r = 0.959; s = 0.173; F = 69.24; Q 2 = 0.869; S PRESS = 0.222)..(9)

4 818 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Fig. 3. Equation 8 describes a Quantitative relationship between the antiadrenergic activities of Compound 2. Equations 8 and 9 demonstrate the superiority of Hansch analysis as compared with Free Wilson analysis. Only a few properties are needed to correlate the biological activities; the model can directly be interpreted in physicochemical terms. The results of the Free Wilson analysis are confirmed in all details but prediction for compounds with other substituents can be made, for example for X = ethyl or CF 3. On the other hand, predictions those are too far outside the range of investigated parameters, such as for tert-bu or -OH or - SO 2 NH 2 will most probably fail because of narrow chemical relationship among the investigated substituents and the very different nature of these chemical groups, in size or in their hydrogen bond donor and acceptor properties. For such predictions much more heterogeneous substituents have to be included in the derivation of QSAR model. The fact that different model can be derived for the same data set frequently offers a dilemma in Hansch analysis. One can never be sure that a certain QSAR model is the correct one for the data set. On the other hand, different models correspond to different working hypothesis. Proposals for the synthesis of new analogs can be made in the following steps, which allow discrimination between these models. Free Wilson group contributions for every substituent can be derived from equations 8 and 9, which clearly indicate the close theoretical relationship between Free Wilson analysis and linear Hansch analysis. Correspondingly both approaches can be used in one model, the so-called mixed approach (Eqn 10). log 1/C = a (log P) 2 + b log P + cσ +. + a i + k..(10) Equation 10 combines the advantage of Hansch and Free Wilson analysis and widens the applicability of both methods. Physicochemical parameters describe parts of the molecules with broad structural variation, whereas indicator variablesa i (Free Wilson type variables) encodes the effect of structural variations that cannot be described otherwise 4. General Scheme of a QSAR Study The chemoinformatics methods used in building QSAR models can be divided into three groups i.e. extracting descriptors from molecular structure, choosing those informative in the context of analyzed activity and finally using the values of the descriptors as independent variables to define a mapping that correlates them with the activity in question. The typical QSAR system realizes these phases as depicted in Figure 4.

5 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 819 The molecular structure is encoded using numerical descriptors. The set of descriptors is pruned to select the most informative ones. The activity is derived as a function of the selected descriptors. Fig. 4. Main stages of a QSAR study. Generation of Molecular Descriptors from Structure The small-molecule compounds are defined by their structure, encoded as a set of atoms and covalent bonds between them. However, the structure cannot be directly used for creating structure-activity mapping for reasons stemming from chemistry and computer science. First, the chemical structure do not usually contain in an explicit from the information that relates to activity. This information has to be extracted from the structure. Various rationally designed molecular descriptors accentuate different chemical properties implicit in the structure of the molecule. Only those properties may correlate more directly with the activity. Such properties range from physicochemical and quantum-chemical to geometrical and topological features. The second, more technical reason which guides the use and development of molecular descriptors stems from the paradigm of feature space prevailing in statistical data analysis. Most methods employed to predict the activity require as input numerical vectors of features of uniform length for all molecules. Chemical structures of compounds are diverse in size and nature and as such do not fit into this model directly. To circumvent this obstacle, molecular descriptors convert the structure to the form of well-defined sets of numerical values. Selection of Relevant Molecular Descriptors Many applications are capable of generating hundreds or thousands of different molecular descriptors. Typically, only some of them are significantly correlated with the activity. Furthermore, many of the descriptors are intercorrelated. This has negative effects on several aspects of QSAR analysis. Some statistical methods require that the number of compounds is significantly greater than the number of descriptors. Using large descriptor sets would require large datasets. Other methods, while capable of handling datasets with large descriptors to compounds ratios, nonetheless suffer from loss of accuracy, especially for compounds unseen during the preparation of the model. Large number of descriptors also affects interpretability of the final model. To tackle these problems, a wide range of methods for automated narrowing of the set of descriptors to the most informative ones is used in QSAR analysis. Mapping the Descriptors to Activity Once the relevant molecular descriptors are computed and selected, the final task of creating a function between their values and the analyzed activity can be carried out. The value quantifying the activity is expressed as a function of the values of the descriptors. The most accurate mapping function from some wide family of functions is usually fitted based on the information available in the training set i.e. compounds for which the activity is known. A wide range of mapping function families can be used including linear or non-linear ones and many methods for carrying out the training to obtain the optimal function can be employed. Molecular Descriptors Molecular descriptors map the structure of the compound into a set of numerical or binary values representing

6 820 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 various molecular properties that are deemed to be important for explaining activity. Two broad families of descriptors can be distinguished based on the dependence on the information about 3D orientation and conformation of the molecule. 2D QSAR Descriptors The broad family of descriptors used in the 2D QSAR approach shares a common property of being independent from the 3D orientation of the compound. These descriptors range from simple measures of entities constituting the molecule, through its topological and geometrical properties to computed electrostatic and quantum-chemical descriptors or advanced fragmentcounting methods. Constitutional Descriptors Constitutional descriptors capture properties of the molecules that are related to elements constituting its structure. These descriptors are fast and easy to compute. Examples of constitutional descriptors include molecular weight, total number of atoms in the molecule and number of atoms of different identity. Also, a number of properties relating to bonds are used including total numbers of single, double, triple or aromatic type bonds as well as number of aromatic rings. Electrostatic and Quantum-Chemical Descriptors Electrostatic descriptors capture information on electronic nature of the molecule. These include descriptors containing information on atomic net and partial charges 17. Descriptors for highest negative and positive charges are also informative, as well as molecular polarizability 18. Partial negatively or positively charged solvent-accessible atomic surface areas have also been used as informative electrostatic descriptors for modeling intermolecular hydrogen bonding 19. Energies of highest occupied and lowest unoccupied molecular orbital from useful quantumchemical descriptors as do the derivative quantities such as absolute hardness 20,21. Topological Descriptors The topological descriptors treat the structure of the compound as a graph, with atoms as vertices and covalent bonds as edges. Based on this approach, many indices quantifying molecular connectivity were defined starting with Wiener index 22, which counts the total number of bonds in shortest paths between all pairs of non-hydrogen atoms. Other topological descriptors include Randic indices x 23, defined as sum of geometric averages of edge degrees of atoms within paths of given lengths, Balaban s J index 24 and Shultz index 25. Information about valence electrons can be included in topological descriptors e.g. Kier and Hall indices x v - 26 or Galvez topological charge indices 27. The first ones use geometric averages of valence connectivities along paths. The latter measure topological valences of atoms and net charges transfer between pair of atoms separated by a given number of bonds. Descriptors combining connectivity information with other properties are also available, e.g. BCUT descriptors which take form of eigenvalues of atom connectivity matrix with atom charge, polarizability or H-bond potential values on diagonal and additional terms of diagonal. Similarly, the topological sub-structural molecular design (TOSS-MODE/TOPS-MODE) 31,32 rely on spectral moments of bond adjacency matrix amended with information on for e.g. bond polarizability. The atom type electrotopological (E-state) indices 33,34 use electronic and topological organization to define the intrinsic atom state and the perturbations of this state induced by other atoms. This information is gathered individually for a wide range of atom types to form a set of indices. Geometrical Descriptors Geometrical descriptors rely on spatial arrangement of atoms constituting the molecule. These descriptors include information on molecular surface obtained from atomic van der Waals area and their overlap 35. Molecular volume may be obtained from atomic van der Waals volumes 36. Principal moments of inertia and gravitational indices 37 also capture information on spatial arrangement of the atoms in molecule. Shadow areas, obtain by projection of the molecule to its two principal axes are also used 38. Another geometrical descriptor is the total solvent-accessible surface area 39. Fragment-Based Descriptors and Molecular Fingerprints The family of descriptors relying on substructural motifs is often used, especially for rapid screening of very large databases. The BCI fingerprints 40 are derived as bits describing the presence or absence in the molecule of certain fragments, including atoms with their nearest neighborhoods, atom pairs and sequences, or ring-based fragments. A similar approach is present in the basic set of 166 MDL keys. However, other variants of the MDL keys are also available, including extending sets of keys or compact sets. The later are results of dedicated pruning strategies 41 or elimination methods, e.g. the fast random elimination of descriptors/substructure keys (FRED/ SKEYS) 42. Recently introduced Hologram QSAR (HQSAR) approach is based on counting the number of occurrences of certain sub-structural paths of functional groups. For each group, cyclic redundancy code is calculated which serves as a hashing function for partitioning the sub-structural motifs into bins of hash table. The numbers of elements in the bins form a hologram 43,44.

7 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 821 The daylight fingerprints are a natural extension of the fragment-based descriptors by eliminating the reliance on pre-defined list of sub-structure motifs. The fingerprint for each molecule is a string of bits. However, a structural motif in the molecule does not correspond to a single bit but leads through a hashing function to a pattern of bits that are added to the fingerprint with a logical or operation. The bits in the different patterns may overlap, due to large number of possible patterns and a finite length of a bit string. Thus, the fact that a bit or several bits are set in the fingerprint cannot be interpreted as a proof of pattern s presence. However, if one of the bits corresponding to a given pattern is not set, this guarantees that the pattern is not present in the molecule. This allows for rapid filtering of the molecules that do not possess certain structural motifs. The patterns are generated individually for each molecule and describe atoms with their neighborhoods and paths of up to 7 bonds. Other approaches than hashed fingerprints are also proposed to circumvent the problem of a pre-defined sub-structure library, e.g. algorithm for optimal discovery of frequent structural fragments relevant to given activity 45. 3D QSAR Descriptors The 3D QSAR methodology is much more computationally complex than 2D QSAR approach. In general, it involves several steps to obtain numerical descriptors of the compound structure. First, the conformation of the compound has to be determined either from experimental data or molecular mechanics and then refined by minimizing the energy 46,47. Next, the conformers in dataset have to be uniformly aligned in space. Finally, the space with immersed conformer is probed computationally for various descriptors. Some methods independent of the compound alignment have also been developed. Alignment-Dependent 3D QSAR Descriptors The group of methods that require molecule alignment prior to the calculation of descriptors is strongly dependent on the information on the receptor for the modeled ligand. In case where such data is available, the alignment can be guided by studying the receptor-ligand complexes. Otherwise, purely computational methods for superimposing the structures in space have to be used 48,49. These methods relies e.g. on atom-atom or substructuresubstructure mapping. Comparative Molecular Field Analysis The Comparative Molecular Field Analysis (CoMFA) 50 uses electrostatic (Coulombic) and steric (van der Waals) energy fields defined by the inspected compound. The aligned molecule is placed in a 3D grid. In each point of the grid lattice a probe atom with unit charge is placed and the potentials (Coulomb and Lennard-Jones) of the energy fields are computed. Then, they serve as descriptors in further analysis, typical using partial least square regression. This analysis allows for identifying structure regions positively and negatively related to the activity in question. Comparative Molecular Similarity Indices Analysis The Comparative Molecular Similarity Indices Analysis (CoMSIA) 51 is similar to CoMFA in the aspect of atom probing throughout the regular grid lattice in which the molecules are immersed. The similarity between probe atom and the analyzed molecule are calculated. Compared to CoMFA, CoMSIA uses a differential potential function, namely the Gaussian-type function. Steric, electrostatic and hydrophobic properties are then calculated; hence the probe atom has unit hydrophobicity as additional property. The use of Gaussian-type potential function instead of Lennard- Jones and Coulombic functions allows for accurate information in grid points located within the molecule. In CoMFA unacceptably large values are obtained in these points due to the nature of the potential functions and arbitrary cut-offs that have to be applied. Alignment-Independent 3D QSAR Descriptors Another group of 3D descriptors are those invariant to molecule rotation and translation in space. Thus, no superposition of compounds is required. Comparative Molecular Moment Analysis The Comparative Molecular Moment Analysis (CoMMA) 52 uses second-order moments of the mass distribution and charge distribution. The moments relate to center of the mass and center of the dipole. The CoMMA descriptors include principal moments of inertia, magnitudes of dipole moment and principal quadrupole moment. Furthermore, descriptors relating charges to mass distributions are defined, i.e. magnitudes of projections of dipole upon principal moments of inertia and displacement between center of mass and center of dipole. Weighted Holistic Invariant Molecular Descriptors The Weighted Holistic Invariant Molecular (WHIM) 53,54 and Molecular Surface WHIM 55 descriptors provide the invariant information by employing the principal component analysis (PCA) on the centered co-ordinates of the atoms constituting the molecule. This transforms the molecule into the space that captures the most variance. In this space, several statistics are calculated and serve as directional descriptors including variance, proportions, symmetry and kurtosis. By combining the directional descriptors, non-directional descriptors are also defined. The contribution of each atom can be weighted by a chemical property leading to different principal components capturing variance within the given property. The atoms can be weighted by mass, van der Waals volume, atomic electronegativity, atomic polarizability,

8 822 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 electrotopological index of Kier and Hall and molecular electrostatic potential. VolSurf The VolSurf 56,57 approach is based on probing the grid around the molecule with specific probes, for e.g. hydrophobic interaction or hydrogen bond acceptor or donor groups. The resulting lattice boxes are used to compute the descriptors relying on volumes or surfaces of 3D contours, defined by the same value of the probe molecule interaction energy. By using various probes and cut-off values for the energy, different molecular properties can be quantified. These include e.g. molecular volume and surface, hydrophobic and hydrophilic regions. Derivative quantities, e.g. molecular globularity or factors relating the surface of hydrophobic and hydrophilic regions to surface of the whole molecule can also be computed. In addition various geometry-based descriptors are also available including energy minima distances or amphiphilic moments. Grid-Independent Descriptors The Grid-Independent Descriptors (GRIND) 58 have been devized to overcome the problems with interpretability common in alignment-independent descriptors. Similarly to VolSurf, it utilizes probing of the grid with specific probes. The regions showing the most favorable energies of interaction are selected provided that the distances between the regions are large. Next, the probe-based energies are encoded in a way independent of the molecule s arrangement. To this end the distances between the nodes in the grid are discretized into a set of bins. For each distance bin, the nodes with the highest product of energies are stored and the value of the product serves as the numerical descriptors. In addition, the stored information on the position of the nodes can be used to track down the exact regions of the molecule relating to the given property. To extend the molecular information captured by the descriptors, the product of node energies may include not only energies relating to the same probe but also from two different probe types. 2D Versus 3D QSAR Approach It is generally assumed that 3D approaches are superior to 2D in drug design. Yet, studies show such an assumption may not always hold. For example, the results of conventional CoMFA may often be non-reproducible due to dependence of the outputs quality on the orientation of the rigidly aligned molecules on user s terminal 59,60. Such alignment problems are typical in 3D approaches and even though some solutions have been proposed the unambiguous 3D alignment of structurally diverse molecules still remains a difficult task. Moreover, the distinction between 2D and 3D QSAR approaches is not a crisp one, especially when alignment-independent descriptors are considered. This can be observed when comparing the BCUT with the WHIM descriptors. Both employ a similar algebraic method, i.e. solving an eigen problem for a matrix describing the compound - the connectivity matrix in case of BCUT descriptors and covariance matrix of 3D co-ordinates in case of WHIM. There is also a deeper connection between 3D QSAR and one of 2D methods, the topological approach. It stems from the fact that the geometry of a compound in many cases depends on its topology. An elegant example was provided by Estrada et al., who demonstrated that the dihedral angles of biphenyl as a function of the substituents attached to it can be predicted by topological indices 61. Along the same line, a supposedly typically 3D property, chirality has been predicted using chiral topological indices 62, constructed by introducing an adequate weight into the topological matrix for the chiral carbons. Automatic Selection of Relevant Molecular Descriptors Automatic methods for selecting the best descriptors or features to be used in construction of the QSAR model fall into two categories 63. In the wrapper approach, the quality of descriptor subsets is obtained from constructing and evaluating a series of QSAR models. In filtering, no model is build and features are evaluated using some other criteria. Filtering Methods These techniques are applied independently of the mapping method used. They are executed prior to the mapping to reduce the number of descriptors following some objective criteria, e.g. inter-descriptor correlation. Correlation-Based Methods Pearson s correlation coefficients may serve as a preliminary filter for discarding intercorrelated descriptors. This can be done by e.g. creating clusters of descriptors having correlation coefficients higher than certain threshold and retaining only one, randomly chosen member of each cluster 64. Another procedure involves estimating correlation between pair of descriptors and if it exceed a threshold, randomly discarding one of the descriptors 65. The choice of the ordering in which pairs are evaluated may lead to significantly different results. One popular method is to first rank the descriptors by using some criterion and then iteratively browse the set starting from pairs containing the highest-ranking features. One such ranking may be the correlation ranking, based on correlation coefficient between activity and descriptors. However, correlation ranking is usually used in conjunction

9 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 823 with principal component analysis 66,67. Methods using measures of correlation activity and descriptors other than Pearson s have been used, notably the pair-correlation method Methods Based on Information Theory Information content of the descriptor is defined in terms of entropy of descriptor treated as a random variable. Based on this notion, various measures relating the information shared between two descriptors or between descriptor and the activity can be defined. An example of such measure used in descriptor selection for QSAR is the mutual information. The mutual information sometimes referred to as information gain, quantifies the reduction of uncertainty or information content of activity variable by knowing the descriptor values. It is used in QSAR to rank the descriptors 71,72. The application of information-theoretic criteria is straight forward when both the descriptors and activity values are categorical. In case of continuous numerical variables, some discretization schemes have to be applied to approximate the variables. Thus, such criteria are usually used with binary descriptors. Statistical Criteria The fisher s ratio, i.e. ratio of the between class variance to the within class variance can be used to rank the descriptors 73. Next, the correlation between pairs of features is used to reduce the set of descriptors. Another method used in assessing the quality of a descriptor is based on the Kolmogorov-Smirnov statistics 74. As applied to descriptor selection in QSAR 75, it is a fast method not relying on the knowledge of the underlying distribution and not requiring the conversion of variables descriptors into categorical values. For two classes of activity to be predicted, the method measures the maximal absolute distance between cumulative distribution functions of the descriptor for individual activity classes. Wrapper Methods These techniques operate in a conjunction with a mapping algorithm 76. The choice of best subset of descriptors is guided by the error of the mapping algorithm for a given subset measured e.g. with cross-validation. The schematic illustration of wrapper methods is given in Figure 5. Iteratively, various configurations of selected and discarded descriptors are evaluated by creating a descriptors-to-activity mapping and assessing its prediction accuracy. The final descriptors are those yielding the highest accuracy for a given family of mapping functions. Fig. 5. Generic scheme for wrapper descriptor selection methods.

10 824 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Genetic Algorithm The Genetic Algorithms (GA) is efficient method for function minimization. In descriptor selection context, the prediction error of the model built upon a set of features is optimized 77,78. The genetic algorithm mimics the natural evaluation by modeling a dynamic population of solutions. The members of the population referred to as chromosomes, encode the selected features. The encoding usually takes form of bit strings with bits corresponding to selected features set others cleared. Each chromosomes leads to a model built using the encoded features. By using the training data, the error of the model is quantified and serves as a fitness function. During the course of evolution, the chromosomes are subjected to crossover and mutation. By allowing survival and reproduction of the fittest chromosomes, the algorithm effectively minimizes the error function in subsequent generations. The success of GA depends on several factors. The parameters steering the crossover, mutation and survival of chromosomes should be carefully chosen to allow the population to explore the solution space and to prevent early convergence to homogeneous population occupying a local minimum. The choice of initial population is also important in genetic feature selection. To address this issue, e.g. a method based on Shannon s entropy combined with graph analysis can be used 79. Genetic Algorithm have been used in feature selection for QSAR with a range of mapping methods, e.g. Artificial Neural Networks 80,81, k- Nearest Neighbor method 82 and Random Forest 65. Simulated Annealing Simulated Annealing (SA) is another stochastic method for function optimization employed in QSAR 65,83,84. As in the evolutionary approach, the function minimized represents the error of the model built using the subset of descriptors. The SA algorithm operates iteratively by binding a new subset of descriptors by altering the current-best one, e.g. by exchanging some percentage of the features. Next, SA evaluates prediction error of the new subset and makes the choice whether to adopt the new solution as the current optimal solution. This decision depends on whether the new solution leads to lower error then the current one. If so, the new solution is used. However, in other case the solution is not automatically discarded. With a given probability based on the Boltzmann distribution the worse solution can replace the current, better one. Replacing the solution with a worse one allows the SA method to escape from local minima of the error function, i.e. solution that cannot be made better without traversing through less-fitted feature subsets. The power of SA method stems from altering the temperature term in the Boltzmann distribution. At an early stage when the solution is not yet highly optimized and mostly prone to encounter local minima, the temperature is high. During the course of algorithm, the temperature is lowered and acceptance of worse solution is less likely. Thus, even if the obtained minimum is not global it is nonetheless usually of high quality. Sequential Feature Forward Selection While genetic algorithm and simulated annealing rely on guided random process of exploring the space of feature subsets, Forward Feature Selection 85 operates in a deterministic manner. It implements a greedy search throughout the feature subsets. As a first step, a single feature that leads to best prediction is selected. Next, sequentially each feature is individually added to the current subset and the errors of resulting models are quantified. The feature that is the best in reducing the error is incorporated into the subset. Thus, in each step a single best feature is added resulting in a sequence of nested subsets of features. The procedure stops when a specified number of features are selected. More elaborate stopping conditions are also proposed, e.g. based on incorporating an artificial random feature 86. When this feature is to be selected as the one that improves the best quality of the model, the procedure is stopped. The drawback of forward selection is that if several feature collectively are good predictors but alone each is a poor prediction, none of the features may be chosen. The recursive feature forward selection has been used in several QSAR studies 64, 79, 87, 88. Sequential Backward Feature Elimination The Backward Feature Elimination 85 is another example of a greedy sequential method that yields nested subsets of features. In contrast to forward selection, the full set of features is used as a starting point, next, in each step all subsets of features resulting from removal of a single feature are analyzed for the prediction error. The feature that leads to a model with highest error is removed from the current subsets. The procedure stops when the given numbers of features are dropped. Backward elimination is slower than forward selection, yet often leads to better results. Recently a significantly faster variant of backward elimination, the Recursive Feature Elimination 89 method has been proposed for Support Vector Machines (SVM). In this method, the feature to be removed is chosen based on a single execution of the learning method using all features remaining in the given iteration. The SVM allows for ranking the features according to their contribution to the result. Thus, the least contributing feature can be dropped to form a new narrowed subset of features. There is no need to train SVMs for each subset as in original feature elimination

11 Navdeep Singh Sethi: A Review on Computational Methods in Developing Quantitative Structure-Activity 825 method. Variants of backward feature elimination method have been used in numerous QSAR studies Hybrid Methods In addition to the purely filter or wrapper-based descriptor selection procedures, QSAR studies utilize the fusion of the two approaches. A rapid objective method is used as a preliminary filter to narrow the feature set, next, one of the more accurate but slower subjective method is employed. As an example of such a combination of techniques, the correlation-based test significantly reducing the number of features followed by genetic algorithm or simulated annealing can be used 65. A similar procedure which uses a greedy sequential feature forward selection is also in use 64. The feature selection can also be implicit in some mapping methods. For example, the Decision Tree (see section 6.2.4) utilizes only a subset of features in the decision process if a single or only a few descriptors are tested at each node and the overall number of features exceeds the number of those used in the nodes. Similarly, ensembles of Decision Stumps (see section 6.2.6) also operate on reduced number of descriptors if the number of members is ensemble is smaller than the number of features. Mapping the Molecular Structure to Activity Given the selected descriptors, the final step in building the QSAR model is to drive the mapping between the activity and the values of the features. Simple, yet useful methods model the activity as a linear function of the descriptors. Other non-linear methods extend this approach to more complex relations. Another important division of the mapping methods is based on the nature of the activity variable. In case of predicting a continuous value a regression problem is encountered. When only some categories of classes of the activity need to be predicted, e.g. portioning compounds into active and inactive, the classification problem occurs. In regression the dependent variable is modeled as a function of the descriptors. In classification framework, the resulting model is defined by decision boundary separating the classes in the descriptor space. The approaches to QSAR mapping is depicted in Figure 6. Linear Models Linear models have been the bases of QSAR analysis since it s beginning. They predict the activity as a linear function of molecular descriptors. In general, linear models are easily interpretable and sufficiently accurate for small datasets of similar compounds especially when the descriptors are carefully selected for a given activity. Multiple Linear Regression Multiple Linear Regression (MLR) models the activity to be predicted as a linear function of all descriptors. Based on the examples from the training set, the coefficients of the function are estimated. These free parameters are chosen to minimize the squares of the errors between the predicted and the actual activity. The main restriction of MLR analysis is the case of large descriptors-tocompounds ratio or multicollinear descriptors in general. This makes the problem ill-conditioned and makes the results unstable. Multiple linear regression is among the most widely used mapping method in QSAR in last decades. It has been employed in conjunction with genetic description selection for modeling GABA A receptor binding, antimalaric activity, HIV-1 protease inhibition and glycogen phosphorylase inhibition 95 exhibiting lower crossvalidation error than partial least square, both using 4D QSAR fingerprints. MLR has been applied to models in predictive toxicology 96,97, Caco-2-permeabbility 98 and aqueous solubility 99. In prediction of physicochemical properties 88,93 and of COX-2 inhibition 100, MLR proved significantly worse than non-linear support vector machine, yet comparable or only slightly inferior to neural network. However, in studies of logp 101 it proved worse than other models including multi-layer perceptron and Decision Tree. (a) Linear regression with activity as a function of two descriptors d 1 and d 2 ; (b) Binary classification with linear decision boundary between classes of active (+) and inactive (-) compounds; (c) Non-linear regression; (d) Non-linear binary classification. Fig. 6. Approaches to QSAR mapping.

12 826 International Journal of Drug Design and Discovery Volume 3 Issue 3 July September 2012 Partial Least Squares Partial Least Squares (PLS) linear regression is a method suitable for overcoming the problem in MLR related to multicollinear or over-abundant descriptors. The technique assumes that despite the large number of descriptors the modeled process is governed by a relatively small number of latent independent variables. The PLS tries to indirectly obtain knowledge on the latent variables by decomposing the input matrix of descriptors into two components, the scores and the loadings. The scores are orthogonal and while being able to capture the descriptor information allow also for good prediction of the activity. The estimation of score vectors is done iteratively. The first one is derived using the first eigenvector of the activitydescriptor combined variance-covariance matrix. Next, the descriptor matrix is deflated by subtracting the information explained by the first score vector. The resulting matrix is used in the derivation of the second score vector which followed by consecutive deflation closes the iteration loop. In each iteration step, the coefficient relating the score vector to the activity is also determined. PLS has been used successfully with 3D QSAR and HQSAR, e.g. in a study of nicotinic acetylocholine receptors binding modeling 105 and estrogen receptor binding 106. It has also been used in a study involving several dataset, including blood-brain barrier permeability, toxicity, P-glycoprotein transport, multidrug resistance reversal and log D showing results better than decision trees in most ensembles and SVM. PLS regression has been tested in prediction of COX-2 inhibition 82, but lower accuracy than neural network and decision tree. However, in a study of solubility prediction 107 PLS outperformed a neural network. Studies report PLS-based models in melting point and log P prediction 108. Finally, PLS models for BBB permeability 109,110, mutagenicity, toxicity, tumor growth inhibition and anti-hiv activity 111 and aqueous solubility 108,109 have been created. Linear Discriminant Analysis Linear Discriminant Analysis (LDA) 112 is a classification method that creates a linear transformation of the original feature space into a space, which maximizes the interclass separability and minimizes the within-class variance. The procedure operates by solving a generalized eigenvalue problem based on the between-class and within-class covariance matrices. Thus, the number of feature has to be significantly smaller than the number of observations to avoid ill-conditioning of the eigenvalue problem. Executing principal component analysis to reduce the dimension of the input data may be employed prior to applying LDA to overcome this problem. LDA has been used to create QSAR models e.g. for prediction of model validity for new compounds 113 where it fared better than PLS but worse than non-linear neural network. However, in BBB permeability prediction, LDA exhibited lower accuracy than PLS-based method. In predicting antibacterial activity 114,115 it performed worse than neural network. LDA was also used to predict drug likeness 116 showing results slightly better than linear programming machine, a method similar to linear SVM. However, it yielded results worse than non-linear SVM and bagging ensembles. In ecotoxicity prediction 117, LDA performed better than other linear methods and k-nn but inferior to decision tree. Non-Linear Models Non-linear models extend the structure-activity relationships to non-linear function of input descriptors. Such models may become more accurate, especially for large and diverse datasets. However, usually they are harder to interpret. Complex non-linear may also fall prey to over fitting 118, i.e. low generalization to compounds unseen during training. Bayes Classifier The Bayes Classifier stems from the Bayes rule relating the posterior probability of a class to its overall probability of the observations and the likelihood of a class with respect to observed variables. In Bayes rule, the class minimizing the posterior probability is chosen as the prediction result. However, in real problems the likelihoods are not known and have to be estimated. Yet, given a finite number of training examples such estimation is not trivial. One method to approach this problem is to make an assumption of independence of likelihoods of class with respect to different descriptors. This leads to the Naïve Bayes Classifier (NBC). For typical datasets the estimation of likelihoods with respect to single variables is feasible. The drawback of this method is that independence assumption usually does not hold. An extensive study using Naïve Bayes Classifier in comparison with other methods was conducted 119 using numerous endpoints including COX-2, CDK-2, BBB, dopamine, logd, P-glycoprotein, toxicity and multidrug resistance reversal. In most cases NBC was inferior to other methods, however it outperformed PLS for BBB and CDK- 2, k-nn for P-glycoprotein and COX-2 and decision trees for BBB and P-glycoprotein. However, NBC has been shown useful in modeling the inhibition of HIV-1 protease 120. The k-nearest Neighbor Method The k-nearest Neighbor (k-nn) 121 is a simple decision scheme that requires practically no training and is asymptotically optimal, i.e. with increase in training data

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

Nonlinear QSAR and 3D QSAR

Nonlinear QSAR and 3D QSAR onlinear QSAR and 3D QSAR Hugo Kubinyi Germany E-Mail kubinyi@t-online.de HomePage www.kubinyi.de onlinear Lipophilicity-Activity Relationships drug receptor Possible Reasons for onlinear Lipophilicity-Activity

More information

Chapter 8: Introduction to QSAR

Chapter 8: Introduction to QSAR : Introduction to 8) Chapter 8: 181 8.1 Introduction to 181 8.2 Objectives of 181 8.3 Historical development of 182 8.4 Molecular descriptors used in 183 8.5 Methods of 185 8.5.1 2D methods 186 8.6 Introduction

More information

In silico pharmacology for drug discovery

In silico pharmacology for drug discovery In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of

More information

Structural biology and drug design: An overview

Structural biology and drug design: An overview Structural biology and drug design: An overview livier Taboureau Assitant professor Chemoinformatics group-cbs-dtu otab@cbs.dtu.dk Drug discovery Drug and drug design A drug is a key molecule involved

More information

Statistical concepts in QSAR.

Statistical concepts in QSAR. Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs

More information

CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS

CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS 159 CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS 6.1 INTRODUCTION The purpose of this study is to gain on insight into structural features related the anticancer, antioxidant

More information

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing

Analysis of a Large Structure/Biological Activity. Data Set Using Recursive Partitioning and. Simulated Annealing Analysis of a Large Structure/Biological Activity Data Set Using Recursive Partitioning and Simulated Annealing Student: Ke Zhang MBMA Committee: Dr. Charles E. Smith (Chair) Dr. Jacqueline M. Hughes-Oliver

More information

Quantitative Structure-Activity Relationship (QSAR) computational-drug-design.html

Quantitative Structure-Activity Relationship (QSAR)  computational-drug-design.html Quantitative Structure-Activity Relationship (QSAR) http://www.biophys.mpg.de/en/theoretical-biophysics/ computational-drug-design.html 07.11.2017 Ahmad Reza Mehdipour 07.11.2017 Course Outline 1. 1.Ligand-

More information

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Plan Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics. Exercise: Example and exercise with herg potassium channel: Use of

More information

Introduction to Chemoinformatics and Drug Discovery

Introduction to Chemoinformatics and Drug Discovery Introduction to Chemoinformatics and Drug Discovery Irene Kouskoumvekaki Associate Professor February 15 th, 2013 The Chemical Space There are atoms and space. Everything else is opinion. Democritus (ca.

More information

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters

Drug Design 2. Oliver Kohlbacher. Winter 2009/ QSAR Part 4: Selected Chapters Drug Design 2 Oliver Kohlbacher Winter 2009/2010 11. QSAR Part 4: Selected Chapters Abt. Simulation biologischer Systeme WSI/ZBIT, Eberhard-Karls-Universität Tübingen Overview GRIND GRid-INDependent Descriptors

More information

Quiz QSAR QSAR. The Hammett Equation. Hammett s Standard Reference Reaction. Substituent Effects on Equilibria

Quiz QSAR QSAR. The Hammett Equation. Hammett s Standard Reference Reaction. Substituent Effects on Equilibria Quiz Select a method you are using for your project and write ~1/2 page discussing the method. Address: What does it do? How does it work? What assumptions are made? Are there particular situations in

More information

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller

Chemogenomic: Approaches to Rational Drug Design. Jonas Skjødt Møller Chemogenomic: Approaches to Rational Drug Design Jonas Skjødt Møller Chemogenomic Chemistry Biology Chemical biology Medical chemistry Chemical genetics Chemoinformatics Bioinformatics Chemoproteomics

More information

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre

Dr. Sander B. Nabuurs. Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre Dr. Sander B. Nabuurs Computational Drug Discovery group Center for Molecular and Biomolecular Informatics Radboud University Medical Centre The road to new drugs. How to find new hits? High Throughput

More information

COMPUTER AIDED DRUG DESIGN (CADD) AND DEVELOPMENT METHODS

COMPUTER AIDED DRUG DESIGN (CADD) AND DEVELOPMENT METHODS COMPUTER AIDED DRUG DESIGN (CADD) AND DEVELOPMENT METHODS DRUG DEVELOPMENT Drug development is a challenging path Today, the causes of many diseases (rheumatoid arthritis, cancer, mental diseases, etc.)

More information

Notes of Dr. Anil Mishra at 1

Notes of Dr. Anil Mishra at   1 Introduction Quantitative Structure-Activity Relationships QSPR Quantitative Structure-Property Relationships What is? is a mathematical relationship between a biological activity of a molecular system

More information

Plan. Day 2: Exercise on MHC molecules.

Plan. Day 2: Exercise on MHC molecules. Plan Day 1: What is Chemoinformatics and Drug Design? Methods and Algorithms used in Chemoinformatics including SVM. Cross validation and sequence encoding Example and exercise with herg potassium channel:

More information

Docking. GBCB 5874: Problem Solving in GBCB

Docking. GBCB 5874: Problem Solving in GBCB Docking Benzamidine Docking to Trypsin Relationship to Drug Design Ligand-based design QSAR Pharmacophore modeling Can be done without 3-D structure of protein Receptor/Structure-based design Molecular

More information

Machine Learning Concepts in Chemoinformatics

Machine Learning Concepts in Chemoinformatics Machine Learning Concepts in Chemoinformatics Martin Vogt B-IT Life Science Informatics Rheinische Friedrich-Wilhelms-Universität Bonn BigChem Winter School 2017 25. October Data Mining in Chemoinformatics

More information

QSAR in Green Chemistry

QSAR in Green Chemistry QSAR in Green Chemistry Activity Relationship QSAR is the acronym for Quantitative Structure-Activity Relationship Chemistry is based on the premise that similar chemicals will behave similarly The behavior/activity

More information

Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites. J. Andrew Surface

Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites. J. Andrew Surface Using Bayesian Statistics to Predict Water Affinity and Behavior in Protein Binding Sites Introduction J. Andrew Surface Hampden-Sydney College / Virginia Commonwealth University In the past several decades

More information

Algorithm-Independent Learning Issues

Algorithm-Independent Learning Issues Algorithm-Independent Learning Issues Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2007 c 2007, Selim Aksoy Introduction We have seen many learning

More information

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004

Computational Methods and Drug-Likeness. Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004 Computational Methods and Drug-Likeness Benjamin Georgi und Philip Groth Pharmakokinetik WS 2003/2004 The Problem Drug development in pharmaceutical industry: >8-12 years time ~$800m costs >90% failure

More information

Molecular Interactions F14NMI. Lecture 4: worked answers to practice questions

Molecular Interactions F14NMI. Lecture 4: worked answers to practice questions Molecular Interactions F14NMI Lecture 4: worked answers to practice questions http://comp.chem.nottingham.ac.uk/teaching/f14nmi jonathan.hirst@nottingham.ac.uk (1) (a) Describe the Monte Carlo algorithm

More information

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National

More information

molecules ISSN

molecules ISSN Molecules 2004, 9, 1004-1009 molecules ISSN 1420-3049 http://www.mdpi.org Performance of Kier-Hall E-state Descriptors in Quantitative Structure Activity Relationship (QSAR) Studies of Multifunctional

More information

Similarity Search. Uwe Koch

Similarity Search. Uwe Koch Similarity Search Uwe Koch Similarity Search The similar property principle: strurally similar molecules tend to have similar properties. However, structure property discontinuities occur frequently. Relevance

More information

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP)

Joana Pereira Lamzin Group EMBL Hamburg, Germany. Small molecules How to identify and build them (with ARP/wARP) Joana Pereira Lamzin Group EMBL Hamburg, Germany Small molecules How to identify and build them (with ARP/wARP) The task at hand To find ligand density and build it! Fitting a ligand We have: electron

More information

Receptor Based Drug Design (1)

Receptor Based Drug Design (1) Induced Fit Model For more than 100 years, the behaviour of enzymes had been explained by the "lock-and-key" mechanism developed by pioneering German chemist Emil Fischer. Fischer thought that the chemicals

More information

L11: Pattern recognition principles

L11: Pattern recognition principles L11: Pattern recognition principles Bayesian decision theory Statistical classifiers Dimensionality reduction Clustering This lecture is partly based on [Huang, Acero and Hon, 2001, ch. 4] Introduction

More information

Introduction. OntoChem

Introduction. OntoChem Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads

More information

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK

Chemoinformatics and information management. Peter Willett, University of Sheffield, UK Chemoinformatics and information management Peter Willett, University of Sheffield, UK verview What is chemoinformatics and why is it necessary Managing structural information Typical facilities in chemoinformatics

More information

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery

A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery AtomNet A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery Izhar Wallach, Michael Dzamba, Abraham Heifets Victor Storchan, Institute for Computational and

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Early Stages of Drug Discovery in the Pharmaceutical Industry

Early Stages of Drug Discovery in the Pharmaceutical Industry Early Stages of Drug Discovery in the Pharmaceutical Industry Daniel Seeliger / Jan Kriegl, Discovery Research, Boehringer Ingelheim September 29, 2016 Historical Drug Discovery From Accidential Discovery

More information

Solutions and Non-Covalent Binding Forces

Solutions and Non-Covalent Binding Forces Chapter 3 Solutions and Non-Covalent Binding Forces 3.1 Solvent and solution properties Molecules stick together using the following forces: dipole-dipole, dipole-induced dipole, hydrogen bond, van der

More information

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013 Chemical Space Space, Diversity, and Synthesis Jeremy Henle, 4/23/2013 Computational Modeling Chemical Space As a diversity construct Outline Quantifying Diversity Diversity Oriented Synthesis Wolf and

More information

CHAPTER 2. Structure and Reactivity: Acids and Bases, Polar and Nonpolar Molecules

CHAPTER 2. Structure and Reactivity: Acids and Bases, Polar and Nonpolar Molecules CHAPTER 2 Structure and Reactivity: Acids and Bases, Polar and Nonpolar Molecules 2-1 Kinetics and Thermodynamics of Simple Chemical Processes Chemical thermodynamics: Is concerned with the extent that

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

PATTERN CLASSIFICATION

PATTERN CLASSIFICATION PATTERN CLASSIFICATION Second Edition Richard O. Duda Peter E. Hart David G. Stork A Wiley-lnterscience Publication JOHN WILEY & SONS, INC. New York Chichester Weinheim Brisbane Singapore Toronto CONTENTS

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Lecture 9 Evolutionary Computation: Genetic algorithms

Lecture 9 Evolutionary Computation: Genetic algorithms Lecture 9 Evolutionary Computation: Genetic algorithms Introduction, or can evolution be intelligent? Simulation of natural evolution Genetic algorithms Case study: maintenance scheduling with genetic

More information

Coefficient Symbol Equation Limits

Coefficient Symbol Equation Limits 1 Coefficient Symbol Equation Limits Squared Correlation Coefficient R 2 or r 2 0 r 2 N 1 2 ( Yexp, i Ycalc, i ) 2 ( Yexp, i Y ) i= 1 2 Cross-Validated R 2 q 2 r 2 or Q 2 or q 2 N 2 ( Yexp, i Ypred, i

More information

Softwares for Molecular Docking. Lokesh P. Tripathi NCBS 17 December 2007

Softwares for Molecular Docking. Lokesh P. Tripathi NCBS 17 December 2007 Softwares for Molecular Docking Lokesh P. Tripathi NCBS 17 December 2007 Molecular Docking Attempt to predict structures of an intermolecular complex between two or more molecules Receptor-ligand (or drug)

More information

Data Mining in the Chemical Industry. Overview of presentation

Data Mining in the Chemical Industry. Overview of presentation Data Mining in the Chemical Industry Glenn J. Myatt, Ph.D. Partner, Myatt & Johnson, Inc. glenn.myatt@gmail.com verview of presentation verview of the chemical industry Example of the pharmaceutical industry

More information

Structural Bioinformatics (C3210) Molecular Docking

Structural Bioinformatics (C3210) Molecular Docking Structural Bioinformatics (C3210) Molecular Docking Molecular Recognition, Molecular Docking Molecular recognition is the ability of biomolecules to recognize other biomolecules and selectively interact

More information

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:

More information

Application integration: Providing coherent drug discovery solutions

Application integration: Providing coherent drug discovery solutions Application integration: Providing coherent drug discovery solutions Mitch Miller, Manish Sud, LION bioscience, American Chemical Society 22 August 2002 Overview 2 Introduction: exploring application integration

More information

Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases

Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases 241 Chapter 9 Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases 9.1 Introduction As we have seen in the preceding chapters, interpretability plays

More information

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence

Artificial Intelligence (AI) Common AI Methods. Training. Signals to Perceptrons. Artificial Neural Networks (ANN) Artificial Intelligence Artificial Intelligence (AI) Artificial Intelligence AI is an attempt to reproduce intelligent reasoning using machines * * H. M. Cartwright, Applications of Artificial Intelligence in Chemistry, 1993,

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Aqueous solutions. Solubility of different compounds in water

Aqueous solutions. Solubility of different compounds in water Aqueous solutions Solubility of different compounds in water The dissolution of molecules into water (in any solvent actually) causes a volume change of the solution; the size of this volume change is

More information

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction

ECE 521. Lecture 11 (not on midterm material) 13 February K-means clustering, Dimensionality reduction ECE 521 Lecture 11 (not on midterm material) 13 February 2017 K-means clustering, Dimensionality reduction With thanks to Ruslan Salakhutdinov for an earlier version of the slides Overview K-means clustering

More information

Quantitative structure activity relationship and drug design: A Review

Quantitative structure activity relationship and drug design: A Review International Journal of Research in Biosciences Vol. 5 Issue 4, pp. (1-5), October 2016 Available online at http://www.ijrbs.in ISSN 2319-2844 Research Paper Quantitative structure activity relationship

More information

Induction of Decision Trees

Induction of Decision Trees Induction of Decision Trees Peter Waiganjo Wagacha This notes are for ICS320 Foundations of Learning and Adaptive Systems Institute of Computer Science University of Nairobi PO Box 30197, 00200 Nairobi.

More information

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data

Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Application of a GA/Bayesian Filter-Wrapper Feature Selection Method to Classification of Clinical Depression from Speech Data Juan Torres 1, Ashraf Saad 2, Elliot Moore 1 1 School of Electrical and Computer

More information

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007 Computational Chemistry in Drug Design Xavier Fradera Barcelona, 17/4/2007 verview Introduction and background Drug Design Cycle Computational methods Chemoinformatics Ligand Based Methods Structure Based

More information

Xia Ning,*, Huzefa Rangwala, and George Karypis

Xia Ning,*, Huzefa Rangwala, and George Karypis J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets

More information

Interactive Feature Selection with

Interactive Feature Selection with Chapter 6 Interactive Feature Selection with TotalBoost g ν We saw in the experimental section that the generalization performance of the corrective and totally corrective boosting algorithms is comparable.

More information

DISCOVERING new drugs is an expensive and challenging

DISCOVERING new drugs is an expensive and challenging 1036 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 17, NO. 8, AUGUST 2005 Frequent Substructure-Based Approaches for Classifying Chemical Compounds Mukund Deshpande, Michihiro Kuramochi, Nikil

More information

Ultra High Throughput Screening using THINK on the Internet

Ultra High Throughput Screening using THINK on the Internet Ultra High Throughput Screening using THINK on the Internet Keith Davies Central Chemistry Laboratory, Oxford University Cathy Davies Treweren Consultants, UK Blue Sky Objectives Reduce Development Failures

More information

Mining Molecular Fragments: Finding Relevant Substructures of Molecules

Mining Molecular Fragments: Finding Relevant Substructures of Molecules Mining Molecular Fragments: Finding Relevant Substructures of Molecules Christian Borgelt, Michael R. Berthold Proc. IEEE International Conference on Data Mining, 2002. ICDM 2002. Lecturers: Carlo Cagli

More information

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor

More information

Kd = koff/kon = [R][L]/[RL]

Kd = koff/kon = [R][L]/[RL] Taller de docking y cribado virtual: Uso de herramientas computacionales en el diseño de fármacos Docking program GLIDE El programa de docking GLIDE Sonsoles Martín-Santamaría Shrödinger is a scientific

More information

CS534 Machine Learning - Spring Final Exam

CS534 Machine Learning - Spring Final Exam CS534 Machine Learning - Spring 2013 Final Exam Name: You have 110 minutes. There are 6 questions (8 pages including cover page). If you get stuck on one question, move on to others and come back to the

More information

Practical QSAR and Library Design: Advanced tools for research teams

Practical QSAR and Library Design: Advanced tools for research teams DS QSAR and Library Design Webinar Practical QSAR and Library Design: Advanced tools for research teams Reservationless-Plus Dial-In Number (US): (866) 519-8942 Reservationless-Plus International Dial-In

More information

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function.

Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Bayesian learning: Machine learning comes from Bayesian decision theory in statistics. There we want to minimize the expected value of the loss function. Let y be the true label and y be the predicted

More information

CS145: INTRODUCTION TO DATA MINING

CS145: INTRODUCTION TO DATA MINING CS145: INTRODUCTION TO DATA MINING 4: Vector Data: Decision Tree Instructor: Yizhou Sun yzsun@cs.ucla.edu October 10, 2017 Methods to Learn Vector Data Set Data Sequence Data Text Data Classification Clustering

More information

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees

Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Comparison of Shannon, Renyi and Tsallis Entropy used in Decision Trees Tomasz Maszczyk and W lodzis law Duch Department of Informatics, Nicolaus Copernicus University Grudzi adzka 5, 87-100 Toruń, Poland

More information

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,

More information

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C.

CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. CS 273 Prof. Serafim Batzoglou Prof. Jean-Claude Latombe Spring 2006 Lecture 12 : Energy maintenance (1) Lecturer: Prof. J.C. Latombe Scribe: Neda Nategh How do you update the energy function during the

More information

Automated Assignment of Backbone NMR Data using Artificial Intelligence

Automated Assignment of Backbone NMR Data using Artificial Intelligence Automated Assignment of Backbone NMR Data using Artificial Intelligence John Emmons στ, Steven Johnson τ, Timothy Urness*, and Adina Kilpatrick* Department of Computer Science and Mathematics Department

More information

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a

Retrieving hits through in silico screening and expert assessment M. N. Drwal a,b and R. Griffith a Retrieving hits through in silico screening and expert assessment M.. Drwal a,b and R. Griffith a a: School of Medical Sciences/Pharmacology, USW, Sydney, Australia b: Charité Berlin, Germany Abstract:

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Chapter 8: Introduction to Evolutionary Computation

Chapter 8: Introduction to Evolutionary Computation Computational Intelligence: Second Edition Contents Some Theories about Evolution Evolution is an optimization process: the aim is to improve the ability of an organism to survive in dynamically changing

More information

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom verview D-QSAR Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary Peter Gedeck ovartis Institutes

More information

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION

SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION SUPERVISED LEARNING: INTRODUCTION TO CLASSIFICATION 1 Outline Basic terminology Features Training and validation Model selection Error and loss measures Statistical comparison Evaluation measures 2 Terminology

More information

Artificial Neural Networks Examination, June 2005

Artificial Neural Networks Examination, June 2005 Artificial Neural Networks Examination, June 2005 Instructions There are SIXTY questions. (The pass mark is 30 out of 60). For each question, please select a maximum of ONE of the given answers (either

More information

Chemometrics. 1. Find an important subset of the original variables.

Chemometrics. 1. Find an important subset of the original variables. Chemistry 311 2003-01-13 1 Chemometrics Chemometrics: Mathematical, statistical, graphical or symbolic methods to improve the understanding of chemical information. or The science of relating measurements

More information

Distance Constraint Model; Donald J. Jacobs, University of North Carolina at Charlotte Page 1 of 11

Distance Constraint Model; Donald J. Jacobs, University of North Carolina at Charlotte Page 1 of 11 Distance Constraint Model; Donald J. Jacobs, University of North Carolina at Charlotte Page 1 of 11 Taking the advice of Lord Kelvin, the Father of Thermodynamics, I describe the protein molecule and other

More information

Cheminformatics platform for drug discovery application

Cheminformatics platform for drug discovery application EGI-InSPIRE Cheminformatics platform for drug discovery application Hsi-Kai, Wang Academic Sinica Grid Computing EGI User Forum, 13, April, 2011 1 Introduction to drug discovery Computing requirement of

More information

Medicinal Chemistry/ CHEM 458/658 Chapter 3- SAR and QSAR

Medicinal Chemistry/ CHEM 458/658 Chapter 3- SAR and QSAR Medicinal Chemistry/ CHEM 458/658 Chapter 3- SAR and QSAR Bela Torok Department of Chemistry University of Massachusetts Boston Boston, MA 1 Introduction Structure-Activity Relationship (SAR) - similar

More information

ECE662: Pattern Recognition and Decision Making Processes: HW TWO

ECE662: Pattern Recognition and Decision Making Processes: HW TWO ECE662: Pattern Recognition and Decision Making Processes: HW TWO Purdue University Department of Electrical and Computer Engineering West Lafayette, INDIANA, USA Abstract. In this report experiments are

More information

Protein-Ligand Docking

Protein-Ligand Docking Protein-Ligand Docking Matthias Rarey GMD - German National Research Center for Information Technology Institute for Algorithms and Scientific Computing (SCAI) 53754Sankt Augustin, Germany rarey@gmd.de

More information

Big Idea #5: The laws of thermodynamics describe the essential role of energy and explain and predict the direction of changes in matter.

Big Idea #5: The laws of thermodynamics describe the essential role of energy and explain and predict the direction of changes in matter. KUDs for Unit 6: Chemical Bonding Textbook Reading: Chapters 8 & 9 Big Idea #2: Chemical and physical properties of materials can be explained by the structure and the arrangement of atoms, ion, or molecules

More information

day month year documentname/initials 1

day month year documentname/initials 1 ECE471-571 Pattern Recognition Lecture 13 Decision Tree Hairong Qi, Gonzalez Family Professor Electrical Engineering and Computer Science University of Tennessee, Knoxville http://www.eecs.utk.edu/faculty/qi

More information

CSC 4510 Machine Learning

CSC 4510 Machine Learning 10: Gene(c Algorithms CSC 4510 Machine Learning Dr. Mary Angela Papalaskari Department of CompuBng Sciences Villanova University Course website: www.csc.villanova.edu/~map/4510/ Slides of this presenta(on

More information

Local Search & Optimization

Local Search & Optimization Local Search & Optimization CE417: Introduction to Artificial Intelligence Sharif University of Technology Spring 2017 Soleymani Artificial Intelligence: A Modern Approach, 3 rd Edition, Chapter 4 Outline

More information

Enduring Understandings & Essential Knowledge for AP Chemistry

Enduring Understandings & Essential Knowledge for AP Chemistry Enduring Understandings & Essential Knowledge for AP Chemistry Big Idea 1: The chemical elements are fundamental building materials of matter, and all matter can be understood in terms of arrangements

More information

Decision Tree Learning Lecture 2

Decision Tree Learning Lecture 2 Machine Learning Coms-4771 Decision Tree Learning Lecture 2 January 28, 2008 Two Types of Supervised Learning Problems (recap) Feature (input) space X, label (output) space Y. Unknown distribution D over

More information

A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY. Hossam Mohamed Farg Ashtawy A THESIS

A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY. Hossam Mohamed Farg Ashtawy A THESIS A COMPARATIVE STUDY OF MACHINE-LEARNING-BASED SCORING FUNCTIONS IN PREDICTING PROTEIN-LIGAND BINDING AFFINITY By Hossam Mohamed Farg Ashtawy A THESIS Submitted to Michigan State University in partial fulfillment

More information

Iterative Laplacian Score for Feature Selection

Iterative Laplacian Score for Feature Selection Iterative Laplacian Score for Feature Selection Linling Zhu, Linsong Miao, and Daoqiang Zhang College of Computer Science and echnology, Nanjing University of Aeronautics and Astronautics, Nanjing 2006,

More information

3.091 Introduction to Solid State Chemistry. Lecture Notes No. 9a BONDING AND SOLUTIONS

3.091 Introduction to Solid State Chemistry. Lecture Notes No. 9a BONDING AND SOLUTIONS 3.091 Introduction to Solid State Chemistry Lecture Notes No. 9a BONDING AND SOLUTIONS 1. INTRODUCTION Condensed phases, whether liquid or solid, may form solutions. Everyone is familiar with liquid solutions.

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren

Outline Introduction OLS Design of experiments Regression. Metamodeling. ME598/494 Lecture. Max Yi Ren 1 / 34 Metamodeling ME598/494 Lecture Max Yi Ren Department of Mechanical Engineering, Arizona State University March 1, 2015 2 / 34 1. preliminaries 1.1 motivation 1.2 ordinary least square 1.3 information

More information

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier

A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier A Modified Incremental Principal Component Analysis for On-line Learning of Feature Space and Classifier Seiichi Ozawa, Shaoning Pang, and Nikola Kasabov Graduate School of Science and Technology, Kobe

More information

Quantitative Structure Activity Relationships: An overview

Quantitative Structure Activity Relationships: An overview Quantitative Structure Activity Relationships: An overview Prachi Pradeep Oak Ridge Institute for Science and Education Research Participant National Center for Computational Toxicology U.S. Environmental

More information

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM

Related Concepts: Lecture 9 SEM, Statistical Modeling, AI, and Data Mining. I. Terminology of SEM Lecture 9 SEM, Statistical Modeling, AI, and Data Mining I. Terminology of SEM Related Concepts: Causal Modeling Path Analysis Structural Equation Modeling Latent variables (Factors measurable, but thru

More information

Artificial Neural Networks Examination, March 2004

Artificial Neural Networks Examination, March 2004 Artificial Neural Networks Examination, March 2004 Instructions There are SIXTY questions (worth up to 60 marks). The exam mark (maximum 60) will be added to the mark obtained in the laborations (maximum

More information