QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

Size: px
Start display at page:

Download "QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships"

Transcription

1 Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships QSAR/QSPR modeling Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE

2 QSAR/QSPR models Development Validation Application

3 Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Training/test set Cross-validation - internal, - external Validation of models Application of the Models Models Applicability Domain

4 Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria

5 Preparation of training and test sets Building of structure - property models % Initial data set Splitting of an initial data set into training and test sets Training set Test Selection of the best models according to statistical criteria Prediction calculations using the best structure - property models

6 Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37,

7 Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good spread, 5-66 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Partial Least Square analysis Genetic Algorithm. Subjective selection Descriptors selection based on mechanistic studies

8 Preprocessing strategy for the derivation of models for use in structure-activity relationships (QSARs) 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40,

9 Machine-Learning Methods

10 Fitting models parameters Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters The goal is to minimize Residual Sum of Squared (RSS) RSS = N i= 1 ( y ) exp,, i y calc i 2

11 Multiple Linear Regression Activity Descriptor Y Y 1 X 1 Y 2 Y 2 Y n X n X Y i = a 0 + a 1 X i1

12 Multiple Linear Regression y=ax+b Residual Sum of Squared (RSS) RSS = N i= 1 ( y i y calc 2, i ) b a

13 Multiple Linear Regression Activity Descr 1 Descr 2 Descr m Y 1 X 11 X 12 X 1m Y 2 X 21 X 22 X 2m Y n X n1 X n2 X nm Y i = a 0 + a 1 X i1 + a 2 X i2 + + a m X im

14 knn (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space Descriptor 2 TRAINING SET Descriptor 1 A.Tropsha, A.Golbraikh, 2003

15 Biological and Artificial Neuron

16 Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer to properties being predicted, neurons in the hidden layer to nonlinear latent variables

17 QSAR/QSPR models Development Validation Application

18 Validating the QSAR Equation How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model. r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. predicted actual

19 r Calculating r 2 2 = Explained Variance Original Variance Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original variance Variance around regression line

20 Calculating r 2 Original variance: TSS = Explained variance: Improvement in predicting y from just using the mean of y ESS = Variance around regression line: RSS = N i= 1 N i= 1 N i= 1 ( y i < y > ) 2 ( y i, calc < y > ) ( y i y calc 2, i ) 2 r r 2 2 ESS TSS RSS = 1 TSS TSS = = = 0.89 RSS TSS Compound Number Log EC 50 Calculated Log EC 50 Residual ?? 1.6??

21 F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn t explain any of the variance.

22 F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. k N

23 Calculating F Values F = ESS k N k 1 = RSS ( N k k(1 r 1) ) Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89 N = 7 k = 1 F = For an F-distribution with N=7, k=1, a value of corresponds to a significance level of Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03% r 2 2

24 Validation of Models 5 fold external cross validation procedure

25 Q 2 = 1 = 1 Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 PRESS PRESS N 2 ( y < > i= i= i y ) 1 1 RSS N 2 ( y < > i= i= i y ) 1 1 ; ; RSS = N = N ( y ) pred, i yi ( ycalc, i yi ) r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality. 2 2

26 Other Model Validation Parameters 1. s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2. Scrambling of y. ( y y ) = obs s N k calc 1 2

27 Statistical tests for «chance correlations» Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Randomization: to generat random number s: from Y min to Y max (Y randomization), from X min to X max (X randomization), or do this simulteneously for Y and X (X, Y randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model

28 Struc.1 Struc.2 Struc.3.. Struc.n Struc.1 Struc.2 Struc.3.. Struc.n Pro.1 Pro.2 Pro.3.. Pro.n Pro.1 Pro.2 Pro.3.. Pro.n Scrambling q The lowest q 2 = 0.51 in the top 10 models The highest q 2 =0.14 for randomized datasets Number of Variables

29 QSAR/QSPR models Development Validation Application

30 QSPR Models Test compound Prediction Performance Robustness of QSPR models - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Applicability domain of models Is a test compound similar to the training set compounds?

31 Applicability domain of QSAR models Descriptor 2 The new compound will be predicted by the model, only if : D i <D k > + Z s k with Z, an empirical parameter (0.5 by default) TRAINING SET Descriptor 1 = TEST COMPOUND INSIDE THE DOMAIN OUTSIDE THE DOMAIN Will be predicted Will not be predicted

32 Applicability domain of QSAR models Range based methods Bounding Box (BB)

33 Should one use only one individual model or many models? ensemble modeling

34 Hunting season Single hunter

35 Hunting season Many hunters

36 Model 4 Model 2 Model 3 Ensemble modelling Model 1

37 Property (Y) predictions using best fit models Compound model 1 model 2 mean ± s Compound 1 Y 11 Y 12 <Y 1 > ± ΔY 1 Compound 2 Y 21 Y 22 <Y 2 > ± ΔY 2 Compound m Y m1 Y m2 <Y m > ± ΔY m Grubbs statistics is used to exclude les outliers

38 Calculation of Descriptors DataSet O C-C-C-C-C-C C=O C-C-C-N-C-C C-C-C-N C-N-C-C*C N O N O N Etc. ISIDA FRAGMENTOR the Pattern matrix

39 PATTERN MATRIX PROPERTY VALUES LEARNING STAGE Building of models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones QSAR models

40 Example : linear QSPR model a k + 0 i= Property Propriété = 1 a i. D PROPERTY calc = * N C-C-C-N-C-C * N C=O * N C-N-C*C + i

41 Virtual screening with QSAR/QSPR models

42 Screening and hits selection Database O N Cl OH Br COOH N OH Virtual Sreening QSPR model N OH Useless compounds O Br COOH Hits Experimental Tests

43 Combinatorial Library Design

44 Generation of Virtual Combinatorial Libraries O Markush structure R1 P R3 R2 if R1, R2, R3 = and then O P O P O P O P O O O O P P P P

45 The types of variation in Markush structures: OH R 1 = Me, Et, Pr 1. Substituent variation (R 1 ) 2. Position variation (R 2 ) 3. Frequency variation 4. Homology variation (R 3 ) (only for patent search) R 1 R 2 R 2 =NH 2 Cl (CH 2 ) n n = 1 3 R 3 = alkyl or heterocycle R 3

46 IN SILICO design of new compounds

47 - Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge «In silico» design of new compounds

48 ISIDA combinatorial module O R 1 N R molecules/second 7 Synthesis and experimental tests 6 Hits selection Database 1 ISIDA QSAR models 5 Assessment of properties 2 Filtering 3 Similarity Search 4 QSAR models Applicability domains R 3 Markush structure The combinatorial module generates virtual libraries based on the Markush structures.

49 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R1 O N R2 R3 R = H, alkyl D = [ U ] organic phase [ U ] aqueous phase A. Varnek, D. Fourches, V. Solov ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N 4

50 SOLVENT EXTRACTION OF METALS M 2 + An - M 1 + L

51 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Reprocessing of the spent nuclear fuel PUREX process Usine de La HAGUE, France TBP : tributyl phosphate

52 Goal: theoretical design of new uranyl binders more efficient than previously studied molecules 1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999)

53 Selected Hits: 21 cmpds DATABASE Virtual library: cmpds VIRTUAL SCREENING ISIDA DATA TREATMENT EXPERT SYSTEM Hits selection PREDICTOR

54 In silico design of uranyl binders with ISIDA

55 logd Experimental vs Predicted logd New amides (ID)

56 Number of compounds Newly synthesized amides Previously studied amides logd Enrichment of the initial data set by new efficient extractants: 4 compounds (previously studied) logd > 0.9 : 9 compounds (newly synthesized)

57 Classification Models

58 Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1 Class2 ClassK Class1 N 11 N 12 N 1K Class2 N 21 N 22 N 2K ClassK N K1 N K2 N KK

59 Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class

60 The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968

Structure-Activity Modeling - QSAR. Uwe Koch

Structure-Activity Modeling - QSAR. Uwe Koch Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity

More information

UniStra activities within the BigChem project:

UniStra activities within the BigChem project: UniStra activities within the Bighem project: data visualization and modeling using GTM approach; chemical reactions mining with ondensed Graphs of Reactions Alexandre Varnek Laboratory of hemoinformatics,

More information

Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,..

Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,.. 3 Conformational Search Molecular Docking Simulate Annealing Ab Initio QM Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,.. Rino Ragno:

More information

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule

Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,

More information

Machine learning for ligand-based virtual screening and chemogenomics!

Machine learning for ligand-based virtual screening and chemogenomics! Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:

More information

Final Overview. Introduction to ML. Marek Petrik 4/25/2017

Final Overview. Introduction to ML. Marek Petrik 4/25/2017 Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,

More information

Kinome-wide Activity Models from Diverse High-Quality Datasets

Kinome-wide Activity Models from Diverse High-Quality Datasets Kinome-wide Activity Models from Diverse High-Quality Datasets Stephan C. Schürer*,1 and Steven M. Muskal 2 1 Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for

More information

Statistical concepts in QSAR.

Statistical concepts in QSAR. Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs

More information

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,

More information

Can we estimate the accuracy of ADMET predictions?

Can we estimate the accuracy of ADMET predictions? Can we estimate the accuracy of ADMET predictions? Igor V. Tetko 1, Pierre Bruneau 2, Hans-Werner Mewes 1, Douglas Rohrer 3, and Gennadiy Poda 3 (1) GSF - National Centre for Environment and Health, Institute

More information

Structural interpretation of QSAR models a universal approach

Structural interpretation of QSAR models a universal approach Methods and Applications of Computational Chemistry - 5 Kharkiv, Ukraine, 1 5 July 2013 Structural interpretation of QSAR models a universal approach Victor Kuz min, Pavel Polishchuk, Anatoly Artemenko,

More information

Holdout and Cross-Validation Methods Overfitting Avoidance

Holdout and Cross-Validation Methods Overfitting Avoidance Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest

More information

CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS

CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS 159 CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS 6.1 INTRODUCTION The purpose of this study is to gain on insight into structural features related the anticancer, antioxidant

More information

QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS

QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS KOVALISHYN Vasyl 1, PEIJNENBURG Willie 2, KOPERNYK Iryna 1, ABRAMENKO Natalia 3, METELYTSIA Larysa 1 1 Institute of Bioorganic Chemistry & Petroleum

More information

Exploring the black box: structural and functional interpretation of QSAR models.

Exploring the black box: structural and functional interpretation of QSAR models. EMBL-EBI Industry workshop: In Silico ADMET prediction 4-5 December 2014, Hinxton, UK Exploring the black box: structural and functional interpretation of QSAR models. (Automatic exploration of datasets

More information

Regression, Ridge Regression, Lasso

Regression, Ridge Regression, Lasso Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.

More information

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Machine Learning. Lecture 9: Learning Theory. Feng Li. Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell

More information

Identification of Active Ligands. Identification of Suitable Descriptors (molecular fingerprint)

Identification of Active Ligands. Identification of Suitable Descriptors (molecular fingerprint) Introduction to Ligand-Based Drug Design Chimica Farmaceutica Identification of Active Ligands Identification of Suitable Descriptors (molecular fingerprint) Establish Mathematical Expression Relating

More information

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.

Linear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach

More information

What is a property-based similarity?

What is a property-based similarity? What is a property-based similarity? Igor V. Tetko (1) GSF - ational Centre for Environment and Health, Institute for Bioinformatics, Ingolstaedter Landstrasse 1, euherberg, 85764, Germany, (2) Institute

More information

Xia Ning,*, Huzefa Rangwala, and George Karypis

Xia Ning,*, Huzefa Rangwala, and George Karypis J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets

More information

Quantitative Structure-Activity Relationship (QSAR) computational-drug-design.html

Quantitative Structure-Activity Relationship (QSAR)  computational-drug-design.html Quantitative Structure-Activity Relationship (QSAR) http://www.biophys.mpg.de/en/theoretical-biophysics/ computational-drug-design.html 07.11.2017 Ahmad Reza Mehdipour 07.11.2017 Course Outline 1. 1.Ligand-

More information

Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis

Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis Bob Clark & Marvin Waldman Simulations Plus, Inc. Lancaster CA USA bob@simulations-plus.com EuroQSAR 2018 The Standard

More information

Prediction of anti-inflammatory activity of anthranylic acids using Structural Molecular Fragment and topochemical models

Prediction of anti-inflammatory activity of anthranylic acids using Structural Molecular Fragment and topochemical models Available online at www.scholarsresearchlibrary.com Scholars Research Library Der Pharmacia Lettre, 2013, 5 (6):88-98 (http://scholarsresearchlibrary.com/archive.html) ISSN 0975-5071 USA CODEN: DPLEB4

More information

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg)

Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg) Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg) The purpose of this tutorial is to generate a library of potential inhibitors

More information

DivCalc: A Utility for Diversity Analysis and Compound Sampling

DivCalc: A Utility for Diversity Analysis and Compound Sampling Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,

More information

Introduction to Machine Learning Midterm Exam

Introduction to Machine Learning Midterm Exam 10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but

More information

Machine learning for pervasive systems Classification in high-dimensional spaces

Machine learning for pervasive systems Classification in high-dimensional spaces Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version

More information

Using Self-Organizing maps to accelerate similarity search

Using Self-Organizing maps to accelerate similarity search YOU LOGO Using Self-Organizing maps to accelerate similarity search Fanny Bonachera, Gilles Marcou, Natalia Kireeva, Alexandre Varnek, Dragos Horvath Laboratoire d Infochimie, UM 7177. 1, rue Blaise Pascal,

More information

Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse. Pranas Japertas

Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse. Pranas Japertas Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse Data Pranas Japertas Outline Why do in silico models fail? Knowing when in silico models fail assessing

More information

Machine Learning Linear Classification. Prof. Matteo Matteucci

Machine Learning Linear Classification. Prof. Matteo Matteucci Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)

More information

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013

Chemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013 Chemical Space Space, Diversity, and Synthesis Jeremy Henle, 4/23/2013 Computational Modeling Chemical Space As a diversity construct Outline Quantifying Diversity Diversity Oriented Synthesis Wolf and

More information

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor

More information

epochs epochs

epochs epochs Neural Network Experiments To illustrate practical techniques, I chose to use the glass dataset. This dataset has 214 examples and 6 classes. Here are 4 examples from the original dataset. The last values

More information

Virtual screening in drug discovery

Virtual screening in drug discovery Virtual screening in drug discovery Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery

More information

Lecture 14: Shrinkage

Lecture 14: Shrinkage Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the

More information

In silico pharmacology for drug discovery

In silico pharmacology for drug discovery In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of

More information

Linear Methods for Regression. Lijun Zhang

Linear Methods for Regression. Lijun Zhang Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived

More information

Machine Learning for Biomedical Engineering. Enrico Grisan

Machine Learning for Biomedical Engineering. Enrico Grisan Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and

More information

Human Health Models Skin sensitization

Human Health Models Skin sensitization Human Health Models Skin sensitization Laboratory of Mathematical Chemistry, Bulgaria 1 Outline Predicting skin sensitization in TIMES TIMES-SS model Applicability domain in TIMES (Q)SAR models Model performance

More information

Linear Regression In God we trust, all others bring data. William Edwards Deming

Linear Regression In God we trust, all others bring data. William Edwards Deming Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification

More information

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined OECD QSAR Toolbox v.3.3 Step-by-step example of how to build a userdefined QSAR Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed

More information

Materials Informatics: Statistical Modeling in Material Science

Materials Informatics: Statistical Modeling in Material Science Materials Informatics: Statistical Modeling in Material Science Hanoch Senderowitz Bar-Ilan University, Israel Strasbourg Summer School in Cheminformatics, June 2016, Strasbourg, France Presentation Goals

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model OECD QSAR Toolbox v.4.1 Step-by-step example for building QSAR model Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed to take

More information

Applied Machine Learning Annalisa Marsico

Applied Machine Learning Annalisa Marsico Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature

More information

An Integrated Approach to in-silico

An Integrated Approach to in-silico An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals

More information

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE

AMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE AMRI COMPOUD LIBRARY COSORTIUM: A OVEL WAY TO FILL YOUR DRUG PIPELIE Muralikrishna Valluri, PhD & Douglas B. Kitchen, PhD Summary The creation of high-quality, innovative small molecule leads is a continual

More information

CE213 Artificial Intelligence Lecture 14

CE213 Artificial Intelligence Lecture 14 CE213 Artificial Intelligence Lecture 14 Neural Networks: Part 2 Learning Rules -Hebb Rule - Perceptron Rule -Delta Rule Neural Networks Using Linear Units [ Difficulty warning: equations! ] 1 Learning

More information

Solved and Unsolved Problems in Chemoinformatics

Solved and Unsolved Problems in Chemoinformatics Solved and Unsolved Problems in Chemoinformatics Johann Gasteiger Computer-Chemie-Centrum University of Erlangen-Nürnberg D-91052 Erlangen, Germany Johann.Gasteiger@fau.de Overview objectives of lecture

More information

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National

More information

Supporting Information

Supporting Information Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen

More information

Evaluation. Andrea Passerini Machine Learning. Evaluation

Evaluation. Andrea Passerini Machine Learning. Evaluation Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain

More information

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel

Lecture 2. Judging the Performance of Classifiers. Nitin R. Patel Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only

More information

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration

The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration Chris Luscombe, Computational Chemistry GlaxoSmithKline Summary of Talk Traditional approaches SAR Free-Wilson

More information

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors

A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.

More information

Day 4: Shrinkage Estimators

Day 4: Shrinkage Estimators Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have

More information

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data

Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Vadim Ayuyev, Joseph Jupin, Philip Harris and Zoran Obradovic Temple University, Philadelphia, USA 2009 Real Life Data is Often

More information

HUMAN ACTIVITY RECOGNITION FROM ACCELEROMETER AND GYROSCOPE DATA

HUMAN ACTIVITY RECOGNITION FROM ACCELEROMETER AND GYROSCOPE DATA HUMAN ACTIVITY RECOGNITION FROM ACCELEROMETER AND GYROSCOPE DATA Jafet Morales University of Texas at San Antonio 8/8/2013 HUMAN ACTIVITY RECOGNITION (HAR) My goal is to recognize low level activities,

More information

Data Mining und Maschinelles Lernen

Data Mining und Maschinelles Lernen Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting

More information

Stephen Scott.

Stephen Scott. 1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training

More information

ECE521 Lectures 9 Fully Connected Neural Networks

ECE521 Lectures 9 Fully Connected Neural Networks ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance

More information

Mining Classification Knowledge

Mining Classification Knowledge Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification

More information

(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available

(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available QMRF identifier (JRC Inventory):To be entered by JRC QMRF Title: Insubria QSAR PaDEL-Descriptor model for prediction of NitroPAH mutagenicity. Printing Date:Jan 20, 2014 1.QSAR identifier 1.1.QSAR identifier

More information

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients

9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate

More information

Evaluation requires to define performance measures to be optimized

Evaluation requires to define performance measures to be optimized Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Chemical library design

Chemical library design Chemical library design Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery Today,

More information

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom

Overview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom verview D-QSAR Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary Peter Gedeck ovartis Institutes

More information

Science and Technology. Solutions, Separation Techniques, and the PUREX Process for Reprocessing Nuclear Waste

Science and Technology. Solutions, Separation Techniques, and the PUREX Process for Reprocessing Nuclear Waste Science and Technology Solutions, Separation Techniques, and the PUREX Process for Reprocessing Nuclear Waste Spent Fuel Rods General Accounting Office Fission products that emit beta and gamma radiation

More information

FINAL: CS 6375 (Machine Learning) Fall 2014

FINAL: CS 6375 (Machine Learning) Fall 2014 FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for

More information

Active Sonar Target Classification Using Classifier Ensembles

Active Sonar Target Classification Using Classifier Ensembles International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 11, Number 12 (2018), pp. 2125-2133 International Research Publication House http://www.irphouse.com Active Sonar Target

More information

SF2930 Regression Analysis

SF2930 Regression Analysis SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression

More information

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann

Information Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann Information Extraction from Chemical Images Discovery Knowledge & Informatics April 24 th, 2006 Dr. Available Chemical Information Textbooks Reports Patents Databases Scientific journals and publications

More information

De Novo molecular design with Deep Reinforcement Learning

De Novo molecular design with Deep Reinforcement Learning De Novo molecular design with Deep Reinforcement Learning @olexandr Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill olexandr@unc.edu http://olexandrisayev.com About me Ph.D. in Chemistry

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering

More information

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons

Linear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons Linear Regression with 1 Regressor Introduction to Econometrics Spring 2012 Ken Simons Linear Regression with 1 Regressor 1. The regression equation 2. Estimating the equation 3. Assumptions required for

More information

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting

Estimating the accuracy of a hypothesis Setting. Assume a binary classification setting Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier

More information

Introduction to Machine Learning Midterm Exam Solutions

Introduction to Machine Learning Midterm Exam Solutions 10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,

More information

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition

Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition J. Uglov, V. Schetinin, C. Maple Computing and Information System Department, University of Bedfordshire, Luton,

More information

Statistical Machine Learning from Data

Statistical Machine Learning from Data Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne

More information

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition

Data Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each

More information

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility

Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Dr. Wendy A. Warr http://www.warr.com Warr, W. A. A Short Review of Chemical Reaction Database Systems,

More information

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares

Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares R Gutierrez-Osuna Computer Science Department, Wright State University, Dayton, OH 45435,

More information

ISyE 691 Data mining and analytics

ISyE 691 Data mining and analytics ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)

More information

Lecture 6. Notes on Linear Algebra. Perceptron

Lecture 6. Notes on Linear Algebra. Perceptron Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors

More information

Lecture 5: Clustering, Linear Regression

Lecture 5: Clustering, Linear Regression Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering

More information

Prediction in bioinformatics applications by conformal predictors

Prediction in bioinformatics applications by conformal predictors Prediction in bioinformatics applications by conformal predictors Alex Gammerman (joint work with Ilia Nuoretdinov and Paolo Toccaceli) Computer Learning Research Centre Royal Holloway, University of London

More information

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,

Linear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,

More information

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina

Direct Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:

More information

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees

Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Rafdord M. Neal and Jianguo Zhang Presented by Jiwen Li Feb 2, 2006 Outline Bayesian view of feature

More information

Using AutoDock for Virtual Screening

Using AutoDock for Virtual Screening Using AutoDock for Virtual Screening CUHK Croucher ASI Workshop 2011 Stefano Forli, PhD Prof. Arthur J. Olson, Ph.D Molecular Graphics Lab Screening and Virtual Screening The ultimate tool for identifying

More information

Computational statistics

Computational statistics Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information

Hierarchical Boosting and Filter Generation

Hierarchical Boosting and Filter Generation January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers

More information

Statistical Methods for Data Mining

Statistical Methods for Data Mining Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence

More information

Introduction to Chemoinformatics

Introduction to Chemoinformatics Introduction to Chemoinformatics Dr. Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology (HMGU) Kyiv, 10 August

More information

Introduction. OntoChem

Introduction. OntoChem Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads

More information

Development of a Structure Generator to Explore Target Areas on Chemical Space

Development of a Structure Generator to Explore Target Areas on Chemical Space Development of a Structure Generator to Explore Target Areas on Chemical Space Kimito Funatsu Department of Chemical System Engineering, This materials will be published on Molecular Informatics Drug Development

More information