QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships
|
|
- Piers Mosley
- 5 years ago
- Views:
Transcription
1 Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships QSAR/QSPR modeling Alexandre Varnek Faculté de Chimie, ULP, Strasbourg, FRANCE
2 QSAR/QSPR models Development Validation Application
3 Development QSAR models Selection and curation of experimental data Preparation of training and test sets (optionaly) Selection of an initial set of descriptors and their normalisation Variables selection Selection of a machine-learning method Training/test set Cross-validation - internal, - external Validation of models Application of the Models Models Applicability Domain
4 Development the QSAR models Experimental Data Descriptors Mathematical techniques Statistical criteria
5 Preparation of training and test sets Building of structure - property models % Initial data set Splitting of an initial data set into training and test sets Training set Test Selection of the best models according to statistical criteria Prediction calculations using the best structure - property models
6 Recommendations to prepare a test set (i) experimental methods for determination of activities in the training and test sets should be similar; (ii) the activity values should span several orders of magnitude, but should not exceed activity values in the training set by more than 10%; (iii) the balance between active and inactive compounds should be respected for uniform sampling of the data. References: Oprea, T. I.; Waller, C. L.; Marshall, G. R. J. Med. Chem. 1994, 37,
7 Selection of descriptors for QSAR model QSAR models should be reduced to a set of descriptors which is as information rich but as small as possible. Rules of thumb: good spread, 5-66 structure points per descriptor. Objective selection (independent variable only) Statistical criteria of correlations Pairwise selection (Forward or Backward Stepwise selection) Principal Component Analysis Partial Least Square analysis Genetic Algorithm. Subjective selection Descriptors selection based on mechanistic studies
8 Preprocessing strategy for the derivation of models for use in structure-activity relationships (QSARs) 1. identify a subset of columns (variables) with significant correlation to the response; 2. remove columns (variables) with small variance; 3. remove columns (variables) with no unique information; 4. identify a subset of variables on which to construct a model; 5. address the problem of chance correlation. D. C. Whitley, M. G. Ford, D. J. Livingstone J. Chem. Inf. Comput. Sci. 2000, 40,
9 Machine-Learning Methods
10 Fitting models parameters Y = F(a i, X i ) X i - descriptors (independent variables) a i - fitted parameters The goal is to minimize Residual Sum of Squared (RSS) RSS = N i= 1 ( y ) exp,, i y calc i 2
11 Multiple Linear Regression Activity Descriptor Y Y 1 X 1 Y 2 Y 2 Y n X n X Y i = a 0 + a 1 X i1
12 Multiple Linear Regression y=ax+b Residual Sum of Squared (RSS) RSS = N i= 1 ( y i y calc 2, i ) b a
13 Multiple Linear Regression Activity Descr 1 Descr 2 Descr m Y 1 X 11 X 12 X 1m Y 2 X 21 X 22 X 2m Y n X n1 X n2 X nm Y i = a 0 + a 1 X i1 + a 2 X i2 + + a m X im
14 knn (k Nearest Neighbors) Activity Y assessment calculating a weighted mean of the activities Y i of its k nearest neighbors in the chemical space Descriptor 2 TRAINING SET Descriptor 1 A.Tropsha, A.Golbraikh, 2003
15 Biological and Artificial Neuron
16 Multilayer Neural Network Neurons in the input layer correspond to descriptors, neurons in the output layer to properties being predicted, neurons in the hidden layer to nonlinear latent variables
17 QSAR/QSPR models Development Validation Application
18 Validating the QSAR Equation How well does the model predicts the activity of known compounds? For a perfect model: All data points would reside on the diagonal. All variance existing in the original data is explained by the model. r 2 is the fraction of the total variation in the dependent variables that is explained by the regression equation. predicted actual
19 r Calculating r 2 2 = Explained Variance Original Variance Original variance = Explained variance (i.e., variance explained by the equation) + Unexplained variance (i.e., residual variance around regression line) Original variance Variance around regression line
20 Calculating r 2 Original variance: TSS = Explained variance: Improvement in predicting y from just using the mean of y ESS = Variance around regression line: RSS = N i= 1 N i= 1 N i= 1 ( y i < y > ) 2 ( y i, calc < y > ) ( y i y calc 2, i ) 2 r r 2 2 ESS TSS RSS = 1 TSS TSS = = = 0.89 RSS TSS Compound Number Log EC 50 Calculated Log EC 50 Residual ?? 1.6??
21 F-test Tests the assumption that a significant portion of the original variance has been explained by the model. In statistical terms tests that the ratio between the explained variance (ESS/k; k = number of parameters) and the original variance (RSS/N-k-1; N = number of data points) significantly differs from 0. This implies that ESS = 0, i.e., the model didn t explain any of the variance.
22 F-distribution As N and k decrease, the probability of getting large r 2 values purely by chance increases. Thus, as N and k decrease, a larger F-value is required for the test to be significant. k N
23 Calculating F Values F = ESS k N k 1 = RSS ( N k k(1 r 1) ) Calculate F according to the above equation. Select a significance level (e.g., 0.05). Look up the F-value from an F-distribution derived for the correct number of N and k at the selected significance level. If the calculated F-value is larger than the listed F-value, then the regression equation is significant at this significance level. Example: r 2 = 0.89 N = 7 k = 1 F = For an F-distribution with N=7, k=1, a value of corresponds to a significance level of Thus, the equation is significant at this level. The probability that the correlation is fortuitous is < 0.03% r 2 2
24 Validation of Models 5 fold external cross validation procedure
25 Q 2 = 1 = 1 Cross Validation A measure of the predictive ability of the model (as opposed to the measure of fit produced by r 2 ). r 2 PRESS PRESS N 2 ( y < > i= i= i y ) 1 1 RSS N 2 ( y < > i= i= i y ) 1 1 ; ; RSS = N = N ( y ) pred, i yi ( ycalc, i yi ) r 2 always increases as more descriptors are added. Q 2 initially increases as more parameters are added but then starts to decrease indicating data over fitting. Thus Q 2 is a better indicator of the model quality. 2 2
26 Other Model Validation Parameters 1. s is the standard deviation about the regression line. This is a measure of how well the function derived by the QSAR analysis predicts the observed biological activity. The smaller the value of s the better is the QSAR. N is the number of observations and k is the number of variables. 2. Scrambling of y. ( y y ) = obs s N k calc 1 2
27 Statistical tests for «chance correlations» Scrambling: to mix randomly: Y values (Y-scrambling), or X values (X-scrambling), or simulteneously Y and X values (X,Y-scrambling) Randomization: to generat random number s: from Y min to Y max (Y randomization), from X min to X max (X randomization), or do this simulteneously for Y and X (X, Y randomization) Calculate statistical parameters of correlations and compare them with those obtained for the model
28 Struc.1 Struc.2 Struc.3.. Struc.n Struc.1 Struc.2 Struc.3.. Struc.n Pro.1 Pro.2 Pro.3.. Pro.n Pro.1 Pro.2 Pro.3.. Pro.n Scrambling q The lowest q 2 = 0.51 in the top 10 models The highest q 2 =0.14 for randomized datasets Number of Variables
29 QSAR/QSPR models Development Validation Application
30 QSPR Models Test compound Prediction Performance Robustness of QSPR models - Descriptors type; - Descriptors selection; - Machine-learning methods; - Validation of models. Applicability domain of models Is a test compound similar to the training set compounds?
31 Applicability domain of QSAR models Descriptor 2 The new compound will be predicted by the model, only if : D i <D k > + Z s k with Z, an empirical parameter (0.5 by default) TRAINING SET Descriptor 1 = TEST COMPOUND INSIDE THE DOMAIN OUTSIDE THE DOMAIN Will be predicted Will not be predicted
32 Applicability domain of QSAR models Range based methods Bounding Box (BB)
33 Should one use only one individual model or many models? ensemble modeling
34 Hunting season Single hunter
35 Hunting season Many hunters
36 Model 4 Model 2 Model 3 Ensemble modelling Model 1
37 Property (Y) predictions using best fit models Compound model 1 model 2 mean ± s Compound 1 Y 11 Y 12 <Y 1 > ± ΔY 1 Compound 2 Y 21 Y 22 <Y 2 > ± ΔY 2 Compound m Y m1 Y m2 <Y m > ± ΔY m Grubbs statistics is used to exclude les outliers
38 Calculation of Descriptors DataSet O C-C-C-C-C-C C=O C-C-C-N-C-C C-C-C-N C-N-C-C*C N O N O N Etc. ISIDA FRAGMENTOR the Pattern matrix
39 PATTERN MATRIX PROPERTY VALUES LEARNING STAGE Building of models VALIDATION STAGE QSAR models filtering -> selection of the most predictive ones QSAR models
40 Example : linear QSPR model a k + 0 i= Property Propriété = 1 a i. D PROPERTY calc = * N C-C-C-N-C-C * N C=O * N C-N-C*C + i
41 Virtual screening with QSAR/QSPR models
42 Screening and hits selection Database O N Cl OH Br COOH N OH Virtual Sreening QSPR model N OH Useless compounds O Br COOH Hits Experimental Tests
43 Combinatorial Library Design
44 Generation of Virtual Combinatorial Libraries O Markush structure R1 P R3 R2 if R1, R2, R3 = and then O P O P O P O P O O O O P P P P
45 The types of variation in Markush structures: OH R 1 = Me, Et, Pr 1. Substituent variation (R 1 ) 2. Position variation (R 2 ) 3. Frequency variation 4. Homology variation (R 3 ) (only for patent search) R 1 R 2 R 2 =NH 2 Cl (CH 2 ) n n = 1 3 R 3 = alkyl or heterocycle R 3
46 IN SILICO design of new compounds
47 - Acquisition of Data; - Acquisition of Knowledge; - Exploitation of Knowledge «In silico» design of new compounds
48 ISIDA combinatorial module O R 1 N R molecules/second 7 Synthesis and experimental tests 6 Hits selection Database 1 ISIDA QSAR models 5 Assessment of properties 2 Filtering 3 Similarity Search 4 QSAR models Applicability domains R 3 Markush structure The combinatorial module generates virtual libraries based on the Markush structures.
49 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Binding of UO 2 2+ by monoamides R1 O N R2 R3 R = H, alkyl D = [ U ] organic phase [ U ] aqueous phase A. Varnek, D. Fourches, V. Solov ev, O. Klimchuk, A. Ouadi, I. Billard J. Solv. Extr. Ion Exch., 2007, 25, N 4
50 SOLVENT EXTRACTION OF METALS M 2 + An - M 1 + L
51 COMPUTER-AIDED DESIGN OF NEW METAL BINDERS: Extraction of UO 2 2+ by monoamides Reprocessing of the spent nuclear fuel PUREX process Usine de La HAGUE, France TBP : tributyl phosphate
52 Goal: theoretical design of new uranyl binders more efficient than previously studied molecules 1. T. H. Siddall III, J. Phys. Chem., 64, 1863 (1960) 2. C. Rabbe, C. Sella, C. Madic, A. Godard, Solv. Extr. Ion Exch, 17, 87 (1999)
53 Selected Hits: 21 cmpds DATABASE Virtual library: cmpds VIRTUAL SCREENING ISIDA DATA TREATMENT EXPERT SYSTEM Hits selection PREDICTOR
54 In silico design of uranyl binders with ISIDA
55 logd Experimental vs Predicted logd New amides (ID)
56 Number of compounds Newly synthesized amides Previously studied amides logd Enrichment of the initial data set by new efficient extractants: 4 compounds (previously studied) logd > 0.9 : 9 compounds (newly synthesized)
57 Classification Models
58 Confusion Matrix For N instances, K classes and a classifier N ij, the number of instances of class i classified as j Class1 Class2 ClassK Class1 N 11 N 12 N 1K Class2 N 21 N 22 N 2K ClassK N K1 N K2 N KK
59 Classification Evaluation Global measures of success Measures are estimated on all classes Local measures of success Measures are estimated for each class
60 The most fundamental and lasting objective of synthesis is not production of new compounds but production of properties George S. Hammond Norris Award Lecture, 1968
Structure-Activity Modeling - QSAR. Uwe Koch
Structure-Activity Modeling - QSAR Uwe Koch QSAR Assumption: QSAR attempts to quantify the relationship between activity and molecular strcucture by correlating descriptors with properties Biological activity
More informationUniStra activities within the BigChem project:
UniStra activities within the Bighem project: data visualization and modeling using GTM approach; chemical reactions mining with ondensed Graphs of Reactions Alexandre Varnek Laboratory of hemoinformatics,
More informationMolecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,..
3 Conformational Search Molecular Docking Simulate Annealing Ab Initio QM Molecular Dynamics Graphical Visualization 3-D QSAR Pharmacophore QSAR, COMBINE, Scoring Functions, Homology Modeling,.. Rino Ragno:
More informationCondensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule
Condensed Graph of Reaction: considering a chemical reaction as one single pseudo molecule Frank Hoonakker 1,3, Nicolas Lachiche 2, Alexandre Varnek 3, and Alain Wagner 3,4 1 Chemoinformatics laboratory,
More informationMachine learning for ligand-based virtual screening and chemogenomics!
Machine learning for ligand-based virtual screening and chemogenomics! Jean-Philippe Vert Institut Curie - INSERM U900 - Mines ParisTech In silico discovery of molecular probes and drug-like compounds:
More informationFinal Overview. Introduction to ML. Marek Petrik 4/25/2017
Final Overview Introduction to ML Marek Petrik 4/25/2017 This Course: Introduction to Machine Learning Build a foundation for practice and research in ML Basic machine learning concepts: max likelihood,
More informationKinome-wide Activity Models from Diverse High-Quality Datasets
Kinome-wide Activity Models from Diverse High-Quality Datasets Stephan C. Schürer*,1 and Steven M. Muskal 2 1 Department of Molecular and Cellular Pharmacology, Miller School of Medicine and Center for
More informationStatistical concepts in QSAR.
Statistical concepts in QSAR. Computational chemistry represents molecular structures as a numerical models and simulates their behavior with the equations of quantum and classical physics. Available programs
More informationGaussian Processes: We demand rigorously defined areas of uncertainty and doubt
Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt ACS Spring National Meeting. COMP, March 16 th 2016 Matthew Segall, Peter Hunt, Ed Champness matt.segall@optibrium.com Optibrium,
More informationCan we estimate the accuracy of ADMET predictions?
Can we estimate the accuracy of ADMET predictions? Igor V. Tetko 1, Pierre Bruneau 2, Hans-Werner Mewes 1, Douglas Rohrer 3, and Gennadiy Poda 3 (1) GSF - National Centre for Environment and Health, Institute
More informationStructural interpretation of QSAR models a universal approach
Methods and Applications of Computational Chemistry - 5 Kharkiv, Ukraine, 1 5 July 2013 Structural interpretation of QSAR models a universal approach Victor Kuz min, Pavel Polishchuk, Anatoly Artemenko,
More informationHoldout and Cross-Validation Methods Overfitting Avoidance
Holdout and Cross-Validation Methods Overfitting Avoidance Decision Trees Reduce error pruning Cost-complexity pruning Neural Networks Early stopping Adjusting Regularizers via Cross-Validation Nearest
More informationCHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS
159 CHAPTER 6 QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIP (QSAR) ANALYSIS 6.1 INTRODUCTION The purpose of this study is to gain on insight into structural features related the anticancer, antioxidant
More informationQSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS
QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS KOVALISHYN Vasyl 1, PEIJNENBURG Willie 2, KOPERNYK Iryna 1, ABRAMENKO Natalia 3, METELYTSIA Larysa 1 1 Institute of Bioorganic Chemistry & Petroleum
More informationExploring the black box: structural and functional interpretation of QSAR models.
EMBL-EBI Industry workshop: In Silico ADMET prediction 4-5 December 2014, Hinxton, UK Exploring the black box: structural and functional interpretation of QSAR models. (Automatic exploration of datasets
More informationRegression, Ridge Regression, Lasso
Regression, Ridge Regression, Lasso Fabio G. Cozman - fgcozman@usp.br October 2, 2018 A general definition Regression studies the relationship between a response variable Y and covariates X 1,..., X n.
More informationMachine Learning. Lecture 9: Learning Theory. Feng Li.
Machine Learning Lecture 9: Learning Theory Feng Li fli@sdu.edu.cn https://funglee.github.io School of Computer Science and Technology Shandong University Fall 2018 Why Learning Theory How can we tell
More informationIdentification of Active Ligands. Identification of Suitable Descriptors (molecular fingerprint)
Introduction to Ligand-Based Drug Design Chimica Farmaceutica Identification of Active Ligands Identification of Suitable Descriptors (molecular fingerprint) Establish Mathematical Expression Relating
More informationLinear regression. Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear.
Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence of Y on X 1,X 2,...X p is linear. 1/48 Linear regression Linear regression is a simple approach
More informationWhat is a property-based similarity?
What is a property-based similarity? Igor V. Tetko (1) GSF - ational Centre for Environment and Health, Institute for Bioinformatics, Ingolstaedter Landstrasse 1, euherberg, 85764, Germany, (2) Institute
More informationXia Ning,*, Huzefa Rangwala, and George Karypis
J. Chem. Inf. Model. XXXX, xxx, 000 A Multi-Assay-Based Structure-Activity Relationship Models: Improving Structure-Activity Relationship Models by Incorporating Activity Information from Related Targets
More informationQuantitative Structure-Activity Relationship (QSAR) computational-drug-design.html
Quantitative Structure-Activity Relationship (QSAR) http://www.biophys.mpg.de/en/theoretical-biophysics/ computational-drug-design.html 07.11.2017 Ahmad Reza Mehdipour 07.11.2017 Course Outline 1. 1.Ligand-
More informationEstimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis
Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis Bob Clark & Marvin Waldman Simulations Plus, Inc. Lancaster CA USA bob@simulations-plus.com EuroQSAR 2018 The Standard
More informationPrediction of anti-inflammatory activity of anthranylic acids using Structural Molecular Fragment and topochemical models
Available online at www.scholarsresearchlibrary.com Scholars Research Library Der Pharmacia Lettre, 2013, 5 (6):88-98 (http://scholarsresearchlibrary.com/archive.html) ISSN 0975-5071 USA CODEN: DPLEB4
More informationTutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg)
Tutorials on Library Design E. Lounkine and J. Bajorath (University of Bonn) C. Muller and A. Varnek (University of Strasbourg) The purpose of this tutorial is to generate a library of potential inhibitors
More informationDivCalc: A Utility for Diversity Analysis and Compound Sampling
Molecules 2002, 7, 657-661 molecules ISSN 1420-3049 http://www.mdpi.org DivCalc: A Utility for Diversity Analysis and Compound Sampling Rajeev Gangal* SciNova Informatics, 161 Madhumanjiri Apartments,
More informationIntroduction to Machine Learning Midterm Exam
10-701 Introduction to Machine Learning Midterm Exam Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes, but
More informationMachine learning for pervasive systems Classification in high-dimensional spaces
Machine learning for pervasive systems Classification in high-dimensional spaces Department of Communications and Networking Aalto University, School of Electrical Engineering stephan.sigg@aalto.fi Version
More informationUsing Self-Organizing maps to accelerate similarity search
YOU LOGO Using Self-Organizing maps to accelerate similarity search Fanny Bonachera, Gilles Marcou, Natalia Kireeva, Alexandre Varnek, Dragos Horvath Laboratoire d Infochimie, UM 7177. 1, rue Blaise Pascal,
More informationAssessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse. Pranas Japertas
Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse Data Pranas Japertas Outline Why do in silico models fail? Knowing when in silico models fail assessing
More informationMachine Learning Linear Classification. Prof. Matteo Matteucci
Machine Learning Linear Classification Prof. Matteo Matteucci Recall from the first lecture 2 X R p Regression Y R Continuous Output X R p Y {Ω 0, Ω 1,, Ω K } Classification Discrete Output X R p Y (X)
More informationChemical Space. Space, Diversity, and Synthesis. Jeremy Henle, 4/23/2013
Chemical Space Space, Diversity, and Synthesis Jeremy Henle, 4/23/2013 Computational Modeling Chemical Space As a diversity construct Outline Quantifying Diversity Diversity Oriented Synthesis Wolf and
More informationQSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression
APPLICATION NOTE QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression GAINING EFFICIENCY IN QUANTITATIVE STRUCTURE ACTIVITY RELATIONSHIPS ErbB1 kinase is the cell-surface receptor
More informationepochs epochs
Neural Network Experiments To illustrate practical techniques, I chose to use the glass dataset. This dataset has 214 examples and 6 classes. Here are 4 examples from the original dataset. The last values
More informationVirtual screening in drug discovery
Virtual screening in drug discovery Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery
More informationLecture 14: Shrinkage
Lecture 14: Shrinkage Reading: Section 6.2 STATS 202: Data mining and analysis October 27, 2017 1 / 19 Shrinkage methods The idea is to perform a linear regression, while regularizing or shrinking the
More informationIn silico pharmacology for drug discovery
In silico pharmacology for drug discovery In silico drug design In silico methods can contribute to drug targets identification through application of bionformatics tools. Currently, the application of
More informationLinear Methods for Regression. Lijun Zhang
Linear Methods for Regression Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Linear Regression Models and Least Squares Subset Selection Shrinkage Methods Methods Using Derived
More informationMachine Learning for Biomedical Engineering. Enrico Grisan
Machine Learning for Biomedical Engineering Enrico Grisan enrico.grisan@dei.unipd.it Curse of dimensionality Why are more features bad? Redundant features (useless or confounding) Hard to interpret and
More informationHuman Health Models Skin sensitization
Human Health Models Skin sensitization Laboratory of Mathematical Chemistry, Bulgaria 1 Outline Predicting skin sensitization in TIMES TIMES-SS model Applicability domain in TIMES (Q)SAR models Model performance
More informationLinear Regression In God we trust, all others bring data. William Edwards Deming
Linear Regression ddebarr@uw.edu 2017-01-19 In God we trust, all others bring data. William Edwards Deming Course Outline 1. Introduction to Statistical Learning 2. Linear Regression 3. Classification
More informationOECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined
OECD QSAR Toolbox v.3.3 Step-by-step example of how to build a userdefined QSAR Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed
More informationMaterials Informatics: Statistical Modeling in Material Science
Materials Informatics: Statistical Modeling in Material Science Hanoch Senderowitz Bar-Ilan University, Israel Strasbourg Summer School in Cheminformatics, June 2016, Strasbourg, France Presentation Goals
More informationNonlinear Classification
Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions
More informationCSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18
CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$
More informationOECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model
OECD QSAR Toolbox v.4.1 Step-by-step example for building QSAR model Background Objectives The exercise Workflow of the exercise Outlook 2 Background This is a step-by-step presentation designed to take
More informationApplied Machine Learning Annalisa Marsico
Applied Machine Learning Annalisa Marsico OWL RNA Bionformatics group Max Planck Institute for Molecular Genetics Free University of Berlin 22 April, SoSe 2015 Goals Feature Selection rather than Feature
More informationAn Integrated Approach to in-silico
An Integrated Approach to in-silico Screening Joseph L. Durant Jr., Douglas. R. Henry, Maurizio Bronzetti, and David. A. Evans MDL Information Systems, Inc. 14600 Catalina St., San Leandro, CA 94577 Goals
More informationAMRI COMPOUND LIBRARY CONSORTIUM: A NOVEL WAY TO FILL YOUR DRUG PIPELINE
AMRI COMPOUD LIBRARY COSORTIUM: A OVEL WAY TO FILL YOUR DRUG PIPELIE Muralikrishna Valluri, PhD & Douglas B. Kitchen, PhD Summary The creation of high-quality, innovative small molecule leads is a continual
More informationCE213 Artificial Intelligence Lecture 14
CE213 Artificial Intelligence Lecture 14 Neural Networks: Part 2 Learning Rules -Hebb Rule - Perceptron Rule -Delta Rule Neural Networks Using Linear Units [ Difficulty warning: equations! ] 1 Learning
More informationSolved and Unsolved Problems in Chemoinformatics
Solved and Unsolved Problems in Chemoinformatics Johann Gasteiger Computer-Chemie-Centrum University of Erlangen-Nürnberg D-91052 Erlangen, Germany Johann.Gasteiger@fau.de Overview objectives of lecture
More informationNext Generation Computational Chemistry Tools to Predict Toxicity of CWAs
Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs William (Bill) Welsh welshwj@umdnj.edu Prospective Funding by DTRA/JSTO-CBD CBIS Conference 1 A State-wide, Regional and National
More informationSupporting Information
Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen
More informationEvaluation. Andrea Passerini Machine Learning. Evaluation
Andrea Passerini passerini@disi.unitn.it Machine Learning Basic concepts requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain
More informationLecture 2. Judging the Performance of Classifiers. Nitin R. Patel
Lecture 2 Judging the Performance of Classifiers Nitin R. Patel 1 In this note we will examine the question of how to udge the usefulness of a classifier and how to compare different classifiers. Not only
More informationThe use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration
The use of Design of Experiments to develop Efficient Arrays for SAR and Property Exploration Chris Luscombe, Computational Chemistry GlaxoSmithKline Summary of Talk Traditional approaches SAR Free-Wilson
More informationA Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors
A Tiered Screen Protocol for the Discovery of Structurally Diverse HIV Integrase Inhibitors Rajarshi Guha, Debojyoti Dutta, Ting Chen and David J. Wild School of Informatics Indiana University and Dept.
More informationDay 4: Shrinkage Estimators
Day 4: Shrinkage Estimators Kenneth Benoit Data Mining and Statistical Learning March 9, 2015 n versus p (aka k) Classical regression framework: n > p. Without this inequality, the OLS coefficients have
More informationDynamic Clustering-Based Estimation of Missing Values in Mixed Type Data
Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data Vadim Ayuyev, Joseph Jupin, Philip Harris and Zoran Obradovic Temple University, Philadelphia, USA 2009 Real Life Data is Often
More informationHUMAN ACTIVITY RECOGNITION FROM ACCELEROMETER AND GYROSCOPE DATA
HUMAN ACTIVITY RECOGNITION FROM ACCELEROMETER AND GYROSCOPE DATA Jafet Morales University of Texas at San Antonio 8/8/2013 HUMAN ACTIVITY RECOGNITION (HAR) My goal is to recognize low level activities,
More informationData Mining und Maschinelles Lernen
Data Mining und Maschinelles Lernen Ensemble Methods Bias-Variance Trade-off Basic Idea of Ensembles Bagging Basic Algorithm Bagging with Costs Randomization Random Forests Boosting Stacking Error-Correcting
More informationStephen Scott.
1 / 35 (Adapted from Ethem Alpaydin and Tom Mitchell) sscott@cse.unl.edu In Homework 1, you are (supposedly) 1 Choosing a data set 2 Extracting a test set of size > 30 3 Building a tree on the training
More informationECE521 Lectures 9 Fully Connected Neural Networks
ECE521 Lectures 9 Fully Connected Neural Networks Outline Multi-class classification Learning multi-layer neural networks 2 Measuring distance in probability space We learnt that the squared L2 distance
More informationMining Classification Knowledge
Mining Classification Knowledge Remarks on NonSymbolic Methods JERZY STEFANOWSKI Institute of Computing Sciences, Poznań University of Technology COST Doctoral School, Troina 2008 Outline 1. Bayesian classification
More information(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available
QMRF identifier (JRC Inventory):To be entered by JRC QMRF Title: Insubria QSAR PaDEL-Descriptor model for prediction of NitroPAH mutagenicity. Printing Date:Jan 20, 2014 1.QSAR identifier 1.1.QSAR identifier
More information9/26/17. Ridge regression. What our model needs to do. Ridge Regression: L2 penalty. Ridge coefficients. Ridge coefficients
What our model needs to do regression Usually, we are not just trying to explain observed data We want to uncover meaningful trends And predict future observations Our questions then are Is β" a good estimate
More informationEvaluation requires to define performance measures to be optimized
Evaluation Basic concepts Evaluation requires to define performance measures to be optimized Performance of learning algorithms cannot be evaluated on entire domain (generalization error) approximation
More informationLinear Model Selection and Regularization
Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In
More informationChemical library design
Chemical library design Pavel Polishchuk Institute of Molecular and Translational Medicine Palacky University pavlo.polishchuk@upol.cz Drug development workflow Vistoli G., et al., Drug Discovery Today,
More informationOverview. Descriptors. Definition. Descriptors. Overview 2D-QSAR. Number Vector Function. Physicochemical property (log P) Atom
verview D-QSAR Definition Examples Features counts Topological indices D fingerprints and fragment counts R-group descriptors ow good are D descriptors in practice? Summary Peter Gedeck ovartis Institutes
More informationScience and Technology. Solutions, Separation Techniques, and the PUREX Process for Reprocessing Nuclear Waste
Science and Technology Solutions, Separation Techniques, and the PUREX Process for Reprocessing Nuclear Waste Spent Fuel Rods General Accounting Office Fission products that emit beta and gamma radiation
More informationFINAL: CS 6375 (Machine Learning) Fall 2014
FINAL: CS 6375 (Machine Learning) Fall 2014 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run out of room for
More informationActive Sonar Target Classification Using Classifier Ensembles
International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 11, Number 12 (2018), pp. 2125-2133 International Research Publication House http://www.irphouse.com Active Sonar Target
More informationSF2930 Regression Analysis
SF2930 Regression Analysis Alexandre Chotard Tree-based regression and classication 20 February 2017 1 / 30 Idag Overview Regression trees Pruning Bagging, random forests 2 / 30 Today Overview Regression
More informationInformation Extraction from Chemical Images. Discovery Knowledge & Informatics April 24 th, Dr. Marc Zimmermann
Information Extraction from Chemical Images Discovery Knowledge & Informatics April 24 th, 2006 Dr. Available Chemical Information Textbooks Reports Patents Databases Scientific journals and publications
More informationDe Novo molecular design with Deep Reinforcement Learning
De Novo molecular design with Deep Reinforcement Learning @olexandr Olexandr Isayev, Ph.D. University of North Carolina at Chapel Hill olexandr@unc.edu http://olexandrisayev.com About me Ph.D. in Chemistry
More informationLecture 5: Clustering, Linear Regression
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 .0.0 5 5 1.0 7 5 X2 X2 7 1.5 1.0 0.5 3 1 2 Hierarchical clustering
More informationLinear Regression with 1 Regressor. Introduction to Econometrics Spring 2012 Ken Simons
Linear Regression with 1 Regressor Introduction to Econometrics Spring 2012 Ken Simons Linear Regression with 1 Regressor 1. The regression equation 2. Estimating the equation 3. Assumptions required for
More informationEstimating the accuracy of a hypothesis Setting. Assume a binary classification setting
Estimating the accuracy of a hypothesis Setting Assume a binary classification setting Assume input/output pairs (x, y) are sampled from an unknown probability distribution D = p(x, y) Train a binary classifier
More informationIntroduction to Machine Learning Midterm Exam Solutions
10-701 Introduction to Machine Learning Midterm Exam Solutions Instructors: Eric Xing, Ziv Bar-Joseph 17 November, 2015 There are 11 questions, for a total of 100 points. This exam is open book, open notes,
More informationComparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition
Comparing Robustness of Pairwise and Multiclass Neural-Network Systems for Face Recognition J. Uglov, V. Schetinin, C. Maple Computing and Information System Department, University of Bedfordshire, Luton,
More informationStatistical Machine Learning from Data
Samy Bengio Statistical Machine Learning from Data 1 Statistical Machine Learning from Data Ensembles Samy Bengio IDIAP Research Institute, Martigny, Switzerland, and Ecole Polytechnique Fédérale de Lausanne
More informationData Mining Classification: Basic Concepts and Techniques. Lecture Notes for Chapter 3. Introduction to Data Mining, 2nd Edition
Data Mining Classification: Basic Concepts and Techniques Lecture Notes for Chapter 3 by Tan, Steinbach, Karpatne, Kumar 1 Classification: Definition Given a collection of records (training set ) Each
More informationChemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility
Chemical Reaction Databases Computer-Aided Synthesis Design Reaction Prediction Synthetic Feasibility Dr. Wendy A. Warr http://www.warr.com Warr, W. A. A Short Review of Chemical Reaction Database Systems,
More informationDrift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares
Drift Reduction For Metal-Oxide Sensor Arrays Using Canonical Correlation Regression And Partial Least Squares R Gutierrez-Osuna Computer Science Department, Wright State University, Dayton, OH 45435,
More informationISyE 691 Data mining and analytics
ISyE 691 Data mining and analytics Regression Instructor: Prof. Kaibo Liu Department of Industrial and Systems Engineering UW-Madison Email: kliu8@wisc.edu Office: Room 3017 (Mechanical Engineering Building)
More informationLecture 6. Notes on Linear Algebra. Perceptron
Lecture 6. Notes on Linear Algebra. Perceptron COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Andrey Kan Copyright: University of Melbourne This lecture Notes on linear algebra Vectors
More informationLecture 5: Clustering, Linear Regression
Lecture 5: Clustering, Linear Regression Reading: Chapter 10, Sections 3.1-3.2 STATS 202: Data mining and analysis October 4, 2017 1 / 22 Hierarchical clustering Most algorithms for hierarchical clustering
More informationPrediction in bioinformatics applications by conformal predictors
Prediction in bioinformatics applications by conformal predictors Alex Gammerman (joint work with Ilia Nuoretdinov and Paolo Toccaceli) Computer Learning Research Centre Royal Holloway, University of London
More informationLinear Regression. In this problem sheet, we consider the problem of linear regression with p predictors and one intercept,
Linear Regression In this problem sheet, we consider the problem of linear regression with p predictors and one intercept, y = Xβ + ɛ, where y t = (y 1,..., y n ) is the column vector of target values,
More informationDirect Learning: Linear Regression. Donglin Zeng, Department of Biostatistics, University of North Carolina
Direct Learning: Linear Regression Parametric learning We consider the core function in the prediction rule to be a parametric function. The most commonly used function is a linear function: squared loss:
More informationClassification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees
Classification for High Dimensional Problems Using Bayesian Neural Networks and Dirichlet Diffusion Trees Rafdord M. Neal and Jianguo Zhang Presented by Jiwen Li Feb 2, 2006 Outline Bayesian view of feature
More informationUsing AutoDock for Virtual Screening
Using AutoDock for Virtual Screening CUHK Croucher ASI Workshop 2011 Stefano Forli, PhD Prof. Arthur J. Olson, Ph.D Molecular Graphics Lab Screening and Virtual Screening The ultimate tool for identifying
More informationComputational statistics
Computational statistics Lecture 3: Neural networks Thierry Denœux 5 March, 2016 Neural networks A class of learning methods that was developed separately in different fields statistics and artificial
More informationMachine Learning for OR & FE
Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com
More informationHierarchical Boosting and Filter Generation
January 29, 2007 Plan Combining Classifiers Boosting Neural Network Structure of AdaBoost Image processing Hierarchical Boosting Hierarchical Structure Filters Combining Classifiers Combining Classifiers
More informationStatistical Methods for Data Mining
Statistical Methods for Data Mining Kuangnan Fang Xiamen University Email: xmufkn@xmu.edu.cn Linear regression Linear regression is a simple approach to supervised learning. It assumes that the dependence
More informationIntroduction to Chemoinformatics
Introduction to Chemoinformatics Dr. Igor V. Tetko Helmholtz Zentrum München - German Research Center for Environmental Health (GmbH) Institute of Bioinformatics & Systems Biology (HMGU) Kyiv, 10 August
More informationIntroduction. OntoChem
Introduction ntochem Providing drug discovery knowledge & small molecules... Supporting the task of medicinal chemistry Allows selecting best possible small molecule starting point From target to leads
More informationDevelopment of a Structure Generator to Explore Target Areas on Chemical Space
Development of a Structure Generator to Explore Target Areas on Chemical Space Kimito Funatsu Department of Chemical System Engineering, This materials will be published on Molecular Informatics Drug Development
More information