Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse. Pranas Japertas

Similar documents
ALOGPS is a free on-line program to predict lipophilicity and aqueous solubility of chemical compounds

Can we estimate the accuracy of ADMET predictions?

Encoding molecular structures as ranks of models: A new, secure way for sharing chemical data and development of ADME/T models. Igor V.

Plan. Lecture: What is Chemoinformatics and Drug Design? Description of Support Vector Machine (SVM) and its used in Chemoinformatics.

QSAR/QSPR modeling. Quantitative Structure-Activity Relationships Quantitative Structure-Property-Relationships

What is a property-based similarity?

90.14 g/mol x g/mol. Molecular formula: molecular formula 2 empirical formula 2 C OH C O H

J. R. Vanderveen, L. Patiny, C. B. Chalifoux, M. J. Jessop, and P. G. Jessop*

Learning Theory. Piyush Rai. CS5350/6350: Machine Learning. September 27, (CS5350/6350) Learning Theory September 27, / 14

Applications of multi-class machine

Structure-Activity Modeling - QSAR. Uwe Koch

Practical QSAR and Library Design: Advanced tools for research teams

1.QSAR identifier 1.1.QSAR identifier (title): Nonlinear QSAR model for acute oral toxicity of rat 1.2.Other related models:

Exploring the black box: structural and functional interpretation of QSAR models.

Translating Methods from Pharma to Flavours & Fragrances

Read Across, SARs and QSARs for Acute Inhalation Toxicity. Tiffany Bredfeldt Carla Kinslow Roberta Grant

Kinome-wide Activity Models from Diverse High-Quality Datasets

Validation of the GastroPlus TM Software Tool and Applications

1.3.Software coding the model: QSARModel Molcode Ltd., Turu 2, Tartu, 51014, Estonia

Chapter 11 Review Packet

Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases

II. Linear Models (pp.47-70)

SAMPL6 pka Challenge

OECD QSAR Toolbox v.3.3. Step-by-step example of how to build a userdefined

Introduction to Chemoinformatics and Drug Discovery

Chapter 17 Additional Aspects of Aqueous Equilibria (Part A)

QMRF# Title. number and title in JRC QSAR Model Data base 2.0 (new) number and title in JRC QSAR Model Data base 1.0

Estimation of Mars surface physical properties from hyperspectral images using the SIR method

Alternative Clustering, Multiview Clustering: What Can We Learn From Each Other?

In Silico Investigation of Off-Target Effects

Part (I): Solvent Mediated Aggregation of Carbonaceous Nanoparticles (NPs) Udayana Ranatunga

Introduction to Chemoinformatics

Gaussian Processes: We demand rigorously defined areas of uncertainty and doubt

INSTRUCTIONAL FOCUS DOCUMENT High School Courses Science/Chemistry

descriptors in QSAR of multi-functional molecules

Data Quality Issues That Can Impact Drug Discovery

Chromatographic Hydrophobicity Index (CHI) Application to Agrochemical Research. Eric Clarke & Beth Shirley, Chemistry Design Group, Jealott s Hill

7.1 Sampling Error The Need for Sampling Distributions

Acid-Base Chemistry & Organic Compounds. Chapter 2

Clustering Ambiguity: An Overview

PREDICTION OF SOIL ORGANIC CARBON CONTENT BY SPECTROSCOPY AT EUROPEAN SCALE USING A LOCAL PARTIAL LEAST SQUARE REGRESSION APPROACH

Predicting volume of distribution with decision tree-based regression methods using predicted tissue:plasma partition coefficients

Estimating Predictive Uncertainty for Ensemble Regression Models by Gamma Error Analysis

Supporting Information

Structural interpretation of QSAR models a universal approach

Reliability of inference (1 of 2 lectures)

Next Generation Computational Chemistry Tools to Predict Toxicity of CWAs

Linear Regression (continued)

Chapter 17 Additional Aspects of Aqueous Equilibria (Part A)

Screening and prioritisation of substances of concern: A regulators perspective within the JANUS project

Goldilocks I: Bronsted Acids and Bases p.1

Batch Mode Sparse Active Learning. Lixin Shi, Yuhang Zhao Tsinghua University

QSPR MODELLING FOR PREDICTING TOXICITY OF NANOMATERIALS

Machine learning for ligand-based virtual screening and chemogenomics!

OECD QSAR Toolbox v.4.1. Step-by-step example for building QSAR model

Advanced Medicinal Chemistry SLIDES B

Development of a Structure Generator to Explore Target Areas on Chemical Space

That s Hot: Predicting Daily Temperature for Different Locations

Computational Chemistry in Drug Design. Xavier Fradera Barcelona, 17/4/2007

1.QSAR identifier 1.1.QSAR identifier (title): QSARINS model for inhalation toxicity of polyfluorinated compounds

Acid - Base Studies. Goals : Safety : HCl, NaOH, HC 2 H 3 O 2 and NH 3 are corrosive. If. Waste : Solutions may be flushed down the sink.

(e.g.training and prediction set, algorithm, ecc...). 2.9.Availability of another QMRF for exactly the same model: No other information available

1.QSAR identifier 1.1.QSAR identifier (title): QSAR for female rat carcinogenicity (TD50) of nitro compounds 1.2.Other related models:

Prediction of Acute Toxicity of Emerging Contaminants on the Water Flea Daphnia magna by Ant Colony Optimization - Support Vector Machine QSTR models

An automatic report for the dataset : affairs

Chemometrics. 1. Find an important subset of the original variables.

molecules ISSN

Classification of Ordinal Data Using Neural Networks

Data Science for Engineers Department of Computer Science and Engineering Indian Institute of Technology, Madras

Xia Ning,*, Huzefa Rangwala, and George Karypis

OECD QSAR Toolbox v.3.0

Survey Calculations Single Surveys

CHEM1902/ N-4 November 2004

Estimation of Melting Points of Brominated and Chlorinated Organic Pollutants using QSAR Techniques. By: Marquita Watkins

Text Mining. Dr. Yanjun Li. Associate Professor. Department of Computer and Information Sciences Fordham University

Lecture 4 Capacity of Wireless Channels

Statistical Pattern Recognition

1. Create a scatterplot of this data. 2. Find the correlation coefficient.

Aijun An and Nick Cercone. Department of Computer Science, University of Waterloo. methods in a context of learning classication rules.

INSTRUCTIONAL FOCUS DOCUMENT HS/Integrated Physics and Chemistry (IPC)

Quantitative Structure-Activity Relationship (QSAR) computational-drug-design.html

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Essential Organic Chemistry. Chapter 9

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Strength-based and time-based supervised networks: Theory and comparisons. Denis Cousineau Vision, Integration, Cognition lab. Université de Montréal

Building innovative drug discovery alliances. Just in KNIME: Successful Process Driven Drug Discovery

Bayesian Inference Technique for Data mining for Yield Enhancement in Semiconductor Manufacturing Data

QSAR Modeling of ErbB1 Inhibitors Using Genetic Algorithm-Based Regression

2009 Optibrium Ltd. Optibrium, STarDrop, Auto-Modeler and Glowing Molecule are trademarks of Optibrium Ltd.

Goldilocks I: Bronsted Acids and Bases p.1

Chemistry. Student Number. Mark / 64. Final Examination Preliminary Course General Instructions. Total Marks 64

Combinatorial Heterogeneous Catalysis

Understanding Chemical Reactions through Computer Modeling. Tyler R. Josephson University of Delaware 4/21/16

OpenDiscovery: Automated Docking of Ligands to Proteins and Molecular Simulation

Chapter Learning Objectives. Regression Analysis. Correlation. Simple Linear Regression. Chapter 12. Simple Linear Regression

R E A D : E S S E N T I A L S C R U M : A P R A C T I C A L G U I D E T O T H E M O S T P O P U L A R A G I L E P R O C E S S. C H.

OECD QSAR Toolbox v.4.1

OECD QSAR Toolbox v.3.3

Chapter 4 Molecular Compounds 4.11 Naming Binary Molecular Compounds (No Metals!)

Low-Level Analysis of High- Density Oligonucleotide Microarray Data

Transcription:

Assessing and Improving the Reliability of In Silico Property Predictions by Incorporating Inhouse Data Pranas Japertas

Outline Why do in silico models fail? Knowing when in silico models fail assessing Model Applicability Domain (Reliability Index) Relating RI to accuracy of predictions Improving models with in-house data Case studies of model improvement with in-house data Connecting measurement and prediction

Why do in silico models fail?

Why do in silico models fail? Irrelevant descriptors? Statistical techniques not sophisticated enough? Limited diversity of the training set? Improper usage of statistical tools? Poor data quality of the training set?

Any model, no matter which descriptors or statistical methods were used in its development cannot be better than the data it is based on. Every empirical model works only in certain chemical space, where the compounds from the training set are located boundaries of Model Applicability Domain

Model Space (Applicability Domain) Compound is in the Model Space. Expected residual error should be comparable to the errors in the validation Compound is outside Model Space. Reliability of prediction cannot be estimated. Compound in the Training Set of the Model Model Space (Applicability Domain) New Compound, we are making prediction for

Model Space and Chemical Space of proprietary Compounds Compound in the Training Set of the Model Model Space (Applicability Domain) New Compound, we are making prediction for Chemical Space of in-house compounds

The very FIRST question in silico model should answer: Is a compound in the Model Applicability Domain (Model Space)? Can we trust this prediction? What is the predicted value for property X is only the second question.

Assessing Model Applicability Domain. Reliability Index (RI) as a measure of the quality of a prediction

Pathway to the Reliability Index Similarity Index Data-Model Consistency Index Reliability Index

The First Step Finding most similar compounds CH 3 CH 3 CH 3 H 3C Compound we are making prediction for Compounds in the training set The most similar compounds in the training set

Calculating Similarity Index SI = 0.89 n most similar compounds from the database: CH 3 CH 3 CH 3 H 3 C SI 1 =0.97 SI 2 =0.92 SI 3 =0.88 SI 4 =0.87 SI 5 =0.80 Similarity Index (SI) is calculated as weighted average pair wise similarity to n most similar compounds in the training set: n n i 1 i 1 i i= 1 i= 1 = = SI a SI / a 0.82

Similarity Index as initial criterion for the assessment of Model Applicability Domain Similarity Index obtains values in the range from 0 (nothing similar exists) to 1 (n completely similar compounds exist), making it easily understandable and usable Similarity Index is a simple but efficient criterion identifying compounds that DO NOT belong to Model Applicability Domain But having similar compounds does not necessary means that predictions will accurate and reliable Let s consider the following example with prediction of Acute Toxicity (LD50) values for the doxorubicin

Doxorubicin, Rat/IP Rat, Intraperitoneal administration: Estimated LD 50 = 13 mg/kg RI = 0.71 Five most similar compounds with experimental : values from the library: LD 50 = 10mg/kg SI = 1.00 LD 50 = 31 mg/kg SI = 1.00 LD 50 = 8.6 mg/kg SI = 0.93 LD 50 = 4.1 mg/kg SI = 0.92 LD 50 = 0.32 mg/kg SI = 0.83

Doxorubicin, Mouse/Oral Mouse, Oral administration: Estimated LD RI = 0.04 50 = 65 mg/kg Five most similar compounds with experimental : values from the library: LD 50 = 570 mg/kg SI = 1.00 LD 50 = 698 mg/kg SI = 1.00 LD 50 = 205 mg/kg SI = 0.98 LD 50 = 16 mg/kg SI = 0.98 LD 50 = 0.5 mg/kg SI = 0.93

Experimental LD50s for similar compounds Closer look at the experimental values of LD50s for similar compounds in the training set: : LD 50 = 570 mg/kg SI = 1.00 LD 50 = 698 mg/kg SI = 1.00 LD 50 = 205 mg/kg SI = 0.98 LD 50 = 16 mg/kg SI = 0.98 LD 50 = 0.5 mg/kg SI = 0.93 LD50 ranges from 0.5 mg/kg to 700 mg/kg!

Can we have an idea how well model will perform by looking at similar compounds?

How well will we predict? (1) CH 3 CH 3 CH 3 H 3C Compound we are making prediction for Compounds in the training set The most similar compounds in the training set

How well will we predict? (2) CH 3 1.97 1.75 CH 3 0.92 2.28 0.73 1.69 Observed H 3C CH 3 1.47 0.80 1.93 0.49 Predicted The most similar compounds in the training set Retrieve measured values Predict values using model Analyse model performance for those compounds

Model performance for similar compounds (1) Two compounds, with the same Similarity Index for both. Scatter plots illustrating model performance for similar compounds. Observed Observed Predicted Similar compounds of compound A Predicted Similar compounds of compound B Which prediction shall we trust more? For compound A or B?

Model performance for similar compounds (2) Observed Observed Predicted Compound A: Model predictions for similar compounds agree with experimental data Predicted Compound B: Model doesn t agree with experimental data Reasons: 1. Model doesn t work 1. Model doesn t work in general 2. Model doesn t work for this particular class of compounds 3. Similarity doesn t work selected compound are not similar 2. Data quality is bad 3. Both model and data are bad

Model-Data Consistency Index Considerations from the previous slides were reflected in the developed Model-Data Consistency Index, which is calculated looking at the predicted and measured values for the similar compounds: MDCI n n i ( 1 2 a SIi ( Δi Δ) / = i= 1 i= 1 e a i 1 ) / b Scaling to [0;1] range Summation of residual errors between observed and predicted values for similar compounds Δ i Δ SI i a,b - Difference between estimated and experimental value for the i th nearest neighbor - Average difference for the neighbors - Similarity to the i th nearest neighbor - Scaling parameters

Reliability Index Reliability Index is a product of Similarity and Model- Data Consistency indices: RI = SI MDCI RI will be low if SI is low (no similar compounds) OR MDCI (model doesn t agree with the experiment for the similar compounds) is low RI will be high only if we have similar compounds AND model performs well compared to measured values on those compounds

Demonstrating relationship between accuracy of the predictions and Reliability Index (Validation)

RMSE vs. RI (1) System: LogP (training set size: 10593 compounds) Results on the validation set: 10 8 y = 1.003x - 0.0168 R 2 = 0.9318 10 8 y = 1.0164x - 0.0439 R 2 = 0.9579 10 8 6 6 y = 1.0094x - 0.0329 R 2 6 = 0.9742 4 4 4 2 2 2 0-6 -4-2 0 2 4 6 8 10-2 0-6 -4-2 0 2 4 6 8 10-2 0-6 -4-2 0 2 4 6 8 10-2 -4-4 -4-6 -6-6 R 2 = 0.9318 RMSE = 0.503 N = 5296 R 2 = 0.9579 RMSE = 0.404 N = 3486 (compounds with estimated RI > 0.75) R 2 = 0.9742 RMSE = 0.313 N = 1220 (compounds with estimated RI > 0.85)

RMSE vs. RI (2) System: pk a (base) (training set size: 8335 compounds) Results on the validation set: 15.00 y = 0.9972x - 0.0265 R 2 = 0.9453 15.00 y = 1.0028x - 0.0512 R 2 = 0.9757 15.00 y = 0.9996x - 0.0157 R 2 = 0.9877 10.00 10.00 10.00 5.00 5.00 5.00 - -5.00-5.00 10.00 15.00 - -5.00-5.00 10.00 15.00 - -5.00-5.00 10.00 15.00-5.00-5.00-5.00 R 2 = 0.9453 RMSE = 0.723 N = 7950 (compounds with estimated RI > 0.5) R 2 = 0.9757 RMSE = 0.460 N = 5498 (compounds with estimated RI > 0.8) R 2 = 0.9877 RMSE = 0.302 N = 3010 (compounds with estimated RI > 0.9)

Improving models using in-house data

Expanding Model Applicability Domain In-house compounds with measured data, somehow being added to the model Compound in the Training Set of the Model Model Space (Applicability Domain) New Compound, we are making prediction for Chemical Space of in-house compounds

Algorithm for Trainable Models Prediction algorithm behind Trainable Models consists of two parts: PLS + Correction according to similar structures Linear, additive method Learns global trends and what s similar for particular property It s constant, doesn t change Adds non-linearity Makes correction to the original prediction from PLS by analysis of local environment It s changing, when new data is added trainable part of the algorithm

CH 3 CH 3 CH 3 H 3C Compound we are making prediction for Compounds in the training set The most similar compounds in the training set

CH 3 1.97 1.75 CH 3 0.92 2.28 0.73 1.69 Observed H 3C CH 3 1.47 0.80 1.93 0.49 Predicted The most similar compounds in the training set Retrieve measured values Predict values using model Do local modeling

PLS Illustration y Simple model relating property y and descriptor x PLS analysis would come up with linear model y = 0.5*x + 0.7 y = 0.5 x + 0.7 x New compound (x = -4.5), using equation y = 0.5 * x + 0.7, we get y = -1.55

PLS + Similarity Illustration y = -1.55 + Δ(mean) y = -1.55 + -0.78 y = -2.33 y y = 0.5 x + 0.7 Now we also explore local environment, analyzing how models performs for similar compounds with known measurements Let s look at the difference between Observed and Predicted values for those compounds: x Δ = -1.93 Δ = -1.42 Δ = -0.40 Δ = +0.63 On average: Δ(mean) = -0.78

Validation Studies of Trainable Models

External Validation at Syngenta, Solubility Model Validation case: Take existing model of aqueous solubility (based on publicly available data). Add portions of in-house measured data retraining the model. Check predictivity against validation set of in-house compounds after each addition of experimental data.

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

External Validation at Syngenta, Solubility Model System: Aqueous solubility (training set size varies) Results on the validation set: Whole test set Unreliable removed (RI>0,3) Moderate and high (RI>0,5) High (RI>0,7) Size of the training set No of cmpds MAE No of cmpds MAE No of cmpds MAE No of cmpds MAE Built-in 400 0.900 270 0.841 48 0.695 Built-in + 100 400 0.894 291 0.873 57 0.710 Built-in + 250 400 0.868 303 0.841 76 0.703 Built-in + 500 400 0.821 327 0.774 144 0.586 58 0.469 Built-in + 750 400 0.697 365 0.658 256 0.534 118 0.434 Built-in + 913 400 0.624 382 0.616 304 0.530 154 0.433 MAE ranges (0,8-0,9] (0,7-0,8] (0,6-0,7] (0,5-0,6] (0,4-0,5]

Any model, no matter which descriptors or statistical methods were used in its development cannot be better than the data it is based on. Every empirical model works only in certain chemical space, where the compounds from the training set are located boundaries of Model Applicability Domain Even within Model Applicability Domain model cannot get more accurate than the unexplainable variation in measured values.

Unexplainable variability in measured data 1. Measurements from the same laboratory p p x 2. Same protocol from several laboratories p x 3. Compilation of publicly available data x

Models incorporating in-house data Better accuracy is achieved because: Chemical space is much better represented, model learns structureproperty relationships for new chemical classes Several layers of data noise are removed as we use much more consistent data model learns your methodology, your protocol

Improving models using in-house data. How many compounds do you need to improve a model?

Validation case: Special built LogP model no beta-lactam antibiotics in the training set! Add beta-lactam antibiotics with measured LogP one after another to improve the model Check predictivity against validation set and selected compounds (beta-lactam antibiotics).

Phenoxymethylcephalosporin example O O HO O Measured LogP = 0.26 O NH N S O Calculated LogP = 1.46 Estimated RI = 0.26 H 3 C O Five most similar compounds from the library: SI = 0.49 SI = 0.17 SI = 0.12 SI = 0.11 SI = 0.10

Model training with beta-lactam antibiotics class Prediction results for Phenoxymethylcephalosporin: Predicted LogP RI Predicted - Observed Original prediction (no beta lactam antibiotics in the training set) 1.46 0.26 1.2 + S O NH O N S O OH S H 3 C N N N N 1.05 0.53 0.79 + O H 3 C OH N S O NH O S 0.62 0.61 0.36 + N N S S O OH N S O NH O S 0.32 0.71 0.06 H 3 C

Applications of the Reliability Index and Traiinable Models Connecting measurement and prediction

As it was shown in the previous slides, RI indeed reflects Model Applicability Domain and can be used as indicator of usability of predicted values RI can also be used in experiment planning and prioritization of the measurements most knowledge will be gained measuring compounds for which estimated RI is the lowest.

Scale of Reliability Index: 0 1 Low RI values, these compounds should be measured first and if possible added to model We expect that amount of compounds with estimated intermediate Reliability Index will decrease after properties of compounds with low RI will be measured and values added to the model for further improvement High RI values, class of compounds well represented in the training set, model already performs well on those compounds. Measurements would be redundant

Acknowledgements Pharma Algorithms Rytis Kubilius Andrius Sazonovas Remigijus Didžiapetris Kiril Lanevskij Syngenta Eric Clarke John Delaney Tom Sheldon (Industrial Placement Student, University of Bath) PhysChem Forum 5 Organizers

Thank You for your attention