Supporting Information

Size: px

Start display at page:

Download "Supporting Information"

Irma Chase
5 years ago
Views:

1 Supporting Information Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction Connor W. Coley a, Regina Barzilay b, William H. Green a, Tommi S. Jaakkola b, Klavs F. Jensen a a Department of Chemical Engineering, Massachusetts Institute of Technology; 77 Massachusetts Avenue, Cambridge, MA b Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology; 77 Massachusetts Avenue, Cambridge, MA Corresponding author, kfjensen@mit.edu S1. Full embedding algorithm The model architecture for convolutional embedding, which converts a molecular graph to a learned fingerprint vector, is as follows. 1. Define molecular tensor and initialize network a. For each molecule, define its molecular tensor as where is the combined number of atom and bond features and is the number of heavy atoms (i.e., not counting hydrogen), such that, 0,, 0, 0, connected otherwise, 1,, where are the atom-level features describing atom and are the bond-level features describing the bond between atoms and. b. Initialize the cumulative molecular fingerprint,, of length with all zeros. 0

2 c. Initialize the model weights and bias terms required at every stage of the convolution (i.e., depth or radius ), 0,1,,. Inner update weight matrices,, which are used to convert an atom and its neighbors representations into a new representation for that atom, are initialized to be close to the identity matrix. Outer update weight matrices,, which are used to convert an atom s representation to a longer atom fingerprint representation, are initialized to be close to zero. Small random uniform noise, is added to the weights and bias terms , , , 1,, 0.005, , , 1,, 2. Define initial attributes at radius zero, Define the initial attribute matrix at 0, that describes the initial feature vector corresponding to each atom.,, 1,, 1, Or, equivalently, using the molecular tensor,,, 1, N, 1, 3. Generate atom-level fingerprints and add to cumulative molecular fingerprint For each atom, calculate an atom-level fingerprint based on its current attributes by passing the current attribute vector through an output layer with weights (e.g., softmax). Add this fingerprint to the cumulative molecular fingerprint., bias, and non-linear activation function 4. Update the attributes associated with each atom Calculate new attributes for each atom based on its own attributes and those of its neighbors. This is done by passing those values through an internal update hidden layer using weights and bias. Information about neighbor features for each atom is conveniently stored in the th entry along the first dimension of. Because the rest of is filled with zeros when two atoms are not connected, we can sum over the second dimension of to aggregate neighbors; because the atom features of are recorded

3 along the diagonal,, this summation will include the features for itself. Next, we sum self and neighbor contributions before passing summed attributes through an inner hidden layer, 1, By design, has the same length along its first two dimensions and is initialized close to the identity matrix, so the updated attributes matrix is the same shape as the previous one,. 5. Update the molecular tensor Update the molecular tensor with new atom attributes. The attributes of an atom will appear in the diagonal of the molecular tensor and in any off-diagonal entry corresponding to a bond. The final index along the third dimension of is a placeholder feature that describes whether a bond is present.,,,, or,, otherwise 1, 1, 6. Repeat until target radius has been reached Repeat steps 3-5, updating the overall learned fingerprint to include fingerprint contributions at each radius, until the specified maximum radius, has been reached 1 Learning step For physical property prediction, the loss function used is the mean squared error; for toxicity prediction on each of the 12 targets, the loss function used is the binary crossentropy. Parameters (weights, biases) in the embedding architecture are trained at the same time as parameters (weights, biases) in the neural network regression architecture using the Adam or Adadelta update procedures, which operate similarly to gradient descent with a dynamic learning rate. The use of Theano operations in the network implementation enables automatic calculation of analytical gradients, i.e. the derivative of the loss function with respect to all network parameters. S2. Model hyperparameters

4 Table S1. Model hyperparameters. Common to all models are: = tanh, = softmax, = tanh, = 512, 5, 50, optimizer = adam, batch size = 1, loss = mean squared error.. CNN-De-aq-all e -epoch/30 Not CV; Train on full 100% CNN-Br-tm-all e -epoch/30 Not CV; Train on full 100% Learning rate Notes CNN-Ab-octrepresentative e -epoch/30 CNN-De-aq-representative e -epoch/30 Train on full 80% CNN-Br-tm-representative e -epoch/30 Train on full 80% CNN-Ab-oct-all e -epoch/30 Not CV; Train on full 100% CNN-Ab-oct-all-De-aqrepresentative CNN-De-aq-all-Ab-octrepresentative CNN-Br-tm-all-Ab-octrepresentative CNN-Ab-oct-De-aqrepresentative e -epoch/30 Train on full 80%; Initialized with CNN- Ab-oct-all e -epoch/30 Initialized with CNN-De-aq-all e -epoch/30 Initialized with CNN-Br-tm-all e -epoch/30 Multitarget; Train on full 80% Tox21-ST ab Adadelta default in Keras Tox21-ECFP4 abe Adadelta default in Keras Tox21-ECFP6 abe Adadelta default in Keras 80% training, 20% internal validation, separate test dataset 80% training, 20% internal validation, separate test dataset 80% training, 20% internal validation, separate test dataset a Activation of final output node = sigmoid, optimizer =Adadelta, loss = binary crossentropy b 100, 50 e Convolutional embedding replaced with fixed fingerprint representation S3. Baseline SVM model performance As an indication of how other QSAR/QSPR methods might perform on the same dataset, without pursuing a full optimization of those methods hyperparameters, we use Scikit-learn s SVM implementation to fit regressions for the Abraham, Delaney, and Bradley datasets. Default values of C (1.0) and epsilon (0.1) were used. For these tests, molecules are represented either by the Morgan fingerprint of radius 2 or of radius 3 (as calculated by RDKit with features, mimicking ECFP4 and ECFP6 fingerprints) folded to length 512. This length matches the learned fingerprint length in our convolutional approach. Three different kernel functions were used: linear, radial basis functions (RBF), and the Tanimoto similarity score defined for Boolean vector fingerprints and as the following:, and, or,

5 Models were trained and tested using 5-fold CVs in triplicate, identical to the convolutional models. We acknowledge that this is by no means a definitive comparison to the full set of possible machine learning techniques, but it does provide a reference point for model performance using more traditional machine learning approaches. The results are shown below in Table S3. Performance is significantly worse than the convolutional models, although the Abraham SVM model using the ECFP4 fingerprint and Tanimoto kernel is better than the two worst-performing convolutional models. Table S3. Performance summary for baseline model runs using Morgan fingerprints of radius 2 or 3, using different kernel functions, run in triplicate. Units are log10(mol/l) for Abraham and Delaney solubility models, and degrees Celsius for Bradley melting point models. The best results for each dataset are bolded. = = Abraham octanol solubility Delaney aqueous solubility log(m) log(m) = Bradley melting point deg C Dataset Fingerprint Kernel MSE MAE SD Abraham Morgan (r=3) Linear p/m p/m p/m Abraham Morgan (r=3) RBF p/m p/m p/m Abraham Morgan (r=3) Tanimoto p/m p/m p/m Abraham Morgan (r=2) Linear p/m p/m p/m Abraham Morgan (r=2) RBF p/m p/m p/m Abraham Morgan (r=2) Tanimoto p/m p/m p/m Delaney Morgan (r=3) Linear p/m p/m p/m Delaney Morgan (r=3) RBF p/m p/m p/m Delaney Morgan (r=3) Tanimoto p/m p/m p/m Delaney Morgan (r=2) Linear p/m p/m p/m Delaney Morgan (r=2) RBF p/m p/m p/m Delaney Morgan (r=2) Tanimoto p/m p/m p/m Bradley Morgan (r=3) Linear p/m p/m p/m Bradley Morgan (r=3) RBF p/m p/m p/m Bradley Morgan (r=3) Tanimoto p/m p/m p/m Bradley Morgan (r=2) Linear p/m p/m p/m Bradley Morgan (r=2) RBF p/m p/m p/m Bradley Morgan (r=2) Tanimoto p/m p/m p/m S4. Code

6 Code and data used for model training/testing can be found at S5. Hyperparameters To examine the sensitivity to key hyperparameters (fingerprint depth during convolution, fingerprint length, and the number of hidden nodes in the regression layer ), we assessed performance on the Abraham octanol solubility dataset using a fixed 80%/20% training/testing split while varying only one parameter. The training schedule was arranged so that a small portion of the training data (~10%) was reserved for internal validation and early stopping. Results were noisy due to the limited number of samples in the Abraham dataset. Comparative boxplots are shown in Figures S4, S5, and S6 in terms of absolute error. Box bounds indicate the 25 th and 75 th percentiles, whiskers indicate the minimum and maximum, red lines indicate the median, and outliers are represented as red plus signs. Deeper convolutions, longer fingerprints, and more hidden nodes all increase the overall flexibility of the model. This is beneficial as long as long as trends are being accurately captured and as long as regularization can prevent overfitting. With smaller datasets the Abraham dataset in particular excessive flexibility will cause overfitting and poor generalization. We find that a fingerprint depth of 5, a fingerprint length of 512, and a hidden layer of size 50 provides a reasonable amount of flexibility without sacrificing test set performance, hence these values were used for models corresponding to the settings in Table S1. It is important to note that these settings were obtained based on the results of a randomized 20% test set from the Abraham dataset and then applied to all other models for the purposes of discussion only; quantitative results are also shown using a full hyperparameter grid search as described in the main text.

7 Figure S4. Effect of changing the fingerprint depth, i.e. the number of iterations used in the convolutional process, on model performance using the Abraham octanol solubility dataset 245 as a model dataset. Figure S5. Effect of changing the fingerprint length, i.e. the size of the embedded vector representation, on model performance using the Abraham octanol solubility dataset 245 as a model dataset.

8 Figure S6. Effect of changing the hidden layer size, i.e. the number of nodes in the hidden layer in the regression portion of the overall model architecture, on model performance using the Abraham octanol solubility dataset 245 as a model dataset. Performance does, however, depend strongly on the learning schedule. As a simple comparison, consider a shallower learning model which follows / and a steeper learning model which follows /. Both use a patience of 10 epochs (i.e., if the validation loss does not improve for 10 epochs, stop training). The loss (MSE) during training is shown in Figure S7 and a comparison of residuals in Figure S8. Due to the very small size of the internal validation dataset, training is prone to early stopping, so superior performance is observed with the more aggressive learning schedule. Figure S7. Comparison of mean squared error loss for the training and internal validation datasets during training for (left) a low, slow-decaying learning rate and (right) a high, fast-decaying learning rate.

9 Figure S8. Effect of changing the aggressiveness of the learning rate on model performance using the Abraham octanol solubility dataset 245 as a model dataset.

Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction

Convolutional Embedding of Attributed Molecular Graphs for Physical Property Prediction The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters.