Prediction of Organic Reaction Outcomes. Using Machine Learning

Prediction of Organic Reaction Outcomes Using Machine Learning Connor W. Coley, Regina Barzilay, Tommi S. Jaakkola, William H. Green, Supporting Information (SI) Klavs F. Jensen Approach Section S.. Data By design, our training method assumes that the recorded product in a reaction example is the primary outcome. This is certainly true for examples with greater than 50% yield. This holds for most examples. However, because we do not have reliable yield information for many of the reactions in the database, this assumption cannot be guaranteed for all examples. In the absence of full speciation information (e.g., yield for side products, by products, intended product), this is a necessary assumption to make use of experimental data. Section S.2. Forward Enumeration When databases of reaction examples were first being developed (e.g., Reaxys, CASREACT, SPRESI), researchers began developing algorithms to automatically extract the essence of the transformation and generalize it as a reaction template. The software developed by Law et al., Route Designer, 7 works by identifying a reaction core containing all atoms with a change in connectivity and then heuristically extending this core to contain the chemically relevant environment; multiple generalized rules are then merged into a final reaction rule.

InfoChem s ICSynth uses a similar approach, but defines multiple templates with varying degrees of specificity for each reaction example. 8 The algorithm to generate forward synthesis templates is as follows:. Load the reaction SMILES string into RDKit as atom-mapped molecule objects representing the reactants and products. 2. Based on the atom map number, identify which atoms have changed between the reactants and products by examining their (a) SMARTS symbol, (b) atomic number, (c) total number of hydrogens, (d) formal charge, (e) degree, (f) number of radical electrons, and (g) bond-neighbor pairs including the atomic number and atom map number. Any changed atoms are part of the reaction core and must be included in the final reaction template. 3. Generate SMARTS labels for each atom in the reaction core. The default in this case is to use the SMILES fragment, as SMILES fragments can be used as SMARTS patterns (although they only match that exact fragment). Adjustments are made to the SMILES fragment to make the number of hydrogens and formal charge explicit. For example, a quartenary carbon with no hydrogens may have the SMILES representation [C:], but the SMARTS [C:] will match any aliphatic carbon with any number of hydrogens. We instead replace this symbol with[c;h0;+0:] to ensure that a matched carbon atom has the correct number of hydrogen atoms (zero). 4. Examine each atom in the reactant molecule(s) that is immediately adjacent to the reaction core. If that atom is atom-mapped, then it must appear in the products (i.e., it is not a leaving group). These atoms are fully generalized to SMARTS wildcards to match any heavy atom: *. The atom map number is preserved, so, for example, the SMARTS fragment for [CH:2] would be generalized to [*:2]. Any atom-map numbers from the reactants added in this step must also be added for the products. 2

If the neighbor atom is not atom-mapped, then it is part of a leaving group. These atoms are generalized based on their atomic number and aromaticity. Aliphatic carbons are all converted to the corresponding SMARTS fragment C, aromatic carbons are all converted to the SMARTS fragment c, and heteroatoms are converted to the SMARTS fragment #AN, where AN is the atomic number of that heteroatom. In this manner, all leaving groups beginning with the same atom type, e.g., an aliphatic carbon, will appear identical. 5. Generate an overall SMARTS string for the reaction. This is done using RDKit s MolFragmentToSmiles function with atom symbols replaced with custom SMARTS fragments as described above. All bonds are made explicit; for example, the pattern CC would be converted to C-C so that it cannot match an alkene C=C. 6. Canonicalize the template. This step has not been fully refined. RDKit s MolFragmentToSmiles calculates a canonical atom ordering (because SMILES strings are not unique), but this is done without considering the custom atom symbols we define. Additionally, the original atom-mapping number of the reaction example will affect the atom ordering. In one example, the reaction core might correspond to atom maps 4, 5, and 6, while a similar reaction with an identical reaction core could have atom maps 32, 34, and 33. We copy atom-mapped templates to unmapped templates and determine the canonical ordering in the absence of atom map numbers, after which we re-assign atom map numbers from left-to-right starting at one. The actual SMARTS fragments corresponding to the top ten forward templates (including the five templates in Figure 2) are listed below in Table S. Reaction templates follow a Zeta distribution in terms of popularity as shown in S, where the plot of frequency against rank is linear on a log-log plot. The implication of this is that relatively few templates cover a decent breadth of chemistry, but much of chemistry can only be described by very unusual reaction templates. To achieve 00% coverage, all >00,000 templates would be required. 3

Table S: Top 0 most popular reaction templates extracted from. million USPTO reactions. Count refers to the number of reaction examples which produced that template. Rank Count SMARTS 4076 [*:]-[O;H0;+0:2]-[C]»[*:]-[OH;+0:2] 2 2074 [#8]-[N+;H0:](=[#8])-[*:2]»[*:2]-[NH2;+0:] 3 5458 [*:]-[C;H0;+0:2](=[O;H0;+0:3])-[O;H0;+0:4]-[C]»[*:]-[C;H0;+0:2](=[O;H0;+0:4])-[OH;+0:3] 4 4436 [*:]-[N;H0;+0:2](-[*:3])-[C]»[*:]-[NH;+0:2]-[*:3] 5 865 [*:]-[NH;+0:2]-[C]»[*:]-[NH2;+0:2] 6 69 [#8]-[C;H0;+0:](-[*:2])=[*:3].[*:4]-[NH2;+0:5]»[*:2]-[C;H0;+0:](=[*:3])-[NH;+0:5]-[*:4] 7 07 [#8]=[C;H0;+0:](-[*:2])-[OH;+0:3].[*:4]-[NH2;+0:5]»[*:2]-[C;H0;+0:](=[O;H0;+0:3])-[NH;+0:5]-[*:4] 8 9734 [#8]=[C;H0;+0:](-[*:2])-[OH;+0:3].[*:4]-[NH2;+0:5]»[*:4]-[NH;+0:5]-[C;H0;+0:](-[*:2])=[O;H0;+0:3] 9 93 [#7]-[c;H0;+0:](:[*:2]):[*:3].[*:4]-[NH2;+0:5]»[*:4]-[NH;+0:5]-[c;H0;+0:](:[*:2]):[*:3] 0 847 [#7]-[c;H0;+0:](:[*:2]):[*:3].[*:4]-[NH2;+0:5]»[*:2]:[c;H0;+0:](:[*:3])-[NH;+0:5]-[*:4] 0 - Frequency of occurrence [-] 0-2 0-3 0-4 0-5 0-6 0-7 0 0 0 0 2 0 3 0 4 0 5 0 6 Rank [-] Figure S: Normalized reaction template popularity versus rank. The primary factor in the decision to truncate to the most popular templates (with at least 50 reaction examples) was the computational cost. Application of templates is computationally expensive due to the underlying subgraph isomorphism problem, despite RDKitâĂŹs use of Boost graph objects and the VF2 algorithm. As shown in Figure S, the marginal benefit of including additional templates decreases rapidly with template rank. The large number of additional templates required for a small gain in coverage carries a significant computational cost to the overall workflow. A timing test for an expanded template set is shown in Figure S2. An alternate means of generating candidate products that does not involve subgraph isomorphism would greatly accelerate the model. The number of reactant atoms per reaction in the 5,000-member dataset is shown in Figure S3. 4

0000 000 CPU time [s] 00 0 0 20 40 60 80 00 20 Number of reactant atoms [-] Figure S2: Timing test showing computational cost of matching forward reaction templates to reactants. Results are shown for,240 reactant sets and 62,946 templates; the total; the total CPU time was 82 hours. 2500 2000 Count [-] 500 000 500 0 0 20 40 60 80 00 20 Number of reactant atoms per reaction [-] Figure S3: Histogram of the number of heavy (non-hydrogen) atoms in each of the 5,000 reaction examples. 5

Section S.3. Candidate Ranking Model In our featurization, we include rapidly-calculable structural and electronic features that reflect the local chemical environment first and foremost, but can reflect the surrounding molecular context. They are described below and tabulated in Table S2. Crippen contribution to logp is an estimate of an atom s contribution to a molecular partition coefficient. The partition coefficient is defined as the equilibrium concentration ratio of a neutral compound in -octanol as compared to water. Crippen contribution to Molar Refractivity is an estimate of an atom s contribution to a molecular MR value. A moleule s MR depends on its size, density, and index of refraction, which tends to correplate with its polarizability. The Total Polar Surface Area (TPSA) contribution is an estimate of an atom s contribution to a molecule s TPSA. The TPSA is calculated by estimating a 3D conformer using force field models, replacing atom centers with overlapping hard spheres with radii proportional to their van der Waals radii, and summing over the exposed areas corresponding to polar atoms (e.g., oxygen, nitrogen). The Labute Approximate Surface Area contribution 9 is an estimate of an atom s contribution to the overall surface area. It is estimated using the same hard-sphere approach as as the TPSA, but does not consider atomic polarity. This atom-level feature provides an indication of steric hinderance or the degree to which an atom is exposed and available to react. The EState Index 20 is an electronic/topological atom-level descriptor that combines information about that atom s intrinsic EState value (calculated from the number of pi and lone-pair electrons), contributions from its immediate neighbors, and contributions from non-bonded atoms elsewhere in the molecule. 6

The Gasteiger partial charge 2,22 is the estimated partial charge on each atom in a molecule. Intrinsic electronegativities are used to initialize values in a molecule, but atom-level values are iteratively updated to allow for the propagation of electron donating/withdrawing effects to neighbors, across conjugated bonds, or across aromatic rings. The Gasteiger hydrogen partial charge is a similar estimate of the total charge on the hydrogens - explicit or implicit - bonded to each heavy atom. The atomic number is encoded as a one-hot vector consisting of all zeros except at the index where the atomic number matches the value in a set of choices. The atomic number choices are 5, 6, 7, 8, 9, 5, 6, 7, 35, 53, and other. The number of neighbors (not including hydrogens) is similarly one-hot encoded from the choices 0,, 2, 3, 4, 5. The number of hydrogen atoms attached to each heavy atom is similarly one-hot encoded from the choices 0,, 2, 3, 4. The formal charge of an atom is represented numerically. Whether or not the atom is in a ring is represented numerically as a or 0. Whether or not the atom is in an aromatic ring is represented numerically as a or 0. The bond order is represented as a one-hot vector from the list of choices single, aromatic, double, or triple. As described in the main text, loss or gain of a hydrogen is represented by the features of that reactant atom alone, a i R 32. Loss or gain of a bond is represented by a concatenation of reactant atom features and product bond features [a i,b ij,a j ] R 68. As an example of this type of representation, consider a simpler set of attributes where atoms are represented by the R 6 vectoris_c, is_n, is_o, is_cl, num_hs, num_neighbors and bonds are represented 7

Table S2: Atom- and bond-level features used to describe edits that occur at the reaction core. Atom index Label 0 Crippen contribution to logp Crippen contribution to Molar Refractivity 2 Total Polar Surface Area contribution 3 Labute Approximate Surface Area contribution 4 EState Index 5 Gasteiger partial charge 6 Gasteiger hydrogen partial charge 7-7 Atomic number (one-hot) 8-23 Number of neighbors (one-hot) 24-28 Number of hydrogens (one-hot) 29 Formal charge 30 Is in ring 3 Is aromatic Bond index Label 0-3 Bond order (one-hot) the R 4 vector is_single, is_aromatic, is_double, is_triple. We look to an example reaction shown in Figure S4. NH O Cl O 3 2 Cl Cl O 2 N O Cl Cl Figure S4: Example reaction with which to demonstrate the edit-based representation. For the reaction in Figure S4, there are three edits:. Atom loses a hydrogen 2. A bond between atoms 2 and 3 is lost 3. A bond between atoms and 2 is gained 8

We then construct vectors a, a 2, and a 3 describing the features of each atom involved in reaction [ ] a = 0 0 0 2 [ ] a 2 = 0 0 0 0 3 [ ] a 3 = 0 0 0 0 () (2) (3) and likewise for each bond [ ] b 2 = 0 0 0 [ ] b 23 = 0 0 0 (4) (5) The representations are then used to generate a score, s, for this candidate. Features are passed through the neural networks tailored to each edit type, summed in their intermediate vector representations, and passed through one final network. For simplicitly, we refer to these functions f H lost, f H gain, f Bond lost, f Bond gain, and f Overall. Hence, s = f Overall (f H lost (a )+f Bond lost ([a 2,b 23,b 3 ])+f Bond gain ([a,b 2,b 2 ])) (6) While this representation does not look very meaningful to a human, it fully specifies the candidate reaction as an sp 2 carbon is aminated by a secondary amine upon loss of chlorine. Once the model generates the score, s, that score can be compared against all other candidates in a softmax layer to produce a probability: p = exp(s) Ncandidates j= exp(s j ) The machine learning models are trained to solve something akin to a classification problem: given hundreds of possible classes (candidate reactions) predict the true class (recorded reaction), which is known to be one of those candidates. Even though each reaction example (7) 9

will have a different set of classes, this framework allows us to use the categorical crossentropy as the loss function (objective) during training. This objective function is defined as the negative logarithm of the probability assigned to the true class (or true outcome): minl = min ( log(p true ) ) (8) Weights were optimized using an Adadelta optimizer 23 with a fixed learning rate of 0.0 for 200 epochs with minibatches of 20 reaction examples. No regularization was used in the reported models. Cross-validation performance in the edit-based model was insensitive to the number of training epochs (i.e., was not prone to overfitting), so an inner validation set was not used for early stopping. The training time for the hybrid model (85 epochs) was selected as a compromise between the baseline model s tendency to overfit and the edit-based model s need to train for longer. Fundamentally, the machine learning model is a candidate ranking model. Ranking models can be trained using categorical crossentropy to maximize the probability assigned to the top rank- outcome, but this is not particularly discriminative. A minor change in probability can change the rank but leaves the categorical crossentropy nearly the same. A more effective approach for training ranking models can be the use of a max-margin rank loss. Candidates that are similar to the target are encouraged to have higher scores, while candidates that are dissimilar to the target are encouraged to have lower scores. This approach necessitates development of a distance function between candidates to assess similarity. It is not easy to imagine what such a function would look like: traditional molecular similarity (e.g., between products) does not measure the similarity between candidate reactions. For example, consider a large molecule with an aldehyde group. This group could be reduced to the alcohol or oxidized to the carboxylic acid. Those two candidate products are similar in structure, but very dissimilar in the reactivity that would produce them. 0

Figure S5: Baseline model architecture for scoring candidate reactions. Note that the candidate representation is fundamentally different from the edit-based representation; it is the Morgan fingerprint of the candidate product Table S3: Parameters required in the neural network for the edit-based model. Layer name Input dimension Output dimension # Parameters Hs_lost_hidden 32 200 6600 Hs_lost_hidden2 200 00 2000 Hs_lost_hidden3 00 50 5050 Hs_gain_hidden 32 200 6600 Hs_gain_hidden2 200 00 2000 Hs_gain_hidden3 00 50 5050 Bonds_lost_hidden 68 200 3800 Bonds_lost_hidden2 200 00 2000 Bonds_lost_hidden3 00 50 5050 Bonds_gain_hidden 68 200 3800 Bonds_gain_hidden2 200 00 2000 Bonds_gain_hidden3 00 50 5050 Summed_hidden 50 50 2550 Output 50 5 Total 4400 Table S4: Parameters required in the neural network for the baseline model. Layer name Input dimension Output dimension # Parameters Hidden 024 50 5250 Output 50 5 Total 530

Results 800 600 Count [-] 400 200 0 0 500 000 500 2000 2500 Number of candidates per reaction [-] Figure S6: Histogram showing the number of product candidates for each of 5,000 reactant sets where the true product was found using the most popular,689 forward templates. We examine model performance as a function of the number of candidates in Figure S. The mean accuracy for examples within each bin shows a very weak trend (if any) with the number of candidates. Because all models are successful at discriminating likely products regardless of the size of the candidate set, this suggests that the scope of the model should be extendable by increasing the template set without a significant loss in performance. 2

a Loss (categorical crossentropy) [-] 5 4 3 2 Training Validation 0 0 20 40 60 Epoch [-] b Accuracy [-] 0.8 0.6 0.4 0.2 Training Validation 0 0 20 40 60 Num. epochs [-] Figure S7: (a) Loss and (b) accuracy during training for the edit-based model. Data from all five CV folds are superimposed. a Loss (categorical crossentropy) [-] 5 4 3 2 Training Validation 0 0 0 20 30 Epoch [-] b Accuracy [-] 0.8 0.6 0.4 0.2 Training Validation 0 0 0 20 30 Num. epochs [-] Figure S8: (a) Loss and (b) accuracy during training for the baseline model. Data from all five CV folds are superimposed. 3

a Loss (categorical crossentropy) [-] 5 4 3 2 Training Validation 0 0 0 20 30 Epoch [-] b Accuracy [-] 0.8 0.6 0.4 0.2 Training Validation 0 0 0 20 30 Num. epochs [-] Figure S9: (a) Loss and (b) accuracy during training for the hybrid model. Data from all five CV folds are superimposed. Hybrid model Figure S0: Histogram of model prediction confidence on the test sets to accompany Figure 5. 4

.0 Mean accuracy [-] 0.8 0.6 0.4 0.2 0.0 Baseline model ( avg.) Edit-based model ( avg.) Hybrid model ( avg.) 0 200 400 600 800 000+ Number of candidates, binned [-] Figure S: Model accuracy as a function of the number of candidate products for each example, binned in intervals of 00. 5

Section S2.. Similarity Comparison Figure S2: Averaged predicted (a) rank and (b) probability of the recorded product as a function of the averaged top 00 similarity scores among other products in the dataset, binned in intervals of 0.05. Shaded intervals correspond to ± one sample standard deviation calculated for each bin. We probe model generalizability by examining model performance as a function of recorded product similarity to other products in the USPTO dataset for the three models. The Sorensen-Dice coefficient calculated using Morgan circular fingerprints of radius 3 is used as implemented in RDKit to compare product similarity. The domain of applicability is assessed by examining the predicted rank and predicted probability of each reaction as a 6

function of the product s distance to the rest of the data set. The average of the top 00 similarity coefficients appearing in the full dataset is used as a crude measure of this distance. The reactions with unusual recorded products (i.e., low average top-00 similarity scores) are expected to have larger prediction errors. These absolute errors are binned, averaged, and displayed in Figure S2; the sample standard deviation for each bin is used to calculate the upper and lower bounds for the shaded interval. All models exhibit a positive correlation between performance and candidate similarity within the dataset. As discussed earlier, the baseline model is making predictions solely based on what products tend to look like, while the edit-based model is not able to see what the candidate products look like and is forced to examine only the reaction center. The existence of a trend for the edit-based model indicates that similar products tend to be formed through similar transformations. Along the same lines, if there are many similar products in the training dataset then the model will have more information about reactions leading to those types of products (and reactants with similar functional groups). In the hybrid model, explicit inclusion of product fingerprints does not substantially change the trends observed in Figure S2 for the edit-based model. Section S2.2. Input Feature Analysis As stated in the main text, the ablation experiments determine how significantly model performance suffers when a trained model is evaluated while masking certain input features. It does not determine how the model might have coped had that feature been absent during training. However, because models are trained using the Adadelta optimizer, similar to gradient descent with a dynamic learning rate, models will learn to depend on features that provide the most immediately useful information in a greedy fashion. 7