SUPPORTING INFORMATION Computation of Octanol-Water Partition Coefficients by Guiding an Additive Model with Knowledge Tiejun Cheng, Yuan Zhao, Xun Li, Fu Lin, Yong Xu, Xinglong Zhang, Yan Li and Renxiao Wang* State Key Laboratory of Bioorganic Chemistry, Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences, Shanghai, P. R. China Luhua Lai State Key Laboratory of Structural Chemistry of Stable and Unstable Species, College of Chemistry, Peking University, Beijing, P. R. China Atom/group types and correction factors defined in XLOGP3 Atom/group Types. A total of 83 basic atom types are implemented in XLOGP3 to classify carbon, nitrogen, oxygen, sulfur, phosphorus, and halogen atoms (Table S1). The classification of a given atom is made by considering (i) its element type, (ii) its hybridization state, (iii) its accessibility to solvent, characterized by the number of attached hydrogen atoms on this atom, (iv) the nature of its direct neighboring atoms, (v) whether it is connected to a conjugated system with π electrons, and (vi) whether it is in a ring. Note that in both XLOGP2 and XLOGP3, atom types are defined to be united atoms, which include hydrogen atoms implicitly. No additional atom types are thus necessary for classifying hydrogen atoms. This classification scheme is augmented by four additional terminal groups, i.e. types 84~87 in Table S1. These groups are combinations of some very unique types of atoms. It is technically more efficient to treat them as integrated pieces rather than to break them into individual atoms. ID Table S1. Atom/Group Types and Corrections Factors Defined in XLOGP3 Symbol sp 3 carbon Relevant Compd. a Total Occur. b Contrib. c Description d 1 C.3.3h.lipo e 642 719 0.7896 lipophilic C * H 3 R 2 C.3.3h.X.pi 1027 1111-0.0753 C * H 3 X connected to a conjugated 3 C.3.3h.X 2090 3200 0.0402 C * H 3 X 4 C.3.3h.pi 1030 1320 0.5018 C * H 3 R connected to a conjugated 5 C.3.3h 2168 4447 0.5240 C * H 3 R 6 C.3.2h.lipo e 559 1224 0.5201 lipophilic C * H 2 R 2 7 C.3.2h.X.pi 1898 2293-0.2441 C * H 2 R 2-n X n (n>0) connected to a 1
conjugated 8 C.3.2h.X 2592 4993-0.0821 C * H 2 R 2-n X n (n>0) 9 C.3.2h.pi 1005 1139 0.2718 C * H 2 R 2 connected to a conjugated 10 C.3.2h 1898 3994 0.3436 C * H 2 R 2 11 C.3.h.X.pi 1105 1603-0.3711 C * HR 3-n X n (n>0) connected to a conjugated 12 C.3.h.X 1162 2077-0.1426 C * HR 3-n X n (n>0) 13 C.3.h.pi 249 291 0.0841 C * HR 3 connected to a conjugated 14 C.3.h 584 988 0.1485 C * HR 3 15 C.3.X.pi 596 634-0.5475 C * R 4-n X n (n>0) connected to a conjugated 16 C.3.X 386 472-0.4447 C * R 4-n X n (n>0) 17 C.3.pi 182 185 0.0885 C * R 4 connected to a conjugated 18 C.3 179 206 0.0596 C * R 4 aromatic carbon 19 C.ar.h.X 1555 2320-0.1039 R C * (-H) X or X C * (-H) X 20 C.ar.h 5962 26613 0.3157 R C * (-H) R 21 C.ar.ar 465 764 0.3158 A C * ( A) A 22 C.ar.(-X).X 1321 1980-0.1003 R C * (-X) X or X C * (-X) X 23 C.ar.(-X) 5009 10899-0.0112 R C * (-X) R 24 C.ar.X 726 888-0.1874 R C * (-R) X or X C * (-R) X 25 C.ar 2702 3786 0.1911 R C * (-R) R sp 2 carbon 26 C.2.2h 133 150 0.5977 H-C * (=A)-H 27 C.2.h.(=C).X 587 830-0.0967 H-C * (=C)-X 28 C.2.h.(=C).ring 224 355 0.4004 H-C * (=C)-R in a ring 29 C.2.h.(=C) 339 409 0.3214 H-C * (=C)-R 30 C.2.h.(=X) 282 284-0.8756 H-C * (=X)-A 31 C.2.(=C).X 560 826-0.2069 R-C * (=C)-X or X-C * (=C)-X 32 C.2.(=C).ring 224 246-0.2084 R-C * (=C)-R in a ring 33 C.2.(=C) 30 33 0.4840 R-C * (=C)-R 34 C.2.(=X).X 4446 6552-0.8076 R-C * (=X)-X or X-C * (=X)-X 35 C.2.(=X).ring 413 539-0.5304 R-C * (=X)-R in a ring 36 C.2.(=X) 384 405-0.6093 R-C * (=X)-R sp carbon 37 C.1.== 30 31-0.5879 A=C * =A 38 C.1 40 88 0.1945 A-C * C * -A sp 3 nitrogen 39 N.am.2h 750 806-0.6414 A-N * H 2 in an amide group 40 N.3.2h.pi 645 833-0.3637 A-N * H 2 connected to a conjugated 41 N.3.2h 437 444-0.7445 A-N * H 2 42 N.am.h 2266 2873-0.3333 A 2 -N * H in an amide group 43 N.3.h.pi 425 474 0.2172 A 2 -N * H connected to a conjugated 2
44 N.3.h 228 235-0.2610 A 2 -N * H 45 N.am 1106 1237-0.1551 A 3 N * in an amide group 46 N.3.pi 435 539 0.3776 A 3 N * connected to a conjugated 47 N.3 402 420 0.1799 A 3 N * aromatic nitrogen 48 N.ar.X2 106 139-0.2167 X N * X in a 6-member ring 49 N.ar.X 551 774-0.2974 R N * X in a 6-member ring 50 N.ar 1476 2205 0.0888 R N * R in a 6-member ring 51 N.ar.h.X 68 69 0.3675 A N * (-H) X in a 5-member ring 52 N.ar.h 253 256 0.2364 R N * (-H) R in a 5-member ring 53 N.ar.X2.(2) 94 94 1.1022 X N * X in a 5-member ring 54 N.ar.X.(2) 228 229 0.4854 R N * X in a 5-member ring 55 N.ar.(2) 392 395 0.3181 R N * R in a 5-member ring sp 2 nitrogen 56 N.2.h 49 54 0.6927 H-N * =A 57 N.2.(=C).ring 446 492 0.7974 A-N * (=C) in a ring 58 N.2.(=C) 329 335 0.9794 A-N * (=C) 59 N.2.(=X) 268 404 0.2698 A-N * (=X) sp 3 oxygen 60 O.3.h.pi 1258 1501-0.0381 A-O * H connected to a conjugated 61 O.3.h 986 1531-0.4802 A-O * H 62 O.ar 290 296 0.5238 Oxygen atom in an aromatic ring 63 O.3.pi 2535 3566 0.2701 A-O * -A connected to a conjugated 64 O.3 596 847 0.0059 A-O * -A sp 2 oxygen 65 O.2.(=C) 4555 6809 0.7148 -C=O * 66 O.2.(=X) 1656 3415-0.5411 -X=O * sp 3 sulfur 67 S.3.h 14 14 0.4927 A-S * H 68 S.ar 187 188 1.1715 Sulfur atom in an aromatic ring 69 S.3.X 106 119 0.4125 R-S * -X or X-S * -X 70 S.3 469 523 0.8300 A-S * -A sp 2 sulfur 71 S.2.(=C) 138 142 1.3544 -C=S * 72 S.2.(=X) 64 66 1.2218 -X=S * Sulfoxide 73 S.o 39 39 0.0525 A-S * (=O)-A Sulfone 74 S.o2 546 620 0.5729 A-S * (=O) 2 -A Phosphorus 75 P.3 188 190-0.6694 A 3 -P * (=A) Fluorine 3
76 F.pi 266 321 0.4401 A-F * connected to a conjugated 77 F 421 1101 0.5360 A-F * Chlorine 78 Cl.pi 1127 2096 0.9610 A-Cl * connected to a conjugated 79 Cl 262 595 0.8036 A-Cl * Bromine 80 Br.pi 254 308 1.0295 A-Br * connected to a conjugated 81 Br 54 74 0.9664 A-Br * Iodine 82 I.pi 101 128 0.7801 A-I * connected to a conjugated 83 I 14 17 0.9071 A-I * Terminal groups 84 -C#N 333 788 0.0337 cyano group 85 -N=N=N 23 48 0.5339 diazo group 86 -NO2 821 904 1.2442 nitro group 87 >[N + ]-O - 56 71-1.2147 nitro oxide group Correction factors 88 AA 214 215-2.4431 Amino acid in zwitterionic form 89 HB 581 609 0.6123 Internal hydrogen bond a Total number of compounds in the training set containing this atom type. b Total occurrence of this atom type in the training set. c Regression coefficient of this atom type. d The following symbols are used here: (-) single bond, (=) double bond, (#) triple bond, ( ) aromatic bond, (R) group linked through a carbon atom, (X) group linked through a hetero-atom, (A) group linked through any atom. The asterisk indicates the relevant atom. e This atom should be separated from any unsaturated carbon atom or hetero-atom by at lease three single bonds. Correction Factors. Two corrections factors are defined in XLOGP3. The first correction factor accounts for internal hydrogen bonding, which makes a given molecule less hydrophilic than what is prompted by its chemical structure. In our study, an internal hydrogen bond will be considered if it meets the following requirements: (1) The donor must be a sp 3 hybridized oxygen atom or a nitrogen atom with at least one hydrogen atom; while acceptor must be a sp 2 hybridized oxygen atom or a sp 3 oxygen atom in a hydroxyl group. (2) The donor atom and the acceptor atom are separated by four consecutive covalent bonds. In other words, the formation of such a hydrogen bond should result in a six-member ring in the given chemical structure (Figure S1). (3) This hydrogen bond must be immobilized on either rings or conjugated unsaturated systems to have a rigid structure (Figure S1). The last requirement needs to be explained further. If a given molecule is flexible in conformation, some internal hydrogen bonds may form and dissociate in a dynamic manner. Extensive conformational samplings are necessary in order to detect such hydrogen bonds, which is apparently impractical for a fast algorithm like XLOGP3 which relies merely on 4
topological structures as inputs. In addition, such hydrogen bonds may also be energetically less stable and consequently contribute less to logp. As a simplification, such hydrogen bonds are neglected in XLOGP3. Figure S1. Examples of internal hydrogen bonds considered by XLOGP3 The second correction factor is used on organic compounds containing an amino acid. Such a compound exists primarily in zwitterionic form instead of neutral form under neutral ph condition. Consequently, the octanol-water partition coefficient in such a scenario is logd rather than logp by definition, where the former is significantly lower than the later (lower than two units according to our results). A correction factor is thus necessary to compensate this. In XLOGP3, a given molecule will be detected to be an amino acid if a carboxyl group and a sp 3 hybridized aliphatic nitrogen atom exist simultaneously. The nitrogen atom, however, must not connect directly with any electron withdraw group or conjugated to be a strong Lewis base. Use of this correction factor is of course a very crude treatment on ionizable compounds, and it is only applicable to amino acids. In the future, we may consider a more robust method for predicting pk a values to compute logp and logd values for a wider range of organic compounds. 5