TetraBASE: A Side Chain-Independent Statistical Energy for Designing Realistically Packed Protein Backbones

Size: px

Start display at page:

Download "TetraBASE: A Side Chain-Independent Statistical Energy for Designing Realistically Packed Protein Backbones"

Antony Day
5 years ago
Views:

This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. Cite This: pubs.acs.

1 This is an open access article published under an ACS AuthorChoice License, which permits copying and redistribution of the article or any adaptations for non-commercial purposes. Cite This: pubs.acs.org/jcim TetraBASE: A Side Chain-Independent Statistical Energy for Designing Realistically Packed Protein Backbones Huanyu Chu, and Haiyan Liu*,,, School of Life Sciences, University of Science and Technology of China, Hefei, Anhui China Hefei National Laboratory for Physical Sciences at the Microscales, Hefei, Anhui China Collaborative Innovation Center of Chemistry for Life Sciences, Hefei, Anhui China Downloaded via on October 17, 2018 at 16:26:14 (UTC). See for options on how to legitimately share published articles. ABSTRACT: To construct backbone structures of high designability is a primary aspect of computational protein design. We report here a side chain-independent statistical energy that aims at realistic modeling of through-space packing of polypeptide backbones. To mitigate the lack of explicit amino acid side chains, the model treats the interbackbone site packing as being dependent on peptide local conformation. In addition, new variables suitable for statistical analysis, one for relative orientation and another for distance, have been introduced to represent the intersite geometry based on the asymmetrical tetrahedron organization of distinct chemical groups surrounding the Cα-carbon atoms. The resulting tetrahedron-based backbone statistical energy (tetrabase) model has been used to optimize the tertiary organizations of secondary structure elements (SSEs) of designated types with Monte Caro simulated annealing, starting from artificial initial configurations. The tetrabase minimum energy structures can reproduce SSE packing frequently observed in native proteins with atomic root-mean-square deviations of 1 2 Å. The model has also been tested by examining the stability of native SSE arrangements under tetrabase. The results suggest that tetrabase model can be used to effectively represent interbackbone packing when designing backbone structures without explicitly knowing side chain types. INTRODUCTION Recently, computational protein design has achieved remarkable successes in engineering de novo proteins of non-native backbones To a large extent, these successes have relied on innovative ways to identify designable backbone structures that fulfilled respective design goals. So far, only a restrictive number of approaches have been proven to be effective to define designable backbones. They mostly considered geometric constraints on backbone structures, such as sequenceindependent rules about preferred lengths of secondary structure elements and loops linking them 3,4 or parametric equations about relative geometries between helices or other structural units. 5 8 These constraints, usually in heuristic forms, were used to build blueprints, which were then used to guide the assembling of peptide fragments into complete backbones. 12 With this strategy, most secondary structure elements and local structural features in the resulting backbones are of ideal forms. 3 6,9,10 Apart from a few examples, it is still challenging to design de novo proteins that are of the rich, nonideal structural features as manifested by native proteins. 11,13 One type of approach to this end is to recombine substructures from different native protein structures. 11 Other than this, a general approach to design peptide backbones of unrestricted diversity would be to consider direct structure modeling of varied trial sequences. However, if complete side chains were to be modeled with atomistic models, the energy surface would be highly frustrated, making searching in the backbone conformational space computationally expensive The searching in the sequence space by itself being already computational very demanding, 17 incorporating large backbone adjustments using such simultaneous searching of both the sequence space and the backbone conformational space may remain impractical even with the rapid growth of computer power, at least for larger proteins without internal symmetries. An alternative way of de novo backbone design is to develop a general sequence-independent energy function that can be used to generate realistic backbone structures with usual conformational sampling and optimization techniques. The main goal of such an energy function would be to make the decoupled search in the backbone conformational space and in the sequence space a better approximation than that can be achieved with simple homopolypeptide or pseudoresidue models. Compared with a model that includes atomistic side chains, such a side chain-unspecialized energy function, as some of the coarse-grained side chain models, 18 may also bring about Received: November 26, 2017 Published: January 9, American Chemical Society 430

2 the benefits of a less-frustrated energy landscape, such as swift convergence of conformation optimization. There have been previous researches on whether sampling or optimization using side chain-independent general energy functions could yield meaningful free energy minima of polypeptide backbones As an earlier example, Hoang et al. considered a simple physical model of inter-residue interactions including steric interaction, hydrogen bond, and hydrophobicity. 19 Results of conformational sampling led them to propose that these interactions place geometrical and energetic constraints that presculpt the free energy landscape of protein structures, giving rise to a limited number of broad free energy minima to which the actual protein folds belong. Using generic simplified or all-atom models, Zhang et al. generated hydrogen-bonded, secondary structure-containing and compact structures of homopolypeptides. 20 They proposed that at certain resolutions, there were one-to-one correspondences between conformations in the generated set and folds of known single domain proteins. More recently, Cossio et al. explored the free energy landscape of the Val60 homopolypeptide with atomistic molecular dynamics simulations. 21 They reported that the sampled set of conformations contained observed protein folds of similar lengths, although the sampled and observed sets could not be considered as equivalent. Kukic et al. also explored the conformational space of Val60 with molecular dynamics simulations, albeit with a coarse-grained model similar to that used by Hoang et al. 22 They came to the conclusion that just as the fully atomistic homopolymer model, the free energy minima of this simplified CamTube model could recapitulate the diversity of protein folds observed in native protein structures. Rather than aiming at producing realistic backbone structures susceptible to sequence design, these previous studies have aimed at coarsely contouring the free energy landscape of polypeptides. Thus, they have focused on global topologies or overall folds of the computationally constructed structures. Being similar to real protein backbones at TM-score 23,24 levels of only 0.4 to , most of these structures were probably not realistic enough to serve as target backbones with current sequence design algorithms. Aiming at generating de novo scaffolds for protein design, MacDonald et al. developed an α-carbon potential energy function, in which realistic modeling of backbone local conformations (namely, the backbone conformation of a few consecutive residues) was emphasized. 25,26 Decoy backbones sampled with this energy function were shown to reproduce realistic protein backbone local conformations. In a loop modeling experiments, the model was shown to be able to compete with methods that used known protein backbone fragments with similar sequences or used residue specific φ,ψmaps to restrict the search space. Recently, this model was used to design de novo loops projecting from the scaffold core of synthetic beta-solenoid repeat proteins. In an experimentally solved structure containing a designed loop, the loop was found to closely match the designed structure. 27 These results suggest that sequence-independent model could indeed be used to design atomistically realistic backbones. In the model of MacDonald et al., while the realistic modeling of peptide local conformations has been emphasized, the through-space packing between different parts of a polypeptide backbone was described only with a simple soft steric repulsive term together with a pseudohydrogen bond term To maintain intersecondary structure element packing, extra heuristic restraints were applied. In the current paper, we report a statistical energy for the realistic modeling of through-space packing of polypeptide backbones. Statistical energies are derived from known sequence and structural data of native proteins and their complexes They are computable models of molecular interactions and have been widely used in protein structure modeling and design Most commonly, statistical energies have been defined to be dependent on one or two geometric parameters, such as a single interatomic distance or one or two (torsional) angles. 30,32,33,36 Recently, for the purpose of designing sequences for given backbone structures, we have developed and experimentally verified a statistical energy function in which the effects of peptide local conformation, local structural environment as well as inter-residue geometries on amino-acid sequences have been considered jointly. 31,39 In the current work, for the purpose of designing backbones without prespecified sequences, a statistical energy function to characterize the packing between two backbone sites has been defined to depend on peptide local conformations to mitigate the lack of explicit sequence information. Besides this, a new representation of interbackbone site packing geometries is introduced, so that the resulting packing energy can reflect not only distance but also orientation dependencies. This is achieved by using the asymmetrical tetrahedron configuration surrounding the α-carbon atoms to define reduced geometrical variables for statistical analysis. We refer to this model as tetrabase (standing for tetrahedron-backbone-arrangement statistical energy) and test it with two types of calculation. In one type, tertiary organizations of SSEs of designated types and lengths are optimized with simulated annealing, starting from artificial initial configurations. The resulting minimum energy structures are compared with existing backbone structures in Protein Data Bank (PDB) using structure alignment. In the second type, the spatial arrangements of secondary structure elements (SSEs) in native protein structures are examined for stability under tetrabase. The results suggest that the tetrabase minima correspond to atomistically realistic backbone structures. RESULTS tetrabase Statistical Energy Describes Local Conformation, Orientation, and Distance-Dependent Packing. Each tetrabase term is intended to be used to describe the through-space packing between two backbone sites that are separated from each other in sequence. In the current work, we focus on the intersecondary structure element (SSE) packing in a backbone architecture that contains a set of M SSEs without any linking loops. Here, a SSE refers to a α-helix or a β-strand, different strands in a single β-sheet considered as separate SSEs. For SSE m, its secondary structure is noted as SS m, its length noted as l m. The system s SSE composition is defined by the number of SSEs (M), the secondary structure types of the SSEs ({SS m, m = 1,2,...,M}) and the lengths of the SSEs ({l m, m = 1,2,...,M}). Each SSE is represented by all the backbone sites it contains. A backbone site i is represented by the main chain nonhydrogen atoms in residue i, namely, the N, Cα, C and O atoms. We collectively note the coordinates of these atoms as r i. The configuration of the system, r N, is completely specified as 431

3 Figure 1. Examples of the orientation categories and the corresponding distance-dependent energy curves. Backbone position pairs are contained in the following types of interacting secondary structure elements: (a) two antiparallel strands, (b) a helix and an antiparallel strand, and (c) two antiparallel helices. The positions are identified with integer numbers. The energy is in unit of log e e. In (a), the tetrahedrons surrounding the central Cα positions of one strand and the interstrand hydrogen bonds are indicated. In this and in all other figures in this paper, graphs of molecular structures have been prepared using the PyMol program. 40 N 1 2 l1 l1+ 1 l1+ 2 l1+ l2 element 1 element 2 r {{ r, r,..., r }, { r, r,..., r },..., { r,..., r,... l1+ l lm 1+ 1 l1+ l lm 1+ lm element m,{ r,..., r }} l1+ l lm 1+ 1 l1+ l lm 1+ lm last element In the current work, each SSE has been treated as a rigid body whose internal structure has been (randomly) taken from a native protein structure. This allowed us to consider only inter-sse packing as the target for optimization. For this purpose, the total tetrabase energy has been defined as the sum of pairwise inter-sse contributions, which are in turn sums of interbackbone site pairwise terms, namely, M 1 M N SS = mssn E( r ) e ( r, r) i j m= 1 n= m+ 1 i SSE j SSE m As explained, eq 1 does not contain terms that describe intra- SSE interactions. n } (1) The individual terms e SS mss n (r i,r j )ineq 1 have been defined to be dependent not only on the atomic coordinates of backbone sites i and j but also on their respective local conformation contexts (here the SS type combination SS m and SS n ). Given the SS type combination, e SS mss n (r i,r j ) is further simplified to depend first on a discrete category of the relative orientation and then on a distance variable. The definition of the orientation category has been inspired by the asymmetric tetrahedron arrangements of atoms/functional groups of different chemical nature surrounding the Cα atoms. More specifically, it is defined by the closest pair of vertex atoms combined with the furthest pair of vertex atoms between the two tetrahedrons. Given the orientation category, the distance variable is chosen to be the distance between the closest pair of vertex atoms (r min ). The distributions of the orientation category and the distance variable have been estimated from backbone site pairs in a set of training proteins. Finally, e SS mss n (r i,r j ) for any relative geometries of two backbone sites are derived from these probabilities and distributions. More details of this process are given in Methods. Figure 1 shows the orientation categories of some backbone position pairs in contacting SSEs of varied types and relative arrangements. Clearly, the orientation category for two 432

4 backbone sites contains information about the overall relative arrangement between the SSEs that contain the two sites. In addition, it also contains information about the relative location of one site in one SSE with respect to the other SSE. Because of this, different SS type combinations as well as different orientation categories lead to different dependences of the statistical energy on the distance variable. In Figure 1, some of the distance-dependent energy curves are shown. As expected, these energy curves exhibit large variations upon varied SS type combinations and orientation categories. Thus, correlations between local structure types (SS types), relative orientations, and distances have been considered in the tetrabase energy form, making the total energy sensitive to variations in both the overall spatial arrangements and the detailed packing between SSEs. By definition, the statistical energy is associated with negative logarithms of the respective distributions in native structures, thus inter-sse packing modes preferred in native protein structures should be associated with lower energies. There is no special treatment of hydrogen bond in the tetrabase packing term. The quality of hydrogen bond geometries in the tetrabase energy-minimized backbone structures can serve as an indicator of the ability of the energy to recapitulate backbone packing in atomic details. In Figure 2, Figure 2. Distributions of hydrogen bond geometries in the tetrabase energy-minimized β-sheets and in native backbones. In total, 896 hydrogen bonds in the tetrabase energy-minimized-structures for the SSE compositions H 16 (E 7 ) 3 and (H 16 ) 2 (E 7 ) 4 (red circles) and 8960 hydrogen bonds randomly extracted form native proteins (black diamonds) are shown. inter-β strand hydrogen bond geometries in tetrabase energyminimized β-sheets (obtained with Monte Carlo simulated annealing started from artificially defined structures, see below) are compared with those in native backbone structures. Minima of tetrabase Energy Reproduce Native Backbone Packing at Atomic Resolutions. For several test systems, minima on their tetrabase energy surfaces defined in eq 1 have been searched with a Monte Carlo (MC) simulated annealing protocol as described in Methods. The SSE composition of each test system, namely, the number of SSEs and the SS type and length of each SSE, have been predefined and are listed in Table 1. For a given SSE composition, different coarse or general architectures can be defined based on the approximate directions of the SSEs relative to each other. For example, for the (H 16 ) 3 composition (see footnotes of Table 1 for the meaning of the notation), there can be two general architectures, one with two helices running in approximately the same direction and the remaining helix running in approximately the opposite direction and the other with all Table 1. Compositions and Directions of Secondary Structure Elements (SSEs) of Different Architectures a Letters H and E indicate secondary structure types (H for helix and E for strand). Superscripts indicate lengths of SSEs. Subscripts indicate numbers of SSEs of given types and lengths. b For each architecture, types and approximate directions of the SSEs are graphically represented. Each triangle represents a β-strand. Each circle represents an α-helix. Upward triangles or dots in circles indicate the outward direction. Downward triangles or crosses in circles indicate the inward direction. the three helices running in approximately the same direction. Similarly, six different architectures can be considered for the SSE composition H 16 (E 7 ) 3. There are far more possible architectures for the composition (H 16 ) 2 (E 7 ) 4, and only three of them have been considered here as examples. For convenience, each architecture is given a name in Table 1, and the corresponding arrangements of SSE are represented graphically in the same table. These SSE compositions associated with corresponding architectures have been selected to cover different combinations of SS types as well as different relative orientations between SSEs. They can comprise a reasonable set to test the ability of tetrabase to describe different types of inter-sse packing. For each architecture in Table 1, 50 independent MC simulated annealing runs have been carried out, all started from artificially constructed structures (see Methods). The lowest energy configuration found in each simulation was extracted. The configuration of the lowest energy among all 50 simulations has been considered as the configuration to represent a minimum. To examine convergence of the simulated annealing simulations, the lowest energy configurations found in the individual MC runs are compared with the corresponding representative configurations. Figure 2 shows that for most architectures, either the resulting rootmean-square deviations (RMSD) of Cα atom positions are small or the associated energies are much higher than the overall lowest energy. This indicates acceptable convergence of the MC simulated annealing protocol. One exception is the h3_1 architecture (Table 1), for which two structurally dissimilar representative configurations (mutual RMSD > 2.5 Å) have been found with similarly low energies (Figure 3). For this architecture, both configurations have been considered for further analysis. To examine whether the representative configurations found above correspond to designable backbone structures, they were separately used as queries to search the PDB database for similar structures. This has been carried out with the program Phyrestorm, 41 which can rapidly and comprehensively compare a given protein structure to the entire PDB through structural alignments of backbones. The algorithm of Phyrestorm requires specific ordering of the SSEs in the primary sequence. This 433

5 Figure 3. Energy variations of the lowest energy configurations in individual MC runs. The tetrabase energy values are relative to the lowest energy configuration from all MC runs of an architecture. The red plus signs correspond to configurations similar to the overall lowest energy one (RMSD < 2.5 Å). The black x signs correspond to the remaining configurations. ordering is not relevant in our minimized architectures. Thus, before a search, the SSEs are arbitrarily reordered to generate a query backbone. After searching with the query, nonredundant top hits with TM-scores above 0.6 were retained. If the accumulated number of retained hits is less than five, a new permutation of the SSE order was considered for another search. Otherwise the search is stopped, and the remaining SSE orders are not considered further. During this process, hits returned by Phyrestorm have been manually filtered to eliminate redundancy and to exclude PDB entries that correspond to NMR or Cryo-EM-determined models. In Figures 4 6, we list the PDB IDs and chain IDs of the top five hits for each representative configuration, together with the TM-scores, the number of aligned residues, and the RMSD of aligned Cα positions. Structures of the representative configurations aligned with the corresponding best hits were also shown. For all the representative configurations, five or more nonredundant PDB entries can be found to match the respective queries with TM-scores higher than The TMscores associated with the best hits are often above 0.7. These results suggest that tetrabase minima can recapitulate preferred inter-sse packing with atomic accuracy. Because of the small sizes of the tested SSE architectures, the tetrabase minimum configurations are usually aligned only to a small part of a known protein structures. On the other hand, the alignments are able to cover % of backbone sites in the tetrabase minimum configurations. In addition, the overall structures of the top hits for a given minimum configurations varied greatly. Such recurrent presence of the tetrabase minimum configurations in proteins of different overall folds support that more favorable (lower) tetrabase packing energies may be associated with higher designability. Figure 4. Comparisons between the tetrabase-energy minimized backbone configurations and backbone structures of native proteins. Panels (a) to (c) correspond to different representative configurations obtained for the SSE composition (H 16 ) 3. For each representative configuration, the left part shows the architecture name, the PDB IDs, and chain IDs of the top five hits obtained by using Phyrestorm, together with the TM-score, the number of aligned residues, and the RMSD of aligned Cα positions for each hit; the middle part shows the structures of the tetrabase-energyminimized configuration aligned with the best hit (gray) in ribbon form. The right part shows a more detailed stereoview of the aligned backbone structures, with the tetrabase-energy-minimized configuration in orange and the native backbone in light purple. 434

6 Journal of Chemical Information and Modeling Figure 5. Same as Figure 4, but for the SSE composition H16(E7)3. Panels (a) to (f) correspond to diﬀerent representative conﬁgurations obtained for the SSE composition H16(E7)3. SSE in several natural proteins have been examined for their stability under the tetrabase inter-sse packing interaction. Again, the native structures have been chosen to cover diﬀerent SSE compositions and relative arrangements. MC simulated annealing runs were carried out using initial SSE arrangements extracted from corresponding native structures. In these initial arrangements, loops and side chains from the original PDB structures were simply removed, leaving only backbone segments that correspond to regular SSEs. From an initial structure, 5.5 cycles of MC simulated annealing were executed. In these cycles, the eﬀective temperature for MC has been changed periodically between an upper and a lower bound. The total tetrabase energies and the RMSD deviations from the The amino acid sequences of the structurally aligned parts of respective natural proteins are shown in Figure 7. They exhibit signiﬁcant variations despite the highly similar backbone arrangements. This observation is in support of the main hypothesis underlying our approach, that is, polypeptide backbone conformational minima can be reconstructed with 1 2 Å resolution with models in which sequence specialization is not explicitly considered. This result is consistent with previous analyses on the none one-to-one correspondence between protein structures and sequences, such as the analysis carried out by Rackovsky.42 Stability of Native SSE Arrangements Can Be Maintained under tetrabase. The native arrangements of 435

These proteins have been found as top hits in searches with respective tetrabase minimum conﬁgurations as queries.

7 Journal of Chemical Information and Modeling Figure 6. Same as Figure 4, but for the SSE composition (H16)2(E7)4. Panels (a) to (c) correspond to diﬀerent representative conﬁgurations obtained for the SSE composition(h16)2(e7)4. Figure 7. Amino acid sequences of the structurally aligned parts of natural proteins. These proteins have been found as top hits in searches with respective tetrabase minimum conﬁgurations as queries. The architecture names, the number of residues contained in each architecture, and the PDB IDs of the natural proteins are given. For the architecture h3_1, two representation conﬁgurations have been obtained from MC minimization. starting native conﬁgurations are monitored. The monitored values are shown in Figure 8. In addition, the lowest energy conﬁgurations encountered in the MC simulated annealing runs are compared with the native conﬁgurations in Figure

Figure 8. Total tetrabase energies and the RMSD deviations during MC simulated annealing simulations. The simulations have been started from native SSE arrangements.

8 Figure 8. Total tetrabase energies and the RMSD deviations during MC simulated annealing simulations. The simulations have been started from native SSE arrangements. The energy values are relative to respective lowest energies encountered in the simulations. The RMSDs are from the initial configurations. The native SSE arrangements considered included the following: (a) the α helices extracted from PDB 1A36, (b) the α helices extracted from PDB 1VCT, (c) the β strands extracted from PDB 1B33, (d) the β strands extracted from PDB 1TRE, (e) the α helices and β strands extracted from PDB 1EW4, (f) the α helices and β strands extracted from PDB 1OBB, (g) the α helices and β strands extracted from PDB 1CY5, and (h) the α helices and β strands extracted from PDB 1J24. For panels (a) to (f), the y-axis is labeled on the left side of the panel. For 1CY5 and 1J24, results of both unrestrained (black) and RMSD-restrained MC simulated annealing runs (red) are shown. For these two systems, the RMSD deviations shown in panels (g) and (h) are labeled on the right side of the y-axis. From Figure 8a f, for six out of the eight examined natural proteins, the native SSE arrangements have been stably maintained during MC simulated annealing; the RMSD values of the low energy configurations ranged between 1 to 2 Å. Even though the configurations could drift further away from the native configurations during the high temperature excitation phases of the annealing cycles, they returned closer to the native configuration in subsequent low temperature relaxation phases. This suggest that stable minima on the tetrabase energy surface exist close to the native configurations. To compare these minima with other possible minima of the same SSE compositions, MC simulated annealing started from artificially constructed initial structures (see Methods) were carried out on systems with the same SSE compositions as the examined natural proteins 1A36, 1VCT, 1B33, 1TRE, 1EW4, and 1OBB. Except for the 1EW4 composition, the lowest energy configurations in 10 independent MC runs of a composition included one or more configurations that are closely similar (RMD < 2.5 Å) to the configuration optimized from the respective native SSE arrangements. In addition, the lowest energies obtained from the simulations started from the native arrangements fell within the ranges covered by the lowest energies obtained from the simulations started from artificially constructed structures (Figure 10). For the two examined natural proteins 1CY5 and 1J24, low energy configurations visited during the MC simulated annealing were further away from the corresponding native configurations. For 1CY5, the RMSD was around 4 Å (Figure 8g). For 1J24, the RMSD was around 2.5 Å (Figure 8h). Despite this, closer inspections of the lowest energy configurations suggested that these configurations may still represent well-packed SSEs. In fact, in these relaxed configurations, SSEs, especially helices, tend to be packed more tightly with each other than in respective native structures (Figure 9), which may explain their lower tetrabase energies. Restrained MC simulated annealing on these two native SSE arrangements, in which RMSDs from the initial native configurations were restrained to be within 2 Å, were carried out. As expected, the restraints led to higher energies (Figure 8g and h). The energy increases, however, are comparable to the energy variations shown in Figure 9, with more residues in 437

Journal of Chemical Information and Modeling Figure 9. Lowest energy conﬁgurations encountered in MC compared with respective native SSE arrangements.

9 Journal of Chemical Information and Modeling Figure 9. Lowest energy conﬁgurations encountered in MC compared with respective native SSE arrangements. The native conﬁgurations are shown in yellow, and the lowest energy conﬁgurations encountered in the MC simulated annealing runs are shown in green cyan. All conﬁgurations are shown in cross-eyed stereo. The corresponding native proteins are (a) PDB 1A36, with an SSE composition of H32H33; (b) PDB 1VCT, with an SSE composition of H25H28H33; (c) PDB 1B33, with an SSE composition of (E7)3; (d) PDB 1TRE, with an SSE composition of (E5)2E4; (e) PDB 1EW4, with and SSE composition of H21(E5)2E7E6E8E4H13; (f) PDB 1OBB, with an SSE composition of H13H18(E5)2; (g) PDB 1CY5, with an SSE composition of H17H11H9 H13H14H7; and (h) PDB 1J24, with an SSE composition of H9H12H14H18(E5)2E2(E7)2E4. the SSEs of 1CY5 and 1J24 than the proteins shown in Figure 9 taken into consideration. Results in Figures 8 10 suggest that the SSE arrangements in native protein backbones are likely to be stable under tetrabase, being atomistically similar to minima on the tetrabase energy surface. On the other hand, given the statistical nature of tetrabase, it is understandable that SSE packing in some natural proteins may have larger deviations from the tetrabase minima. Even after taking this latter point into consideration, a reasonably low tetrabase energy can probably still be used as a useful criterion to deﬁne acceptable packing geometries in a backbone design protocol. DISCUSSION For de novo protein design, it is desirable to identify designable backbone structures with usual computational sampling/ optimization schemes without the need to prespecify a sequence. For this purpose, one needs a general sequenceindependent energy function. The tetrabase energy introduced here is a new form of statistical energy to describe the through-space packing between polypeptide backbone sites. Side chain types are not explicitly considered. Instead, the interaction has been made to explicitly depend on the local conformation (here, the secondary structure type). In addition, Figure 10. Energy variations of MC-minimized SSE arrangements with the same SSE compositions as natural proteins. The MC runs have been started either from the native SSE arrangements (circles) or from artiﬁcially constructed SSE arrangements (triangles and plus signs, results of 10 independent MC runs for each SSE compositions are given). The energy values are relative to the averaged minimum energies of the 10 MC runs. Conﬁgurations similar to the respective native SSE arrangements (RMSD < 2.5 Å) are shown as plus signs, and the remaining conﬁgurations are shown as triangles. 438

10 the packing energy depends not only on distances but also on relative orientations. The tetrahedron representation of the relative orientation has been designed as a statistically easy to estimate and chemically sensible way to capture the anisotropic nature of the coarse-grained packing interactions. Thus, the resulting total tetrabase energy contained correlations between local peptide conformation, relative orientation, and distance, making it sensitive to variations in spatial arrangements between SSEs. In packing models depending on simple intersite distance, such correlations would have been averaged out. Previously, we have reported the ABACUS (a backbonebased amino acid usage survey) model for sequence design with given backbones. 31,43 Using fixed native backbones as design targets, sequences have been designed successfully using ABACUS for different fold classes. The RMSD deviations between the actual structures and the respective design targets are around 2 Å. 31,39 Thus, it may be reasonable to aim at constructing designable backbones with about 2 Å RMSD accuracy for subsequent sequence design. In previous studies, polypeptide backbones have also been generated de novo, with packing mostly modeled using simple steric interactions. The top TM-scores between the structures generated and actual protein structures were usually Although in some studies a few generated structured with higher TM-scores ( 0.6) have been reported, 20,21 such structures were posteriorly selected from a large number of generated structures using the TM-scores as criteria and not with an energetic criterion. The level of resemblance between the generated backbones to natural ones in these previous studies, although may be recognized as statistically significant and reflecting good correspondences between the overall topologies of generated and natural proteins, may not be high enough for the generated backbones to be used as viable targets for sequence design. On the other hand, the structures generated by minimizing tetrabase energy have much higher TM-scores with natural proteins (Figures 4 6). The small RMSDs of 1 2 Å suggest that backbones optimized using tetrabase may be realistic enough to be used as input for sequence design programs such as ABACUS. 31 A number of previous theoretical studies with side chain-free modeling have already suggested that the interplay between a number of side chain type-independent factors including backbone geometry, backbone hydrogen bonding, and (de)- solvation have strong presculpting effects on the free energy landscape of polypeptides Here, we observe that even at the atomistic resolution, the free energy minima of a statistical potential omitting residue types can still correspond to realistic backbone structures. This suggests that these backboneassociated factors can shape the free energy landscape of proteins at a resolution higher than crude or coarse contouring. This point may have implications not only for protein design but also for protein structure, function, and evolution in general, which may worth exploration in future studies. To further verify the contribution of the tetrabase packing term to the reaching of this accuracy, we have also performed controlling calculations in which the inter-sse arrangements have been optimized with the tetrabase packing terms replaced by simple steric plus attractive interactions. In these calculations, the interaction between a pair of Cα atoms are considered to be repulsive (associated with a positive energy of 10) if their distance is less than 3.5 Å and attractive (assigned a negative energy of 0.5) if their distance is between 3.5 and 8 Å. Top TM scores between configurations generated in these controlling calculations and native backbones are all below 0.6 (results not shown). Even though the tetrabase statistical energy does not explicitly depend on side chain types, it may still carry implicit sequence dependences that have been encoded in the backbone structure, including the local backbone conformations and the relative orientations between backbone sites. A side chain indiscriminative energy function that simply depends on inter Cα distances may not encode such implicit dependences. Our results support that general approaches to constructing realistic backbones at atomistic resolutions without explicit differentiation of side chain types is possible. In the current work, the local conformation types that have been considered included only regular secondary structure types. This allowed us to search the SSE packing space effectively, so that the effectiveness of tetrabase could be examined by comparing results with SSE packing in native backbones. In our ongoing work, tetrabase is being extended to cover backbone sites in loops by using a discrete local structure alphabet (for example, the ProteinBlock model 44,45 ) to represent local conformational states of loops as well as regular SSEs. In addition, to reach a complete model that will allow side chain-unspecialized backbone design with atomistic authenticity and full range of flexibility, the resulting tetrabase statistical energy that describes through-space packing between sequentially separated backbone sites, either in SSEs or in loops, is being integrated with statistical energy models developed to describe realistic peptide local conformations. METHODS tetrabase Packing Energy. The tetrabase packing term between two sites (generally indexed with numbers 1 and 2, respectively), e SS 1SS 2 (r 1,r 2 ), can be determined from the conditional probability distributionp(r 1,r 2 SS 1,SS 2 ), namely, e SS1SS2 ( r, r ) ln p( r, r SS, SS ) As previously stated, r 1 and r 2 refer to coordinates of main chain heavy atoms, SS 1 and SS 2 refer to secondary structure types. The distributionp(r 1,r 2 SS 1,SS 2 ) in eq 2 needs to be estimated using known protein structures. If we ignore structure variations within each backbone site, it is a joint distribution of at least six dimensions. To overcome the problem brought about by the multidimensionality, we lump together the dimensions that define relative orientation and represent the relative orientation as discrete categories. To define the orientation category O 1,2 from the atomic coordinates r 1 and r 2, each of the two backbone sites 1 and 2 is considered as a tetrahedron with the respective N, C, C β, and H α atoms as its vertices (the coordinates of C β and H α are determined using standard covalent internal coordinates). The orientation category is defined by the closest pair of vertices combined with the furthest pair of vertices between the two tetrahedrons. For example, if the closest pair of vertices are atom N of site 1 and atom C of site 2 (i.e., r min = r NC ) and the furthest pair are the C β atoms of both sites (i.e., r max = r Cβ C β ), O 1,2 would be assigned to category O β β NC,C C. Similarly, another interbackbone site orientation with r min = r Cβ C β and r max = r NC would be assigned to category O β β C C,NC and so on. With this categorization scheme, possible relative orientations of two backbone sites of the same (different) secondary structure type(s) are divided into 78 (144) distinct categories. (2) 439

11 Within each orientation category, we choose r min, the distance between the closest vertices, as the variable to further describe the packing geometry. Formally, this treatment is equivalent to approximating the overall distribution, up to a normalization constant, with the following form: p( r1, r2 SS 1, SS 2) p ( r1, r2 SS 1, SS2) 1,2 1,2 ρ( rmin O, SS 1, SS 2) 1,2 = PO ( SS, SS ) ref 1,2 1 2 ρ ( r ) min min (3) ref In eq 3, ρ min is a reference distribution of r min for two uniformly distributed, noninteracting tetrahedrons. It is needed because of the nonunitary nature of the transform from the complete coordinates r 1 and r 2 to the reduced representation O 1,2 and r min. In the current work, the probabilities of different orientation categories P(O 1,2 SS 1,SS 2 ) and the distributions of r min associated with different orientation categories ρ(r 1,2 min O 1,2,SS 1,SS 2 ) have been estimated from backbone site pairs in training proteins whose Cα Cα distances are within a cutoff of r cut = 13 Å. The conditional distributions for r min have been estimated by dividing r min between 0 and 11.5 Å into bins 0.25 Å in width and then counting the frequency of occurrence in ref each bin. The reference distribution ρ min has been estimated with the same resolution of bins, albeit with configurations computationally sampled for two uniformly distributed tetrahedrons within the same α-carbon distance cutoff. The set of training proteins is the same as that used in Xiong et al., comprising 12,465 nonredundant peptide chains (pairwise sequence identity between any two chains below 50%) with structures determined at resolutions of 2.5 Å or above by X-ray crystallography. 31,46 The secondary structure types have been assigned with the STRIDE program. 47 Finally, we define SS1SS2 SS1SS e ( r, r ) = ln p ( r, r SS, SS ) + e SS In eq 4, the reference energy e 1 SS 2 0 has been introduced to control the relative strength of interactions between different secondary structure types. In the current work, the value of SS e 1 SS 2 0 has been chosen so that the averages of e SS 1SS 2 (r 1,r 2 ) over the different orientation categories at the last bin of r min (centered around Å) are zero. The energies as defined in eq 4 are stored as separated tables of numerical values at bin centers of r min. Energies at r min values off the bin centers are obtained through linear interpolation. Monte Carlo Simulated Annealing. Monte Carlo (MC) simulated annealing has been employed to minimize the total tetrabase energy for systems of given SSE compositions. In MC, two types of configurational moves have been considered. The first type is rigid body moves of a subset of SSEs relative to the remaining SSEs. The chosen SSEs are randomly translated and rotated by uniformly distributed amounts with maximum step sizes of 3 Å and 0.5, respectively. In the second type pf MC move, one SSE in the current configuration is replaced with another SSE of the same type and length randomly taken from the training protein structures. New atomic coordinates of the substituting SSE is obtained by minimum RMSD fitting to the substituted SSE. In MC runs started from the native SSE arrangements, only the first type of MC moves have been considered. As different SSEs are not connected to each other, they may diffuse away from each other simply because of the effects of (4) translational entropy. To avoid this, the SSEs are confined within a finite simulation region, which is simply a cubic box that is large enough to accommodate free rearrangements of the contained SSEs. MC moves are rejected if they lead any of the SSEs, either in part or as a whole, to move out of the simulation region. At each MC step, an attempted move or structure change as described above is either accepted or rejected based on the energy change ΔE, the decision being made according to the Metropolis criterion, namely, βδ accept E (5) P ( Δ E) = min(1, e ) where 1/β is an effective temperature. Step-dependent effective temperatures were used to search the configurational space for energy minima. In simulations started from the artificially constructed structures, the effective temperature takes the form 1 2πk β ( k) = 2.5 cos( ) ,000 where k is the step number. A maximum number of 50,000 steps would be carried out, during which 1/β would drop from a starting highest value of 5 to the lowest value of 0. During the course, if the lowest energy visited has not been changes in the previous 5000 steps, the MC run was finished. In simulations started from the native SSE arrangements, the effective temperature was changed according 1 2πk to β ( k) = 2.5 cos( ) A total of 55,000 MC steps were executed, during which the effective temperature periodically oscillated between 5 and 0 for five full cycles and then dropped from 5 to 0 in the final half cycle. We did not try to fine-tune the parameters used in the Monte Carlo approach, although it may be possible to improve its efficiency by tuning parameters such as the step sizes and the temperature varying schemes. With the current scheme, the overall acceptance ratio is around 25%. In the high temperature phase (the first 10,000 steps of a simulated annealing cycle), the ratio is usually above 30%, while in the low temperature phase (the last 10,000 steps of a simulated annealing cycle) the ratio is usually lower than 20%. Construct Artificial Initial Configurations. Two different schemes have been used to construct artificial arrangements of the SSEs as initial configurations for subsequent MC simulated annealing. The first scheme is to randomly choose SSEs of designated types and lengths from the training backbone structures and then insert them at random positions with random orientations into the simulation box. This scheme worked well for SSE compositions containing only α-helices, similar sets of lowest energy configurations being found in the MC simulations starting from different initial structures. However, for SSE compositions containing β-strands, initial structures constructed with this scheme often did not lead to the formation of complete β-sheets in subsequent MC simulated annealing. Instead, the system frequently got trapped into configurations in which individual strands were packed against α-helices without forming β-sheets. This is purely a sampling problem, as such trapped configurations are always of significantly higher energies than configurations with wellformed β-sheets. The problem arose probably because that energy wells corresponding to well-formed β-sheets, despite being relatively deep, are entropically highly disfavored, spanning only narrow regions in the configurational space. When the system is outside of the narrow energy wells, the short-ranged favorable inter-β-strands-packing terms are not 440

12 effective to guide the MC moves, making the moves essentially random. If the initial configuration of the system is far from the β-sheet forming wells, it would become highly likely that the simulation got trapped in those broader or entropically more favored metastable energy wells before the more stable but narrower minima could be reached. To overcome this problem and to increase the efficiency of locating minima associated with β-sheets, a second scheme was introduced to construct initial structures. This scheme employs the idea of Taylor et al., in which SSEs are organized on lattice points in mutually parallel or antiparallel orientations. 48,49 For the SSE architectures considered in this work, at most two layers of lattice points in a 2-D plane were used as the (end) positions of SSEs. Both layers run horizontally, one of them containing lattice points for β-strands and the other containing points for α-helices. Within the β-strand layer, the horizontal distance between two neighboring lattice points or β-strands was set to 5 Å. Within the α-helix layer, this distance was set to 11 Å. The vertical distance between the two layers was set to 8 Å. The first lattice point of the α-helix layer was vertically aligned to the second lattice point of the β-strand layer, fixing the relative horizontal shift between the two layers. Each SSE of a given type and length was first randomly chosen from the training backbone structures. It is then rotated so that its endto-end (from N to C) direction is perpendicular to the lattice plane, pointing to one of two opposite directions (one defined as positive or the other negative). The remaining rotational degree of freedom, which is the rotation around the end-to-end axis, was left as random. SSEs pointing to the positive direction were translated to have their N-terminal Cα atoms at the corresponding lattice points, while SSEs pointing to the negative direction were translated to have their C-terminal Cα atoms at corresponding lattice points. Besides using initial configurations close to the intended β- sheet forming states, a coarse restraining energy was applied to restrict the β-strands from drifting too far away from the intended states during MC simulated annealing, especially in the higher temperature phase. This restraining function has the form strand pair A,B E = ε( n + n ) A B B A (6) where A and B each refers to one of two strands intended to become immediate neighbors in a β-sheet, ε is a positive constant, n A B is the number of backbone sites in strand A whose closest backbone site in strand B is within 7.5 Å (measured by the Cα Cα distance), and n B A is similarly defined for strand B. Equation 6 has been chosen to restrict the intrastrand conformational space so that it can be more efficiently searched during the Monte Caro simulation. As the restraint can be easily fulfilled by two strands spatially approaching each other to lead to zero restraining energies and forces, it does not affect the finer details of the relative arrangements between neighboring strands. The positive constant ε only needs to be large enough so that the system cannot move out of the restricted search region once it has moved into it. Here, we have used a large enough value of 10 for ε without trying to readjust it. We note that both the initial lattice-based arrangements of the β-strands and the strand-pair restraining energy (eq 6) have been introduced to increase the efficiency of the MC simulated annealing by avoiding it being trapped into metastable configuration states. Neither of these treatments should have 441 affected the atomistic details of interstrand packing in the finally identified stable states. Given the definition of our model, these details should have been determined solely by the tetrabase energy function. AUTHOR INFORMATION Corresponding Author * hyliu@ustc.edu.cn. ORCID Huanyu Chu: Author Contributions The manuscript was written through contributions of all authors. All authors have given approval to the final version of the manuscript. Funding This work has been supported by the National Natural Science Foundation of China (Grants and ). Notes The authors declare no competing financial interest. ACKNOWLEDGMENTS This work has been supported by the National Natural Science Foundation of China (Grants and ). REFERENCES (1) Dahiyat, B. I.; Mayo, S. L. De Novo Protein Design: Fully Automated Sequence Selection. Science 1997, 278, (2) Kuhlman, B.; Dantas, G.; Ireton, G. C.; Varani, G.; Stoddard, B. L.; Baker, D. Design of a Novel Globular Protein Fold with Atomic- Level Accuracy. Science 2003, 302, (3) Koga, N.; Tatsumi-Koga, R.; Liu, G.; Xiao, R.; Acton, T. B.; Montelione, G. T.; Baker, D. Principles for Designing Ideal Protein Structures. Nature 2012, 491, (4) Lin, Y. R.; Koga, N.; Tatsumi-Koga, R.; Liu, G.; Clouser, A. F.; Montelione, G. T.; Baker, D. Control Over Overall Shape and Size in de novo Designed Proteins. Proc. Natl. Acad. Sci. U. S. A. 2015, 112, E (5) Grigoryan, G.; Degrado, W. F. Probing Designability via a Generalized Model of Helical Bundle Geometry. J. Mol. Biol. 2011, 405, (6) Huang, P. S.; Oberdorfer, G.; Xu, C.; Pei, X. Y.; Nannenga, B. L.; Rogers, J. M.; DiMaio, F.; Gonen, T.; Luisi, B.; Baker, D. High Thermodynamic Stability of Parametrically Designed Helical Bundles. Science 2014, 346, (7) Thomson, A. R.; Wood, C. W.; Burton, A. J.; Bartlett, G. J.; Sessions, R. B.; Brady, R. L.; Woolfson, D. N. Computational Design of Water-Soluble Alpha-Helical Barrels. Science 2014, 346, (8) Marcos, E.; Basanta, B.; Chidyausiku, T. M.; Tang, Y.; Oberdorfer, G.; Liu, G.; Swapna, G. V.; Guan, R.; Silva, D. A.; Dou, J.; Pereira, J. H.; Xiao, R.; Sankaran, B.; Zwart, P. H.; Montelione, G. T.; Baker, D. Principles for Designing Proteins with Cavities formed by Curved Beta Sheets. Science 2017, 355, (9) Brunette, T. J.; Parmeggiani, F.; Huang, P. S.; Bhabha, G.; Ekiert, D. C.; Tsutakawa, S. E.; Hura, G. L.; Tainer, J. A.; Baker, D. Exploring the Repeat Protein Universe through Computational Protein Design. Nature 2015, 528, (10) Park, K.; Shen, B. W.; Parmeggiani, F.; Huang, P. S.; Stoddard, B. L.; Baker, D. Control of Repeat-Protein Curvature by Computational Protein Design. Nat. Struct. Mol. Biol. 2015, 22, (11) Jacobs, T. M.; Williams, B.; Williams, T.; Xu, X.; Eletsky, A.; Federizon, J. F.; Szyperski, T.; Kuhlman, B. Design of Structurally Distinct Proteins using Strategies Inspired by Evolution. Science 2016, 352, (12) Huang, P. S.; Boyken, S. E.; Baker, D. The Coming of Ageof de novo Protein Design. Nature 2016, 537,

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy

Design of a Novel Globular Protein Fold with Atomic-Level Accuracy Brian Kuhlman, Gautam Dantas, Gregory C. Ireton, Gabriele Varani, Barry L. Stoddard, David Baker Presented by Kate Stafford 4 May 05 Protein