Using A Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes

Size: px
Start display at page:

Download "Using A Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes"

Transcription

1 doi: /s (03) J. Mol. Biol. (2003) 330, Using A Neural Network and Spatial Clustering to Predict the Location of Active Sites in Enzymes Alex Gutteridge, Gail J. Bartlett and Janet M. Thornton* EBI, Wellcome Trust Genome Campus, EMBL Outstation Hinxton, Cambridgeshire CB10 1SD, UK *Corresponding author Structural genomics projects aim to provide a sharp increase in the number of structures of functionally unannotated, and largely unstudied, proteins. Algorithms and tools capable of deriving information about the nature, and location, of functional sites within a structure are increasingly useful therefore. Here, a neural network is trained to identify the catalytic residues found in enzymes, based on an analysis of the structure and sequence. The neural network output, and spatial clustering of the highly scoring residues are then used to predict the location of the active site. A comparison of the performance of differently trained neural networks is presented that shows how information from sequence and structure come together to improve the prediction accuracy of the network. Spatial clustering of the network results provides a reliable way of finding likely active sites. In over 69% of the test cases the active site is correctly predicted, and a further 25% are partially correctly predicted. The failures are generally due to the poor quality of the automatically generated sequence alignments. We also present predictions identifying the active site, and potential functional residues in five recently solved enzyme structures, not used in developing the method. The method correctly identifies the putative active site in each case. In most cases the likely functional residues are identified correctly, as well as some potentially novel functional groups. q 2003 Elsevier Science Ltd. All rights reserved Keywords: bioinformatics; structural genomics; functional prediction; neural network; active sites Introduction The huge increase in the rate of DNA sequencing, and the use of gene prediction technologies, such as Genscan 1,2 and Genewise, 3 have flooded protein databases with new sequence data. The various structural genomics initiatives (SGI), now aim to produce a similar increase in the amount of Present addresses: A. Gutteridge, Birkbeck College, University of London, Malet St, Bloomsbury, London WC1E 7HX, UK; G. J. Bartlett, Department of Biochemistry and Molecular Biology, University College London, Gower St, London WC1E 6BT, UK. Abbreviations used: SGI, structural genomics initiatives; HMM, hidden Markov model; TIM, triose phosphate isomerase; RGS, regulator of G-protein signalling; ET, evolutionary trace; RSA, relative solvent accessibility; MCC, Matthews correlation coefficient; DOPS, diversity of position score; FEM, factors essential for methcillin resistance; PDB, Protein Data Bank. address of the corresponding author: thornton@ebi.ac.uk structural information. 3 One of the most important tasks in biology today is to use these data to provide functional annotation that leads to biologically useful knowledge. 4 One type of information that structural data can provide is the location and nature of the functional regions of a protein, such as protein protein interaction sites and ligand binding pockets. Knowing the location of the functional sites within a protein allows the study of targeted mutants, structurebased drug design, and functional annotation of the protein by comparison with other characterised proteins. Most novel genes are functionally annotated using sequence analysis to find similar genes of known function, typically by running one of the various flavours of BLAST 5 or profile and HMM approaches such as Pfam. 6 Several studies 7 9 have pointed out the problems associated with homology based functional annotation and indicate a cut-off of sequence identity as high as 40%, below which it is dangerous to transfer anything but the /03/$ - see front matter q 2003 Elsevier Science Ltd. All rights reserved

2 720 Predicting Active Sites Using Neural Networks broadest functional annotation. It is certain that some sequences in the public databases are incorrectly annotated due to the difficulty of transferring function based purely on sequence similarity. As more data comes through from structural genomics it is likely that a similar approach will be taken to annotate proteins using structural homologues. The idea that proteins with similar structures perform similar functions has been examined closely Generally, although it is true that structure is more conserved than sequence at great evolutionary distance, the transfer of function based on structural similarity is no more reliable than annotation based on sequence similarity. The problem being the limited number of unique folds found in nature, which has been estimated to be as low as Since the number of functions performed by proteins far exceeds this number, it follows that one fold must be capable of many functions. The triose phosphate isomerase (TIM) barrel fold, for instance, is associated with 61 different EC 14 numbers, covering five of the six top level EC classifications. 15 Methods to locate and characterise the functional sites of a protein could provide data for functional annotation in ways not based on homology, as well as providing information for mutagenesis and drug design studies. Traditional molecular biology techniques for finding functional sites, such as mutagenesis, 16 ph dependence 17 and chemical labelling 18 are generally time consuming, and rely on some prior knowledge of the function of the protein to allow it to be assayed. In silico methods for finding and annotating functional sites would clearly be of great help in annotating novel protein structures from structural genomics. Several different strategies have already been developed, however, none has been used to perform an analysis across the whole structure database. Pattern matching approaches such as TESS, 19 FFF 20 and SPASM 21 aim to locate functional sites and annotate structures by finding small three-dimensional motifs within the structure. The disadvantage of this is that suitable motifs have to be derived (usually from literature, though datamining techniques have been used for automatic extraction of motifs 22 ) and truly novel structures may not match any known motif. Recent studies 23 have also presented methods for finding similarities between cavities on the protein surface which could be used to annotate structures, once the functionally important cavities are identified. Techniques for finding functional sites de novo, such as evolutionary trace (ET), and other similar methods generally focus on searching for three-dimensional clusters of conserved residues. ET studies have made genuine, experimentally confirmed, predictions for the location of functional sites in G-proteins 30 and regulator of G-protein signalling (RGS) proteins 31 demonstrating the potential of this type of technique. Some proteins, including those targetted by structural genomics, have no sequence homologues and so conservation based approaches do not work. Methods, which only use structural information to locate functional residues, have been developed to provide functional annotation for these proteins. 32,33 These techniques rely on identifying residues with unusual electrostatic and ionisation properties, and have shown a correlation between these residues and functional sites within the protein. Here, we describe a new method for de novo prediction of functional sites specific for the active sites of enzymes. Instead of searching for clusters of conserved residues, a neural network is used to score the residues of a protein structure by the likelihood that they are catalytic. By searching for clusters of high-ranking residues the algorithm determines the most likely active site. The neural network is trained using a dataset of proteins for which the catalytic residues have been confidently located by experiment. Structural parameters such as the solvent accessibility, type of secondary structure, depth, and cleft that the residue lies in, as well as the conservation score and residue type are used as inputs for the neural network. Results Analysis of parameters A detailed analysis of the parameters is provided by Bartlett et al. 34 A brief summary is presented here. Conservation was the most powerful parameter for discriminating catalytic and non-catalytic residues. Some proteins, however, failed to find sufficiently diverse homologues to generate meaningful conservation scores. It is hoped that in these cases the predictive power of the other parameters will be enough to allow reliable predictions to still be made. Catalytic residues show a tendency to be buried within the structure and so have a lower relative solvent accessibility (RSA) than other residues, particularly non-catalytic polar residues, the majority of which lie exposed on the surface of the protein. Despite this tendency to be buried, catalytic residues are often found lining a large cleft. This tendency is particularly marked for the largest cleft, and is significant for the second and third largest clefts. For clefts smaller than this (fourth to ninth largest) the difference is not particularly significant. There is a slight tendency for catalytic residues to prefer coil regions over helix or sheet regions, this could be due to the extra conformational flexibility this gives them (allowing the active site to change conformation on ligand binding). The hydrophobic and small residue groups were found to be very rarely catalytic, presumably because they do not contain the chemical groups required for most catalytic tasks. The obvious

3 Predicting Active Sites Using Neural Networks 721 Figure 1. Distribution of residue depths for non-catalytic residues and catalytic residues. exceptions to this are when the backbone amide and carbonyl groups perform catalytic functions. It was found that glycine is the residue most often used in this case. Depth Depth values were calculated for the non-catalytic residues in the data set, and the distribution (Figure 1 shows that almost 40% of residues lie on, or near, the surface of the protein, with depths less than 1 Å. These residues are almost completely exposed to the surface with only a few of their atoms not solvent accessible. The proportion of the total represented by each 1 Å division then decreases steadily, apart from a small-peak in the 4 5 Å division. Presumably this second-peak is due to invaginations on the protein surface, which alter the distribution from the smooth decrease one would expect given a perfectly spherical protein. The very deepest residues in this data set lie at,13 Å. Catalytic residues show a different distribution, with only 17% lying in the outer 1 Å, the majority occupy the next partially buried layer between 2 Å and 4Å. This allows the catalytic residues to have some solvent accessibility (in order to interact with the substrates) whilst remaining mostly buried (to allow themselves to be correctly orientated by other residues). The catalytic residues rarely have depths greater than 5 Å. An example: quinolate phosphoribosyltransferase As an example of the neural network output, the scores along the 286 amino acid sequence of quinolate phosphoribosyltransferase (1QPR) are shown in Figure 2. Most residues score very low (a large majority score less than 0.01), and around 20 residues score over 0.5. The four known catalytic residues (Arg105, Lys140, Glu201 and Asp222) all score highly, though several other residues score as high or higher. There is some grouping of the high-scoring residues in the sequence, particularly around residue 140, but most high-scores are isolated spikes. When the scores are mapped on to the 1QPR structure (Figure 3) the high-scoring Figure 2. The distribution of neural network scores along the sequence of 1QPR. The true catalytic residues are highlighted.

4 722 Predicting Active Sites Using Neural Networks Training the network The training process is tracked by measuring the Matthews correlation coefficient (MCC) after each epoch, Figures 4 and 5 show how the MCC varies as training progresses. The variation in performance is quite considerable, with the final MCC varying between 0.35 and 0.25, reflecting the natural variation within the data set. Figure 5 shows the MCC varying with each epoch averaged over all ten runs. The network reaches its best MCC after only 30 epochs or so, levelling off at an average MCC of around There is no evidence of over-fitting in the results, as the MCC does not fall significantly once it has plateaued. Network weights Figure 3. (a) Distribution of neural network scores in the 1QPR structure. Residues are coloured by network score (Red ¼ high, blue ¼ low). (b) The structure of the 1QPR homodimer, coloured by chain, with the known catalytic residues drawn in thick lines. All structure diagrams are prepared using PyMol. 58 areas, although widely separated in the sequence, are brought together and cluster into two areas corresponding to the two active sites of the quinolate phosphoribosyltransferase homodimer. The relative strength of the weights that the network converges to are shown in Figure 6. Conservation and diversity of position score (DOPS) are both highly weighted. As expected the network also looks for buried residues, as RSA is given a negative weighting. The cleft categories show that lying in a cleft, and the size of that cleft are important factors in the network score, though not important as conservation or RSA. Depth is not weighted strongly in either direction, and is not important in making a prediction. The difference for the secondary structure parameters is also small. Residue type has a very large variation with histidine, cysteine and the charged residues (aspartate, glutamate, lysine and arginine residues) all scoring highly, whilst the hydrophobic residues score low. The high DOPS weighting is interesting, as it is the same for all residues within a protein chain. The only effect is to raise all the scores of all residues in chains with high DOPS and lower all the scores of all residues in chains with low DOPS. The network has learnt that when DOPS is low it is better, in terms of the overall error rate, to make no catalytic predictions at all, rather than predict everything to be catalytic. Since the clustering algorithm uses residues based on their rank rather than absolute scores, this makes no difference in the later stages. Figure 4. Training the neural network, each line represents one of the ten cross validation runs.

5 Predicting Active Sites Using Neural Networks 723 Figure 5. MCC averaged over all ten cross validation runs. Clustering In the network scoring we consider each residue as independent of the others, however, catalytic residues are likely to cluster together in the structure. Ranking and clustering the residues allows us to use this information to improve the predictions and locate the active site. For each structure a list of possible catalytic residues is generated by ranking the residues by network score. The clustering algorithm finds distinct clusters of these residues and generates a sphere that forms the predicted active site clusters are generated from the test set, an average of 7.2 per protein. The multimeric nature of most of the proteins means that the average number of known active sites is 2.6 per protein. The distribution of sphere sizes for the known sites and all the predicted sites is shown in Figure 7. Figure 8 shows the sizes for the known sites and the top scoring predicted sites only. Most predicted clusters are small and contain two or three members with a radius of 3 4 Å, in contrast the top scoring predictions in each structure are generally large and lie at the upper end of the allowed size range (15 Å). The known sites generate spheres with sizes between 6 Å and 12 Å, though a significant number have a single catalytic residue and so have radii of 3 Å. A few outliers have spheres larger than 20 Å in radius. These cases all represent structures where the catalytic cluster is thought to come together upon substrate binding so the cluster appears very large in the unbound form. Comparing the predicted sites to the known sites To test whether a prediction is correct, the overlap between the predicted site and the closest known active site is calculated. A correct prediction occurs Figure 6. The relative strengths of the weights placed on the various parameters. Categorical parameters such as residue type are grouped, with the lowest weight set at 0.

6 724 Predicting Active Sites Using Neural Networks Figure 7. Size distribution of all the predicted sites compared to the known sites. Figure 8. Size distribution of the top scoring predicted sites compared to the known sites. Figure 9. Pie chart showing the per protein accuracy when only the top prediction is considered for each protein, and when all predictions are considered. when the overlap is greater than 50% of the volume of the known active site, a partially correct prediction occurs when there is some overlap but less than 50%, a failure occurs when there is no overlap between the known and predicted spheres. For each protein in the test set, the prediction with the highest total network score was selected and compared to the known sites. The results are shown in Figure 9, 62% of the proteins have the active site correctly identified, and a further 22% are partially correct. When we consider the overlap for all the sites predicted for each protein we find the results improve: 69% of the proteins have the active site correctly identified and 25% have a partially correct prediction. The increase of only 7% when all predictions are considered shows that the highest scoring cluster is very often the true active site. Eleven cases were found where the top prediction was not correct, but one of the other predictions was. In six of these cases the correct cluster was the second or third highest scoring prediction and in four cases the correct cluster was the fourth or fifth highest scoring, in the final case the correct cluster was the seventh highest scoring. When each of the 1158 predicted clusters is considered individually, as opposed to by each protein, 25% are found to be correct and 41%

7 Predicting Active Sites Using Neural Networks 725 are partially correct. The high number of partial hits is presumably due to the tendency of the network to find residues lying near the active site, but which aren t close enough to the true catalytic residues to score as correct. It is also possible that many of the partially correct and incorrect clusters represent secondary functional sites such as ligand binding or protein protein interaction sites. These clusters are biologically interesting, but are considered incorrect when searching solely for active sites. Significance of results To calculate the significance of these results, we estimate the probability (P R ) of achieving this level of prediction by random chance. A similar method to that used by Aloy et al. 26 is applied. To a reasonable approximation a correct hit occurs when the centre of the smaller of the two spheres lies within the volume of the larger. Since the known catalytic site is usually smaller than the largest predicted site, and assuming the prediction has an equal probability of being anywhere within the volume of the protein, P R is the ratio of the volume of the predicted sphere to the volume of the protein. Since most of the proteins are multimeric this ratio is then multiplied by the number of active sites (any one of which could have overlapped with the predicted site). The volume of each protein is estimated by drawing a sphere around all the C b atoms of the structure, giving an average of 510,000 Å 3. Since most catalytic residues lie in the outer 5 Å of the protein we shall consider the predictions restricted to only a third of this volume. The average volume of all the predicted spheres is 2632 Å 3, and the average volume of the top scoring predictions is 5783 Å 3. There are 7.2 predicted sites and 2.6 known sites per protein on average. A summary of the observed and expected rates of correct predictions for the three different analysis is shown in Table 1. We estimate the significance of the differences using equation (1), 26 which follows a normal distribution with mean 0 and standard deviation 1. All the results are significant to more than z ¼ P O 2 P R rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi P R ð1 2 P R Þ n ð1þ Comparison of the performance of different networks The neural network and clustering process incorporates a variety of different types of information: evolutionary information encoded in the conservation scores, residue propensities, structural information in the parameters and detailed structural information included in the clustering stage. To understand how these different types of information contribute to the overall performance, networks have been trained using different subsets of the parameters. Two additional networks have been developed. First a network trained solely using sequence parameters (conservation, DOPS and residue propensities), and second a network trained using structural parameters but excluding conservation and DOPS scores. Residue propensity is included in the structural information as the sequence of a protein would always be known given a structure. The relative performance of the different networks is shown in Figure 10 and in detail in Table 2. The performance of each network in finding the location of the active site is shown in Figure 11 and Table 3. The performance of the technique described by Aloy et al., 26 which uses conservation, residue propensity and clustering is also shown in Table 3. This study uses the same sphere based method as shown here to assess the accuracy of the predictions making comparison easy, The functional residues were based on SITE records in PDB files, which are not as well defined as the catalytic residues used here. Aloy et al. analysed 106 proteins and found that 20 of them could not generate sufficiently diverse alignments to give good predictions. Since, here, we have included proteins with low DOPS, we include these 20 proteins as incorrect predictions when calculating performance. Once this is taken into account we find that of the 106 proteins, 68 are correctly predicted (64%), 13 partially correct (12%) and 25 are incorrect (24%), when all predictions are considered. This level of prediction is almost identical to the sequence trained network. Predictions The neural network was run on several recently published enzyme structures, which were not included in the original data, or subsequent analysis, Table 1. Observed and expected frequencies of correct results for the three analysis Per site ðn ¼ 1158Þ(%) Per protein ðn ¼ 159Þ(%) Top site ðn ¼ 159Þ(%) Expected ðp R Þ Expected ðp R Þ (1/3 Vol) Observed ðp o Þ Expected ðp R Þ Expected ðp R Þ (1/3 Vol) Observed ðp o Þ Expected ðp R Þ Expected ðp R Þ (1/3 Vol) Observed ðp o Þ

8 726 Predicting Active Sites Using Neural Networks Figure 10. Comparison of the MCC achieved by the three different networks in predicting catalytic residues, before and after structural clustering is applied. Figure 11. Comparison of the site prediction accuracy for the three different networks. Results are presented considering all predicted sites and the top scoring site only. Table 2. Comparison of the performance of the three different neural networks in predicting catalytic residues Before clustering After clustering Data used MCC Q Predicted Q Observed MCC Q Predicted Q Observed Structure Sequence Sequence þ structure Table 3. Comparison of the performance of the three different neural networks in locating active sites Top sites only (%) All sites (%) Data used Correct Partial Incorrect Correct Partial Incorrect Structure Sequence Sequence þ Structure Aloy et al

9 Predicting Active Sites Using Neural Networks 727 Figure 12. (a) The front face of the SET domain showing the large, high-scoring surface patch and the Ado-HCys binding cleft. Residues His410, Asp450, Asn409 and Tyr451 form the L-shaped patch in the centre, Arg406 and Tyr357 lie in the pocket to the right. (b) The catalytic site of I-TevI, network scores are generated without conservation data. (c) The catalytic site of I-TevI, showing the improved prediction once conservation data is included. (d) L-Arabinanase, the central red patch is made of His37 and Asp38. Asp158 and Glu221 both lie close by in the same pocket. (e) FemA, the binding cleft is the long green patch in the centre of the figure. The most likely catalytic site lies at the far left end of the cleft. (f) The RlmB dimer. The subunits are stacked on top of each other running left to right. The two active sites lie in the high-scoring regions in the interface between the two subunits. to gauge the usefulness of the method in annotating structures. SET domain histone lysine methyltransferases Several recent papers have presented the first structures of histone lysine methyltransferase (HMTase) containing SET domains. SET domains are responsible for the methylation of specific lysine residues in histone proteins, leading to changes in chromatin regulation and gene expression. SET domains share no homology with other structurally characterised methyltransferases, and so the structure, and the functional information that the structure contains is of significant importance. The structure of yeast protein Clr4 (PDB code 1MVX) was used for the prediction of the functional sites. The PSI-BLAST search required 7 iterations to converge (using an E-value cut-off of ) and found,150 homologues, producing a very diverse alignment. 31 residues score over the ranking cut-off and clustering reveals one large cluster, containing 19 of these residues. The output of the neural network mapped to the surface of the structure is shown in Figure 12. The dominant cluster forms the large L-shaped patch in the centre of the structure comprising residues His410, Asp450, Asn409, and Tyr451, the other residues in the cluster extend either side of the L-shape patch and into the structure. Mutations to His410, Cys412, Arg320, Glu446 and Arg406 have been shown to inactivate the enzyme, 41,42 though it is suggested that Arg320 is most likely to be of structural, rather than catalytic importance. The structure with an AdoHcy cofactor bound is known for homologous SET domains. This reveals that Tyr451, Asn409 and His410 make contacts to the cofactor and Tyr357 is proposed as a possible catalytic proton source. A high resolution crystal structure of the human SET7/9 domain has been recently published. 40 This study suggests catalytic roles for the residues equivalent to Tyr451, Tyr419, Tyr357 and the main-chain carbonyl oxygens of Asp403 and Phe408. Of these functional residues, the large predicted cluster contains His410, Glu446, Arg406, Tyr357, Tyr451 and Asp403. The neural network identifies the correct active site and many of the known functional residues. Intron endonuclease I-TevI The structure of the intron endonuclease I-TevI from bacteriophage T4 has recently been published. 43 Intron endonucleases catalyse a break in double stranded DNA, that facilitates the insertion of introns and inteins. I-TevI contains separate catalytic and DNA binding domains, the structure of the catalytic domain is analysed here (PDB code: 1LN0). Using the default PSI-BLAST parameters the sequence of 1LN0 picks up no homologues.

10 728 Predicting Active Sites Using Neural Networks Despite this the network still makes predictions based purely on the structure and the residue propensities. The network identifies three residues (His31, His40, Ser42) forming the highest scoring cluster. A putative active site is proposed based on conservation and mutagenesis data. 44 The site is located in the same cleft identified by the network. Glu75 binds a divalent cation and is likely to be the principal functional residue. Other functional residues suggested by the authors include Tyr17, Arg27, His31 and His40. The 1LN0 structure has Arg27 mutated to alanine, as active I-TevI cannot be produced by Escherichia coli. Replacing Ala27 by arginine in the sequence presented to PSI- BLAST, and reducing the E-value cut-off to 10 25, allows the network to improve the prediction. Twelve residues now form the largest cluster including Tyr17, Arg27, His31, and His40. Glu75 still remains outside the predicted cluster, however. This example demonstrates how the network can cope with structures occupying a sparsely populated region of sequence space. The prediction made only on the basis of residue propensities and structural data correctly identifies the active site and several functional residues. Once the mutated structure is corrected and conservation scores are added the network makes improved predictions, correctly identifying the active site and many of the principal residues, though it still fails to predict the crucial Glu75. The problems of mutated structures and limited sequence homologues highlight some of the difficulties that would be encountered in a PDB-wide analysis. a-l-arabinanase The structure of Cellvibrio japonicus arabianase has been solved recently 45 revealing a novel fivebladed b-propeller fold (PDB code: 1GYD). Arabianase hydrolyses the arabinans polymers found in plant cell walls. The PSI-BLAST search converges after only four iterations, only finding 11 homologues, however, the alignment is quite diverse and useful conservation scores are obtained. The highest scoring cluster lies centred around the high-scoring pair of residues His37 and Asp38. The other residues in the cluster are Ser86, Ser112, His92, Trp94, Gln316, Asp158, Thr58, His291, Tyr308, Ser52 and Thr53. The authors of the paper used analogy with other enzymes, 46 conservation, and mutagenesis to identify Asp38 and Glu221 as the likely catalytic groups. A third carboxylate, Asp158, is suggested to be involved in pk a modulation or positioning of the Glu221 side-chain. The neural network correctly identifies the three acidic residues as catalytic (all are highly ranked), however, the clustering algorithm does not link Glu221 into the cluster containing Asp38 and Asp158 (even though Glu221 is the highest scoring residue in the protein). Altering the clustering parameters to join residues separated by less than 5 Å (rather than the default 4 Å) allows Glu221 to join the main cluster. FemA FemA is a Staphylococcus aureus protein identified as a member of the Fem (factors essential for methicillin resistance) family, a series of antibiotic resistance genes. 47,48 FemA is responsible for the addition of glycines to peptidoglycan molecules in the bacterial cell wall. The structure is the first example of this important family 49 (PDB code: 1LRZ). PSI-BLAST converges after four iterations finding 40 homologues and generates a diverse alignment. The network scores mapped to the structure are shown in Figure 12. The high-scoring residues line the large cleft that runs the length of the protein. The clustering algorithm suggests a seven residue cluster comprising the high-scoring residues His106 and His29, and five other lower scoring residues. This cluster lies at the very end of the cleft. Another five residue cluster lies approximately halfway along the cleft comprising Lys383, Phe382, Ser342, Ser314 and Thr332. The crystal structure does not have any ligand bound, and no mutagenesis data is available to pinpoint the actual catalytic residues. The cleft is the only structure large enough to accommodate the peptidoglycan substrate and hence is the most likely binding site, though a conformational change on substrate binding cannot be ruled out. The network suggests several residues as potential catalytic groups and further experimentation is required to confirm which, if any, of these residues form the catalytic centre. RlmB 23 S rrna Methyltransferase RlmB is an Escherichia coli protein representing the novel Ado-Met dependent methyltransferase class, SPOUT. RlmB is responsible for the methylation of a specific guanosine group in the 23 S rrna component of the ribosome. 50 The crystal structure of the enzyme has recently been solved 51 (PDB code: 1GZ0). PSI-BLAST converges after iteration five, having found 100 homologues and generates a very diverse alignment. RlmB forms a homodimer in solution and the high-scoring residues cluster into two almost identical sites in the dimer interface region. Each site contains residues from both chains A and F. The highest scoring residue is Arg114 which is involved in a salt-bridge with Glu198 from the opposite chain. Surrounding this pair are His9, Asp117, Glu147, Ser148 and Gly144 from the same chain as Arg114 and Ser224, Leu225, Asn226 and Ser228 from the same chain as Glu198. A secondary cluster comprised of Asp105, His107 and Asn108 lies 4.3 Å from this main cluster. The authors propose a putative active site based on conservation of three previously identified motifs, found in most methyltransferases. 52,53 Motif 1 covers residues Asn108 to Arg114, motif II

11 Predicting Active Sites Using Neural Networks 729 covers Glu198 and motif III covers Ser224, Leu225 and Asn226. They also report that mutagenesis of the equivalent residue to Glu198 in a homologue abolishes methyltransferase activity. Glu198 and Ser224 are suggested as possible catalytic bases. His9 is implicated in RNA binding, however, several other putative RNA binding residues are not identified strongly by the network. The network has correctly identified the putative catalytic centre, though again the clustering has split the site, leaving part in a small secondary cluster. Discussion One of the original aims of the project, to predict catalytic residues from structures, has proven to be an extremely difficult task given the narrow definition of catalytic used here. The MCC of 0.28 (or 0.32 if clustering is used) is too low to realistically use the simple predictions from the neural network in identifying catalytic residues directly. The main problem is the high number of false positives. 56% of catalytic residues are identified correctly, but only one in seven catalytic predictions are correct. Visual inspection of the results shows that many of the false positives are other functional residues lying in the active site such as substrate binding and metal binding residues. These residues have very similar properties to the catalytic residues: conserved, low-solvent accessibility, lying in clefts and they also lie extremely close to the true catalytic residues and do not form a distinct or separate spatial cluster. A system looking to identify any functional residues at the active site may well consider these false positives to be true positives, however, given the definition used here they are errors. As well as the problem of these false positives there is the inherent difficulty of picking the handful of catalytic residues from hundreds in the protein. The ratio of catalytic to non-catalytic is around one in one hundred across the entire data set. Given these difficulties the low success rate is understandable and not as disappointing as first appears. The network weights and the performance of the sequence-only neural network shows that evolutionary information, encoded in conservation scores is very important in making a prediction. This network reflects the performance that one could expect to achieve when predicting catalytic residues purely from sequence data. We see from the Q Observed and Q Predicted values in Table 2 that 50% of catalytic residues are found by this network, but only one in eight of the predictions is correct. Structural genomics projects aim to provide some level of structural information for the majority of protein sequences. Some of these proteins will not have any known sequence homologues and the structure will be the only information available. The neural network trained without conservation scores reflects the performance one could expect to achieve when analysing these proteins. The network alone performs poorly, however, the structural information can also be used to cluster the predictions in these proteins. When this form of structural information is incorporated the overall performance rises almost to the level of the sequence network, and 57% of the catalytic residues are correctly predicted, though the true positives are still only one in ten of the catalytic predictions. For the majority of structural genomics targets there are some sequence homologues and in these cases both types of information can be incorporated. The network trained using sequence and structure outperforms both the other networks with an MCC of 0.28 rising to 0.32 when clustering is used (Table 2). 68% of catalytic residues are correctly predicted and one in six of the catalytic predictions is correct. Although predicting the catalytic residues is difficult, predicting the location of the active site can be done with significant levels of success (Table 3). When only structural information is used the clustering algorithm is still able to correctly identify the catalytic cluster in 62% of proteins and a partially correctly in a further 31%. This suggests that even for structural genomics targets where no conservation data is available, it will still be possible to make significant predictions about the location of the active site. The neural network trained using sequence data identifies 63.5% of sites when all predictions are considered. This level of performance is similar to the technique described by Aloy et al. 26 which also uses conservation, residue propensities and clustering. It should be noted that Aloy et al. compared their predictions to the SITE records of PDB files, which are less rigorously defined than the catalytic clusters used here, and generally comprise larger number of residues. The performance of the neural networks used here are likely to be underestimated compared to Aloy et al., therefore. As with the neural network output, when structure and sequence are combined the performance exceeds that of sequence or structure alone. In this case 69% of sites are correct considering all predictions and 62% considering only the top prediction. A further 25% of sites are partially correctly predicted when all predictions are considered and 22% when only the top prediction is considered. The method fails to make a useful prediction in only 6% of cases when all the predictions are examined. One of the justifications for the large investment made in structural genomics is that it will allow identification of functional sites and residues in cases where it is not possible from sequence. The results we have shown here indicate that structure alone can be used to identify catalytic residues and active sites in enzymes, however, evolutionary

12 730 Predicting Active Sites Using Neural Networks history encoded in the form of conservation scores is an extremely rich source of information for making these types of predictions and should be incorporated at every opportunity. The improvement in performance when structure and sequence are used, shows that structural information, other than that used for clustering, should be incorporated into de novo prediction techniques such as evolutionary trace. Why did the failures fail? When considering the top scoring sites in each protein we find that 16% of the proteins failed to find any overlap between the predicted spheres and the known catalytic cluster. It is important to understand why these failures occurred in order to improve the algorithm and assess whether there are specific types of enzyme on which the algorithm performs consistently badly. Poor alignments The alignments automatically generated by PSI- BLAST are the most likely point of failure. The optimal E-value cut-off for each family varies depending on its size and diversity. The single E-value cut-off used represents the best compromise, but still generates poor alignments for some families. To test whether poor alignments are the major source of error the difference between the conservation of the catalytic residues and the conservation of all residues was calculated and averaged for each group of results (correct, partial and incorrect), the results are shown in Figure 13. The different groups clearly show a variation in the distinction between conservation of catalytic and noncatalytic residues. In the correctly predicted group the difference is more than 0.3, this falls to 0.25 for the partially correct group, and the incorrect group has an average difference of only Clearly, given the importance of conservation scores in making predictions, a lack of differentiation between the conservation of catalytic and non-catalytic residues will reduce the overall accuracy. This trend implies that unusual conservation scores are responsible for a large part of the failure rate. The low difference in conservation scores in the failure group could be explained if these proteins all had low DOPS. The DOPS for each protein chain were averaged for each category and are also shown in Figure 13. There is a correlation between DOPS and the success of a prediction, however, looking at the scores themselves shows that, although some chains have very low DOPS, most are just as high as the average correctly predicted protein. If low DOPS were responsible for all of the failures then one would not expect the average conservation of the catalytic residues to vary across the three groups. However, a clear trend of increasing catalytic conservation in the correct predictions is detected and shown in Figure 13. How then to explain these anomalous conservation scores? The assumption must be that these enzymes are part of a larger family of proteins, which have different catalytic activities. Catalytic residues conserved within a sub-family would therefore vary between members of the family and not be necessarily conserved. Several examples of this can be seen in the failed structures. Calpain (1DKV) for instance contains an EF-hand domain, which is even found in non-enzymes. This means the catalytic residues of Calpain are not conserved in many of the homologues a PSI-BLAST search returns, whilst other residues involved in forming the EF-hand are conserved. This pattern of conservation is the inverse of what the network is expecting, and so it fails to correctly predict the catalytic residues. Clustering errors Of the 26 structures that failed to find the active site when only the top site was considered, ten also failed when all sites were considered. In these ten cases the error occurs prior to clustering, generally with poor alignments from PSI-BLAST. Of the remaining 16, 11 generated a lower scoring correct cluster and five generated a lower scoring partially correct cluster. These 16 cases are failures of the clustering algorithm to find the right cluster, Figure 13. The difference in DOPS and conservation between catalytic and non-catalytic residues in the three groups of results.

13 Predicting Active Sites Using Neural Networks 731 presumably because the signal from the true active site was weak compared to other sites in the protein. If each structure is analysed by hand, the fault is generally obvious. The single-linkage algorithm is prone to forming long aspherical clusters, since two separate clusters can be joined even if only a single residue joins them. In several failures the true active site is a relatively compact cluster with a few high-scoring residues, whilst the top scoring prediction is a large cluster which out-scores the others by its size even if no single residue scores highly. Another problem is that the algorithm tends to select clusters buried in the protein, since these contain more residues than surface clusters, a human can easily spot that these are not suitable active sites. Further work Analysing the structures for high-scoring surface patches, as well as simple clusters might help in identifying the location of active sites, particularly if the top scoring cluster is deeply buried and hence unsuitable as a catalytic centre. Patch analysis has been used to identify other surface features, such as protein protein interaction sites and ligand binding pockets. The predicted clusters can also be used to automatically generate three-dimensional templates for analysis by one of the pattern searching algorithms, such as TESS and SPASM. Designing templates by hand is a time consuming job and automated methods, such as this and the method recently described by Oldfield, 22 could be used to quickly generate starting templates suitable for manual refinement. The basic methodology of neural network scoring of residues and spatial clustering could be used to find other types of functional sites such as non-obligate protein protein interfaces or protein DNA interaction sites. Many of the secondary clusters found by this network, may be more confidently predicted by networks trained on these other functional classes. A novel protein structure could be presented to each network in turn and different types of functional sites identified at each stage. Materials and Methods Protein test set The protein test set and the compilation of the data are described in detail in a recent paper by Bartlett et al. 34 The original test set contains some proteins with homologous non-catalytic domains, for this study these redundant structures have been removed. The final test set contains 159 proteins from the PDB, 54 containing no homologous pairs and covering all six top level enzyme classification (EC) 14 numbers. This data set contains approximately 55,000 non-catalytic residues and 550 catalytic residues available for training the network. Compilation of data The catalytic residues were defined using the following rules: (1) Direct involvement in the catalytic mechanism (e.g. as a nucleophile). (2) Exerting an effect on another residue or water molecule, which is directly involved in the catalytic mechanism, which aids catalysis (e.g. by electrostatic or acid base action). (3) Stabilisation of a proposed transition-state intermediate. (4) Exerting an effect on a substrate or cofactor which aids catalysis, e.g. by polarising a bond which is to be broken. Includes steric and electrostatic effects. Note that residues that bind substrate, cofactor or metal ions are not included, unless they also perform one of the functions listed above. Many studies have used the SITE records defined in PDB files as the basis for defining functional residues and sites. Unfortunately SITE records are not a homogenous data set, and there are no fixed rules on what may or may not be included in a SITE entry. Only 13 of the 159 PDB files in our data set contain SITE records, less than 10%. These 13 structures contain 50 catalytic residues, as defined above and 94 SITE residues. The overlap between these two groups contains 36 residues. We find therefore that in our data set 28% of catalytic residues are not found in the SITE records and only 38% of SITE residues are catalytic. The following parameters were derived for each residue (catalytic and non-catalytic) in all 159 proteins:. Conservation. The sequence of each chain in the protein was used to initiate a PSI-BLAST search of the NCBI non-redundant database (NRDB) with an E-value cut-off of for inclusion in the next iteration. Each PSI-BLAST search was run to convergence or a maximum of 20 iterations. The final multiple alignment generated by PSI-BLAST was then scored for conservation and DOPS as described by Valdar et al. 36. Relative Solvent Accessibility (RSA). NACCESS 37 was used with standard parameters to calculate the RSA of each residue.. Secondary structure. DSSP 55 was used to extract the secondary structure for each residue. The DSSP classification was simplified to three categories: helix, sheet or coil/other.. Cleft. Surfnet 56 was used to define in which, if any, cleft the residue lay. If a residue lay in two or more clefts only the largest was recorded.. Depth. The depth of a residue within the protein structure is defined as the average minimum distance between each of its atoms and the closest solvent accessible atom in the structure. NACCESS was used to define solvent accessibility. Encoding and generation of data sets Conservation, as calculated above, is already encoded as a suitably scaled factor between 0 and 1 (0 for no conservation and 1 for perfect conservation) and so is passed

14 732 Predicting Active Sites Using Neural Networks Figure 14. Example of the neural network input encoding. to the network as is. The RSA is a percentage and is scaled to between 0 and 1 before presentation to the network. Depth is scaled so that the deepest residue in each structure is scored 1 and surface residues 0. The other parameters: residue type, secondary structure and cleft are categorical in nature, and are encoded using 1-of-C encoding. Amino acid type is encoded as an array of 20 inputs where one input is set to 1 and the rest to 0. Secondary structure is encoded by three input parameters. Cleft size is divided into four categories: no cleft, largest cleft, second or third largest cleft and fourth to ninth largest cleft. An example encoding is shown in Figure 14 for a serine residue with conservation 0.7, DOPS score 0.9, depth 0.3, RSA 15%, in a coil region and lying in the largest cleft. Training the neural network The neural network software used is FFNN, 57 a feed forward neural network trained using a scaled conjugate gradients algorithm. A single-layer architecture is used in all cases. In order to accurately measure the performance of the network it is trained using a ten fold cross validation experiment. The dataset is divided into ten equal subgroups, and then in each training run nine of the groups are used for training, whilst the network is tested on the single remaining group. The network is run ten times using a different subgroup as the test group each time. Here the dataset was divided by structure rather than residue, so each subgroup contains the data for approximately 16 structures. The ratio of catalytic to non-catalytic residues is approximately 1:60 in the training set. Presenting the data in this ratio causes the net to predict every residue as non-catalytic. The best balanced training set was found to have a ratio of 1:6. Each training group is balanced by discarding a random selection of the non-catalytic residues prior to training. Training was for 100 epochs, in every case the network converged to a stable error-level before training was terminated. The number of training epochs was not optimised, and in particular the performance of the test set was not used to optimise the stopping point in any way. Measuring performance In order to judge the neural network learning process, a suitable measure of performance is required. Total error (percentage of incorrect predictions) is not sufficient due to the highly unbalanced nature of the dataset. All of the statistics are derived from the following quantities: p ¼ Number of correctly classified catalytic residues. n ¼ Number of correctly classified non-catalytic residues. o ¼ Number of non-catalytic residues incorrectly predicted to be catalytic (over-predictions). u ¼ Number of catalytic residues incorrectly predicted to be non-catalytic (under-predictions). t ¼ Total residues (p þ n þ o þ u). The total error ðq Total Þ is given by equation (2): Q Total ¼ p þ n 100 ð2þ t To complement this, two other measures of performance are used, Q Predicted measures the percentage of catalytic predictions that are correct and Q Observed measures the percentage of catalytic residues that are correctly predicted. The formulae for these two parameters are shown in equations (3) and (4)): Q Predicted ¼ p p þ o 100 ð3þ Q Observed ¼ p p þ u 100 ð4þ A measure of performance that takes both these factors into account is the MCC. The formula for calculating MCC is shown in equation (5): pn 2 ou MCC ¼ pffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð5þ ðp þ oþðp þ uþðn þ oþðn þ uþ Ranking and clustering The residues in each structure are ranked by network score, and all residues scoring above a cut-off value are used in the clustering algorithm. A pair of residues are clustered together if any of their atoms lies within 4 Å of each other. Each cluster is then defined as a sphere with its centre at the geometric centroid of all the C b atoms of the component residues (C a for glycine) and a radius such that all the C b atoms lie within the sphere. The first ranking cut-off was set at 35% of the highest scoring residue. If any sphere in a structure had a radius greater than 15 Å, the clustering was repeated, increasing the ranking cut-off by 1% until no sphere was greater than 15 Å in radius. Single residue clusters were discarded at this stage. The definition of the known sites is the same. Spheres were defined for each active site with centres at the centroid of the C b atoms and radii such that all the C b atoms are within the sphere. Proteins with single catalytic residues were set a radius of 3 Å. Acknowledgements Thanks to Dr Adrian Shepherd at UCL and Dr Craig Porter at the EBI for useful discussions on neural networks, clustering algorithms and other topics. Thanks to the Medical Research Council for financial support. G.J.B. was supported by a BBSRC CASE studentship in association with Roche Products Ltd. References 1. Burge, C. & Karlin, S. (1997). Prediction of complete

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods

Protein structure. Protein structure. Amino acid residue. Cell communication channel. Bioinformatics Methods Cell communication channel Bioinformatics Methods Iosif Vaisman Email: ivaisman@gmu.edu SEQUENCE STRUCTURE DNA Sequence Protein Sequence Protein Structure Protein structure ATGAAATTTGGAAACTTCCTTCTCACTTATCAGCCACCT...

More information

Packing of Secondary Structures

Packing of Secondary Structures 7.88 Lecture Notes - 4 7.24/7.88J/5.48J The Protein Folding and Human Disease Professor Gossard Retrieving, Viewing Protein Structures from the Protein Data Base Helix helix packing Packing of Secondary

More information

Structure to Function. Molecular Bioinformatics, X3, 2006

Structure to Function. Molecular Bioinformatics, X3, 2006 Structure to Function Molecular Bioinformatics, X3, 2006 Structural GeNOMICS Structural Genomics project aims at determination of 3D structures of all proteins: - organize known proteins into families

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

Viewing and Analyzing Proteins, Ligands and their Complexes 2

Viewing and Analyzing Proteins, Ligands and their Complexes 2 2 Viewing and Analyzing Proteins, Ligands and their Complexes 2 Overview Viewing the accessible surface Analyzing the properties of proteins containing thousands of atoms is best accomplished by representing

More information

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions

Advanced Certificate in Principles in Protein Structure. You will be given a start time with your exam instructions BIRKBECK COLLEGE (University of London) Advanced Certificate in Principles in Protein Structure MSc Structural Molecular Biology Date: Thursday, 1st September 2011 Time: 3 hours You will be given a start

More information

Properties of amino acids in proteins

Properties of amino acids in proteins Properties of amino acids in proteins one of the primary roles of DNA (but not the only one!) is to code for proteins A typical bacterium builds thousands types of proteins, all from ~20 amino acids repeated

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Detection of Protein Binding Sites II

Detection of Protein Binding Sites II Detection of Protein Binding Sites II Goal: Given a protein structure, predict where a ligand might bind Thomas Funkhouser Princeton University CS597A, Fall 2007 1hld Geometric, chemical, evolutionary

More information

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES

Protein Structure. W. M. Grogan, Ph.D. OBJECTIVES Protein Structure W. M. Grogan, Ph.D. OBJECTIVES 1. Describe the structure and characteristic properties of typical proteins. 2. List and describe the four levels of structure found in proteins. 3. Relate

More information

BIRKBECK COLLEGE (University of London)

BIRKBECK COLLEGE (University of London) BIRKBECK COLLEGE (University of London) SCHOOL OF BIOLOGICAL SCIENCES M.Sc. EXAMINATION FOR INTERNAL STUDENTS ON: Postgraduate Certificate in Principles of Protein Structure MSc Structural Molecular Biology

More information

Translation. A ribosome, mrna, and trna.

Translation. A ribosome, mrna, and trna. Translation The basic processes of translation are conserved among prokaryotes and eukaryotes. Prokaryotic Translation A ribosome, mrna, and trna. In the initiation of translation in prokaryotes, the Shine-Dalgarno

More information

CHAPTER 29 HW: AMINO ACIDS + PROTEINS

CHAPTER 29 HW: AMINO ACIDS + PROTEINS CAPTER 29 W: AMI ACIDS + PRTEIS For all problems, consult the table of 20 Amino Acids provided in lecture if an amino acid structure is needed; these will be given on exams. Use natural amino acids (L)

More information

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution

Supplemental Materials for. Structural Diversity of Protein Segments Follows a Power-law Distribution Supplemental Materials for Structural Diversity of Protein Segments Follows a Power-law Distribution Yoshito SAWADA and Shinya HONDA* National Institute of Advanced Industrial Science and Technology (AIST),

More information

Physiochemical Properties of Residues

Physiochemical Properties of Residues Physiochemical Properties of Residues Various Sources C N Cα R Slide 1 Conformational Propensities Conformational Propensity is the frequency in which a residue adopts a given conformation (in a polypeptide)

More information

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov

Read more about Pauling and more scientists at: Profiles in Science, The National Library of Medicine, profiles.nlm.nih.gov 2018 Biochemistry 110 California Institute of Technology Lecture 2: Principles of Protein Structure Linus Pauling (1901-1994) began his studies at Caltech in 1922 and was directed by Arthur Amos oyes to

More information

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure

Secondary Structure. Bioch/BIMS 503 Lecture 2. Structure and Function of Proteins. Further Reading. Φ, Ψ angles alone determine protein structure Bioch/BIMS 503 Lecture 2 Structure and Function of Proteins August 28, 2008 Robert Nakamoto rkn3c@virginia.edu 2-0279 Secondary Structure Φ Ψ angles determine protein structure Φ Ψ angles are restricted

More information

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell

Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell Using Higher Calculus to Study Biologically Important Molecules Julie C. Mitchell Mathematics and Biochemistry University of Wisconsin - Madison 0 There Are Many Kinds Of Proteins The word protein comes

More information

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS

PROTEIN SECONDARY STRUCTURE PREDICTION: AN APPLICATION OF CHOU-FASMAN ALGORITHM IN A HYPOTHETICAL PROTEIN OF SARS VIRUS Int. J. LifeSc. Bt & Pharm. Res. 2012 Kaladhar, 2012 Research Paper ISSN 2250-3137 www.ijlbpr.com Vol.1, Issue. 1, January 2012 2012 IJLBPR. All Rights Reserved PROTEIN SECONDARY STRUCTURE PREDICTION:

More information

Table 1. Crystallographic data collection, phasing and refinement statistics. Native Hg soaked Mn soaked 1 Mn soaked 2

Table 1. Crystallographic data collection, phasing and refinement statistics. Native Hg soaked Mn soaked 1 Mn soaked 2 Table 1. Crystallographic data collection, phasing and refinement statistics Native Hg soaked Mn soaked 1 Mn soaked 2 Data collection Space group P2 1 2 1 2 1 P2 1 2 1 2 1 P2 1 2 1 2 1 P2 1 2 1 2 1 Cell

More information

DOCKING TUTORIAL. A. The docking Workflow

DOCKING TUTORIAL. A. The docking Workflow 2 nd Strasbourg Summer School on Chemoinformatics VVF Obernai, France, 20-24 June 2010 E. Kellenberger DOCKING TUTORIAL A. The docking Workflow 1. Ligand preparation It consists in the standardization

More information

Section Week 3. Junaid Malek, M.D.

Section Week 3. Junaid Malek, M.D. Section Week 3 Junaid Malek, M.D. Biological Polymers DA 4 monomers (building blocks), limited structure (double-helix) RA 4 monomers, greater flexibility, multiple structures Proteins 20 Amino Acids,

More information

Detailed description of overall and active site architecture of PPDC- 3dThDP, PPDC-2HE3dThDP, PPDC-3dThDP-PPA and PPDC- 3dThDP-POVA

Detailed description of overall and active site architecture of PPDC- 3dThDP, PPDC-2HE3dThDP, PPDC-3dThDP-PPA and PPDC- 3dThDP-POVA Online Supplemental Results Detailed description of overall and active site architecture of PPDC- 3dThDP, PPDC-2HE3dThDP, PPDC-3dThDP-PPA and PPDC- 3dThDP-POVA Structure solution and overall architecture

More information

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27

Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase Cwc27 Acta Cryst. (2014). D70, doi:10.1107/s1399004714021695 Supporting information Volume 70 (2014) Supporting information for article: Structure and evolution of the spliceosomal peptidyl-prolyl cistrans isomerase

More information

UNIT TWELVE. a, I _,o "' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I "-oh

UNIT TWELVE. a, I _,o ' I I I. I I.P. l'o. H-c-c. I ~o I ~ I / H HI oh H...- I II I II 'oh. HO\HO~ I -oh UNT TWELVE PROTENS : PEPTDE BONDNG AND POLYPEPTDES 12 CONCEPTS Many proteins are important in biological structure-for example, the keratin of hair, collagen of skin and leather, and fibroin of silk. Other

More information

Proteins: Characteristics and Properties of Amino Acids

Proteins: Characteristics and Properties of Amino Acids SBI4U:Biochemistry Macromolecules Eachaminoacidhasatleastoneamineandoneacidfunctionalgroupasthe nameimplies.thedifferentpropertiesresultfromvariationsinthestructuresof differentrgroups.thergroupisoftenreferredtoastheaminoacidsidechain.

More information

Protein Structures: Experiments and Modeling. Patrice Koehl

Protein Structures: Experiments and Modeling. Patrice Koehl Protein Structures: Experiments and Modeling Patrice Koehl Structural Bioinformatics: Proteins Proteins: Sources of Structure Information Proteins: Homology Modeling Proteins: Ab initio prediction Proteins:

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

DATE A DAtabase of TIM Barrel Enzymes

DATE A DAtabase of TIM Barrel Enzymes DATE A DAtabase of TIM Barrel Enzymes 2 2.1 Introduction.. 2.2 Objective and salient features of the database 2.2.1 Choice of the dataset.. 2.3 Statistical information on the database.. 2.4 Features....

More information

Sunhats for plants. How plants detect dangerous ultraviolet rays

Sunhats for plants. How plants detect dangerous ultraviolet rays Sunhats for plants How plants detect dangerous ultraviolet rays Anyone who has ever suffered sunburn will know about the effects of too much ultraviolet (UV) radiation, in particular UV-B (from 280-315

More information

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor

LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor LS1a Fall 2014 Problem Set #2 Due Monday 10/6 at 6 pm in the drop boxes on the Science Center 2 nd Floor Note: Adequate space is given for each answer. Questions that require a brief explanation should

More information

1. Amino Acids and Peptides Structures and Properties

1. Amino Acids and Peptides Structures and Properties 1. Amino Acids and Peptides Structures and Properties Chemical nature of amino acids The!-amino acids in peptides and proteins (excluding proline) consist of a carboxylic acid ( COOH) and an amino ( NH

More information

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE

Examples of Protein Modeling. Protein Modeling. Primary Structure. Protein Structure Description. Protein Sequence Sources. Importing Sequences to MOE Examples of Protein Modeling Protein Modeling Visualization Examination of an experimental structure to gain insight about a research question Dynamics To examine the dynamics of protein structures To

More information

Proton Acidity. (b) For the following reaction, draw the arrowhead properly to indicate the position of the equilibrium: HA + K + B -

Proton Acidity. (b) For the following reaction, draw the arrowhead properly to indicate the position of the equilibrium: HA + K + B - Proton Acidity A01 Given that acid A has a pk a of 15 and acid B has a pk a of 10, then: (a) Which of the two acids is stronger? (b) For the following reaction, draw the arrowhead properly to indicate

More information

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013

Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India. 1 st November, 2013 Hydration of protein-rna recognition sites Ranjit P. Bahadur Assistant Professor Department of Biotechnology Indian Institute of Technology Kharagpur, India 1 st November, 2013 Central Dogma of life DNA

More information

CSCE555 Bioinformatics. Protein Function Annotation

CSCE555 Bioinformatics. Protein Function Annotation CSCE555 Bioinformatics Protein Function Annotation Why we need to do function annotation? Fig from: Network-based prediction of protein function. Molecular Systems Biology 3:88. 2007 What s function? The

More information

Amino Acids and Peptides

Amino Acids and Peptides Amino Acids Amino Acids and Peptides Amino acid a compound that contains both an amino group and a carboxyl group α-amino acid an amino acid in which the amino group is on the carbon adjacent to the carboxyl

More information

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues Programme 8.00-8.20 Last week s quiz results + Summary 8.20-9.00 Fold recognition 9.00-9.15 Break 9.15-11.20 Exercise: Modelling remote homologues 11.20-11.40 Summary & discussion 11.40-12.00 Quiz 1 Feedback

More information

Review. Membrane proteins. Membrane transport

Review. Membrane proteins. Membrane transport Quiz 1 For problem set 11 Q1, you need the equation for the average lateral distance transversed (s) of a molecule in the membrane with respect to the diffusion constant (D) and time (t). s = (4 D t) 1/2

More information

CHEM 3653 Exam # 1 (03/07/13)

CHEM 3653 Exam # 1 (03/07/13) 1. Using phylogeny all living organisms can be divided into the following domains: A. Bacteria, Eukarya, and Vertebrate B. Archaea and Eukarya C. Bacteria, Eukarya, and Archaea D. Eukarya and Bacteria

More information

Sequence Based Bioinformatics

Sequence Based Bioinformatics Structural and Functional Analysis of Inosine Monophosphate Dehydrogenase using Sequence-Based Bioinformatics Barry Sexton 1,2 and Troy Wymore 3 1 Bioengineering and Bioinformatics Summer Institute, Department

More information

Central Dogma. modifications genome transcriptome proteome

Central Dogma. modifications genome transcriptome proteome entral Dogma DA ma protein post-translational modifications genome transcriptome proteome 83 ierarchy of Protein Structure 20 Amino Acids There are 20 n possible sequences for a protein of n residues!

More information

B O C 4 H 2 O O. NOTE: The reaction proceeds with a carbonium ion stabilized on the C 1 of sugar A.

B O C 4 H 2 O O. NOTE: The reaction proceeds with a carbonium ion stabilized on the C 1 of sugar A. hbcse 33 rd International Page 101 hemistry lympiad Preparatory 05/02/01 Problems d. In the hydrolysis of the glycosidic bond, the glycosidic bridge oxygen goes with 4 of the sugar B. n cleavage, 18 from

More information

Model Mélange. Physical Models of Peptides and Proteins

Model Mélange. Physical Models of Peptides and Proteins Model Mélange Physical Models of Peptides and Proteins In the Model Mélange activity, you will visit four different stations each featuring a variety of different physical models of peptides or proteins.

More information

Exam I Answer Key: Summer 2006, Semester C

Exam I Answer Key: Summer 2006, Semester C 1. Which of the following tripeptides would migrate most rapidly towards the negative electrode if electrophoresis is carried out at ph 3.0? a. gly-gly-gly b. glu-glu-asp c. lys-glu-lys d. val-asn-lys

More information

Protein Secondary Structure Prediction using Feed-Forward Neural Network

Protein Secondary Structure Prediction using Feed-Forward Neural Network COPYRIGHT 2010 JCIT, ISSN 2078-5828 (PRINT), ISSN 2218-5224 (ONLINE), VOLUME 01, ISSUE 01, MANUSCRIPT CODE: 100713 Protein Secondary Structure Prediction using Feed-Forward Neural Network M. A. Mottalib,

More information

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013

Sequence Alignments. Dynamic programming approaches, scoring, and significance. Lucy Skrabanek ICB, WMC January 31, 2013 Sequence Alignments Dynamic programming approaches, scoring, and significance Lucy Skrabanek ICB, WMC January 31, 213 Sequence alignment Compare two (or more) sequences to: Find regions of conservation

More information

Protein Struktur (optional, flexible)

Protein Struktur (optional, flexible) Protein Struktur (optional, flexible) 22/10/2009 [ 1 ] Andrew Torda, Wintersemester 2009 / 2010, AST nur für Informatiker, Mathematiker,.. 26 kt, 3 ov 2009 Proteins - who cares? 22/10/2009 [ 2 ] Most important

More information

Biotechnology of Proteins. The Source of Stability in Proteins (III) Fall 2015

Biotechnology of Proteins. The Source of Stability in Proteins (III) Fall 2015 Biotechnology of Proteins The Source of Stability in Proteins (III) Fall 2015 Conformational Entropy of Unfolding It is The factor that makes the greatest contribution to stabilization of the unfolded

More information

Lecture 15: Realities of Genome Assembly Protein Sequencing

Lecture 15: Realities of Genome Assembly Protein Sequencing Lecture 15: Realities of Genome Assembly Protein Sequencing Study Chapter 8.10-8.15 1 Euler s Theorems A graph is balanced if for every vertex the number of incoming edges equals to the number of outgoing

More information

Protein Structure Prediction

Protein Structure Prediction Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on

More information

Full wwpdb X-ray Structure Validation Report i

Full wwpdb X-ray Structure Validation Report i Full wwpdb X-ray Structure Validation Report i Jan 14, 2019 11:10 AM EST PDB ID : 6GYW Title : Crystal structure of DacA from Staphylococcus aureus Authors : Tosi, T.; Freemont, P.S.; Grundling, A. Deposited

More information

NH 2. Biochemistry I, Fall Term Sept 9, Lecture 5: Amino Acids & Peptides Assigned reading in Campbell: Chapter

NH 2. Biochemistry I, Fall Term Sept 9, Lecture 5: Amino Acids & Peptides Assigned reading in Campbell: Chapter Biochemistry I, Fall Term Sept 9, 2005 Lecture 5: Amino Acids & Peptides Assigned reading in Campbell: Chapter 3.1-3.4. Key Terms: ptical Activity, Chirality Peptide bond Condensation reaction ydrolysis

More information

Full wwpdb X-ray Structure Validation Report i

Full wwpdb X-ray Structure Validation Report i Full wwpdb X-ray Structure Validation Report i Mar 14, 2018 02:00 pm GMT PDB ID : 3RRQ Title : Crystal structure of the extracellular domain of human PD-1 Authors : Lazar-Molnar, E.; Ramagopal, U.A.; Nathenson,

More information

Resonance assignments in proteins. Christina Redfield

Resonance assignments in proteins. Christina Redfield Resonance assignments in proteins Christina Redfield 1. Introduction The assignment of resonances in the complex NMR spectrum of a protein is the first step in any study of protein structure, function

More information

Protein Structure Bioinformatics Introduction

Protein Structure Bioinformatics Introduction 1 Swiss Institute of Bioinformatics Protein Structure Bioinformatics Introduction Basel, 27. September 2004 Torsten Schwede Biozentrum - Universität Basel Swiss Institute of Bioinformatics Klingelbergstr

More information

1-D Predictions. Prediction of local features: Secondary structure & surface exposure

1-D Predictions. Prediction of local features: Secondary structure & surface exposure 1-D Predictions Prediction of local features: Secondary structure & surface exposure 1 Learning Objectives After today s session you should be able to: Explain the meaning and usage of the following local

More information

Supersecondary Structures (structural motifs)

Supersecondary Structures (structural motifs) Supersecondary Structures (structural motifs) Various Sources Slide 1 Supersecondary Structures (Motifs) Supersecondary Structures (Motifs): : Combinations of secondary structures in specific geometric

More information

Heteropolymer. Mostly in regular secondary structure

Heteropolymer. Mostly in regular secondary structure Heteropolymer - + + - Mostly in regular secondary structure 1 2 3 4 C >N trace how you go around the helix C >N C2 >N6 C1 >N5 What s the pattern? Ci>Ni+? 5 6 move around not quite 120 "#$%&'!()*(+2!3/'!4#5'!1/,#64!#6!,6!

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Major Types of Association of Proteins with Cell Membranes. From Alberts et al

Major Types of Association of Proteins with Cell Membranes. From Alberts et al Major Types of Association of Proteins with Cell Membranes From Alberts et al Proteins Are Polymers of Amino Acids Peptide Bond Formation Amino Acid central carbon atom to which are attached amino group

More information

CHEM J-9 June 2014

CHEM J-9 June 2014 CEM1611 2014-J-9 June 2014 Alanine (ala) and lysine (lys) are two amino acids with the structures given below as Fischer projections. The pk a values of the conjugate acid forms of the different functional

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

Enzyme Catalysis & Biotechnology

Enzyme Catalysis & Biotechnology L28-1 Enzyme Catalysis & Biotechnology Bovine Pancreatic RNase A Biochemistry, Life, and all that L28-2 A brief word about biochemistry traditionally, chemical engineers used organic and inorganic chemistry

More information

BSc and MSc Degree Examinations

BSc and MSc Degree Examinations Examination Candidate Number: Desk Number: BSc and MSc Degree Examinations 2018-9 Department : BIOLOGY Title of Exam: Molecular Biology and Biochemistry Part I Time Allowed: 1 hour and 30 minutes Marking

More information

Lecture 10: Cyclins, cyclin kinases and cell division

Lecture 10: Cyclins, cyclin kinases and cell division Chem*3560 Lecture 10: Cyclins, cyclin kinases and cell division The eukaryotic cell cycle Actively growing mammalian cells divide roughly every 24 hours, and follow a precise sequence of events know as

More information

Final Chem 4511/6501 Spring 2011 May 5, 2011 b Name

Final Chem 4511/6501 Spring 2011 May 5, 2011 b Name Key 1) [10 points] In RNA, G commonly forms a wobble pair with U. a) Draw a G-U wobble base pair, include riboses and 5 phosphates. b) Label the major groove and the minor groove. c) Label the atoms of

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Gene regulation II Biochemistry 302. February 27, 2006

Gene regulation II Biochemistry 302. February 27, 2006 Gene regulation II Biochemistry 302 February 27, 2006 Molecular basis of inhibition of RNAP by Lac repressor 35 promoter site 10 promoter site CRP/DNA complex 60 Lewis, M. et al. (1996) Science 271:1247

More information

BCH 4053 Exam I Review Spring 2017

BCH 4053 Exam I Review Spring 2017 BCH 4053 SI - Spring 2017 Reed BCH 4053 Exam I Review Spring 2017 Chapter 1 1. Calculate G for the reaction A + A P + Q. Assume the following equilibrium concentrations: [A] = 20mM, [Q] = [P] = 40fM. Assume

More information

Protein Structure. Role of (bio)informatics in drug discovery. Bioinformatics

Protein Structure. Role of (bio)informatics in drug discovery. Bioinformatics Bioinformatics Protein Structure Principles & Architecture Marjolein Thunnissen Dep. of Biochemistry & Structural Biology Lund University September 2011 Homology, pattern and 3D structure searches need

More information

Rex-Family Repressor/NADH Complex

Rex-Family Repressor/NADH Complex Kasey Royer Michelle Lukosi Rex-Family Repressor/NADH Complex Part A The biological sensing protein that we selected is the Rex-family repressor/nadh complex. We chose this sensor because it is a calcium

More information

Problem Set 1

Problem Set 1 2006 7.012 Problem Set 1 Due before 5 PM on FRIDAY, September 15, 2006. Turn answers in to the box outside of 68-120. PLEASE WRITE YOUR ANSWERS ON THIS PRINTOUT. 1. For each of the following parts, pick

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

CAP 5510 Lecture 3 Protein Structures

CAP 5510 Lecture 3 Protein Structures CAP 5510 Lecture 3 Protein Structures Su-Shing Chen Bioinformatics CISE 8/19/2005 Su-Shing Chen, CISE 1 Protein Conformation 8/19/2005 Su-Shing Chen, CISE 2 Protein Conformational Structures Hydrophobicity

More information

Introduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas

Introduction to the Ribosome Overview of protein synthesis on the ribosome Prof. Anders Liljas Introduction to the Ribosome Molecular Biophysics Lund University 1 A B C D E F G H I J Genome Protein aa1 aa2 aa3 aa4 aa5 aa6 aa7 aa10 aa9 aa8 aa11 aa12 aa13 a a 14 How is a polypeptide synthesized? 2

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

Chemistry Chapter 22

Chemistry Chapter 22 hemistry 2100 hapter 22 Proteins Proteins serve many functions, including the following. 1. Structure: ollagen and keratin are the chief constituents of skin, bone, hair, and nails. 2. atalysts: Virtually

More information

Study of Mining Protein Structural Properties and its Application

Study of Mining Protein Structural Properties and its Application Study of Mining Protein Structural Properties and its Application A Dissertation Proposal Presented to the Department of Computer Science and Information Engineering College of Electrical Engineering and

More information

Chemistry Problem Set #9 Due on Thursday 11/15/18 in class.

Chemistry Problem Set #9 Due on Thursday 11/15/18 in class. Chemistry 391 - Problem Set #9 Due on Thursday 11/15/18 in class. Name 1. There is a real enzyme called cocaine esterase that is produced in bacteria that live at the base of the coca plant. The enzyme

More information

β1 Structure Prediction and Validation

β1 Structure Prediction and Validation 13 Chapter 2 β1 Structure Prediction and Validation 2.1 Overview Over several years, GPCR prediction methods in the Goddard lab have evolved to keep pace with the changing field of GPCR structure. Despite

More information

We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the

We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the SUPPLEMENTARY METHODS - in silico protein analysis We used the PSI-BLAST program (http://www.ncbi.nlm.nih.gov/blast/) to search the Protein Data Bank (PDB, http://www.rcsb.org/pdb/) and the NCBI non-redundant

More information

Protein sidechain conformer prediction: a test of the energy function Robert J Petrella 1, Themis Lazaridis 1 and Martin Karplus 1,2

Protein sidechain conformer prediction: a test of the energy function Robert J Petrella 1, Themis Lazaridis 1 and Martin Karplus 1,2 Research Paper 353 Protein sidechain conformer prediction: a test of the energy function Robert J Petrella 1, Themis Lazaridis 1 and Martin Karplus 1,2 Background: Homology modeling is an important technique

More information

What binds to Hb in addition to O 2?

What binds to Hb in addition to O 2? Reading: Ch5; 158-169, 162-166, 169-174 Problems: Ch5 (text); 3,7,8,10 Ch5 (study guide-facts); 1,2,3,4,5,8 Ch5 (study guide-apply); 2,3 Remember Today at 5:30 in CAS-522 is the second chance for the MB

More information

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability

Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability Lecture 2 and 3: Review of forces (ctd.) and elementary statistical mechanics. Contributions to protein stability Part I. Review of forces Covalent bonds Non-covalent Interactions: Van der Waals Interactions

More information

Exam III. Please read through each question carefully, and make sure you provide all of the requested information.

Exam III. Please read through each question carefully, and make sure you provide all of the requested information. 09-107 onors Chemistry ame Exam III Please read through each question carefully, and make sure you provide all of the requested information. 1. A series of octahedral metal compounds are made from 1 mol

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine

Supplementary figure 1. Comparison of unbound ogm-csf and ogm-csf as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine Supplementary figure 1. Comparison of unbound and as captured in the GIF:GM-CSF complex. Alignment of two copies of unbound ovine GM-CSF (slate) with bound GM-CSF in the GIF:GM-CSF complex (GIF: green,

More information

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration:

PROTEIN STRUCTURE AMINO ACIDS H R. Zwitterion (dipolar ion) CO 2 H. PEPTIDES Formal reactions showing formation of peptide bond by dehydration: PTEI STUTUE ydrolysis of proteins with aqueous acid or base yields a mixture of free amino acids. Each type of protein yields a characteristic mixture of the ~ 20 amino acids. AMI AIDS Zwitterion (dipolar

More information

Computational Biology: Basics & Interesting Problems

Computational Biology: Basics & Interesting Problems Computational Biology: Basics & Interesting Problems Summary Sources of information Biological concepts: structure & terminology Sequencing Gene finding Protein structure prediction Sources of information

More information

Sensitive NMR Approach for Determining the Binding Mode of Tightly Binding Ligand Molecules to Protein Targets

Sensitive NMR Approach for Determining the Binding Mode of Tightly Binding Ligand Molecules to Protein Targets Supporting information Sensitive NMR Approach for Determining the Binding Mode of Tightly Binding Ligand Molecules to Protein Targets Wan-Na Chen, Christoph Nitsche, Kala Bharath Pilla, Bim Graham, Thomas

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction part of Bioinformatik von RNA- und Proteinstrukturen Computational EvoDevo University Leipzig Leipzig, SS 2011 the goal is the prediction of the secondary structure conformation which is local each amino

More information

Full wwpdb X-ray Structure Validation Report i

Full wwpdb X-ray Structure Validation Report i Full wwpdb X-ray Structure Validation Report i Mar 8, 2018 06:13 pm GMT PDB ID : 5G5C Title : Structure of the Pyrococcus furiosus Esterase Pf2001 with space group C2221 Authors : Varejao, N.; Reverter,

More information

It s the amino acids!

It s the amino acids! Catalytic Mechanisms HOW do enzymes do their job? Reducing activation energy sure, but HOW does an enzyme catalysis reduce the energy barrier ΔG? Remember: The rate of a chemical reaction of substrate

More information

Identifying Interaction Hot Spots with SuperStar

Identifying Interaction Hot Spots with SuperStar Identifying Interaction Hot Spots with SuperStar Version 1.0 November 2017 Table of Contents Identifying Interaction Hot Spots with SuperStar... 2 Case Study... 3 Introduction... 3 Generate SuperStar Maps

More information

Protein Structure Prediction Using Neural Networks

Protein Structure Prediction Using Neural Networks Protein Structure Prediction Using Neural Networks Martha Mercaldi Kasia Wilamowska Literature Review December 16, 2003 The Protein Folding Problem Evolution of Neural Networks Neural networks originally

More information

A. Reaction Mechanisms and Catalysis (1) proximity effect (2) acid-base catalysts (3) electrostatic (4) functional groups (5) structural flexibility

A. Reaction Mechanisms and Catalysis (1) proximity effect (2) acid-base catalysts (3) electrostatic (4) functional groups (5) structural flexibility (P&S Ch 5; Fer Ch 2, 9; Palm Ch 10,11; Zub Ch 9) A. Reaction Mechanisms and Catalysis (1) proximity effect (2) acid-base catalysts (3) electrostatic (4) functional groups (5) structural flexibility B.

More information

O 3 O 4 O 5. q 3. q 4. Transition

O 3 O 4 O 5. q 3. q 4. Transition Hidden Markov Models Hidden Markov models (HMM) were developed in the early part of the 1970 s and at that time mostly applied in the area of computerized speech recognition. They are first described in

More information

Enhancing Specificity in the Janus Kinases: A Study on the Thienopyridine. JAK2 Selective Mechanism Combined Molecular Dynamics Simulation

Enhancing Specificity in the Janus Kinases: A Study on the Thienopyridine. JAK2 Selective Mechanism Combined Molecular Dynamics Simulation Electronic Supplementary Material (ESI) for Molecular BioSystems. This journal is The Royal Society of Chemistry 2015 Supporting Information Enhancing Specificity in the Janus Kinases: A Study on the Thienopyridine

More information