proteins CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 *

Size: px
Start display at page:

Download "proteins CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 *"

Transcription

1 proteins STRUCTURE O FUNCTION O BIOINFORMATICS CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 * 1 Genome Center, University of California, Davis, California Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland ABSTRACT This article describes the general quality of models of three dimensional structure submitted to CASP7 and analyzes progress since the previous experiment, primarily using measures that were used in earlier analyses. Overall improvement in model accuracy compared to CASP6 is modest, but there are two developments of note: server performance has moved closer to that of humans, and there has been a significant improvement in the fraction of targets for which the best model is superior to that obtainable using knowledge of a single best template structure. Proteins 2007; 69(Suppl 8): VC 2007 Wiley-Liss, Inc. Key words: protein structure prediction; community wide experiment; CASP. INTRODUCTION This article examines progress in structure modeling between the CASP6 experiment 1 held in 2004 and CASP7, held in 2006, in the context of the full record of the CASP experiments, carried out every 2 years since Details of the CASP7 experiment can be found in the introduction to this special issue of Proteins 2 and the articles by the three assessors focus more on the state of the art now. 3 5 The analysis methods have mostly been introduced in the earlier articles. 6,7 In addition, we have introduced a more thorough analysis of the relationship between the best models submitted for each target and those that a naïve modeler might produce. We include only results in the core areas of CASP, concerned with producing three dimensional models of complete proteins or protein domains. As before, we analyze two aspects of progress how the quality of the very best models is improving, and to a lesser extent, how the quality of models produced in the field as a whole is advancing. Best performance is evaluated by comparing the most accurate models of targets of comparable difficulty in different CASPs. Progress in the field as a whole is evaluated by comparing the average accuracy of the six best models for a target with the average accuracy of models in other CASPs for targets of similar difficulty. GENERAL CONSIDERATIONS Relative target difficulty As in the earlier progress assessments, we use a two-dimensional scale to estimate difficulty, incorporating the similarity of the protein sequence to that of a protein with known structure, and the similarity of the structure of the target protein to potential templates (see Methods). Some other significant factors that affect modeling difficulty are not considered, particularly the number and phylogenetic distribution of related sequences and the number and structural distribution of available templates. The set of related sequences will influence whether or not an evolutionary relationship can be detected, and also the quality of the alignment that can be generated. As discussed later, The authors state no conflict of interest. Grant sponsor: NIH; Grant number: LM *Correspondence to: John Moult, Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD moult@umbi.umd.edu Received 14 May 2007; Revised 21 July 2007; Accepted 23 July 2007 Published online 5 October 2007 in Wiley InterScience ( DOI: /prot PROTEINS VC 2007 WILEY-LISS, INC.

2 Progress from CASP6 to CASP7 Figure 1 Distribution of target difficulty. For each CASP, the difficulty of producing an accurate model is shown as function of the fraction of each target that can be superimposed on a known structure (horizontal axis) and the sequence identity between target and template for the superimposed portion (vertical axis). In all CASPs, targets span a wide range of difficulty. additional templates may provide models for regions of structure not present in the single best one. Omitting these factors adds some noise to the relationship between model quality and our difficulty scale. Domains Many target structures consist of two or more structural domains. Since domains within the same structure may present modeling problems of different difficulty, assessment in CASPs 4 7 has treated each identifiable domain as a separate target. Assessors have the advantage of deriving domains from the experimental structure, whereas predictors only have the sequence. In any case, domain definitions are nearly always subjective. For most of the analysis, we subdivide template based modeling (TBM) targets into domains only if these divisions are likely identifiable by a predictor, and require different modeling approaches (i.e., belong to different difficulty categories), or the domains are sequentially related to different templates. There are seven such targets in CASP7. For evaluation of nontemplate based models all domains identified by the assessors are treated as separate targets. TARGET DIFFICULTY ANALYSIS Figure 1 shows the distribution of target difficulty for all CASPs, as a function of structure and sequence similarity between the experimental structure of each target and the corresponding best available template. 8 Targets span a wide range of structure and sequence similarity in all the CASPs (see Ref. 9 for a more detailed discussion). The distribution for CASP7 is similar, with the notable feature that there are fewer targets at a low level of structure superposability. As in previous CASPs, we derive a one dimensional scale of target difficulty, combining the structure superposability and sequence identity (see Methods). As discussed previously, 9 this difficulty scale has obvious deficiencies. However, so far neither ourselves nor others have suggested a more appropriate alternative. In CASP7, targets were divided on the basis of difficulty a little differently than in previous experiments. Targets with a high degree of sequence and structural similarity to an available template were considered as producing potentially high accuracy models. Targets were assigned to this category on the basis of two criteria. First, to ensure that there was a good template in the DOI /prot PROTEINS 195

3 A. Kryshtafovych et al. PDB at the time predictions were made, structural superpositions had to identify at least one template with an LGA-S score 10 of greater than 80. Second, to ensure that it was possible to construct a good model, it was required that at least one model must give a GDT-TS score (see Methods and next section) of greater than 80. These high accuracy targets and all other targets that could be built with the aid of a domain level template were considered as TBM targets, and remainder, those for which there was no available template, were considered to be Free modeling (FM) targets. (These changes are discussed more fully elsewhere in this issue 2 ). As with the previous divisions, these map approximately to the difficulty scale, with high accuracy targets at the easy end, the rest of the template based targets taking up the bulk of the range, and template free targets at the most difficult end. As before, though, the mapping boundaries are not completely sharp. OVERALL MODEL QUALITY As before, the GDT_TS measure 11 is the key standard for the overall quality. GDT_TS is based on the fraction of residues of a target for which the Ca atoms are within 1, 2, 4, and 8 Å of the corresponding atoms in a model (see Methods for a precise definition). For relatively accurate comparative models (in the High Accuracy regime), almost all residues will likely fall under the 8 Å cutoff, and many will be under 4 Å, so that the 1 and 2 Å thresholds capture most of the variations in model quality. In the most difficult template FM regime, on the other hand, few residues fall under the 1 and 2 Å thresholds, and the larger thresholds capture most of the variation between models. For the bulk of the template based models, all four thresholds will often play a significant role. Template free models are often very approximate and measuring model quality is more difficult. In this CASP, the assessor carried out a careful analysis of the relationship between GDT_TS ranking and that obtained by visual inspection. As in previous CASPs these sometimes differ, but we have not considered these variations in the analysis. For High Accuracy regime models, accuracy improvements are inherently small scale, since templates and targets tend to be quite similar. For this reason, we also considered the finer-grained measure GDT_HA, where the thresholds are 0.5, 1, 2, and 4 Å. Generally, the conclusions are not affected by which measure is used (results not shown), so only GDT_TS results are presented here. Figure 2(A) shows the GDT_TS scores for the best model on each target as a function of target difficulty, for all CASPs, with each point an average over the target itself and the two targets to the left and to the right (if available) on the difficulty scale. Quadratic splines have been fitted through the data for each CASP. Note that throughout the article, best is selected from all submitted models. Predictors may submit up to five models per target, and are asked to rank expected quality. Assessment of template-based models usually included only model 1. At the high accuracy (left) side of the plot, models based on 50% or higher sequence identity have GDT_TS scores around 90 or higher, close to the limit imposed by experimental accuracy. As target difficulty increases, best scores fall off rapidly and steadily. From one point of view, only the models based on very high sequence identity are close to perfect by this criterion. From another, it is very encouraging to note the steady improvement in the trend lines in successive CASPs. In the mid-range of difficulty, scores have about doubled, from around 30 in CASP1 to 60 in CASPs 6 and 7. Between any particular pair of CASPs, improvement has been different in magnitude and in region. Between 6 and 7, the only significant change has been at the hard end of the difficulty scale, where two targets with high quality best models (283 and 350) pull up the curve. Figure 2(B) shows the smoothed average GDT_TS values over the six best models from different groups, rather than the single best. There is a similar improvement at the difficult end of TBM region. ALIGNMENT ACCURACY For template-based models correct alignment of the target sequence onto available template structures is a critical and often demanding step. As in previous analyses, we measure alignment accuracy (AL0) by counting the number of correctly aligned residues in the LGA 5 Å superposition of the modeled and experimental structures of a target. A model residue is considered to be correctly aligned if the Ca atom falls within 3.8 Å of the corresponding atom in the experimental structure, and there is no other experimental structure Ca atom nearer. Figure 3 shows the smoothed alignment accuracy for the best models of each target in all the CASPs. Features here are similar to those in the GDT_TS plot (2A), with alignment accuracy near 100% for the easiest targets, but falling steadily with target difficulty. The spline fits show a steady improvement in alignment accuracy over the CASPs, this time with the gain at the difficult end of TBM, in a manner very similar to that seen with GDT_TS. FEATURES OF MODELS NOT AVAILABLE FROM A SINGLE BEST TEMPLATE An important criterion in establishing the effectiveness of TBM methods is whether or not a model contains features in addition to those that could be simply copied from the best possible template. That is, there is added value over and above the information in the template, and the model is therefore superior to that which an intelligent but naïve modeler might produce. To evaluate that, we con- 196 PROTEINS DOI /prot

4 Progress from CASP6 to CASP7 Figure 2 GDT_TS scores of submitted models for targets in all CASPs, as a function of target difficulty. Each point represents one target. Data are smoothed by averaging over sets of five consecutive targets. Part A shows the scores for the best models on each target, and part B, the average score over the top six models from different groups. Trend lines show a clear though sometimes modest improvement from each successive CASP to the next, for both the best models and the best sets of models. In CASP7, there is apparent progress only at the more difficult end of the target spectrum, where two targets for which outstandingly good models were made (283 and 350) influence the average. DOI /prot PROTEINS 197

5 A. Kryshtafovych et al. Figure 3 Percentage of residues correctly aligned for the best model of each target in all CASPs, smoothed by averaging over sets of five adjacent targets. Trend lines here follow those in the equivalent GDT_TS plot [Fig. 2(A)] indicating that for many targets, alignment accuracy, together with the fraction of residues that can be aligned to a single template, dominate model quality. In CASP7, the same apparent improvement for the more difficult targets is present. structed a set of models, two for each target (one for each measure considered GDT_TS and AL0), utilizing knowledge of the structure of the best template. It should be noted that the conclusions reached depend to significant extent on how these best template models are built. Here, we begin by finding an optimal templatetarget alignment as follows: We first find all target Ca atoms that are within 3.8 Å of any template Ca atom in the 5 Å LGA sequence independent superposition. Then, we use a dynamic programming procedure that determines the longest alignment between the two structures using these preselected atoms, in such a way that no atom is taken twice and all the atoms in the alignment are in the order of the sequence. The maximum alignability (Smith Waterman alignment score SWALI) is then the fraction of aligned Ca atoms in the target. Then we assign the coordinates of template s backbone residues to the aligned target residues. It is not always the case that the template with the best structural coverage of the target gets the highest score (GDT_TS or AL0) when superimposed onto the target. To ensure that templates yielding the highest score possible were used, for each target, we built template models based on each of the 10 templates with the best structural coverage, and picked the template models with the highest GDT_TS and AL0 score for the subsequent analysis (in many cases, the same model was selected and used for both analyses). These models were then evaluated exactly like the submitted ones. Figure 4(A) shows the difference between the GDT_TS score for the best submitted model and the GDT_TS highest scored template model, for all targets in CASPs 5, 6, and 7. Values greater than zero indicate that overall the submitted model is more accurate than the one obtained from the best template. Spline fits show that the average improvement of best models over template has increased for the template-based modeling targets in successive CASPs. The spline lines are at positive values for the easier targets, dipping below zero for the hardest ones. There are seven targets where the best model scores more than 10% higher in GDT_TS than the best template one. We also investigated the extent of value added features in the best models compared to single template models built using several other methods. All analyses revealed similar trends (results not shown). One of these methods used the Modeller 12 package in a restricted way, starting from the correct structural alignment to the best single 198 PROTEINS DOI /prot

6 Progress from CASP6 to CASP7 Figure 4 A. Difference in GDT_TS score between the best submitted model for each target, and a naïve model based on knowledge of the best single template. Values greater than zero indicate added value in the best model. For most of the target range, on average, best models do have added value. There are a few cases where this is substantial in excess of 10% improvement in GDT_TS. In the difficult target range, a small number of very poorly modeled targets tend to pull down the trend line, but most models still are more accurate than a naïve model. B. Percentage difference in correctly aligned residues for the best submitted model for each target and for the corresponding naïve model. Similar trends as in Fig. 5A are seen, except that the trend lines are only above zero for the 1/3 or so easiest targets. C. Fraction of targets for which the best submitted models are superior to the best template models in CASPs 5, 6, and 7. For both GDT_TS and AL0, the fraction has increased substantially from CASP5 to CASP6 and from CASP6 to CASP7. In CASP7, almost 80% of the TBM targets have a higher GDT_TS score than a naïve template model. DOI /prot PROTEINS 199

7 A. Kryshtafovych et al. template. The resulting models were better than the simple single template models by an average of around 2% in GDT_TS for the template-based modeling targets. But even with this improvement, the majority of these reference models are still less accurate than the corresponding best models. It should be emphasized that we constrained Modeller to use only one template, and compared the results of a single method with the best models produced by any method this is in no way a test of Modeller performance. Figure 4(B) shows the difference between the fraction of residues that are correctly aligned in the best template model and the fraction correctly aligned in the best submitted model. By this measure too, a large fraction of best models are better than the template and that fraction has been improving over successive CASPs. The falloff in alignment improvement over template with target difficulty is more pronounced than in the GDT_TS plot 4A, but never-the-less, there are some impressive improvements even for difficult targets, including one with 20% more residues aligned in the best model compared with the best template model. Note that Figure 4 compares the alignment accuracy of two models in terms of AL0 one submitted, the other based on knowledge of the template. The alignment evaluation algorithm used in CASP (see Methods) assigns a residue as correctly aligned if it is within 3.8 Å of the corresponding residue in the target structure in the sequence independent LGA superposition and no other experimental structure residue is closer, a demanding requirement. Figure 4(C) summarizes the comparison of the best submitted models with the best template ones, for CASPs 5, 6, and 7. The fraction of all best submitted models with a higher GDT_TS value than the corresponding best single template one has increased from 53% in CASP5, through 59% in CASP6% to 67% in CASP7 (a). If only the TBM targets are considered, the numbers are even more impressive: 66, 71, and 79% for CASPs 5, 6, and 7, respectively (b). A similar picture is apparent for alignment accuracy, with increases for each CASP, and CASP7 with 50% of all best models and 60% of best models for the template-based modeling targets more accurately aligned than the best template models (c and d). Model features not present in the best template may have been derived in three ways: by the identification of features that are present in other available template structures, by refining aligned regions away from a template structure towards the experimental one, or by the use of template FM methods for features not present in the template. Examples of all three processes can be found in the CASP7 results, and are illustrated below using an error difference function along the sequence. For each residue i : De i ¼ e i ðmodelþ e i ðtemplateþ; where e i (model) is the Ca error in the model for residue i, and e i (template) is the distance between the target Ca atom of residue i and the correctly aligned Ca atom from the template. These inter Ca distances are taken from the LGA sequence independent superposition of target and template. Negative De values thus represent regions of a model that are more accurate than could be obtained by simply correctly aligning the single best template. Target 315 is an example of a target for which a superior model was constructed with the use of multiple templates. Figure 5(A) shows the error difference function along the sequence for the best model. Two template/target differences are shown, for the best template, and the next best template. It is apparent that although the best template provides the most accurate basis for a model over-all, there are three regions where it is less accurate than the second choice. In these regions the best model successfully incorporates those parts of the second template. An example of an improvement using FM followed by refinement is provided by the best model of target 283 [Fig. 5(B)]. This is the point at about 120% GDT_TS units in Figure 4(A), and is officially a difficult TBM target. In this case, Figure 5(B) shows that the model is Figure 5 Examples of added value over a naïve template model. In each plot, the blue curve shows the accuracy of the best model, as the difference between it and the target structure for each residue s Ca atom. The thin yellow line shows maximum accuracy of a model built from the best template, as the difference between the best template and the target, and where present, the thin white line shows the difference between a second template and the target. Thick yellow and white lines show the difference between the corresponding templates and the model. Thus, negative regions in these curves are places where the model is more accurate than could achieved with the information from a single template alone. A. Target 315, an example of added value through incorporation of information from a second template to successfully model short loop regions. There are four regions where the model is substantially more accurate than could be obtained from the best template alone (dips in the thick yellow curve), and in each place, the model is close to the second template (thick white curve near to zero). Conversely, several regions where the second template would have provided a poor model (dips in the thick white curve) are built from the first template. B. Target 283, an example of improvement relative to the best available template through use a refinement procedure and starting from a more remote structure. It is clear that for much of the sequence, the model is much better than the best template (thick yellow curve well into negative territory), which was not used as the starting model. Here the alternative template shown (white) is the 10th best, and the model is close to it for a short region in the middle of the structure (thick white line close to zero). C. Target 380, an example of refinement starting from a nearby template. In this case, a refinement procedure has succeeded in improving the accuracy of the model over the template starting point substantially in two regions (very negative regions in the thick yellow curve), one at the N terminus, one near the C terminus. There are also other more minor improvements in many places. D. Target 328, an example of FM methods providing part of structure not available from a template. The thick yellow curve shows that the model is substantially more accurate than the template in the region between residues 30 and 50. The thin yellow curve shows the template is of little use there, and no other template is available for that region. 200 PROTEINS DOI /prot

8 Progress from CASP6 to CASP7 Figure 5 DOI /prot PROTEINS 201

9 A. Kryshtafovych et al. Figure 6 A. Percentage of server predictions among the best 6 models for three categories of target difficulty in CASPs 5, 6, and 7. Relative server performance improved overall, and the most for easy comparative models. CASP7 also shows a substantial improvement in the middle category of difficulty. There are likely too few targets in the most difficult category for trends to be reliable. (For CASPs 5 and 6: CM: comparative models, e : easy, h : hard; For CASP7, TMB_easy, and TBM_Hard are PSI-BLAST and non-psi-blast level TBM targets respectively; FRH: Homologous fold recognition; FRA: analogous fold recognition; NF: new folds). B. Percentage of modeling targets for which at least one server is among the top six best models. Here too, it is evident that relative server performance has improved overall, and most dramatically in the easy comparative modeling category. C. Ratio of the quality of best server models to best human models as a function of target difficulty, for CASPs 5, 6, and 7, as measured by GDT_TS. The spline fit for CASP7 runs very close to unity (equal quality for server and human models) over much of the difficulty range. There are two outstanding cases (T0321_1 and T0356_2) where the best server models have GDT_TS scores more than 25% higher than the corresponding human ones. There are also still cases (for example T0386_2 and T0283) where human input and extra time does greatly improve model accuracy over that provided by any server. D. As for 6C, but for performance of the best six server models compared with the best six human groups on each target. Here it is clear that humans still have the edge, with the spline significantly below unity for all CASPs for much of the difficulty range. Never-the-less, CASP7 performance is improved relative to CASP6. E. GDT_TS scores of best and average over 6 best server models for all targets in CASPs 6 and 7, as a function of target difficulty. Each point represents one target. Lines show best fits for each category. It is clear that there has been progress in server accuracy over the full range of target difficulty. more accurate than the best template for most of the structure. The submitters of this model (Group 20) explain that the refinement was started from a template free model of another member of the family, and that very extensive changes and additional features were introduced by a refinement procedure. This is thus an outstanding example of refinement in CASP7. An example of refinement from a nearby template is provided by target 380 [Fig. 5(C)]. This was a target with a relatively close template available, but never-theless the best model shows a substantial improvement over that template (almost 11% in GDT_TS). In general, large improvements over the template are harder to achieve for easier targets, because the template errors are small to begin with, and this is therefore an exceptional result. It is apparent that the overall improvement over the template is distributed over the full sequence, with two regions of major change, at the N terminus, and near to the C terminus. In all, seven out of nine possible regions have moved closer to the target structure. Finally, target 328 provides an example where FM methods were successfully used to model part of an otherwise template based structure. Figure 5(D) shows the relationship between template and model accuracy along the sequence. Between residues 30 and 50, there is discontinuous information from the template, and no alternative template can be identified. The best model (submitted by group 24) incorporated FM-based steps there and is reasonably accurate. SERVER PERFORMANCE In early CASPs, most predictions were produced by a combination of computer methods and human judgment, in a manner analogous to the process of obtaining a structure experimentally, using crystallography or NMR. One of the most striking developments has been the increasingly effective role played by fully automatic methods: first in providing starting models for human teams, but now increasingly providing models of comparable quality to that the very best human groups can produce. In addition, servers are the only option for high throughput modeling, so their performance is independently important. As in the previous evaluations of progress, 9 we have compared the accuracy of server models to that in earlier rounds, as well as to the corresponding human predictions. There were 53, 50, and 68 server groups submitting tertiary structure models in CASPs 5, 6, and 7, respectively. Figure 6(A) shows the fraction of server models that were among the best 6 overall and for three categories of target difficulty. Strikingly, this fraction increased substantially in CASP7, both over-all and in the two tem- 202 PROTEINS DOI /prot

10 Progress from CASP6 to CASP7 Figure 6 (Continued) DOI /prot PROTEINS 203

11 A. Kryshtafovych et al. Figure 7 Model quality for the best (part A) and averaged over the six best (part B) new fold category targets, for CASPs 6 and 7. For each target, the lowest bars show the number of residues superimposed between model and target to closer than 1 Å, the next bar, the number superimposed to 2 Å, then 4 Å, then8å. The open bars show number of residues superimposed to greater than 8 Å. Greek letters indicate the fold type. By this measure, there is no obvious progress from CASP6 to CASP7. plate modeling categories. For the easy comparative modeling category (BLAST detectable templates), the fraction has increased from about 10% in CASP5, to nearly 20% in CASP6, to over 30% in CASP7. For the harder TBM targets there is also a significant increase in CASP7. The signal is less clear for the template free category, but the small number of targets makes trends harder to reliably identify. Figure 6(B) shows another view of these data, the fraction of targets for which at least one of the best six models came from a server. A very similar pattern of increase is apparent, particularly in CASP7. Figure 6(C) shows the ratio of GDT_TS scores for the best server and best human models for each CASP target, arranged by target difficulty. This ratio has improved in CASP7. Strikingly, there are two CASP7 targets where the best models produced by servers were substantially superior to any human prediction (T0321_1 and T0356_2). More generally, there were 19 targets (18.5%) in CASP7 where servers were at least on par with humans (three or more models in the best 6) and very few targets where the best server model was significantly worse than the best human one. Are humans becoming redundant in protein structure modeling? A possible alternative explanation for the convergence of human server performance is that humans are getting worse, rather than servers getting better. Figure 6(E) shows this is not the case, by means of a comparison of server performance in CASP6 and CASP7 over the full range of target difficulty. It is clear that the accuracy of 204 PROTEINS DOI /prot

12 Progress from CASP6 to CASP7 Figure 8 Distribution of success in predicting new fold targets for individual groups in CASP6 (part A) and CASP7 (part B), compared with that expected by chance. Left hand panels (green bars) show the number of groups submitting predictions for at least 1, 2,... up to the maximum number of targets in each of these CASPs. In the right panels, red bars show the number of groups ranked in the top six for 1 target, two targets, and so on. Yellow bars show the distribution of ranking expected by chance. In CASP6, there are three groups with six top ranked predictions, already very significantly more than random, and also single groups top ranked for 7, 8, 9, and 13 targets. In CASP7, with a smaller number of targets, there are still five groups with rankings counts falling completely outside the random distribution, and in general, observed rankings are shifted substantially away from random. Thus, the trend towards much better than random performance has been maintained, but not obviously strengthened. the best models and the best six models did improve significantly. TEMPLATE FREE METHODS Template free modeling methods must be used when either the target protein or domain has a fold topology not previously known (i.e., it s a so-called new fold ), or possible templates are too remote in sequence to be detected. From a very poor beginning, we have seen substantial progress in this area over the history of CASP, up to about CASP5. At that point, there were a small number of targets (usually less than about 120 residues) for which very impressive and accurate models were submitted, but the quality of the best models for the majority of targets was very much poorer than for targets where templates could be used. 7 In CASP6, there were again a small number of targets with excellent models, and notably these included mixed a/b structures, previously problematic. However, no significant overall increase could be detected. 9 An ongoing issue in this area is the need for better metrics to evaluate very approximate models. This problem is still not solved, so for CASP7, we again use the measures of earlier analyses. Specifically, we consider the four terms that contribute to the GDT_TS measure separately, rather than just that single value. That is, the number of residues which can be superimposed under 1, 2, 4, and 8 Å. There are a total of 19 targets in CASP7, compared with 25 in CASP6. As in the earlier analyses, all domains identified by the assessors are treated as separate targets. Figure 7 shows the results of the GDT threshold analysis for CASPs 6 and 7. Targets are ordered by size and fold type is indicated by the usual Greek letter classifica- DOI /prot PROTEINS 205

13 A. Kryshtafovych et al. tion. The stacked bars show the number of residues super-imposed under the distance thresholds of 1, 2, 4, and 8 Å, i.e., the number of residues for which the largest error is less than or equal to each threshold. The RMS error on such a set is typically about half the threshold, thus substructures meeting the 8 Å threshold would usually be judged excellent by visual inspection. In part (A), the performance in terms of the best models for each target is shown, and in part (B), the performance averaging over the six best models for each target. By this measure, it is very hard to detect any improvement between CASP6 and CASP7. An analysis of the number of residues falling under the GDT4 and GDT8 thresholds for CASPs 6 and 7, as a function of target size also shows little signs of progress (data not shown). As described in the assessment article though, there are a number of targets for which note-worthy models were submitted, 5 similar to that found in CASP6. ANALYSIS OF SUSTAINED PERFORMANCE FOR NEW FOLD TARGETS As in previous analyses, we wish to determine to what extent particular groups are performing consistently, as opposed to groups occasionally getting lucky, and happening to produce the best score for a target. As before, we address that by comparing the distribution of success of individual groups with the distribution of success expected by chance. Success is measured as the number of targets for which a group had a model ranking among the top six. The chance distribution was generated by randomly choosing six groups as the best scoring for each target. The chance distribution was constrained so that only groups predicting on that target were included, and the draw was weighted by the number of models submitted, i.e., a group submitting four models was four times as likely to be selected as one submitting a single model for a particular target. Figure 8 shows these data for the 25 CASP6 targets and 19 CASP7 targets. Also shown is information on how many groups submitted models for different number of targets. The green bars show the number of prediction groups submitting predictions on at least 1, 2, 3,... up to the maximum number of targets in each CASP. The yellow bars show the probability of a group scoring among the top six for one target, two targets, three targets, and so on, if the results were random. The red bars show the number of groups actually falling among the top six for one target, two targets, and so on. The more different this distribution from random, the more significant the results. For CASP7, because of the relatively small number of targets in this category, random expectation is that a large number of groups (47) would have one model among the six best of a single target. Reassuringly, this is not the case, and best performances are distributed over a higher numbers of targets. In particular, there are more groups than expected scoring among the best six models for all entries three and higher, and three groups score well outside the random distribution. These results continue the trend to consistent scoring seen earlier. CONCLUSIONS By the well-established GDT_TS and alignment accuracy measures, progress from CASP6 to CASP7 has been modest at best. However, a closer look reveals two interesting and significant changes. First, the closing of the gap between human prediction groups and automatic servers. As Figure 6 makes clear, server performance has been steadily improving relative to that of the best humans. This is in spite of the fact that most human assisted prediction begins with one or more server generated models, and involves a great deal more effort. Second, and most significantly, comparison of predictor performance with results using a model based on knowledge of a single best template structure is very encouraging. For template based targets, almost 80% of best models have added value over a simple template, as measured by GDT_TS. In a few cases, the improvement is dramatic. As illustrated earlier, improvements over template are being generated by all three possible mechanisms combining information from multiple templates, template FM of regions not present in a template, and refinement. As noted earlier, some caution is needed here because of sensitivity of the results to the exact definition of a template based model, but we tried six different approaches and found the results in every set were consistent, showing the same upward CASP-to-CASP trend. An obvious question is why progress is apparent relative to best template models, and not in absolute terms. The likely explanation is that the use of an intermediate reference state, rather than the absolute GDT_TS value, helps extract the otherwise weak signal. Against these encouraging points must be set the small overall progress picture. The long running analysis of progress by smoothed GDT_TS as a function of target difficulty [Fig. 2(A)] shows only a modest improvement at the more difficult end of the scale. Particularly in template free modeling, the new methods that caused such excitement from about CASP4 onwards seem to have run out of steam. These do produce spectacular successes for a few of the smaller targets, but do not seem to be expandable to a broader set. Still, there is every reason to hope that this energetic and creative community will find new ways to move us forward. METHODS Target difficulty The difficulty of a target is calculated by comparing it with every structure in the appropriate release of the pro- 206 PROTEINS DOI /prot

14 Progress from CASP6 to CASP7 tein databank, using the LGA structure superposition program. 10 For CASP7, templates were taken from the PDB releases accessible before each target deadline. Templates for the previous CASPs are the same as those used in the earlier analyses. 7 For each target, the most similar structure, as determined by LGA, in the appropriate version of the PDB is chosen as the representative template. Similarity between a target structure and a potential template is measured as the number of target-template Ca atom pairs that are within 5 Å in the LGA superposition, irrespective of continuity in the sequence, or sequence relatedness. The 5 Å threshold maintains compatibility with earlier target/template comparisons, 6,13,14 which were made using Prosup 15 software. It is a little larger than we now consider most appropriate (3.8 Å), and there is some times significant superposition score between unrelated structures, particularly for small proteins. Sequence identity is defined as the fraction of structurally aligned residues that are identical, maintaining sequence order. Note that basing sequence identity on structurally equivalent regions will usually yield a higher value than obtained by sequence comparison alone. In cases where several templates display comparable structural similarity to the target (coverage differing by less than 3%), but one has clearly higher sequence similarity (7% or more) the template with the highest sequence identity was selected. There were a total of 11 such instances in CASP7. Difficulty scale We project the data in Figure 1 into one dimension, using the following relationship: Relative Difficulty ¼ðRANK STR ALN + RANK SEQ IDÞ=2; where RANK_STR_ALN is the rank of the target along the horizontal axis of Figure 1 (i.e., ranking by % of the template structure aligned to the target) and RANK_SE- Q_ID is the rank along the vertical axis (ranking by % sequence identity in the structurally aligned regions). GDT_TS The GDT_TS value of a model is determined as follows. A large sample of possible structure super-positions of the model on the corresponding experimental structure is generated by superposing all sets of three, five, and seven consecutive Ca atoms along the backbone (each peptide segment provides one super-position). Each of these initial super-positions is iteratively extended, including all residue pairs under a specified threshold in the next iteration, and continuing until there is no change in included residues. 10 The procedure is carried out using thresholds of 1, 2, 4, and 8 Å, and the final super-position that includes the maximum number of residues is selected for each threshold. Super-imposed residues are not required to be continuous in the sequence, nor is there necessarily any relationship between the sets of residues super-imposed at different thresholds. GDT_TS is then obtained by averaging over the four super-position scores for the different thresholds: GDT TS ¼½1=4Š½N1 þ N2 þ N4 þ N8Š; where Nn is the number of residues superimposed under a distance threshold of n Å. GDT_TS may be thought of as an approximation of the area under the curve of accuracy versus the fraction of the structure included. Different thresholds play different roles in different modeling regimes. ACKNOWLEDGMENT The authors thank Zinoviy Dmytriv for assisting in generating data. REFERENCES 1. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction (CASP) round 6. Proteins 2005;61(Suppl 7): Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction Round VII. Proteins 2007;69(Suppl 8): Read R, Chavali G. Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 2007;69 (Suppl 8): Kopp K, Bordoli L, Battey JNB, Kiefer F, Schwede T. Assessment of CASP7 predictions for template-based modeling targets. Proteins 2007;69(Suppl 8): Jauch R, Yeo HC, Kolatkar PR, Clarke ND. Assessment of CASP7 structure predictions for template free targets. Proteins 2007;69 (Suppl 8): Venclovas C, Zemla A, Fidelis K, Moult J. Comparison of performance in successive CASP experiments. Proteins 2001;Suppl 5: Venclovas C, Zemla A, Fidelis K, Moult J. Assessment of progress over the CASP experiments. Proteins 2003;53(Suppl 6): Clarke ND, Ezkurdia I, Kopp J, Read RJ, Schwede T, Tress M. Domain definition and target classification for CASP7. Proteins 2007;69(Suppl 8): Kryshtafovych A, Venclovas C, Fidelis K, Moult J. Progress over the first decade of CASP experiments. Proteins 2005;61(Suppl 7): Zemla A. LGA a method for finding 3D similarities in protein structures. Nucleic Acids Res 2003;31: Zemla A, Venclovas C, Moult J, Fidelis K. Processing and evaluation of predictions in CASP4. Proteins 2001;Suppl 5: Fiser A, Sali A. Modeller: generation and refinement of homologybased protein structure models. Methods Enzymol 2003;374: Venclovas C, Zemla A, Fidelis K, Moult J. Some measures of comparative performance in the three CASPs. Proteins 1999;Suppl 3: Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ. Automated large scale evaluation of protein structure predictions. Proteins 1999;37(Suppl 3): Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein structure alignment. Protein Eng 2000; 13: DOI /prot PROTEINS 207

ProcessingandEvaluationofPredictionsinCASP4

ProcessingandEvaluationofPredictionsinCASP4 PROTEINS: Structure, Function, and Genetics Suppl 5:13 21 (2001) DOI 10.1002/prot.10052 ProcessingandEvaluationofPredictionsinCASP4 AdamZemla, 1 ČeslovasVenclovas, 1 JohnMoult, 2 andkrzysztoffidelis 1

More information

proteins High Accuracy Assessment Assessment of CASP7 predictions in the high accuracy template-based modeling category

proteins High Accuracy Assessment Assessment of CASP7 predictions in the high accuracy template-based modeling category proteins STRUCTURE O FUNCTION O BIOINFORMATICS High Accuracy Assessment Assessment of CASP7 predictions in the high accuracy template-based modeling category Randy J. Read* and Gayatri Chavali Department

More information

RMS/Coverage Graphs: A Qualitative Method for Comparing Three-Dimensional Protein Structure Predictions

RMS/Coverage Graphs: A Qualitative Method for Comparing Three-Dimensional Protein Structure Predictions PROTEINS: Structure, Function, and Genetics Suppl 3:15 21 (1999) RMS/Coverage Graphs: A Qualitative Method for Comparing Three-Dimensional Protein Structure Predictions Tim J.P. Hubbard* Sanger Centre,

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Tertiary Structure Prediction CMPS 6630: Introduction to Computational Biology and Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the

More information

CMPS 3110: Bioinformatics. Tertiary Structure Prediction

CMPS 3110: Bioinformatics. Tertiary Structure Prediction CMPS 3110: Bioinformatics Tertiary Structure Prediction Tertiary Structure Prediction Why Should Tertiary Structure Prediction Be Possible? Molecules obey the laws of physics! Conformation space is finite

More information

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling

Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling 63:644 661 (2006) Multiple Mapping Method: A Novel Approach to the Sequence-to-Structure Alignment Problem in Comparative Protein Structure Modeling Brajesh K. Rai and András Fiser* Department of Biochemistry

More information

Protein Structure Prediction

Protein Structure Prediction Page 1 Protein Structure Prediction Russ B. Altman BMI 214 CS 274 Protein Folding is different from structure prediction --Folding is concerned with the process of taking the 3D shape, usually based on

More information

Template-Based Modeling of Protein Structure

Template-Based Modeling of Protein Structure Template-Based Modeling of Protein Structure David Constant Biochemistry 218 December 11, 2011 Introduction. Much can be learned about the biology of a protein from its structure. Simply put, structure

More information

Protein Structure Determination from Pseudocontact Shifts Using ROSETTA

Protein Structure Determination from Pseudocontact Shifts Using ROSETTA Supporting Information Protein Structure Determination from Pseudocontact Shifts Using ROSETTA Christophe Schmitz, Robert Vernon, Gottfried Otting, David Baker and Thomas Huber Table S0. Biological Magnetic

More information

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity.

2MHR. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. Protein structure classification is important because it organizes the protein structure universe that is independent of sequence similarity. A global picture of the protein universe will help us to understand

More information

CS612 - Algorithms in Bioinformatics

CS612 - Algorithms in Bioinformatics Fall 2017 Databases and Protein Structure Representation October 2, 2017 Molecular Biology as Information Science > 12, 000 genomes sequenced, mostly bacterial (2013) > 5x10 6 unique sequences available

More information

The typical end scenario for those who try to predict protein

The typical end scenario for those who try to predict protein A method for evaluating the structural quality of protein models by using higher-order pairs scoring Gregory E. Sims and Sung-Hou Kim Berkeley Structural Genomics Center, Lawrence Berkeley National Laboratory,

More information

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics.

Procheck output. Bond angles (Procheck) Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics. Structure verification and validation Bond lengths (Procheck) Introduction to Bioinformatics Iosif Vaisman Email: ivaisman@gmu.edu ----------------------------------------------------------------- Bond

More information

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison

CMPS 6630: Introduction to Computational Biology and Bioinformatics. Structure Comparison CMPS 6630: Introduction to Computational Biology and Bioinformatics Structure Comparison Protein Structure Comparison Motivation Understand sequence and structure variability Understand Domain architecture

More information

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ

Chapter 5. Proteomics and the analysis of protein sequence Ⅱ Proteomics Chapter 5. Proteomics and the analysis of protein sequence Ⅱ 1 Pairwise similarity searching (1) Figure 5.5: manual alignment One of the amino acids in the top sequence has no equivalent and

More information

Protein structures and comparisons ndrew Torda Bioinformatik, Mai 2008

Protein structures and comparisons ndrew Torda Bioinformatik, Mai 2008 Protein structures and comparisons ndrew Torda 67.937 Bioinformatik, Mai 2008 Ultimate aim how to find out the most about a protein what you can get from sequence and structure information On the way..

More information

Prediction and refinement of NMR structures from sparse experimental data

Prediction and refinement of NMR structures from sparse experimental data Prediction and refinement of NMR structures from sparse experimental data Jeff Skolnick Director Center for the Study of Systems Biology School of Biology Georgia Institute of Technology Overview of talk

More information

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics

Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Statistical Machine Learning Methods for Bioinformatics IV. Neural Network & Deep Learning Applications in Bioinformatics Jianlin Cheng, PhD Department of Computer Science University of Missouri, Columbia

More information

Magnetic Case Study: Raglan Mine Laura Davis May 24, 2006

Magnetic Case Study: Raglan Mine Laura Davis May 24, 2006 Magnetic Case Study: Raglan Mine Laura Davis May 24, 2006 Research Objectives The objective of this study was to test the tools available in EMIGMA (PetRos Eikon) for their utility in analyzing magnetic

More information

TASSER: An Automated Method for the Prediction of Protein Tertiary Structures in CASP6

TASSER: An Automated Method for the Prediction of Protein Tertiary Structures in CASP6 PROTEINS: Structure, Function, and Bioinformatics Suppl 7:91 98 (2005) TASSER: An Automated Method for the Prediction of Protein Tertiary Structures in CASP6 Yang Zhang, Adrian K. Arakaki, and Jeffrey

More information

TOUCHSTONE: A Unified Approach to Protein Structure Prediction

TOUCHSTONE: A Unified Approach to Protein Structure Prediction PROTEINS: Structure, Function, and Genetics 53:469 479 (2003) TOUCHSTONE: A Unified Approach to Protein Structure Prediction Jeffrey Skolnick, 1 * Yang Zhang, 1 Adrian K. Arakaki, 1 Andrzej Kolinski, 1,2

More information

Homology Modeling. Roberto Lins EPFL - summer semester 2005

Homology Modeling. Roberto Lins EPFL - summer semester 2005 Homology Modeling Roberto Lins EPFL - summer semester 2005 Disclaimer: course material is mainly taken from: P.E. Bourne & H Weissig, Structural Bioinformatics; C.A. Orengo, D.T. Jones & J.M. Thornton,

More information

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX

AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-12 NATIONAL SECURITY COMPLEX AUTOMATED TEMPLATE MATCHING METHOD FOR NMIS AT THE Y-1 NATIONAL SECURITY COMPLEX J. A. Mullens, J. K. Mattingly, L. G. Chiang, R. B. Oberer, J. T. Mihalczo ABSTRACT This paper describes a template matching

More information

Interaction effects for continuous predictors in regression modeling

Interaction effects for continuous predictors in regression modeling Interaction effects for continuous predictors in regression modeling Testing for interactions The linear regression model is undoubtedly the most commonly-used statistical model, and has the advantage

More information

Introduction to Comparative Protein Modeling. Chapter 4 Part I

Introduction to Comparative Protein Modeling. Chapter 4 Part I Introduction to Comparative Protein Modeling Chapter 4 Part I 1 Information on Proteins Each modeling study depends on the quality of the known experimental data. Basis of the model Search in the literature

More information

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1

Tiffany Samaroo MB&B 452a December 8, Take Home Final. Topic 1 Tiffany Samaroo MB&B 452a December 8, 2003 Take Home Final Topic 1 Prior to 1970, protein and DNA sequence alignment was limited to visual comparison. This was a very tedious process; even proteins with

More information

SUPPLEMENTARY MATERIALS

SUPPLEMENTARY MATERIALS SUPPLEMENTARY MATERIALS Enhanced Recognition of Transmembrane Protein Domains with Prediction-based Structural Profiles Baoqiang Cao, Aleksey Porollo, Rafal Adamczak, Mark Jarrell and Jaroslaw Meller Contact:

More information

Algorithms in Bioinformatics

Algorithms in Bioinformatics Algorithms in Bioinformatics Sami Khuri Department of Computer Science San José State University San José, California, USA khuri@cs.sjsu.edu www.cs.sjsu.edu/faculty/khuri RNA Structure Prediction Secondary

More information

proteins Refinement by shifting secondary structure elements improves sequence alignments

proteins Refinement by shifting secondary structure elements improves sequence alignments proteins STRUCTURE O FUNCTION O BIOINFORMATICS Refinement by shifting secondary structure elements improves sequence alignments Jing Tong, 1,2 Jimin Pei, 3 Zbyszek Otwinowski, 1,2 and Nick V. Grishin 1,2,3

More information

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009

114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 114 Grundlagen der Bioinformatik, SS 09, D. Huson, July 6, 2009 9 Protein tertiary structure Sources for this chapter, which are all recommended reading: D.W. Mount. Bioinformatics: Sequences and Genome

More information

proteins Comprehensive description of protein structures using protein folding shape code Jiaan Yang* INTRODUCTION

proteins Comprehensive description of protein structures using protein folding shape code Jiaan Yang* INTRODUCTION proteins STRUCTURE O FUNCTION O BIOINFORMATICS Comprehensive description of protein structures using protein folding shape code Jiaan Yang* MicrotechNano, LLC, Indianapolis, Indiana 46234 ABSTRACT Understanding

More information

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche

Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche Protein Structure Prediction II Lecturer: Serafim Batzoglou Scribe: Samy Hamdouche The molecular structure of a protein can be broken down hierarchically. The primary structure of a protein is simply its

More information

Bioinformatics. Dept. of Computational Biology & Bioinformatics

Bioinformatics. Dept. of Computational Biology & Bioinformatics Bioinformatics Dept. of Computational Biology & Bioinformatics 3 Bioinformatics - play with sequences & structures Dept. of Computational Biology & Bioinformatics 4 ORGANIZATION OF LIFE ROLE OF BIOINFORMATICS

More information

Linear Model Selection and Regularization

Linear Model Selection and Regularization Linear Model Selection and Regularization Recall the linear model Y = β 0 + β 1 X 1 + + β p X p + ɛ. In the lectures that follow, we consider some approaches for extending the linear model framework. In

More information

Introduction to Evolutionary Concepts

Introduction to Evolutionary Concepts Introduction to Evolutionary Concepts and VMD/MultiSeq - Part I Zaida (Zan) Luthey-Schulten Dept. Chemistry, Beckman Institute, Biophysics, Institute of Genomics Biology, & Physics NIH Workshop 2009 VMD/MultiSeq

More information

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB

Homology Modeling (Comparative Structure Modeling) GBCB 5874: Problem Solving in GBCB Homology Modeling (Comparative Structure Modeling) Aims of Structural Genomics High-throughput 3D structure determination and analysis To determine or predict the 3D structures of all the proteins encoded

More information

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues

Programme Last week s quiz results + Summary Fold recognition Break Exercise: Modelling remote homologues Programme 8.00-8.20 Last week s quiz results + Summary 8.20-9.00 Fold recognition 9.00-9.15 Break 9.15-11.20 Exercise: Modelling remote homologues 11.20-11.40 Summary & discussion 11.40-12.00 Quiz 1 Feedback

More information

Assessment of the model refinement category in CASP12

Assessment of the model refinement category in CASP12 Received: 19 June 2017 Revised: 3 October 2017 Accepted: 24 October 2017 DOI: 10.1002/prot.25409 RESEARCH ARTICLE Assessment of the model refinement category in CASP12 Ladislav Hovan 1 * Vladimiras Oleinikovas

More information

Universal Similarity Measure for Comparing Protein Structures

Universal Similarity Measure for Comparing Protein Structures Marcos R. Betancourt Jeffrey Skolnick Laboratory of Computational Genomics, The Donald Danforth Plant Science Center, 893. Warson Rd., Creve Coeur, MO 63141 Universal Similarity Measure for Comparing Protein

More information

Protein quality assessment

Protein quality assessment Protein quality assessment Speaker: Renzhi Cao Advisor: Dr. Jianlin Cheng Major: Computer Science May 17 th, 2013 1 Outline Introduction Paper1 Paper2 Paper3 Discussion and research plan Acknowledgement

More information

An Introduction to Sequence Similarity ( Homology ) Searching

An Introduction to Sequence Similarity ( Homology ) Searching An Introduction to Sequence Similarity ( Homology ) Searching Gary D. Stormo 1 UNIT 3.1 1 Washington University, School of Medicine, St. Louis, Missouri ABSTRACT Homologous sequences usually have the same,

More information

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007

Molecular Modeling. Prediction of Protein 3D Structure from Sequence. Vimalkumar Velayudhan. May 21, 2007 Molecular Modeling Prediction of Protein 3D Structure from Sequence Vimalkumar Velayudhan Jain Institute of Vocational and Advanced Studies May 21, 2007 Vimalkumar Velayudhan Molecular Modeling 1/23 Outline

More information

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5

Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Sequence and Structure Alignment Z. Luthey-Schulten, UIUC Pittsburgh, 2006 VMD 1.8.5 Why Look at More Than One Sequence? 1. Multiple Sequence Alignment shows patterns of conservation 2. What and how many

More information

proteins Effect of using suboptimal alignments in template-based protein structure prediction Hao Chen 1 and Daisuke Kihara 1,2,3 * INTRODUCTION

proteins Effect of using suboptimal alignments in template-based protein structure prediction Hao Chen 1 and Daisuke Kihara 1,2,3 * INTRODUCTION proteins STRUCTURE O FUNCTION O BIOINFORMATICS Effect of using suboptimal alignments in template-based protein structure prediction Hao Chen 1 and Daisuke Kihara 1,2,3 * 1 Department of Biological Sciences,

More information

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU

Can protein model accuracy be. identified? NO! CBS, BioCentrum, Morten Nielsen, DTU Can protein model accuracy be identified? Morten Nielsen, CBS, BioCentrum, DTU NO! Identification of Protein-model accuracy Why is it important? What is accuracy RMSD, fraction correct, Protein model correctness/quality

More information

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror

Protein structure prediction. CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror Protein structure prediction CS/CME/BioE/Biophys/BMI 279 Oct. 10 and 12, 2017 Ron Dror 1 Outline Why predict protein structure? Can we use (pure) physics-based methods? Knowledge-based methods Two major

More information

Measuring quaternary structure similarity using global versus local measures.

Measuring quaternary structure similarity using global versus local measures. Supplementary Figure 1 Measuring quaternary structure similarity using global versus local measures. (a) Structural similarity of two protein complexes can be inferred from a global superposition, which

More information

Protein Structure Determination

Protein Structure Determination Protein Structure Determination Given a protein sequence, determine its 3D structure 1 MIKLGIVMDP IANINIKKDS SFAMLLEAQR RGYELHYMEM GDLYLINGEA 51 RAHTRTLNVK QNYEEWFSFV GEQDLPLADL DVILMRKDPP FDTEFIYATY 101

More information

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 8. Chapter 8. Classification: Basic Concepts Data Mining: Concepts and Techniques (3 rd ed.) Chapter 8 1 Chapter 8. Classification: Basic Concepts Classification: Basic Concepts Decision Tree Induction Bayes Classification Methods Rule-Based Classification

More information

proteins Prediction Methods and Reports

proteins Prediction Methods and Reports proteins STRUCTURE O FUNCTION O BIOINFORMATICS Prediction Methods and Reports Automated protein structure modeling in CASP9 by I-TASSER pipeline combined with QUARK-based ab initio folding and FG-MD-based

More information

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment

Algorithms in Bioinformatics FOUR Pairwise Sequence Alignment. Pairwise Sequence Alignment. Convention: DNA Sequences 5. Sequence Alignment Algorithms in Bioinformatics FOUR Sami Khuri Department of Computer Science San José State University Pairwise Sequence Alignment Homology Similarity Global string alignment Local string alignment Dot

More information

Methods of Protein Structure Comparison. Irina Kufareva and Ruben Abagyan

Methods of Protein Structure Comparison. Irina Kufareva and Ruben Abagyan Chapter 1 Methods of Protein Structure Comparison Irina Kufareva and Ruben Abagyan Abstract Despite its apparent simplicity, the problem of quantifying the differences between two structures of the same

More information

In-Depth Assessment of Local Sequence Alignment

In-Depth Assessment of Local Sequence Alignment 2012 International Conference on Environment Science and Engieering IPCBEE vol.3 2(2012) (2012)IACSIT Press, Singapoore In-Depth Assessment of Local Sequence Alignment Atoosa Ghahremani and Mahmood A.

More information

PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM

PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM 43 1 PROTEIN FOLD RECOGNITION USING THE GRADIENT BOOST ALGORITHM Feng Jiao School of Computer Science, University of Waterloo, Canada fjiao@cs.uwaterloo.ca Jinbo Xu Toyota Technological Institute at Chicago,

More information

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748

Giri Narasimhan. CAP 5510: Introduction to Bioinformatics. ECS 254; Phone: x3748 CAP 5510: Introduction to Bioinformatics Giri Narasimhan ECS 254; Phone: x3748 giri@cis.fiu.edu www.cis.fiu.edu/~giri/teach/bioinfs07.html 2/15/07 CAP5510 1 EM Algorithm Goal: Find θ, Z that maximize Pr

More information

A Multistage Modeling Strategy for Demand Forecasting

A Multistage Modeling Strategy for Demand Forecasting Paper SAS5260-2016 A Multistage Modeling Strategy for Demand Forecasting Pu Wang, Alex Chien, and Yue Li, SAS Institute Inc. ABSTRACT Although rapid development of information technologies in the past

More information

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION

proteins Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * INTRODUCTION proteins STRUCTURE O FUNCTION O BIOINFORMATICS Estimating quality of template-based protein models by alignment stability Hao Chen 1 and Daisuke Kihara 1,2,3,4 * 1 Department of Biological Sciences, College

More information

As of December 30, 2003, 23,000 solved protein structures

As of December 30, 2003, 23,000 solved protein structures The protein structure prediction problem could be solved using the current PDB library Yang Zhang and Jeffrey Skolnick* Center of Excellence in Bioinformatics, University at Buffalo, 901 Washington Street,

More information

proteins Prediction Report Template-based modeling and free modeling by I-TASSER in CASP7 Yang Zhang* 108 PROTEINS VC 2007 WILEY-LISS, INC.

proteins Prediction Report Template-based modeling and free modeling by I-TASSER in CASP7 Yang Zhang* 108 PROTEINS VC 2007 WILEY-LISS, INC. proteins STRUCTURE O FUNCTION O BIOINFORMATICS Prediction Report Template-based modeling and free modeling by I-TASSER in CASP7 Yang Zhang* Center for Bioinformatics, Department of Molecular Biosciences,

More information

0. Introduction 1 0. INTRODUCTION

0. Introduction 1 0. INTRODUCTION 0. Introduction 1 0. INTRODUCTION In a very rough sketch we explain what algebraic geometry is about and what it can be used for. We stress the many correlations with other fields of research, such as

More information

C. Watson, E. Churchwell, R. Indebetouw, M. Meade, B. Babler, B. Whitney

C. Watson, E. Churchwell, R. Indebetouw, M. Meade, B. Babler, B. Whitney Reliability and Completeness for the GLIMPSE Survey C. Watson, E. Churchwell, R. Indebetouw, M. Meade, B. Babler, B. Whitney Abstract This document examines the GLIMPSE observing strategy and criteria

More information

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha

Neural Networks for Protein Structure Prediction Brown, JMB CS 466 Saurabh Sinha Neural Networks for Protein Structure Prediction Brown, JMB 1999 CS 466 Saurabh Sinha Outline Goal is to predict secondary structure of a protein from its sequence Artificial Neural Network used for this

More information

Description of the ED library Basic Atoms

Description of the ED library Basic Atoms Description of the ED library Basic Atoms Simulation Software / Description of the ED library BASIC ATOMS Enterprise Dynamics Copyright 2010 Incontrol Simulation Software B.V. All rights reserved Papendorpseweg

More information

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1 Regression diagnostics As is true of all statistical methodologies, linear regression analysis can be a very effective way to model data, as along as the assumptions being made are true. For the regression

More information

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition

09/06/25. Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Non-uniform distribution of folds. Scheme of protein structure predicition Sequence identity Structural similarity Computergestützte Strukturbiologie (Strukturelle Bioinformatik) Fold recognition Sommersemester 2009 Peter Güntert Structural similarity X Sequence identity Non-uniform

More information

Pairwise sequence alignments

Pairwise sequence alignments Pairwise sequence alignments Volker Flegel VI, October 2003 Page 1 Outline Introduction Definitions Biological context of pairwise alignments Computing of pairwise alignments Some programs VI, October

More information

Student Sheet: Self-Assessment

Student Sheet: Self-Assessment Student s Name Date Class Student Sheet: Self-Assessment Directions: Use the space provided to prepare a KWL chart. In the first column, write things you already know about energy, forces, and motion.

More information

Introduction to Bioinformatics Online Course: IBT

Introduction to Bioinformatics Online Course: IBT Introduction to Bioinformatics Online Course: IBT Multiple Sequence Alignment Building Multiple Sequence Alignment Lec1 Building a Multiple Sequence Alignment Learning Outcomes 1- Understanding Why multiple

More information

PROTEIN STRUCTURE PREDICTION II

PROTEIN STRUCTURE PREDICTION II PROTEIN STRUCTURE PREDICTION II Jeffrey Skolnick 1,2 Yang Zhang 1 Because the molecular function of a protein depends on its three dimensional structure, which is often unknown, protein structure prediction

More information

Week 10: Homology Modelling (II) - HHpred

Week 10: Homology Modelling (II) - HHpred Week 10: Homology Modelling (II) - HHpred Course: Tools for Structural Biology Fabian Glaser BKU - Technion 1 2 Identify and align related structures by sequence methods is not an easy task All comparative

More information

PHYS 281 General Physics Laboratory

PHYS 281 General Physics Laboratory King Abdul-Aziz University Faculty of Science Physics Department PHYS 281 General Physics Laboratory Student Name: ID Number: Introduction Advancement in science and engineering has emphasized the microscopic

More information

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids Science in China Series C: Life Sciences 2007 Science in China Press Springer-Verlag Grouping of amino acids and recognition of protein structurally conserved regions by reduced alphabets of amino acids

More information

Genki Terashi, Mayuko Takeda-Shitaka, Kazuhiko Kanou, Mitsuo Iwadate, Daisuke Takaya, Akio Hosoi, Kazuhiro Ohta, and Hideaki Umeyama*

Genki Terashi, Mayuko Takeda-Shitaka, Kazuhiko Kanou, Mitsuo Iwadate, Daisuke Takaya, Akio Hosoi, Kazuhiro Ohta, and Hideaki Umeyama* proteins STRUCTURE O FUNCTION O BIOINFORMATICS Prediction Report fams-ace: A combined method to select the best model after remodeling all server models Genki Terashi, Mayuko Takeda-Shitaka, Kazuhiko Kanou,

More information

Identification of correct regions in protein models using structural, alignment, and consensus information

Identification of correct regions in protein models using structural, alignment, and consensus information Identification of correct regions in protein models using structural, alignment, and consensus information BJO RN WALLNER AND ARNE ELOFSSON Stockholm Bioinformatics Center, Stockholm University, SE-106

More information

Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling

Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling Article Atomic-Level Protein Structure Refinement Using Fragment-Guided Molecular Dynamics Conformation Sampling Jian Zhang, 1 Yu Liang, 2 and Yang Zhang 1,3, * 1 Center for Computational Medicine and

More information

Protein Fold Recognition Using Gradient Boost Algorithm

Protein Fold Recognition Using Gradient Boost Algorithm Protein Fold Recognition Using Gradient Boost Algorithm Feng Jiao 1, Jinbo Xu 2, Libo Yu 3 and Dale Schuurmans 4 1 School of Computer Science, University of Waterloo, Canada fjiao@cs.uwaterloo.ca 2 Toyota

More information

Computational Biology

Computational Biology Computational Biology Lecture 6 31 October 2004 1 Overview Scoring matrices (Thanks to Shannon McWeeney) BLAST algorithm Start sequence alignment 2 1 What is a homologous sequence? A homologous sequence,

More information

Analysis and Prediction of Protein Structure (I)

Analysis and Prediction of Protein Structure (I) Analysis and Prediction of Protein Structure (I) Jianlin Cheng, PhD School of Electrical Engineering and Computer Science University of Central Florida 2006 Free for academic use. Copyright @ Jianlin Cheng

More information

Development and Large Scale Benchmark Testing of the PROSPECTOR_3 Threading Algorithm

Development and Large Scale Benchmark Testing of the PROSPECTOR_3 Threading Algorithm PROTEINS: Structure, Function, and Bioinformatics 56:502 518 (2004) Development and Large Scale Benchmark Testing of the PROSPECTOR_3 Threading Algorithm Jeffrey Skolnick,* Daisuke Kihara, and Yang Zhang

More information

Bioinformatics. Macromolecular structure

Bioinformatics. Macromolecular structure Bioinformatics Macromolecular structure Contents Determination of protein structure Structure databases Secondary structure elements (SSE) Tertiary structure Structure analysis Structure alignment Domain

More information

Improving De novo Protein Structure Prediction using Contact Maps Information

Improving De novo Protein Structure Prediction using Contact Maps Information CIBCB 2017 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology Improving De novo Protein Structure Prediction using Contact Maps Information Karina Baptista dos Santos

More information

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity

frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity 1 frmsdalign: Protein Sequence Alignment Using Predicted Local Structure Information for Pairs with Low Sequence Identity HUZEFA RANGWALA and GEORGE KARYPIS Department of Computer Science and Engineering

More information

Building 3D models of proteins

Building 3D models of proteins Building 3D models of proteins Why make a structural model for your protein? The structure can provide clues to the function through structural similarity with other proteins With a structure it is easier

More information

Massachusetts Tests for Educator Licensure (MTEL )

Massachusetts Tests for Educator Licensure (MTEL ) Massachusetts Tests for Educator Licensure (MTEL ) BOOKLET 2 Mathematics Subtest Copyright 2010 Pearson Education, Inc. or its affiliate(s). All rights reserved. Evaluation Systems, Pearson, P.O. Box 226,

More information

Linker Dependent Bond Rupture Force Measurements in Single-Molecule Junctions

Linker Dependent Bond Rupture Force Measurements in Single-Molecule Junctions Supplemental Information Linker Dependent Bond Rupture Force Measurements in Single-Molecule Junctions M. Frei 1, S Aradhya 1, M. S. Hybertsen 2, L. Venkataraman 1 1 Department of Applied Physics and Applied

More information

Detection of homologous proteins by an intermediate sequence search

Detection of homologous proteins by an intermediate sequence search by an intermediate sequence search BINO JOHN 1 AND ANDREJ SALI 2 1 Laboratory of Molecular Biophysics, Pels Family Center for Biochemistry and Structural Biology, The Rockefeller University, New York,

More information

Analysis of the 500 mb height fields and waves: testing Rossby wave theory

Analysis of the 500 mb height fields and waves: testing Rossby wave theory Analysis of the 500 mb height fields and waves: testing Rossby wave theory Jeffrey D. Duda, Suzanne Morris, Michelle Werness, and Benjamin H. McNeill Department of Geologic and Atmospheric Sciences, Iowa

More information

Protein Modeling Methods. Knowledge. Protein Modeling Methods. Fold Recognition. Knowledge-based methods. Introduction to Bioinformatics

Protein Modeling Methods. Knowledge. Protein Modeling Methods. Fold Recognition. Knowledge-based methods. Introduction to Bioinformatics Protein Modeling Methods Introduction to Bioinformatics Iosif Vaisman Ab initio methods Energy-based methods Knowledge-based methods Email: ivaisman@gmu.edu Protein Modeling Methods Ab initio methods:

More information

Example: What number is the arrow pointing to?

Example: What number is the arrow pointing to? Number Lines Investigation 1 Inv. 1 To draw a number line, begin by drawing a line. Next, put tick marks on the line, keeping an equal distance between the marks. Then label the tick marks with numbers.

More information

Protein Secondary Structure Prediction

Protein Secondary Structure Prediction Protein Secondary Structure Prediction Doug Brutlag & Scott C. Schmidler Overview Goals and problem definition Existing approaches Classic methods Recent successful approaches Evaluating prediction algorithms

More information

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years.

Copyright Mark Brandt, Ph.D A third method, cryogenic electron microscopy has seen increasing use over the past few years. Structure Determination and Sequence Analysis The vast majority of the experimentally determined three-dimensional protein structures have been solved by one of two methods: X-ray diffraction and Nuclear

More information

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference

hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference CS 229 Project Report (TR# MSB2010) Submitted 12/10/2010 hsnim: Hyper Scalable Network Inference Machine for Scale-Free Protein-Protein Interaction Networks Inference Muhammad Shoaib Sehgal Computer Science

More information

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences

Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Statistical Machine Learning Methods for Biomedical Informatics II. Hidden Markov Model for Biological Sequences Jianlin Cheng, PhD William and Nancy Thompson Missouri Distinguished Professor Department

More information

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment

Molecular Modeling Lecture 7. Homology modeling insertions/deletions manual realignment Molecular Modeling 2018-- Lecture 7 Homology modeling insertions/deletions manual realignment Homology modeling also called comparative modeling Sequences that have similar sequence have similar structure.

More information

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University

Sequence Alignment: A General Overview. COMP Fall 2010 Luay Nakhleh, Rice University Sequence Alignment: A General Overview COMP 571 - Fall 2010 Luay Nakhleh, Rice University Life through Evolution All living organisms are related to each other through evolution This means: any pair of

More information

Supporting Online Material for

Supporting Online Material for www.sciencemag.org/cgi/content/full/309/5742/1868/dc1 Supporting Online Material for Toward High-Resolution de Novo Structure Prediction for Small Proteins Philip Bradley, Kira M. S. Misura, David Baker*

More information

BIOINFORMATICS: An Introduction

BIOINFORMATICS: An Introduction BIOINFORMATICS: An Introduction What is Bioinformatics? The term was first coined in 1988 by Dr. Hwa Lim The original definition was : a collective term for data compilation, organisation, analysis and

More information

proteins Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick*

proteins Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick* proteins STRUCTURE O FUNCTION O BIOINFORMATICS Comparison of structure-based and threading-based approaches to protein functional annotation Michal Brylinski, and Jeffrey Skolnick* Center for the Study

More information

Hit Finding and Optimization Using BLAZE & FORGE

Hit Finding and Optimization Using BLAZE & FORGE Hit Finding and Optimization Using BLAZE & FORGE Kevin Cusack,* Maria Argiriadi, Eric Breinlinger, Jeremy Edmunds, Michael Hoemann, Michael Friedman, Sami Osman, Raymond Huntley, Thomas Vargo AbbVie, Immunology

More information

Machine Learning for OR & FE

Machine Learning for OR & FE Machine Learning for OR & FE Regression II: Regularization and Shrinkage Methods Martin Haugh Department of Industrial Engineering and Operations Research Columbia University Email: martin.b.haugh@gmail.com

More information