proteins CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 *

Size: px

Start display at page:

Download "proteins CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 *"

Albert Rice
5 years ago
Views:

1 proteins STRUCTURE O FUNCTION O BIOINFORMATICS CASP Progress Report Progress from CASP6 to CASP7 Andriy Kryshtafovych, 1 Krzysztof Fidelis, 1 and John Moult 2 * 1 Genome Center, University of California, Davis, California Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, Rockville, Maryland ABSTRACT This article describes the general quality of models of three dimensional structure submitted to CASP7 and analyzes progress since the previous experiment, primarily using measures that were used in earlier analyses. Overall improvement in model accuracy compared to CASP6 is modest, but there are two developments of note: server performance has moved closer to that of humans, and there has been a significant improvement in the fraction of targets for which the best model is superior to that obtainable using knowledge of a single best template structure. Proteins 2007; 69(Suppl 8): VC 2007 Wiley-Liss, Inc. Key words: protein structure prediction; community wide experiment; CASP. INTRODUCTION This article examines progress in structure modeling between the CASP6 experiment 1 held in 2004 and CASP7, held in 2006, in the context of the full record of the CASP experiments, carried out every 2 years since Details of the CASP7 experiment can be found in the introduction to this special issue of Proteins 2 and the articles by the three assessors focus more on the state of the art now. 3 5 The analysis methods have mostly been introduced in the earlier articles. 6,7 In addition, we have introduced a more thorough analysis of the relationship between the best models submitted for each target and those that a naïve modeler might produce. We include only results in the core areas of CASP, concerned with producing three dimensional models of complete proteins or protein domains. As before, we analyze two aspects of progress how the quality of the very best models is improving, and to a lesser extent, how the quality of models produced in the field as a whole is advancing. Best performance is evaluated by comparing the most accurate models of targets of comparable difficulty in different CASPs. Progress in the field as a whole is evaluated by comparing the average accuracy of the six best models for a target with the average accuracy of models in other CASPs for targets of similar difficulty. GENERAL CONSIDERATIONS Relative target difficulty As in the earlier progress assessments, we use a two-dimensional scale to estimate difficulty, incorporating the similarity of the protein sequence to that of a protein with known structure, and the similarity of the structure of the target protein to potential templates (see Methods). Some other significant factors that affect modeling difficulty are not considered, particularly the number and phylogenetic distribution of related sequences and the number and structural distribution of available templates. The set of related sequences will influence whether or not an evolutionary relationship can be detected, and also the quality of the alignment that can be generated. As discussed later, The authors state no conflict of interest. Grant sponsor: NIH; Grant number: LM *Correspondence to: John Moult, Center for Advanced Research in Biotechnology, University of Maryland Biotechnology Institute, 9600 Gudelsky Drive, Rockville, MD moult@umbi.umd.edu Received 14 May 2007; Revised 21 July 2007; Accepted 23 July 2007 Published online 5 October 2007 in Wiley InterScience ( DOI: /prot PROTEINS VC 2007 WILEY-LISS, INC.

$For each CASP, the difficulty of producing an accurate model is shown as function of the fraction of each target that can be superimposed on a known structure (horizontal axis) and the sequence$

2 Progress from CASP6 to CASP7 Figure 1 Distribution of target difficulty. For each CASP, the difficulty of producing an accurate model is shown as function of the fraction of each target that can be superimposed on a known structure (horizontal axis) and the sequence identity between target and template for the superimposed portion (vertical axis). In all CASPs, targets span a wide range of difficulty. additional templates may provide models for regions of structure not present in the single best one. Omitting these factors adds some noise to the relationship between model quality and our difficulty scale. Domains Many target structures consist of two or more structural domains. Since domains within the same structure may present modeling problems of different difficulty, assessment in CASPs 4 7 has treated each identifiable domain as a separate target. Assessors have the advantage of deriving domains from the experimental structure, whereas predictors only have the sequence. In any case, domain definitions are nearly always subjective. For most of the analysis, we subdivide template based modeling (TBM) targets into domains only if these divisions are likely identifiable by a predictor, and require different modeling approaches (i.e., belong to different difficulty categories), or the domains are sequentially related to different templates. There are seven such targets in CASP7. For evaluation of nontemplate based models all domains identified by the assessors are treated as separate targets. TARGET DIFFICULTY ANALYSIS Figure 1 shows the distribution of target difficulty for all CASPs, as a function of structure and sequence similarity between the experimental structure of each target and the corresponding best available template. 8 Targets span a wide range of structure and sequence similarity in all the CASPs (see Ref. 9 for a more detailed discussion). The distribution for CASP7 is similar, with the notable feature that there are fewer targets at a low level of structure superposability. As in previous CASPs, we derive a one dimensional scale of target difficulty, combining the structure superposability and sequence identity (see Methods). As discussed previously, 9 this difficulty scale has obvious deficiencies. However, so far neither ourselves nor others have suggested a more appropriate alternative. In CASP7, targets were divided on the basis of difficulty a little differently than in previous experiments. Targets with a high degree of sequence and structural similarity to an available template were considered as producing potentially high accuracy models. Targets were assigned to this category on the basis of two criteria. First, to ensure that there was a good template in the DOI /prot PROTEINS 195

3 A. Kryshtafovych et al. PDB at the time predictions were made, structural superpositions had to identify at least one template with an LGA-S score 10 of greater than 80. Second, to ensure that it was possible to construct a good model, it was required that at least one model must give a GDT-TS score (see Methods and next section) of greater than 80. These high accuracy targets and all other targets that could be built with the aid of a domain level template were considered as TBM targets, and remainder, those for which there was no available template, were considered to be Free modeling (FM) targets. (These changes are discussed more fully elsewhere in this issue 2 ). As with the previous divisions, these map approximately to the difficulty scale, with high accuracy targets at the easy end, the rest of the template based targets taking up the bulk of the range, and template free targets at the most difficult end. As before, though, the mapping boundaries are not completely sharp. OVERALL MODEL QUALITY As before, the GDT_TS measure 11 is the key standard for the overall quality. GDT_TS is based on the fraction of residues of a target for which the Ca atoms are within 1, 2, 4, and 8 Å of the corresponding atoms in a model (see Methods for a precise definition). For relatively accurate comparative models (in the High Accuracy regime), almost all residues will likely fall under the 8 Å cutoff, and many will be under 4 Å, so that the 1 and 2 Å thresholds capture most of the variations in model quality. In the most difficult template FM regime, on the other hand, few residues fall under the 1 and 2 Å thresholds, and the larger thresholds capture most of the variation between models. For the bulk of the template based models, all four thresholds will often play a significant role. Template free models are often very approximate and measuring model quality is more difficult. In this CASP, the assessor carried out a careful analysis of the relationship between GDT_TS ranking and that obtained by visual inspection. As in previous CASPs these sometimes differ, but we have not considered these variations in the analysis. For High Accuracy regime models, accuracy improvements are inherently small scale, since templates and targets tend to be quite similar. For this reason, we also considered the finer-grained measure GDT_HA, where the thresholds are 0.5, 1, 2, and 4 Å. Generally, the conclusions are not affected by which measure is used (results not shown), so only GDT_TS results are presented here. Figure 2(A) shows the GDT_TS scores for the best model on each target as a function of target difficulty, for all CASPs, with each point an average over the target itself and the two targets to the left and to the right (if available) on the difficulty scale. Quadratic splines have been fitted through the data for each CASP. Note that throughout the article, best is selected from all submitted models. Predictors may submit up to five models per target, and are asked to rank expected quality. Assessment of template-based models usually included only model 1. At the high accuracy (left) side of the plot, models based on 50% or higher sequence identity have GDT_TS scores around 90 or higher, close to the limit imposed by experimental accuracy. As target difficulty increases, best scores fall off rapidly and steadily. From one point of view, only the models based on very high sequence identity are close to perfect by this criterion. From another, it is very encouraging to note the steady improvement in the trend lines in successive CASPs. In the mid-range of difficulty, scores have about doubled, from around 30 in CASP1 to 60 in CASPs 6 and 7. Between any particular pair of CASPs, improvement has been different in magnitude and in region. Between 6 and 7, the only significant change has been at the hard end of the difficulty scale, where two targets with high quality best models (283 and 350) pull up the curve. Figure 2(B) shows the smoothed average GDT_TS values over the six best models from different groups, rather than the single best. There is a similar improvement at the difficult end of TBM region. ALIGNMENT ACCURACY For template-based models correct alignment of the target sequence onto available template structures is a critical and often demanding step. As in previous analyses, we measure alignment accuracy (AL0) by counting the number of correctly aligned residues in the LGA 5 Å superposition of the modeled and experimental structures of a target. A model residue is considered to be correctly aligned if the Ca atom falls within 3.8 Å of the corresponding atom in the experimental structure, and there is no other experimental structure Ca atom nearer. Figure 3 shows the smoothed alignment accuracy for the best models of each target in all the CASPs. Features here are similar to those in the GDT_TS plot (2A), with alignment accuracy near 100% for the easiest targets, but falling steadily with target difficulty. The spline fits show a steady improvement in alignment accuracy over the CASPs, this time with the gain at the difficult end of TBM, in a manner very similar to that seen with GDT_TS. FEATURES OF MODELS NOT AVAILABLE FROM A SINGLE BEST TEMPLATE An important criterion in establishing the effectiveness of TBM methods is whether or not a model contains features in addition to those that could be simply copied from the best possible template. That is, there is added value over and above the information in the template, and the model is therefore superior to that which an intelligent but naïve modeler might produce. To evaluate that, we con- 196 PROTEINS DOI /prot

Progress from CASP6 to CASP7 Figure 2 GDT_TS scores of submitted models for targets in all CASPs, as a function of target difficulty. Each point represents one target.

4 Progress from CASP6 to CASP7 Figure 2 GDT_TS scores of submitted models for targets in all CASPs, as a function of target difficulty. Each point represents one target. Data are smoothed by averaging over sets of five consecutive targets. Part A shows the scores for the best models on each target, and part B, the average score over the top six models from different groups. Trend lines show a clear though sometimes modest improvement from each successive CASP to the next, for both the best models and the best sets of models. In CASP7, there is apparent progress only at the more difficult end of the target spectrum, where two targets for which outstandingly good models were made (283 and 350) influence the average. DOI /prot PROTEINS 197

A. Kryshtafovych et al. Figure 3 Percentage of residues correctly aligned for the best model of each target in all CASPs, smoothed by averaging over sets of five adjacent targets.

5 A. Kryshtafovych et al. Figure 3 Percentage of residues correctly aligned for the best model of each target in all CASPs, smoothed by averaging over sets of five adjacent targets. Trend lines here follow those in the equivalent GDT_TS plot [Fig. 2(A)] indicating that for many targets, alignment accuracy, together with the fraction of residues that can be aligned to a single template, dominate model quality. In CASP7, the same apparent improvement for the more difficult targets is present. structed a set of models, two for each target (one for each measure considered GDT_TS and AL0), utilizing knowledge of the structure of the best template. It should be noted that the conclusions reached depend to significant extent on how these best template models are built. Here, we begin by finding an optimal templatetarget alignment as follows: We first find all target Ca atoms that are within 3.8 Å of any template Ca atom in the 5 Å LGA sequence independent superposition. Then, we use a dynamic programming procedure that determines the longest alignment between the two structures using these preselected atoms, in such a way that no atom is taken twice and all the atoms in the alignment are in the order of the sequence. The maximum alignability (Smith Waterman alignment score SWALI) is then the fraction of aligned Ca atoms in the target. Then we assign the coordinates of template s backbone residues to the aligned target residues. It is not always the case that the template with the best structural coverage of the target gets the highest score (GDT_TS or AL0) when superimposed onto the target. To ensure that templates yielding the highest score possible were used, for each target, we built template models based on each of the 10 templates with the best structural coverage, and picked the template models with the highest GDT_TS and AL0 score for the subsequent analysis (in many cases, the same model was selected and used for both analyses). These models were then evaluated exactly like the submitted ones. Figure 4(A) shows the difference between the GDT_TS score for the best submitted model and the GDT_TS highest scored template model, for all targets in CASPs 5, 6, and 7. Values greater than zero indicate that overall the submitted model is more accurate than the one obtained from the best template. Spline fits show that the average improvement of best models over template has increased for the template-based modeling targets in successive CASPs. The spline lines are at positive values for the easier targets, dipping below zero for the hardest ones. There are seven targets where the best model scores more than 10% higher in GDT_TS than the best template one. We also investigated the extent of value added features in the best models compared to single template models built using several other methods. All analyses revealed similar trends (results not shown). One of these methods used the Modeller 12 package in a restricted way, starting from the correct structural alignment to the best single 198 PROTEINS DOI /prot

Progress from CASP6 to CASP7 Figure 4 A. Difference in GDT_TS score between the best submitted model for each target, and a naïve model based on knowledge of the best single template.

6 Progress from CASP6 to CASP7 Figure 4 A. Difference in GDT_TS score between the best submitted model for each target, and a naïve model based on knowledge of the best single template. Values greater than zero indicate added value in the best model. For most of the target range, on average, best models do have added value. There are a few cases where this is substantial in excess of 10% improvement in GDT_TS. In the difficult target range, a small number of very poorly modeled targets tend to pull down the trend line, but most models still are more accurate than a naïve model. B. Percentage difference in correctly aligned residues for the best submitted model for each target and for the corresponding naïve model. Similar trends as in Fig. 5A are seen, except that the trend lines are only above zero for the 1/3 or so easiest targets. C. Fraction of targets for which the best submitted models are superior to the best template models in CASPs 5, 6, and 7. For both GDT_TS and AL0, the fraction has increased substantially from CASP5 to CASP6 and from CASP6 to CASP7. In CASP7, almost 80% of the TBM targets have a higher GDT_TS score than a naïve template model. DOI /prot PROTEINS 199

7 A. Kryshtafovych et al. template. The resulting models were better than the simple single template models by an average of around 2% in GDT_TS for the template-based modeling targets. But even with this improvement, the majority of these reference models are still less accurate than the corresponding best models. It should be emphasized that we constrained Modeller to use only one template, and compared the results of a single method with the best models produced by any method this is in no way a test of Modeller performance. Figure 4(B) shows the difference between the fraction of residues that are correctly aligned in the best template model and the fraction correctly aligned in the best submitted model. By this measure too, a large fraction of best models are better than the template and that fraction has been improving over successive CASPs. The falloff in alignment improvement over template with target difficulty is more pronounced than in the GDT_TS plot 4A, but never-the-less, there are some impressive improvements even for difficult targets, including one with 20% more residues aligned in the best model compared with the best template model. Note that Figure 4 compares the alignment accuracy of two models in terms of AL0 one submitted, the other based on knowledge of the template. The alignment evaluation algorithm used in CASP (see Methods) assigns a residue as correctly aligned if it is within 3.8 Å of the corresponding residue in the target structure in the sequence independent LGA superposition and no other experimental structure residue is closer, a demanding requirement. Figure 4(C) summarizes the comparison of the best submitted models with the best template ones, for CASPs 5, 6, and 7. The fraction of all best submitted models with a higher GDT_TS value than the corresponding best single template one has increased from 53% in CASP5, through 59% in CASP6% to 67% in CASP7 (a). If only the TBM targets are considered, the numbers are even more impressive: 66, 71, and 79% for CASPs 5, 6, and 7, respectively (b). A similar picture is apparent for alignment accuracy, with increases for each CASP, and CASP7 with 50% of all best models and 60% of best models for the template-based modeling targets more accurately aligned than the best template models (c and d). Model features not present in the best template may have been derived in three ways: by the identification of features that are present in other available template structures, by refining aligned regions away from a template structure towards the experimental one, or by the use of template FM methods for features not present in the template. Examples of all three processes can be found in the CASP7 results, and are illustrated below using an error difference function along the sequence. For each residue i : De i ¼ e i ðmodelþ e i ðtemplateþ; where e i (model) is the Ca error in the model for residue i, and e i (template) is the distance between the target Ca atom of residue i and the correctly aligned Ca atom from the template. These inter Ca distances are taken from the LGA sequence independent superposition of target and template. Negative De values thus represent regions of a model that are more accurate than could be obtained by simply correctly aligning the single best template. Target 315 is an example of a target for which a superior model was constructed with the use of multiple templates. Figure 5(A) shows the error difference function along the sequence for the best model. Two template/target differences are shown, for the best template, and the next best template. It is apparent that although the best template provides the most accurate basis for a model over-all, there are three regions where it is less accurate than the second choice. In these regions the best model successfully incorporates those parts of the second template. An example of an improvement using FM followed by refinement is provided by the best model of target 283 [Fig. 5(B)]. This is the point at about 120% GDT_TS units in Figure 4(A), and is officially a difficult TBM target. In this case, Figure 5(B) shows that the model is Figure 5 Examples of added value over a naïve template model. In each plot, the blue curve shows the accuracy of the best model, as the difference between it and the target structure for each residue s Ca atom. The thin yellow line shows maximum accuracy of a model built from the best template, as the difference between the best template and the target, and where present, the thin white line shows the difference between a second template and the target. Thick yellow and white lines show the difference between the corresponding templates and the model. Thus, negative regions in these curves are places where the model is more accurate than could achieved with the information from a single template alone. A. Target 315, an example of added value through incorporation of information from a second template to successfully model short loop regions. There are four regions where the model is substantially more accurate than could be obtained from the best template alone (dips in the thick yellow curve), and in each place, the model is close to the second template (thick white curve near to zero). Conversely, several regions where the second template would have provided a poor model (dips in the thick white curve) are built from the first template. B. Target 283, an example of improvement relative to the best available template through use a refinement procedure and starting from a more remote structure. It is clear that for much of the sequence, the model is much better than the best template (thick yellow curve well into negative territory), which was not used as the starting model. Here the alternative template shown (white) is the 10th best, and the model is close to it for a short region in the middle of the structure (thick white line close to zero). C. Target 380, an example of refinement starting from a nearby template. In this case, a refinement procedure has succeeded in improving the accuracy of the model over the template starting point substantially in two regions (very negative regions in the thick yellow curve), one at the N terminus, one near the C terminus. There are also other more minor improvements in many places. D. Target 328, an example of FM methods providing part of structure not available from a template. The thick yellow curve shows that the model is substantially more accurate than the template in the region between residues 30 and 50. The thin yellow curve shows the template is of little use there, and no other template is available for that region. 200 PROTEINS DOI /prot

8 Progress from CASP6 to CASP7 Figure 5 DOI /prot PROTEINS 201

A. Kryshtafovych et al. Figure 6 A. Percentage of server predictions among the best 6 models for three categories of target difficulty in CASPs 5, 6, and 7.

9 A. Kryshtafovych et al. Figure 6 A. Percentage of server predictions among the best 6 models for three categories of target difficulty in CASPs 5, 6, and 7. Relative server performance improved overall, and the most for easy comparative models. CASP7 also shows a substantial improvement in the middle category of difficulty. There are likely too few targets in the most difficult category for trends to be reliable. (For CASPs 5 and 6: CM: comparative models, e : easy, h : hard; For CASP7, TMB_easy, and TBM_Hard are PSI-BLAST and non-psi-blast level TBM targets respectively; FRH: Homologous fold recognition; FRA: analogous fold recognition; NF: new folds). B. Percentage of modeling targets for which at least one server is among the top six best models. Here too, it is evident that relative server performance has improved overall, and most dramatically in the easy comparative modeling category. C. Ratio of the quality of best server models to best human models as a function of target difficulty, for CASPs 5, 6, and 7, as measured by GDT_TS. The spline fit for CASP7 runs very close to unity (equal quality for server and human models) over much of the difficulty range. There are two outstanding cases (T0321_1 and T0356_2) where the best server models have GDT_TS scores more than 25% higher than the corresponding human ones. There are also still cases (for example T0386_2 and T0283) where human input and extra time does greatly improve model accuracy over that provided by any server. D. As for 6C, but for performance of the best six server models compared with the best six human groups on each target. Here it is clear that humans still have the edge, with the spline significantly below unity for all CASPs for much of the difficulty range. Never-the-less, CASP7 performance is improved relative to CASP6. E. GDT_TS scores of best and average over 6 best server models for all targets in CASPs 6 and 7, as a function of target difficulty. Each point represents one target. Lines show best fits for each category. It is clear that there has been progress in server accuracy over the full range of target difficulty. more accurate than the best template for most of the structure. The submitters of this model (Group 20) explain that the refinement was started from a template free model of another member of the family, and that very extensive changes and additional features were introduced by a refinement procedure. This is thus an outstanding example of refinement in CASP7. An example of refinement from a nearby template is provided by target 380 [Fig. 5(C)]. This was a target with a relatively close template available, but never-theless the best model shows a substantial improvement over that template (almost 11% in GDT_TS). In general, large improvements over the template are harder to achieve for easier targets, because the template errors are small to begin with, and this is therefore an exceptional result. It is apparent that the overall improvement over the template is distributed over the full sequence, with two regions of major change, at the N terminus, and near to the C terminus. In all, seven out of nine possible regions have moved closer to the target structure. Finally, target 328 provides an example where FM methods were successfully used to model part of an otherwise template based structure. Figure 5(D) shows the relationship between template and model accuracy along the sequence. Between residues 30 and 50, there is discontinuous information from the template, and no alternative template can be identified. The best model (submitted by group 24) incorporated FM-based steps there and is reasonably accurate. SERVER PERFORMANCE In early CASPs, most predictions were produced by a combination of computer methods and human judgment, in a manner analogous to the process of obtaining a structure experimentally, using crystallography or NMR. One of the most striking developments has been the increasingly effective role played by fully automatic methods: first in providing starting models for human teams, but now increasingly providing models of comparable quality to that the very best human groups can produce. In addition, servers are the only option for high throughput modeling, so their performance is independently important. As in the previous evaluations of progress, 9 we have compared the accuracy of server models to that in earlier rounds, as well as to the corresponding human predictions. There were 53, 50, and 68 server groups submitting tertiary structure models in CASPs 5, 6, and 7, respectively. Figure 6(A) shows the fraction of server models that were among the best 6 overall and for three categories of target difficulty. Strikingly, this fraction increased substantially in CASP7, both over-all and in the two tem- 202 PROTEINS DOI /prot

10 Progress from CASP6 to CASP7 Figure 6 (Continued) DOI /prot PROTEINS 203

11 A. Kryshtafovych et al. Figure 7 Model quality for the best (part A) and averaged over the six best (part B) new fold category targets, for CASPs 6 and 7. For each target, the lowest bars show the number of residues superimposed between model and target to closer than 1 Å, the next bar, the number superimposed to 2 Å, then 4 Å, then8å. The open bars show number of residues superimposed to greater than 8 Å. Greek letters indicate the fold type. By this measure, there is no obvious progress from CASP6 to CASP7. plate modeling categories. For the easy comparative modeling category (BLAST detectable templates), the fraction has increased from about 10% in CASP5, to nearly 20% in CASP6, to over 30% in CASP7. For the harder TBM targets there is also a significant increase in CASP7. The signal is less clear for the template free category, but the small number of targets makes trends harder to reliably identify. Figure 6(B) shows another view of these data, the fraction of targets for which at least one of the best six models came from a server. A very similar pattern of increase is apparent, particularly in CASP7. Figure 6(C) shows the ratio of GDT_TS scores for the best server and best human models for each CASP target, arranged by target difficulty. This ratio has improved in CASP7. Strikingly, there are two CASP7 targets where the best models produced by servers were substantially superior to any human prediction (T0321_1 and T0356_2). More generally, there were 19 targets (18.5%) in CASP7 where servers were at least on par with humans (three or more models in the best 6) and very few targets where the best server model was significantly worse than the best human one. Are humans becoming redundant in protein structure modeling? A possible alternative explanation for the convergence of human server performance is that humans are getting worse, rather than servers getting better. Figure 6(E) shows this is not the case, by means of a comparison of server performance in CASP6 and CASP7 over the full range of target difficulty. It is clear that the accuracy of 204 PROTEINS DOI /prot

Progress from CASP6 to CASP7 Figure 8 Distribution of success in predicting new fold targets for individual groups in CASP6 (part A) and CASP7 (part B), compared with that expected by chance.

12 Progress from CASP6 to CASP7 Figure 8 Distribution of success in predicting new fold targets for individual groups in CASP6 (part A) and CASP7 (part B), compared with that expected by chance. Left hand panels (green bars) show the number of groups submitting predictions for at least 1, 2,... up to the maximum number of targets in each of these CASPs. In the right panels, red bars show the number of groups ranked in the top six for 1 target, two targets, and so on. Yellow bars show the distribution of ranking expected by chance. In CASP6, there are three groups with six top ranked predictions, already very significantly more than random, and also single groups top ranked for 7, 8, 9, and 13 targets. In CASP7, with a smaller number of targets, there are still five groups with rankings counts falling completely outside the random distribution, and in general, observed rankings are shifted substantially away from random. Thus, the trend towards much better than random performance has been maintained, but not obviously strengthened. the best models and the best six models did improve significantly. TEMPLATE FREE METHODS Template free modeling methods must be used when either the target protein or domain has a fold topology not previously known (i.e., it s a so-called new fold ), or possible templates are too remote in sequence to be detected. From a very poor beginning, we have seen substantial progress in this area over the history of CASP, up to about CASP5. At that point, there were a small number of targets (usually less than about 120 residues) for which very impressive and accurate models were submitted, but the quality of the best models for the majority of targets was very much poorer than for targets where templates could be used. 7 In CASP6, there were again a small number of targets with excellent models, and notably these included mixed a/b structures, previously problematic. However, no significant overall increase could be detected. 9 An ongoing issue in this area is the need for better metrics to evaluate very approximate models. This problem is still not solved, so for CASP7, we again use the measures of earlier analyses. Specifically, we consider the four terms that contribute to the GDT_TS measure separately, rather than just that single value. That is, the number of residues which can be superimposed under 1, 2, 4, and 8 Å. There are a total of 19 targets in CASP7, compared with 25 in CASP6. As in the earlier analyses, all domains identified by the assessors are treated as separate targets. Figure 7 shows the results of the GDT threshold analysis for CASPs 6 and 7. Targets are ordered by size and fold type is indicated by the usual Greek letter classifica- DOI /prot PROTEINS 205

13 A. Kryshtafovych et al. tion. The stacked bars show the number of residues super-imposed under the distance thresholds of 1, 2, 4, and 8 Å, i.e., the number of residues for which the largest error is less than or equal to each threshold. The RMS error on such a set is typically about half the threshold, thus substructures meeting the 8 Å threshold would usually be judged excellent by visual inspection. In part (A), the performance in terms of the best models for each target is shown, and in part (B), the performance averaging over the six best models for each target. By this measure, it is very hard to detect any improvement between CASP6 and CASP7. An analysis of the number of residues falling under the GDT4 and GDT8 thresholds for CASPs 6 and 7, as a function of target size also shows little signs of progress (data not shown). As described in the assessment article though, there are a number of targets for which note-worthy models were submitted, 5 similar to that found in CASP6. ANALYSIS OF SUSTAINED PERFORMANCE FOR NEW FOLD TARGETS As in previous analyses, we wish to determine to what extent particular groups are performing consistently, as opposed to groups occasionally getting lucky, and happening to produce the best score for a target. As before, we address that by comparing the distribution of success of individual groups with the distribution of success expected by chance. Success is measured as the number of targets for which a group had a model ranking among the top six. The chance distribution was generated by randomly choosing six groups as the best scoring for each target. The chance distribution was constrained so that only groups predicting on that target were included, and the draw was weighted by the number of models submitted, i.e., a group submitting four models was four times as likely to be selected as one submitting a single model for a particular target. Figure 8 shows these data for the 25 CASP6 targets and 19 CASP7 targets. Also shown is information on how many groups submitted models for different number of targets. The green bars show the number of prediction groups submitting predictions on at least 1, 2, 3,... up to the maximum number of targets in each CASP. The yellow bars show the probability of a group scoring among the top six for one target, two targets, three targets, and so on, if the results were random. The red bars show the number of groups actually falling among the top six for one target, two targets, and so on. The more different this distribution from random, the more significant the results. For CASP7, because of the relatively small number of targets in this category, random expectation is that a large number of groups (47) would have one model among the six best of a single target. Reassuringly, this is not the case, and best performances are distributed over a higher numbers of targets. In particular, there are more groups than expected scoring among the best six models for all entries three and higher, and three groups score well outside the random distribution. These results continue the trend to consistent scoring seen earlier. CONCLUSIONS By the well-established GDT_TS and alignment accuracy measures, progress from CASP6 to CASP7 has been modest at best. However, a closer look reveals two interesting and significant changes. First, the closing of the gap between human prediction groups and automatic servers. As Figure 6 makes clear, server performance has been steadily improving relative to that of the best humans. This is in spite of the fact that most human assisted prediction begins with one or more server generated models, and involves a great deal more effort. Second, and most significantly, comparison of predictor performance with results using a model based on knowledge of a single best template structure is very encouraging. For template based targets, almost 80% of best models have added value over a simple template, as measured by GDT_TS. In a few cases, the improvement is dramatic. As illustrated earlier, improvements over template are being generated by all three possible mechanisms combining information from multiple templates, template FM of regions not present in a template, and refinement. As noted earlier, some caution is needed here because of sensitivity of the results to the exact definition of a template based model, but we tried six different approaches and found the results in every set were consistent, showing the same upward CASP-to-CASP trend. An obvious question is why progress is apparent relative to best template models, and not in absolute terms. The likely explanation is that the use of an intermediate reference state, rather than the absolute GDT_TS value, helps extract the otherwise weak signal. Against these encouraging points must be set the small overall progress picture. The long running analysis of progress by smoothed GDT_TS as a function of target difficulty [Fig. 2(A)] shows only a modest improvement at the more difficult end of the scale. Particularly in template free modeling, the new methods that caused such excitement from about CASP4 onwards seem to have run out of steam. These do produce spectacular successes for a few of the smaller targets, but do not seem to be expandable to a broader set. Still, there is every reason to hope that this energetic and creative community will find new ways to move us forward. METHODS Target difficulty The difficulty of a target is calculated by comparing it with every structure in the appropriate release of the pro- 206 PROTEINS DOI /prot

14 Progress from CASP6 to CASP7 tein databank, using the LGA structure superposition program. 10 For CASP7, templates were taken from the PDB releases accessible before each target deadline. Templates for the previous CASPs are the same as those used in the earlier analyses. 7 For each target, the most similar structure, as determined by LGA, in the appropriate version of the PDB is chosen as the representative template. Similarity between a target structure and a potential template is measured as the number of target-template Ca atom pairs that are within 5 Å in the LGA superposition, irrespective of continuity in the sequence, or sequence relatedness. The 5 Å threshold maintains compatibility with earlier target/template comparisons, 6,13,14 which were made using Prosup 15 software. It is a little larger than we now consider most appropriate (3.8 Å), and there is some times significant superposition score between unrelated structures, particularly for small proteins. Sequence identity is defined as the fraction of structurally aligned residues that are identical, maintaining sequence order. Note that basing sequence identity on structurally equivalent regions will usually yield a higher value than obtained by sequence comparison alone. In cases where several templates display comparable structural similarity to the target (coverage differing by less than 3%), but one has clearly higher sequence similarity (7% or more) the template with the highest sequence identity was selected. There were a total of 11 such instances in CASP7. Difficulty scale We project the data in Figure 1 into one dimension, using the following relationship: Relative Difficulty ¼ðRANK STR ALN + RANK SEQ IDÞ=2; where RANK_STR_ALN is the rank of the target along the horizontal axis of Figure 1 (i.e., ranking by % of the template structure aligned to the target) and RANK_SE- Q_ID is the rank along the vertical axis (ranking by % sequence identity in the structurally aligned regions). GDT_TS The GDT_TS value of a model is determined as follows. A large sample of possible structure super-positions of the model on the corresponding experimental structure is generated by superposing all sets of three, five, and seven consecutive Ca atoms along the backbone (each peptide segment provides one super-position). Each of these initial super-positions is iteratively extended, including all residue pairs under a specified threshold in the next iteration, and continuing until there is no change in included residues. 10 The procedure is carried out using thresholds of 1, 2, 4, and 8 Å, and the final super-position that includes the maximum number of residues is selected for each threshold. Super-imposed residues are not required to be continuous in the sequence, nor is there necessarily any relationship between the sets of residues super-imposed at different thresholds. GDT_TS is then obtained by averaging over the four super-position scores for the different thresholds: GDT TS ¼½1=4Š½N1 þ N2 þ N4 þ N8Š; where Nn is the number of residues superimposed under a distance threshold of n Å. GDT_TS may be thought of as an approximation of the area under the curve of accuracy versus the fraction of the structure included. Different thresholds play different roles in different modeling regimes. ACKNOWLEDGMENT The authors thank Zinoviy Dmytriv for assisting in generating data. REFERENCES 1. Moult J, Fidelis K, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction (CASP) round 6. Proteins 2005;61(Suppl 7): Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A. Critical assessment of methods of protein structure prediction Round VII. Proteins 2007;69(Suppl 8): Read R, Chavali G. Assessment of CASP7 predictions in the high accuracy template-based modeling category. Proteins 2007;69 (Suppl 8): Kopp K, Bordoli L, Battey JNB, Kiefer F, Schwede T. Assessment of CASP7 predictions for template-based modeling targets. Proteins 2007;69(Suppl 8): Jauch R, Yeo HC, Kolatkar PR, Clarke ND. Assessment of CASP7 structure predictions for template free targets. Proteins 2007;69 (Suppl 8): Venclovas C, Zemla A, Fidelis K, Moult J. Comparison of performance in successive CASP experiments. Proteins 2001;Suppl 5: Venclovas C, Zemla A, Fidelis K, Moult J. Assessment of progress over the CASP experiments. Proteins 2003;53(Suppl 6): Clarke ND, Ezkurdia I, Kopp J, Read RJ, Schwede T, Tress M. Domain definition and target classification for CASP7. Proteins 2007;69(Suppl 8): Kryshtafovych A, Venclovas C, Fidelis K, Moult J. Progress over the first decade of CASP experiments. Proteins 2005;61(Suppl 7): Zemla A. LGA a method for finding 3D similarities in protein structures. Nucleic Acids Res 2003;31: Zemla A, Venclovas C, Moult J, Fidelis K. Processing and evaluation of predictions in CASP4. Proteins 2001;Suppl 5: Fiser A, Sali A. Modeller: generation and refinement of homologybased protein structure models. Methods Enzymol 2003;374: Venclovas C, Zemla A, Fidelis K, Moult J. Some measures of comparative performance in the three CASPs. Proteins 1999;Suppl 3: Lackner P, Koppensteiner WA, Domingues FS, Sippl MJ. Automated large scale evaluation of protein structure predictions. Proteins 1999;37(Suppl 3): Lackner P, Koppensteiner WA, Sippl MJ, Domingues FS. ProSup: a refined tool for protein structure alignment. Protein Eng 2000; 13: DOI /prot PROTEINS 207

ProcessingandEvaluationofPredictionsinCASP4

PROTEINS: Structure, Function, and Genetics Suppl 5:13 21 (2001) DOI 10.1002/prot.10052 ProcessingandEvaluationofPredictionsinCASP4 AdamZemla, 1 ČeslovasVenclovas, 1 JohnMoult, 2 andkrzysztoffidelis 1