Evaluation of the template-based modeling in CASP12

The article describes results of numerical evaluation of CASP12 models submitted on targets for which structural templates could be identified and for which servers produced models of relatively high accuracy. The emphasis is on analysis of details of models, and how well the models compete with experimental structures. Performance of contributing research groups is measured in terms of backbone accuracy, all-atom local geometry, and the ability to estimate local errors in models. Separate analyses for all participating groups and automatic servers were carried out. Compared with the last CASP, two years ago, there have been significant improvements in a number of areas, particularly the accuracy of protein backbone atoms, accuracy of sequence alignment between models and available structures, increased accuracy over that which can be obtained from simple copying of a closest template, and accuracy of modeling of sub-structures not present in the closest template. These advancements are likely associated with more effective strategies to build non-template regions of the targets ab initio , better algorithms to combine information from multiple templates, enhanced refinement methods, and better methods for estimating model accuracy


| I N TR ODU C TI ON
Template-based modeling is currently the most reliable type of protein structure prediction. A typical template-based modeling procedure involves, among others, two major steps: finding proteins with sequences similar to known structure(s) and building 3D models using the detected homologues as structural templates. Since the number of different protein folds is estimated to be limited and fold coverage increases with the growth of protein structure database, 1 the applicability of template-based modeling is ever growing.
Accuracy of protein models has increased dramatically from the early CASPs (mid-1990s) to the present day. Now it is routinely expected that a good structural model can be built for a target sharing > 20% of sequence with at least one known protein structure, while cases where good models are built at a lower sequence similarity are not unusual any more. In the latest three CASPs, almost all targets (96%) with homology > 20% were modeled to GDT_TS 2 > 50 (usually implying a topologically correct structure 3,4 ), and around half of the targets (51%) were modeled to a high accuracy of GDT_TS > 80. For low homology targets (seq. id. < 20%), contemporary modeling methods (CASP10-12) still generate models of good overall fold accuracy (GDT_TS > 50) for more than half of the targets (56%), including 8% cases (16 targets) of high accuracy modeling (GDT_TS > 80); while back in 1990s (CASP1-4), only 15% of such targets could be modeled to GDT_TS > 50, and none to GDT_TS > 80.
Typically, submitted models are automatically evaluated at the Prediction Center [5][6][7][8][9] and then the results are interpreted by independent assessors. In recent CASPs, there has been only incremental progress in the template modeling category, and so no assessor was appointed in CASP12 and the analysis has been performed by Prediction Center staff and the CASP organizing committee. In the event, it turned out that there were interesting improvements in CASP12 and these are discussed below.

| Evaluation measures
A wide suite of numerical measures has been used in CASP to assess accuracy of tertiary structures. 5 In this evaluation we chose to use (1) the rigid-body structure superposition measure GDT_HA, 10,11 (2) three all-atom local structure-based measures-LDDT, 12 CADaa, 13 and SphereGrinder (SG), 6 and (3) a measure of the accuracy of local error estimates ASE. 14 The GDT_HA measure and local measures have been already used in previous TBM assessments 15,16 and have proved useful. The GDT_HA scores are highly correlated with the widely used in CASP GDT_TS scores 2 and are usually 10-20 points lower for the same models. The ASE measure was previously used to score model accuracy estimates, 14 and is used here to emphasize the importance of predicting atomic level errors. The CASP tertiary structure prediction format (http://predictioncenter.org/casp12/ index.cgi?page5format#TS) requires predictors to provide the atomic error estimates in the temperature factor column of the PDB file.
The "ground truth" deviations of atoms in models from their experimental counterparts are calculated from the optimal model-target superposition established by the LGA program. 17 The aforementioned measures (1)- (3) highlight different aspects of model utility (global fit, all-atom local accuracy, and correctness of local error estimates) and are given equal weight in evaluating an overall relative model quality score (see below). For ranking purposes, we first calculate z-scores (a.k.a. standard scores) for each of the measures according to the following procedure. First, z-scores are calculated from the distribution of raw scores for all models submitted on a target. Then, apparent outliers (ie, models that scored two standard deviations or more below the average) are excluded, and the standard scores are re-calculated based on the mean and standard deviation of the outlier-free model set. All models that scored below the average (ie, those with negative z-scores) are assigned z scores of 0 in order not to over-penalize the groups attempting novel strategies.
If a group did not submit a prediction on a target or it was impossible to calculate an evaluation score (eg, ASE score for predictions without self-estimates of accuracy), z-scores were also assigned zero value. The target-based z-scores are then summed for each group (separately for every measure) and combined in the final ranking with the following weights: Since the measures in the final ranking formula were developed only after CASP7, for evaluation of progress and comparison of CASP12 results with those from previous CASPs we also use the GDT_TS measure, which was in use in CASP since CASP4. 2 FIG URE 1 GDT_TS scores of the best and median models submitted on the template-based modeling targets (including TBM and TBM/FM domains) in CASP5 and CASPs11-12. Points represent best models for each target in CASP11 and CASP12. Data are for the "all-group" targets in CASPs 11 and 12, and for all targets in CASP5. Apparently, the high-scoring outlier for target T0868 is pulling the CASP12 trend line (solid black line) up at the hard difficulty end, but even without this outstanding target, the CASP12 trend line (dotted and dashed black line) stays above the CASP5 and CASP11 lines. Specifics of the labeled in the graph targets are discussed in a separate section below FIGURE 2 Difference in GDT_TS score between the best submitted model for each target and the corresponding naïve model built by simple copying of the backbone atoms for the aligned residues of the best single template. Values greater than zero indicate added value in the best model. In contrast to CASP11, value was added for every target in CASP12, and in general the increase is greater than in CASP11. Targets T0868 and  T0892-D1 are examples, where the best models were significantly better than the models built on a single best template

| Targets
Based on the performance of the best CASP servers and template availability, CASP12 targets were separated into three difficulty categories:

| Predictions
In CASP, predictors are allowed to submit five models per target. In the TBM assessment, the assessors usually limit themselves to assessing only the models assigned by predictors as model number 1 10,11,15,16,[18][19][20][21] (supposedly the best models), and for ranking purposes we followed this practice. For establishing the progress between different CASPs, we took into account all submitted models. To measure improvement in the overall backbone accuracy of the TBM models, we compared GDT_TS scores of the best models (and median models) submitted on targets of similar prediction difficulty in different CASPs. Target difficulty is defined as a linear combination of best structural template coverage and sequence identity of the target to the best template. The procedure used was similar to that reported in CASP papers describing overall progress in tertiary structure Percentage of correctly predicted non-template residues, and (B) difference between the percentages of correctly predicted non-template residues and incorrectly predicted template residues. The data are provided for targets with at least 15 residues missing in the best template. A residue is considered as correctly aligned/predicted in the template/model if its Ca error is < 3.8 Å in the optimal LGA superposition. Values greater than zero in panel (B) indicate net gain in the modeling (ie, more correctly predicted residues from those missing in the template than incorrectly predicted residues from those available in the template). The best model for target T0868 (the highest positive outlier marked in panel B) includes substantial portion of the structure that was not available from the best template and was modeled ab initio KRYSHTAFOVYCH ET AL.

| 323
modeling. [22][23][24] Figure 1 shows the backbone accuracy of templatebased models for the latest two CASPs, 11 and 12, and for CASP5 (at which time progress from earlier CASPs had plateaued). Trend lines in CASP12 (for both the best and median models) run noticeably higher than the corresponding trend lines for CASP11, indicating improved performance in CASP12. Backbone accuracy of the best CASP12 models is about 10 GDT_TS units better than that of CASP11 models in the medium range of target difficulty and > 15 GDT_TS units higher for the most difficult template-based modeling targets. If judged by the accuracy of median models, CASP12 methods are about 10 GDT_TS units better than CASP11 methods across the full range of target difficulty. Comparing individual data points it is apparent that only one CASP12 template-based modeling target had no models scoring above GDT_TS 5 50 and only six targets had no models scoring above GDT_TS 5 60, while in CASP11 these numbers were significantly worse (11 and 16, respectively). As one can see, there are several outliers at both ends of the accuracy spectrum. We discuss aspects of specific targets in a later section of the article and discuss possible reasons for the improved performance in Conclusions.
The higher accuracy of the main chain prediction in CASP12 models is also supported by comparison of the best models with naïve models built by copying the coordinates of the aligned residues from the best available structural template. Figure 2 shows the difference in GDT_TS scores between such models. In CASP12, for the first time all the best template-based models were better than the naïve models built for the same target (all data points being above the DGDT_TS 5 0 line Modeling of the non-principal template covered regions is often key to correctly characterizing functional differences between the template protein and the target. Figure 3A shows the percentage of non-template residues that are correctly predicted (Ca atom error < 3.8 Ångstr€ oms) in the best model, while Figure 3B shows difference between the % of such residues (ie, the data for Figure 3A) and % of incorrectly predicted residues for those that align with the best template where the average net modeling gain was positive (ie, more nontemplate residues were correctly modeled than template residues misplaced).
It is a well-known fact and a long-standing problem in the template-based modeling that model accuracy is dominated by alignment accuracy, together with the fraction of residues that can be aligned to the available template. To measure alignment accuracy, we compute the AL0 score, 2,25 representing the number of correctly aligned residues in the LGA 4 Å superposition of the modeled and experimental structures. Figure 4 shows that alignment in CASP12 is significantly better than in previous CASPs, with the average accuracy around 70% for the targets from the middle range of difficulty, compared to around 60% in CASP11 and 50% in CASP5. Note that these numbers are the percentage of all residues, not the percentage of alignable residues. The maximum alignability line for CASP12 shows that the maximum possible values are not much larger and the alignment FIG URE 4 Percentage of correctly aligned residues (AL0) for the best models submitted on the template-based modeling targets (including TBM and TBM/FM domains) in CASP5 and CASPs11-12, and the maximum percentage of residues that could be aligned using the single best template (ie, maximum alignability) on CASP12 targets as functions of target difficulty. A model residue is considered correctly aligned if the Ca atom falls within 3.8 Å of the corresponding atom in an optimal model-target superposition, and there is no other experimental structure Ca atom nearer. A template residue is considered alignable if there is at least one experimental residue that is within 3.8 Å (in terms of the Ca-Ca distance) in an optimal template-target superposition. The maximum alignability is the percentage of aligned residues in the longest alignment between the best template and the experimental structure built with the dynamic programming procedure in such a way that no alignable residue is taken twice and all residues in the alignment are in the order of the sequence. The data in the graph are provided for the all-group targets in the latest two CASPs and for all targets in CASP5. The maximum alignability line (dotted black line) shows that CASP12 predictions (solid black line) on harder template-based targets exceeded the alignability limit for single templates. The detailed analysis shows that such result is a consequence of presence of extraordinary well modeled target T0868 in the dataset. While this target has maximum alignability of only 63% (marked on the graph), 90% of its residues were correctly aligned in the best model due to ab initio modeling of non-template regions and successful refinement (as discussed below). Removing T0868 from the target set brings the alignment line for CASP12 models (dotted and dashed black line) about 5% below the maximum single-template alignability line in the whole range of target difficulty (D) the most often used by the CASP12 predictors evolutionary related template (4g6u, red); (E) the highest scoring HHsearch sequence template (2ghz, magenta); (F) the highest scoring LGA structural template (2cw6, yellow). (G) Ca-Ca distances between the target residues and the aligned residues in the best evolutionary related template (red dotted line), best server model (green), and the overall best model (blue). Lower values indicate closer residues, and thus better modeling. The secondary structure diagram of the target is provided at the bottom of the panel, with the regions shown in panel A marked on the sequence. (H) Position-specific alignment of the best models to the target structure. The models are sorted according to the number of correctly aligned residues. Green color shows regions of perfect alignment in the optimal sequence-independent LGA superposition, yellow-residues misaligned by no more than 4 positions along the sequence, redmisaligned by 5 or more residues, and white-not aligned. Three regions of the target: (1) the second part of helix a1 together with the loop and strand b1, (2) the first part of the second helix before the kink, a2a, and (3) the small C-terminal helix a4 are missing in the templates (D-F), but included in the models (panels B, C). Two other structural fragments-the b2-loop-b3 and the a3 helix-have different orientation in the best templates, but are well placed in the models (green and blue lines run noticeably lower than the red dotted line in panel G). The best model from an expert group (C) shows overall improvement over the best server model (B) due to the successful refinement (blue line runs generally lower than the green line in panel G). In particular, the best expert model (T0868TS330_2, boxed in the top part of panel H) was able to fix the alignment error in the best server model (T0868TS005_1, boxed at the bottom) in the connector (residues 90-96) between the b3 strand (84-89) and the a2a helix (residues 97-106); and move the regions a1-b1 and a2b toward native structure errors are quite small-about 5% over the whole range of target difficulty. The trend lines in Figures 1 and 4 are similar, confirming the dependence of overall model accuracy on the alignment accuracy.
3.2 | Targets with unusually high or low accuracy for their difficulty range T0868 ( Figure 5A) is the target representing unusually high modeling performance in Figures 1-4. This target is a bacterial CdiA tRNase toxin in complex with its immunity protein CdiI 26 (PDB ID-5j4a). The success in modeling of this target results from the ability of the best server, Baker-Rosettaserver, to (1) recognize the best evolutionary related template, 4g6u ( Figure 5D), which is not the highest-scoring sequence template ( Figure 5E) or the highest-scoring structural template ( Figure 5F); (2) accurately model structure fragments not present in the templates, especially the second part of helix a1 together with the loop and the first part of strand b1 (residues 55-65), a2a helix with the leader (residues 90-106), and a4 helix (residues 150-156)-see  Figure 6A shows alignment of the top four templates. It can be seen that the first two templates cover the target deeper into the Cterminal region, while the third and fourth templates are better in the N-terminal region (regions with mostly yellow and orange colors). Combining templates that cover different regions appears to have helped building more accurate models for this target. As it can be inferred from Figure 6B, the best submitted model (TS126_4_2 from the EdaRose group, blue) seems to be built on two different templates, as it closely follows the 3k7a template in the first part of the sequence, and the 2lg1 template in the last part. Conversely, the second-best model (TS287_5 from the Multicom-cluster group, green) closely follows a single template (3k7a) along the whole domain sequence. As a result, The best server model for target T0882 ( Figure 7A), a hypothetical domain from the serine/threonine-protein kinase WNK1 (PDB ID-5g3q), was built by the Baker_Rosettaserver (TS005_2, Figure 7B) using the structure of protein 2v3s as a main template ( Figure 7C). As it can be seen from the alignment plot ( Figure 7F), the model (blue line) closely follows this template (red) except for the first approximately 15 residues, where the template misses a b-strand. However, the missing strand is present in other high-scoring templates-2lru ( Figure 7D) and 2kt9 ( Figure 7E), and the Rosetta combined the N-terminal strand from these templates with the rest of the structure from 2v3s ( Figure 7F) to build a complete model (personal communication), which was subsequently refined to GDT_TS 5 90.8 ( Figure 7B).
Besides the targets with unusually high scores for their difficulty range, Figure 1 shows a couple of targets-T0874 and T0875, with unusually low scores. T0875 (LV2A2) is a protein from Ljungan virus and T0874 (HPeV1) from Human Parechovirus A. The two proteins are related to each other at around 50% sequence identity. Even though the best structural templates (4dot, 4dpz) were found by the servers, the templates appeared to be hard to improve on. The best models scored only 54 and 45 LGA_S points on targets T0874 and T0875, respectively, and these scores are just slightly better than the scores of the best templates (46/42 LGA_S, respectively). Closer inspection reveals that the best templates cover only the N-terminal parts of the This may have caused an additional complication in modeling these targets.

| Group rankings
In this CASP we ranked participating TBM groups based on the accu- the all-group rankings (see Figure S1 in the Supporting Information); however, they drop to #10 and #11 in the cumulative ranking (see Figure 8) because of the low ASE scores (it was brought to our attention that the LEE and LEEab groups provided estimates of the crystallographic temperature factors in place of the required distance error estimates). Similarly, the Baker group is positioned 6th in the GDT_HA only ranking and 4th in the ranking according to the CASP11 assessors' formula, but drops to #15 in the cumulative ranking. Conversely, the McGuffin group is ranked 8th in the GDT_HA only ranking and 7th in the CASP11-style ranking, but climbs to the top of the cumulative ranking due to the relatively higher ASE scores.
Ranking of the groups that show similar performance across the different measures remains quite stable. For example, the Zhang and VoroMQA_select groups occupy positions #3 and #4 in the GDT_HA based ranking and #2 and #3 in the cumulative ranking, correspondingly. On a separate note, Figure 9 also shows that all four of these groups improved accuracy of their backbone modeling in CASP12, if compared to CASP11 (ie, achieved higher GDT_TS scores).
To obtain insight into the methodological advances of the best methods, we checked the CASP12 Methods Abstracts (http://predictioncenter.org/casp12/doc/CASP12_Abstracts.pdf) and also got in touch with the authors. While every group had their own recipe for success (the detailed account of the better performing TBM methods can be found elsewhere in this issue), we can summarize that much of the CASP12 progress in TBM comes from (1) more effective strategies to combine multiple templates or build missing parts in the templates regions ab initio, (2) enhanced refinement methods, and (3) better methods for estimating model accuracy (EMA). The first two points in the list are discussed in the examples provided above. Better EMA methods enabled picking more accurate models from decoy sets for all top-performing groups, and in some cases helped identifying low reliability regions as candidates for refolding or refinement (see Yang Zhang's article, this issue). 30 While using contact information in free modeling was a story of success in the recent two CASPs 31,32 the CASP12 TBM data showed that there is no trend for better performance of template-based methods on targets with deeper alignments.
This general conclusion is confirmed in Yang Zhang's article (this issue), which shows that results of the Zhang-server on TBM targets with or without using contacts are essentially the same.  Inclusion of the ASE accuracy estimate measure in this evaluation was somewhat controversial and down-ranked some groups (eg, Lee and Baker) that were good performers according to the previous CASP ranking schemes. We strongly encourage the prediction community to take advantage of the FORCASP forum (http://predictioncenter.org/ forcasp/) to discuss approaches for evaluation of the TBM category before the next experiment starts, and so ensure that the community's views are fully considered.