Accurate prediction of protein structures and interactions using a 3-track neural network

DeepMind presented remarkably accurate predictions at the recent CASP14 protein structure prediction assessment conference. We explored network architectures incorporating related ideas and obtained the best performance with a 3-track network in which information at the 1D sequence level, the 2D distance map level, and the 3D coordinate level is successively transformed and integrated. The 3-track network produces structure predictions with accuracies approaching those of DeepMind in CASP14, enables the rapid solution of challenging X-ray crystallography and cryo-EM structure modeling problems, and provides insights into the functions of proteins of currently unknown structure. The network also enables rapid generation of accurate protein-protein complex models from sequence information alone, short circuiting traditional approaches which require modeling of individual subunits followed by docking. We make the method available to the scientific community to speed biological research.


Initial Embedding
We describe the input multiple sequence alignments (MSA) as a matrix x ∈ ℝ N×L , where rows correspond to N sequences in the MSA, and columns are L positions in the aligned sequence. The input MSA is first tokenized to get the initial MSA features for further processing. Individual amino acids and gaps are regarded as character-level tokens (21 in total), and those are mapped to vectors having dmsa size through an embedding layer. The sinusoidal positional encoding (37) is added for residues in each sequence to let the network know the positional relationship. For the sequence dimension, an indicator for the query sequence instead of positional encoding is added because MSAs are unordered sets of sequences except the query sequence.
Template information is used to generate initial pair features by extracting pairwise distances and orientations from template structures for the aligned positions, along with 1D (positional similarity and alignment confidence scores) and scalar features (HHsearch probability, sequence similarity, and sequence identity) provided by HHsearch (13). Both features are concatenated to 2D inputs by tiling them along both axes of 2D inputs. Templates are first processed independently by one round of axial attention (row-wise attention followed by column-wise attention) (38) and then merged into a single 2D feature matrix using a pixel-wise attention mechanism. This processed feature matrix is then concatenated with the 2D-tiled query sequence embedding and projected to hidden dimension (dpair) for pair features. The 2D sinusoidal positional encoding is also added.

Processing MSA features via self-attention
After embedding the input MSA as described in the previous section, each MSA update step has ℝ N×L×d features as input and output. The MSA features are processed by the axial attention approach (38) which alternates attention over rows and columns of the 2D features. To reduce memory usage, we used Performer architecture (39) for the column attention (attention over sequence dimension) that reduces the memory requirements from O(LN 2 ) to O(LN). We first compared this MSA encoder with a coevolution extractor (described in the next section) to the architecture with hand-crafted features (sequence profiles and inverse covariance matrices). As shown in Table S1 (architecture 1 vs 2), we found that having a learnable MSA encoder slightly improves distance and orientation prediction (Δloss=-0.07) as well as top L long-range contact accuracy (Δaccuracy=2%p).
For the row attention (attention over residue dimension), we tested two different attention methods: 1) untied attention and 2) softly tied attention inspired by MSA Transformer architecture (40). In MSA Transformer, the tied attention idea for residue-wise attention was first introduced because the homologous sequences in the MSA should have similar structures. Here, we modified this tied attention idea to reduce contributions from unaligned regions by introducing a learned position-wise weight factor (see Algorithm 1) to combine attention signals from sequences in MSA. We defined the soft-tied attention as Eq. (1), where N is the number of sequences in MSA, Qn and Kn are the matrix of queries and keys for the n-th sequence of input, and Wn is the position-wise weight factor for the corresponding sequence.

Update pair features with coevolution signal derived from MSA features
To extract residue pairwise interaction information from given MSA features, we adopted the outer product and aggregation idea from the CopulaNet method (41). The outer product can capture the correlation between two residues in each sequence. By aggregating the signals from all sequences in MSA, we can measure the strength of covariation. For example, in the simplest case with one-hot encoded embedding for sequences, we get a 21x21 substitution table for each pair of positions including gaps. When we take the average of the substitution tables from all sequences, the resulting 21x21 features will show different distributions depending on whether they interact with each other in 3D space or not. The broadly distributed 21x21 features indicate random uncorrelated mutations, and it means that those two residues are less likely to make contact in 3D space. On the other hand, if the aggregated features have sharp distributions (indicating correlated mutations), they will have a higher chance of interacting directly. In practice, the learned MSA embeddings through the network are used instead of one-hot encoding.
As outer products could require a huge memory (O(d 2 )), the MSA embeddings are first projected down to the smaller hidden dimensions (32 features in this case) to reduce the memory requirements. After taking the outer product of embeddings derived from each sequence in MSA for any two residues, it calculates weighted averages of the outer products from all sequences with position-wise sequence weights. These aggregated coevolution features are then combined with 1D features (weighted average of MSA features) and residue-wise attention maps from the previous MSA update. They are projected down to match the hidden dimension for pair features.
To combine newly extracted pair features and previous pair features, we tested two different approaches: 1) Adding two pair features followed by feed-forward network and 2) concatenating two pair features followed by a single residual block of 2D convolutional network. As shown in Table S1 (architecture 5 vs 6), feature concatenation and 2D convolution clearly showed better performances, and we used this approach as outlined in Fig. S1B for our final model.

Refine pair features via row and column-wise self-attention
The updated pair features based on coevolution signals from MSAs are further refined by axial attention (38) as shown in Fig. S1C. Using axial attention instead of 2D convolution gave a clear improvement in inter-residue geometry predictions (Δloss=-0.35) with additional contact accuracy gain (Δaccuracy=2%p) even with single track architecture having sequential MSA and pair feature processing (Table S1, architecture 2 vs 3). This recapitulates one of DeepMind's observations that the attention mechanism is more suitable for protein structure prediction as it can directly learn the relationship between two residues distant in sequence.
In addition to the axial attention, we used Performer architecture (39) for the attention algorithm to further reduce memory usage so that the larger architecture could fit on the GPU for experiments as larger architecture showed better performance (Table S1, architecture 7 vs 8).

Update MSA features based on structure information encoded in pair features
The most distinctive feature of AlphaFold2 architecture is that MSA features are updated based on pairwise features. We experimented with two different ways to update MSAs based on given pair features: 1) taking cross-attention (or encoder-decoder attention) (37) between MSA and pair features by considering pair to MSA updates as a kind of encode-and-decode process and 2) applying attention maps derived from pair features to MSA features (named direct-attention here) so that MSA features can be updated by attending positions close in 3D space that encoded in pairwise features. As shown in Table S1 (architecture 4 vs 5), direct-attention showed clearly better performance (Δloss=-0.4, Δcontact accuracy=4%p). The attention maps derived from pairwise features showed a good agreement with the true contact map (Fig. S12, panel A and C). The final architecture based on direct-attention is outlined in Fig. S1D.

Initial 3D structure prediction
We employed Graph Transformer-based architecture (42) (shown in Fig. S1E) to generate initial backbone coordinates for the 3D track (structure track). The input is defined as a fully connected graph with nodes representing the residues in the protein. The node and edge embeddings are learned from the averaged MSA features combined with a one-hot encoded query sequence and the pair features along with sequence separation, respectively. The backbone coordinates are estimated using a stack of four Graph Transformer layers followed by a simple linear transformation to predict Cartesian coordinates of N, Cɑ, C atoms for each residue node. (6) is employed to refine given 3D coordinates based on updated MSA and pair features in the 3track model (Fig. S1F). The protein graph is defined with nodes representing Cɑ atoms, and each node is connected to the K-nearest neighbors. The positions of N and C atoms are encoded by including displacement vectors to the corresponding Cɑ atoms as the degree 1 node features (vector node features). The node embeddings derived from averaged MSA features and the one-hot encoded query sequence are used as degree 0 node features (scalar node features). Pair features corresponding to the graph edges are also included as input features for SE(3)-Transformer. SE(3)-Transformer predicts shifts of Cɑ atoms and new displacement vectors for N and C atoms to the updated Cɑ positions. It also gives degree 0 node features (called state features here) that are used to calculate attention maps for structure-based MSA updates described in the next section.

Update MSA features based on a 3D structure
Similar to the MSA updates based on pair features in the 2-track model, MSA features are updated based on attention maps derived from the current 3D structures. Four attention maps are calculated based on the state features, and they are masked based on the Cɑ distances with four different cutoffs (8, 12, 16, and 20 Å) so that it only attends to the neighbors in 3D space. The same attention maps are applied to all the sequences in the MSA. A pointwise feed-forward layer further processes the outputs from the masked multi-head attention. The entire process is outlined in Fig. S1G.

Definition of 2-track and 3-track feature processing blocks
We defined 2-track blocks with four arrows in Fig. 1A (orange box). It first updates MSA features through selfattention, extracts coevolution features from MSA, and combines them with the previous pair features. Pair features are further optimized by axial attention, and MSA features are updated based on the structural information encoded in the current pair features. For 3-track blocks (blue box in Fig. 1A), we found that the order of communication between tracks is important. We experimented with two different ways to communicate 1D, 2D, and 3D tracks: updating structures before and after synchronizing MSA and pair features as shown in Fig. S13. The 3D coordinate updates based on synchronized MSA and pair features showed clearly better performance (Table S1, architecture 10 vs 11).

Residue pairwise distance and orientation prediction
The inter-residue geometry representations (shown in Fig. S14) are predicted through a single residual block consisting of two 2D convolution layers with 3x3 filters followed by convolution with 1x1 filters and softmax activation. Since maps for Cᵦ-Cᵦ distances and dihedral angles along pseudo Cᵦ-Cᵦ bonds are symmetric, we enforce symmetry in the network by using averages of transposed and untransposed feature maps as inputs for those predictions.
Additional structure module for iterative refinement through the network Although structures are explicitly sampled in 3-track blocks, an additional structure module is introduced to build a model based on combined 1D features and 2D inter-residue geometry predictions for inference with multiple discontinuous crops. Initial coordinates for backbone N, Cɑ, C atoms are generated using simple graph-based architecture (see Initial 3D structure prediction section above) with node and edge features derived from averaged MSA features and 2D distance and orientation distributions. These coordinates are further refined with multiple SE(3)-Transformer layers (6) by taking the same node and edge features used to generate initial coordinates. At the end of SE(3)-Transformer layers, the residue-wise Cɑ-lDDT (43) is also estimated based on the degree 0 features from the final SE(3)-Transformer layer.
We didn't use any iteration during the training, and the parameters were optimized through a single pass of the network. However, we found that we could use this structure module as an iterative refinement tool by feeding the output coordinates of the final SE(3)-Transformer layer to the first SE(3) layer as inputs at inference time (Fig.  S15). The predicted Cɑ-lDDT is used as a scoring function to decide when to stop the iteration and select the final model from all the sampled structures.

Comparison between 2-track end-to-end model and 3-track model
AlphaFold2 passed information from the 2-track trunk model into a 3D equivariant network operating on 3D coordinates directly. AlphaFold2 also employed end-to-end training, updating all model parameters by backpropagation from a loss function computed on 3D coordinates after many SE(3)-equivariant layers. As an experiment, we built a model with SE(3)-Transformer layers on top of the graph-based initial coordinate generation following the 2-track model. We found that adding SE(3)-Transformer layers improved the accuracy of structures generated by the simple graph-based network (Fig. S16), but this 2-track end-to-end model was not as good as the 3track end-to-end model (Table S1, architecture 9 vs 12).

Training details
The extended trRosetta training set (containing 22,922 clusters with sequence identity cutoff 30%, 208,659 protein chains released in the PDB as of 02/17/2020) was used to train RoseTTAFold. We cycled through all sequence clusters every training epoch by picking a random protein chain from each cluster. For each selected protein chain, a subsampled MSA (having maximum NxL=2 14 tokens) and up to 10 randomly selected templates were used to augment training data. During training, protein chains over 260 residues in length were cropped to fit into GPU memory.
The loss function used to train the model consists of 1) distance and orientation prediction loss (cross entropy) with 0.5 Å and 10° bins, 2) coordinate and distance RMSD of predicted coordinates, and 3) mean squared error of predicted Cɑ-lDDT score. During training, weights for coordinate and distance RMSD were ramped up from 0.05 to 0.2. For the other loss terms, weights are set to 1.0.
We train 130M parameters models having eight 2-track blocks and five 3-track blocks. Using eight 32GB V100 GPUs, it took about four weeks to train the model up to 200 epochs. The following hyper-parameters were used: • MSA, pair, template embedding size: 384, 288, and 64, respectively • The number of attention heads for self-attention on MSA, pair, and template: 12, 8, and 4 • The number of attention heads for MSA updates based on pair features: 4 • Size of node and edge features for initial coordinate generation: 64 • The number of attention heads for initial coordinate generation: 4 • Size of input node and edge features for SE(3)-Transformer: 32 • SE(3)-Transformer architecture: 2 layers with 16 channels, 4 attention heads, and up to representation degree 1 (l=0 and 1 features were used) • The number of closest residues to define graph for SE(3)-Transformer: 128 for first two 3-track blocks, 64 for last three 3-track blocks • Learning rate: 0.0005 with linear learning rate decay after 16000 warm up steps • Effective batch size: 64 in total (8 V100 GPUs, single training example per GPU, 8 gradient accumulation steps) • Weight decay: 0.0001

RoseTTAFold modeling pipeline
We built a fully automated modeling pipeline based on RoseTTAFold. It first iteratively searches homologous sequences against UniRef30 (44) and BFD (45) sequence databases using HHblits (13). The E-value cutoff for sequence search is gradually relaxed until the resulting MSA has at least 2000 sequences with 75% coverage or 5000 sequences with 50% coverage (both at 90% sequence identity cutoff). The generated MSA is used to perform template searches against the PDB100 database with HHsearch (13).
With MSA and top 10 templates as input, the RoseTTAFold network predicts inter-residue geometries (probability distributions of 6D coordinates described in Fig. S14) for many 300!300 discontinuous crops (150 residues per each segment) and combined them by taking weighted averages based on predicted Cɑ-lDDT values. We used two different strategies to generate final structure model with this combined 6D coordinate distribution: 1) gradient-based folding using pyRosetta (5) script and 2) a structure module based on SE(3)-Transformer architecture described above (see Additional structure module for iterative refinement through the network section). The first method doesn't require a large memory GPU as it predicts 300!300 sizes of 6D coordinates only and gives a fullatom model at the end, but it requires more CPU cores and time to run multiple trajectories (15 in total) of gradientbased folding from scratch. The second method can model backbone coordinates much faster than gradient-based folding (with a similar accuracy level), but it requires a large memory GPU (e.g. TITAN RTX) for proteins having more than 400 residues.
For the pyRosetta-based modeling protocol, the five models out of 15 sampled structures are selected based on predicted lDDT of DeepAccNet (12) after clustering. The Cɑ RMS error is estimated by converting predicted non-local Cɑ-lDDT (only considering residue pairs having sequence separation > 12) using Eq. (2). This pyRosettabased protocol is implemented in the Robetta server. & = 1.5 '×(*.,-.//%) Eq. (2) Both pyRosetta and end-to-end versions are available at https://github.com/RosettaCommons/RoseTTAFold. The following tutorial shows how to install and run the RoseTTAFold method.
Tutorial. How to install and use the RoseTTAFold method to predict protein structures Installation 1. Clone the package git clone https://github.com/RosettaCommons/RoseTTAFold cd RoseTTAFold 2. Create conda environments using RoseTTAFold-linux.yml file and folding-linux.yml file.
The latter is required to run the pyRosetta version only (run_pyrosetta_ver.sh

Expected outputs
For the pyRosetta version, users will get five final models having estimated CA rms error at the B-factor column (model/model_ [1][2][3][4][5].crderr.pdb). For the end-to-end version, there will be a single PDB output with estimated residue-wise CA-lDDT at the B-factor column (t000_.e2e.pdb).

Structure of glycine N-acyltransferase
The structure of glycine N-acyltransferase (GLYAT) from Bos taurus had evaded numerous attempts at solution, despite the availability of excellent data from three crystal forms. Structures of homologues were found using HHpred (46), which revealed that the only known structures were from distant relatives, almost all with low coverage of the target. Only 3 homologues (including the top hit) had greater than 60% coverage; these were only 12% identical in sequence. The top 5 hits were prepared for molecular replacement trials by pruning non-conserved side chains and loops using phenix.sculptor (47). In addition, an ensemble model was prepared by superimposing the individual homologues in phenix.ensembler (48) and trimming parts of the ensemble that are poorly conserved to leave a small conserved core. Molecular replacement trials with Phaser (49), MoRDa (50) and I-TASSER-MR (51) on all three crystal forms, using individual models, ensemble models and domain models, failed to yield any convincing results. Models made with trRosetta (3) also failed in MR calculations with Phaser.
In contrast, molecular replacement was straightforward for all three crystal forms when using the RoseTTAFold models, whether as individual models or trimmed ensembles. An estimate of the effective RMS error is required to calibrate the likelihood target, and a value of 1.2 Å was used for these models.
A post mortem analysis was carried out to verify that model quality was the limiting factor for molecular replacement with models derived from the PDB. This analysis concentrated on a tetragonal crystal form, which diffracts to 1.5 Å resolution and has a single copy in the asymmetric unit. The other two crystal forms each have two copies of the protein in the asymmetric unit.
In the likelihood-based molecular replacement algorithm implemented in Phaser, the log-likelihood-gain (LLG) score is an excellent predictor of success. If LLG scores of 60 or more are achieved in placing a single copy, the solution is almost always correct (52). In contrast, scores below 30 are more likely to correspond to random incorrect placements. By correctly positioning a molecular replacement model and carrying out a rigid-body refinement in Phaser, we can evaluate the score that could have been achieved in the search. This calculation shows that none of the available models came close to providing sufficient signal to solve the structure, giving LLG scores of only 7 to 11 when correctly placed. The best model (with a score of 11) was the top hit in HHpred, PDB entry 1sqh. A full molecular replacement search with this model yielded a top LLG score of 22 for incorrect placement. The high quality of the RoseTTAFold model, especially compared to the model derived from 1sqh, can be seen in Fig. S5A. For this figure, the experimental structure is illustrated using chain A from the current model of the hexagonal crystal form, in which the poorly ordered loop is most clearly defined. Table S2 summarizes the refinement statistics for this structure, as well as other crystal structures discussed below.

Value added by coordinate error estimates for GLYAT structure determination
LLG scores obtained with the RoseTTAFold models were compared, either ignoring the estimates provided for the RMS error of each amino acid or using it to weight each atom's contribution by providing a B-factor equal to (8 1 /3) 1 (53). Before applying the B-factor weighting, the LLG scores ranged from 88 to 148 for the 5 alternative models. After applying the weighting, the LLG scores ranged between 117 and 188. Similarly, the LLG score for the trimmed ensemble model increased from 191 to 244. In a more marginal case, such weighting could well be pivotal to success. Fig. S5A illustrates the correlation between predicted and actual errors in the RoseTTAFold model, especially in the poorly ordered loop which has the highest predicted errors.

Structure of a bacterial oxidoreductase
The structure of an oxidoreductase from a bacterial source wasn't solved by molecular replacement using related structures available from the PDB, identified using HHpred (46). These efforts were likely unsuccessful because available structures had low sequence identity and only moderate sequence coverage -the best structures had an identity of ~33% for the first 40% of the sequence, or ~25% identity for the first 60% of the sequence. In addition, the 2 crystal forms were expected to have 6 or 12 molecules in the crystallographic asymmetric unit based on the most probable solvent content. The top 5 HHpred structures were prepared for molecular replacement trials by pruning non-conserved side chains and loops using phenix.sculptor (47). In addition, an ensemble model was prepared by superimposing the individual homologues in phenix.ensembler (48) and trimming parts of the ensemble that are poorly conserved to leave a small conserved core. Molecular replacement trials with Phaser (49) did not produce correct solutions as judged by significant overlaps between placed molecules, and a modest TFZ score of 7.4 in the lower probability P2 space group.
The top 5 RoseTTAFold models were superimposed using phenix.ensembler and parts of the ensemble that are poorly conserved were automatically trimmed. Atomic B-factors were calculated from the estimated RMS error as described above. Molecular replacement trials with Phaser produced a solution in the more likely P21 space group, albeit with a modest TFZ score of 6.9. Manual inspection of the solution revealed that 4 of the molecules formed 2 dimers, which were expected on the basis of the closest homologue structures and biophysical data. One dimer was extracted from the model and used in a new MR trial, which produced a very clear solution with a TFZ of 17.2. Comparison of the 2 molecular placement trials showed that the initial search had placed 5 molecules correctly but the 6th incorrectly. The successful dimer-based solution was used as the starting point for phase improvement using statistical density modification methods (54) in Phenix (55). The resulting map showed unambiguous density for the protein including many regions where the search model was locally different from the true structure. The structure could be completed by the application of automated model building methods in phenix.phase_and_build and phenix.autosol (56), followed by manual model rebuilding in coot (57) in combination with refinement in phenix.refine (58) and validation with MolProbity (59).

Structure of bacterial surface layer protein (SLP)
Excellent diffraction data were available for SLP, but a search for homologues in the PDB using HHpred (46) yielded only one hit at a low significance level (E-value of 6.1, sequence identity of 19%) covering only 38% of the protein sequence. Considering that the crystal contains 4 copies of SLP in the asymmetric unit, it was not surprising that molecular replacement attempts failed before the RoseTTAFold models were available.
Initial attempts to solve the structure using an ensemble made from models of the entire protein were partially successful but failed because of crystal packing clashes. However, when the models were divided into two domains, searches with four copies of an ensemble model for the N-terminal domain gave a clear solution with good signal. This turned out to be sufficient to complete the structure if weak phase information from a mercury derivative was added by MR-SAD (60). Alternatively, the structure could be solved purely by molecular replacement, by adding four copies of an ensemble model for the C-terminal domain, in which B-factors were computed from the estimated RMS errors and residues with a predicted error greater than 1.3 Å were removed. Automated building procedures were sufficient to complete the structure from this point. As a control, further molecular replacement calculations were carried out using models obtained with trRosetta (3), IntFOLD6 (8), RaptorX (61), I-TASSER (62) and QUARK (63), but none of these succeeded.

Structure of secreted fungal protein Lrbp
Diffraction data were available to 1.53 Å resolution, but no significant hits were found in a search for homologues in the PDB using HHpred (46) as the top hit had an E-value of 110. Attempts over the course of 4 years to solve this structure, using a variety of predicted models and small fragments of regular secondary structure had failed.
The initial MR searches using RoseTTAFold models prepared with the default protocol also failed. However, the diversity of the models was increased by varying the selection criteria for the MSA, and the estimated RMS errors were used to delete residues with errors estimated to be greater than 1.3 Å. To generate more diverse models, we collected 8 different MSAs with E-value cutoff of 1e-40, 1e-30, 1e-20, and 1e-10 and sequence coverage cutoff of 50% and 75%. With this strategy, clear solutions for the two copies in the asymmetric unit emerged, leading to a high quality model. As seen in Fig. S5C, the error estimates give a reliable indication of where confidence should be placed in the model.

GPCR modeling benchmark set construction and evaluation
A benchmark set of 27 GPCR sequences with experimentally determined structures that were not included in the RoseTTAFold training set was constructed. X-ray and cryo-EM structures determined with resolution higher than 4 Å were excluded. Annotations in the GPCRdb (14) were used to classify GPCR sequences, structures, active states, and the transmembrane region residues for analyses. All predicted models were evaluated for the transmembrane regions only. The reference experimental structures were also truncated to the corresponding transmembrane regions, and the TM-score software (33) was used to calculate Cɑ-RMSD of the models. To check if templates with similar sequences were available, the sequence identities between the target transmembrane region sequence and the aligned sequences were re-calculated. From the HHblits template search, results with e-value less than 1e-10 were considered, if they were found. The highest sequence identity among the alignments that have transmembrane region coverage higher than 80% was used for analysis. The estimated model accuracy (DAN-lDDT) was predicted by applying the DeepAccNet (12) on each truncated model.

Modeling active and inactive states of GPCRs
For each target sequence, active and inactive state GPCR template sets were separately provided to two parallel predictions, each generating the corresponding state models. When a template structure in a certain state was not available, models were not predicted for that state. For the benchmark test, templates with sequence identities higher than 70% from HHsearch (13) results were excluded to construct the test more fairly.

GPCR benchmark test performance
Models with highest estimated accuracy values (DAN-lDDT) were selected for each active and inactive state. RoseTTAFold could predict highly accurate models of both active and inactive states. Examples of good predictions are shown in Fig. S7 panel A and B.
Template-based models of the benchmark set targets were collected from available GPCR model databases. Active state models were brought from GPCRdb (14) and inactive state models were downloaded from the Meiler group modeling database (15). Targets that could have been modeled easily using any template with sequence identity > 70% in the same state were excluded for analysis. The accuracies of the RoseTTAFold model and corresponding homology model are compared in Fig. S7C. For most of the targets, RoseTTAFold could predict higher TM-score structures.
The best template sequence identity values for each GPCR sequence are reported with estimated model accuracy (DAN-lDDT) and actual accuracy in Fig. S7D. When multiple reference experimental structures existed for the corresponding state, the best Cɑ-RMSD was reported with color representing model accuracy. RoseTTAFold prediction results on the GPCR benchmark set didn't have a high correlation with the best template sequence identity. This again corroborates that the deep-learned network of RoseTTAFold can predict models with accuracies beyond that which can be achieved only with homology information. However, generating highly accurate active state models (Cɑ-RMSD < 1.5 Å) was more feasible when templates with higher sequence identities were available.
The DAN-lDDT of 0.80 can roughly be used as a threshold to discriminate between accurate (Cɑ-RMSD < 1.5 Å or TM-score > 0.9, Fig. S7D) and inaccurate models. Using this guideline to estimate model accuracy could be better applied to inactive state models (Fig. S7D). The active state models turned out to have lower DAN-lDDT than their actual accuracy. The DeepAccNet was trained on monomeric structures only, and the receptor chain in an active state, which would require other chains such as G-proteins as interacting partners, could have been underestimated.

GPCR models of unknown states
In the GPCR benchmark set we constructed, 25 targets (as of May 14th, 2021) didn't have known structures of one state, either inactive or active. We predicted models of the unknown state for each target, and models with DAN-lDDT higher than 0.75 were achieved for all targets. These models are provided in http://files.ipd.uw.edu/pub/RoseTTAFold/GPCR_benchmark_one_state_unknown_models.tar.gz

Human GPCR model generation
We collected a set of 298 human GPCR sequences without known experimental structures as of May 14th, 2021. Models both in active and inactive states were predicted by applying RoseTTAFold. The best template sequence identity and the estimated accuracy (DAN-lDDT) of the models are reported in Fig. S7D. All models with DAN-lDDT values higher than 0.75 are provided in http://files.ipd.uw.edu/pub/RoseTTAFold/all_human_GPCR_unknown_models.tar.gz. The DAN-lDDT metric can be used to estimate the reliability of each model, and the relative per-residue quality estimation information can be found in the B-factor column.

Modeling of structurally uncharacterized domains from human proteins
We selected human proteins of biomedical importance based on the number (>50) of literature that are linked to them in Uniprot (64) and whether mutations in them are known to cause human diseases according to the DBSAV database (65). 7,639 human proteins were selected and domains were predicted using the HMMER (66) search against the Pfam database (67). A total of 18,233 domains were detected (e-value < 1e-5) in these proteins. The majority of these domains can be modeled confidently by homologous structure in PDB (68), and out of the structurally uncharacterized domains, over half of them include a considerably large (> 25%) fraction of residues that are predicted to be disordered (69). Excluding domains that are disordered or can be modeled by homology, we removed redundancy, i.e. domains that were mapped to the same Pfam, in the remaining 2,083 domains, resulting in 693 targets to model with our method. We obtained high-quality (estimated lDDT with DeepAccNet (12) > 0.8) models for 245 targets (provided in http://files.ipd.uw.edu/pub/RoseTTAFold/human_prot.tar.gz). Only 28 out of 693 targets have predicted lDDT lower than 0.5, and half of them are turned out to be potential disordered proteins (predicted by SPOT-Disorder2 (70)). For the rest of the targets (420 targets having predicted lDDT between 0.5 and 0.8), it failed to predict high-accuracy structures due to the several factors, including 1) local inaccuracies come from the local regions that might be stabilized by interactions to its binding partner (other proteins or nucleic acids), 2) having disordered local regions, or 3) limitations of the method itself. The 245 high-quality models were manually inspected to reveal biological insights with the help of literature, sequence conservation, and remote homology that can be detected by searching structurally similar proteins.
For three RoseTTAFold structure models that provided insight into their biological function, their sequences (Q6ICL3:1-259 for TANGO2, P27544:98-304 for CERS1, and Q9BZ11:39-167 for ADAM33 prodomain) were checked against the SWISS-MODEL repository (68) for homology models. Their sequences were also submitted to the HHpred server (71) for search against the PDB database (PDB_mmCIF70_17_May) and the ECOD (72) domain database (ECOD_F70_20200717) using default parameters. For the CERS1 example, where no confident hits were identified, a second MSA generation method using PSI-BLAST against nr70 was used to identify possible template homologs. HHpred results are summarized in Table S3, omitting hits below rank 5. To identify related folds for the examples, RoseTTAFold models were used as queries to search the ECOD database with RUPEE (default settings) (73). Potential functional sites for the models were mapped with AL2CO (74) using conservations from multiple sequence alignment (MAFFT, default settings (75)) of orthologs collected from the OMA database (76).
The SWISS-MODEL repository could only generate low-quality models for the TANGO2 sequence. However, HHpred generated alignments for several Ntn templates with high confidence (Table S3). We chose the top two templates (3gvz and 2x1d) to generate homology models using the SWISS-MODEL workspace alignment mode (68). Each of the homology models was of poor quality based on QMEAN scores (77) (-6.12 and -6.11, respectively). These homology models were compared to the RoseTTAFold structure using pairwise DaliLite (78) superpositions (DaliZ 19.1 and 17.9, respectively). Compared to the RoseTTAFold structure (Fig. S9A), each of the homology models displays shifts in alignment and relatively poorly structured loops. Some of the conserved residues that form the RoseTTAFold active site (Fig. S9A, colored red) shifts further away from the active site in each of the homology models: R86, G87, and K166 in the 3gvz model (Fig. S9B) and G49, G51, and K166 in the 2x1d model (Fig. S9C).
Template search for the ADAM33 prodomain confidently identified an incorrect template (4on1_B) corresponding to a fragilysin-3 prodomain fold. While each of the structures possesses a similar four-stranded betameander, the alpha + beta C-terminus of the fragilysin prodomain extends the beta-meander into a longer sheet (Fig.  S17A). Alternately, the N-terminus of ADAM33 continues the beta-meander to form a beta-barrel fold similar to that of lipocalin (Fig. S17B). The HHpred alignment for the fragilysin template incorrectly extended the metalloprotease domain present in both ADAM33 and fragilysin into the prodomain (aligned portions of the prodomains in a rainbow, Fig. S17). HHpred search with the ADAM33 prodomain sequence of templates from the ECOD domain database, which separates the prodomain from the metalloprotease, avoids this multi-domain problem.

Hetero-complex structure prediction using RoseTTAFold
Despite RoseTTAFold being trained on single protein chains, we deployed its ability to make inferences on discontinuous sequence segments to the hetero-oligomer complexes. The only modification we introduced for hetero-complex structure prediction was a change in the positional encoding. We added 200 to the residue numbers of the following subunits to let the network know that it has chain breaks between each subunit.
As a benchmark, we predicted the hetero-oligomer structures of E. coli. proteins from the PDB benchmark set (32). Among 868 pairs in the PDB benchmark set, we selected 68 interaction pairs having known complex structures of identical or close homologous proteins (sequence identity > 90%) in the PDB and having interface area (calculated by naccess (79)) larger than 1,500 Å 2 . The list of 68 interaction pairs and the accuracy of predicted complex models are provided in Table S4. The complex model accuracy is evaluated based on the Interface Contact Similarity (ICS) score (80) and complex TM-score (33). To see whether RoseTTAFold can predict higher-order oligomer structures, we also tried to predict hetero-trimer complex structures of bacterial proteins shown in Fig. 4B. For both cases (dimer and trimer prediction), the prediction was made based on a paired alignment of the target complex without any template information.
We generated paired alignment for human IL-12R/IL-12 complex structure prediction by simply pairing the sequences with the same taxonomy ID. Based on the paired sequence alignments and the template structure (IL-23R/IL-23 complex structure; PDB 6wdq), the backbone coordinates were predicted using RoseTTAFold. The fullatom structures were generated by FastRelax (81) with restraints derived from predicted distances and orientations.    S3. Model accuracy changes upon an intensive use of the network for inference. By sampling MSAs randomly and providing predicted structures as templates (y-axis), the 3-track end-to-end model was able to sample much better model structures than the single-pass (x-axis) as shown in the left panel. DeepAccNet was able to select improved structures for some cases (right), but there is still room for improvement in model accuracy estimation. The model accuracy is measured by TM-score.

Fig. S4. Experiments on network architecture modification with Perceiver for efficient MSA encoder. (A)
Model accuracy comparison (in terms of TM-score) between predicting the entire structures (one-shot prediction) and combining predictions from multiple discontinuous crops. Scanning multiple crops generated more accurate predictions. (B) Differences in the subset of sequences used for one-shot and cropped prediction. Due to the memory limitation, only up to 1,000 sequences were used during the prediction. For the one-shot prediction, the top 1,000 sequences were selected, while 1,000 sequences having sequence coverage over 50% were selected for the cropped prediction.  residues. The RoseTTAFold model is colored based on estimated RMS error, ranging from blue for the minimum of 0.6 Å to red for 3.5 Å and higher. In these superpositions, 203 Cɑ atoms of 4mkz match the experimental structure with a Cɑ-RMSD of 1.8 Å, whereas 272 of the RoseTTAFold model match with a Cɑ-RMSD of 1.34 Å (C) The full RoseTTAFold model for Lrbp. The model structure is colored based on estimated RMS error, ranging from red for the minimum of 0.84 Å to red for 1.8 Å and higher. The refined structure is shown in gray.  The best DAN-lDDT (0.81) inactive state model of GABR1_HUMAN (cyan, Uniprot ID Q9UBS5) compared to the native (PDB 6w2y chain B, magenta) and the closest homolog of known structure (PDB 4or2 chain A, gray, seqID 18%). Transmembrane region Cɑ-RMSD was 1.14 Å. Middle and right panels focused on extracellular regions (top view). (B) The best DAN-lDDT (0.80) active state model of MC4R_HUMAN (cyan, Uniprot ID P32245) compared to the native (PDB 7aue chain R, magenta, G-protein helix in red) and the closest homolog of known structure (gray, PDB 3kj6 chain A, seqID 27%). Transmembrane region Cɑ-RMSD was 1.49 Å. Middle and right panels focused on intracellular regions (bottom view). (C) Accuracies (in TM-score) of RoseTTAFold models versus template-based models from public databases (14,15). Only transmembrane regions were considered. (D) For each active (o) and inactive (x) state prediction, the best template sequence identity and predicted model accuracy (DAN-lDDT) are reported. The color gradient represents actual model accuracy in Cɑ-RMSD for the subset of proteins of known structure, ranging from 1.2 Å (accurate, blue) to 2.2 Å (inaccurate, red). The human GPCR set with unknown structures is shown in light gray. Data with DAN-lDDT between 0.7 and 0.9 are only shown.

Fig. S8. A correlation between predicted lDDT by DeepAccNet and actual Cɑ-RMSD for CASP14 targets.
The predicted lDDT of 0.80 can roughly be used as a threshold to discriminate between accurate (average Cɑ-RMSD of 2.6 Å and TM-score of 0.89) and inaccurate models. The homology model based on the top HHpred hit to 3gvz template is colored green (aligned with the RoseTTAFold structure) or red (shifted alignment). Three conserved residues (black sphere and stick) shift away from the active site. (C) The homology model based on the next best HHpred hit to 2x1d template is colored blue (aligned with the RoseTTAFold structure) or red (shifted alignment). Three conserved residues (black sphere and stick) shift away from the active site.      S14. 6D representation of rigid body transforms between two residues. It includes distance (d) between Cβ atoms, dihedral angle (w) along the virtual bond connecting two Cβ atoms, and two dihedral angles (θij, θji) and two pseudo-bond angles (φij, φji) specifying the direction of the Cβ atom of a residue in a reference frame centered on the other residue.   The prodomain from HHpred template 4on1 is in ribbon and adopts a fragilysin-like ɑ+β fold with a central 4-stranded beta-meander. The domain architecture below highlights a C-terminal metalloprotease that is also in ADAM33. The HHpred template alignment incorrectly extends into the prodomain (aligned sequence in a rainbow). (B) ADAM33 RoseTTAFold structure (oriented by its corresponding central beta-meander) adopts a lipocalin-like beta-barrel. The aligned betameander sequence (in a rainbow) is unrelated to the alpha + beta sequence from the template. Table S1. Performance of different model architectures in terms of inter-residue geometry prediction loss (cross entropy), top L long-range contact accuracy and Cɑ-lDDT.

Architecture
Inter-residue geometry loss