Protein structure determination using metagenome sequence data
Filling in the protein fold picture
Fewer than a third of the 14,849 known protein families have at least one member with an experimentally determined structure. This leaves more than 5000 protein families with no structural information. Protein modeling using residue-residue contacts inferred from evolutionary data has been successful in modeling unknown structures, but it requires large numbers of aligned sequences. Ovchinnikov et al. augmented such sequence alignments with metagenome sequence data (see the Perspective by Söding). They determined the number of sequences required to allow modeling, developed criteria for model quality, and, where possible, improved modeling by matching predicted contacts to known structures. Their method predicted quality structural models for 614 protein families, of which about 140 represent newly discovered protein folds.
Abstract
Despite decades of work by structural biologists, there are still ~5200 protein families with unknown structure outside the range of comparative modeling. We show that Rosetta structure prediction guided by residue-residue contacts inferred from evolutionary information can accurately model proteins that belong to large families and that metagenome sequence data more than triple the number of protein families with sufficient sequences for accurate modeling. We then integrate metagenome data, contact-based structure matching, and Rosetta structure calculations to generate models for 614 protein families with currently unknown structures; 206 are membrane proteins and 137 have folds not represented in the Protein Data Bank. This approach provides the representative models for large protein families originally envisioned as the goal of the Protein Structure Initiative at a fraction of the cost.
There are 14,849 protein families in the Pfam (1) database with 50 or more residues, of which 4752 have at least one member with experimentally determined x-ray crystal or nuclear magnetic resonance (NMR) structure, and an additional 3984, for which reliable comparative models can be built on the basis of homologs of known structure detected using the powerful HHsearch fold-recognition program (2). There are an additional 902 for which less-confident comparative models can be built, but no structural information available for 5211 of the remaining 6113 families (HHsearch E-value ≥ 1). Until recently, computational methods could not generate accurate models for these 5211 families, as they lack homologs of known structure for comparative modeling, and the very large number of conformations accessible to a polypeptide chain made the sampling problem in de novo protein structure prediction intractable for all but the smallest proteins. The original goal of the Protein Structure Initiative was to determine structures for at least one representative of such families, but this proved to be extremely challenging, and the focus of the initiative shifted to targets of immediate biological interest (3).
The increase in the number of known amino acid sequences has enabled the accurate prediction of residue-residue contacts by using evolutionary data (4–10)—substitutions at positions close in space in the three-dimensional structure covary. Such contact predictions have been used for a wide range of protein modeling efforts (11–22). Accurate contact prediction requires large numbers of aligned sequences so that residue-residue covariance is clearly distinguished from lineage effects. Although coevolution-based structure modeling has been used to generate models for individual proteins with fold-level accuracy [template modeling (TM) score (23) is >0.5 (5, 7, 8, 10, 11, 14–18, 21, 22)], it has not been clear whether such data, combined with structure-prediction methodology, can generate accurate models on a larger scale.
Rosetta de novo structure-prediction calculations guided by evolutionary information were recently used to generate models for 58 large protein families (21). The structures of proteins in six of these families have since been published, which provides an opportunity to assess this medium-scale prediction effort. Recently solved structures of the lipoprotein signal peptidase II (24), prolipoprotein diacylglyceryl transferase (25), fluoride ion transporter (26), cytochrome bd oxidase (27), DMT superfamily transporter YddG (28), and fumarate hydratase (29) are all very close to computational models published and publicly released well before the structures were solved (Fig. 1). In the case of the three-subunit cytochrome bd oxidase, the computational model of the 788-residue complex generated using both inter- and intra-subunit contact information was used together with experimental phase information obtained from the three heme irons and a single methionine to solve the structure. Because the phase information was weak, it was only possible to place the transmembrane helices and a subset of the side chains on the basis of the density, but the loops, connectivity, location of the CydX subunit, and registration of the amino acid sequence on many of the helices were unclear. Our Escherichia coli protein model closely overlapped with the traced helices, and Phenix-Rosetta refinement (30) of a model built for the Geobacillus thermodenitrificans protein resolved the above ambiguities, enabling rapid completion of structure determination. The final deposited structure is very similar to our previously published model of the E. coli protein (Fig. 1A) [TM-align score (23) of 0.8]. The power of Rosetta structure-prediction calculations coupled with coevolution data for soluble proteins is illustrated by an extremely accurate blind de novo prediction for a complex protein structure in the CASP11 structure-prediction experiment (31) (Fig. 1E). In all of the cases shown in Fig. 1, standard threading or fold-recognition methods fail to identify the correct fold. Taken together, these data show that Rosetta modeling guided by coevolutionary constraints generates accurate models (in all six cases, the TM-align score is >0.7; the models also illustrate some of the limitations of the approach, including the lack of explicit modeling of ligands, cofactors, and lipids) (see supplementary text).

Fig. 1 Comparison of Rosetta models (left) to subsequently published crystal structures (right).
The models accurately recapitulate the structural details of the named proteins. The scores are as follows: (A) the cytochrome bd oxidase (TM-align score 0.88), (B) the lipoprotein signal peptidase II (TM-align score 0.70), (C) the DMT superfamily transporter YddG (TM-align score 0.70), (D) the fluoride ion transporter dimer (TM-align score 0.69), (E) the CASP11 target T0806, (F) prolipoprotein diacylglyceryl transferase (TM-align score 0.69), and (G) fumarate hydratase [TM-align score 0.80 for monomer (top) and 0.76 for dimer (bottom)].
Structure models with the accuracy of those in Fig. 1 would have broad utility for framing biological hypotheses about function and interpreting mutational data, as well as for guiding experimental structure determination. To determine the number of aligned sequences required for contact prediction accuracy sufficient to guide generation of accurate 3D models, we carried out Rosetta structure-prediction calculations for a benchmark set of 27 large protein families (table S1) with known structure. We used both the full sequence alignments and alignments of subsets of the sequences for contact prediction. We also performed structure-prediction calculations using Rosetta to hybridize and refine (32) partial structural matches identified by matching predicted contacts with the contact patterns of known protein structures. To do this, we developed an algorithm (map_align) [see the supplementary materials (SM)] that uses iterative double-dynamic programming (33). The two approaches are complementary: De novo structure prediction (using only sequence information) (34) can succeed where there are no related structures in the Protein Data Bank (PDB), whereas making use of matches to known structures can help for large complex proteins that otherwise present a convergence challenge for de novo structure prediction (structural matches can occur in the absence of detectable sequence similarity because structural similarity is retained over larger evolutionary distances). For large sequence families, combining de novo structure-prediction models and map_align structure matches using the Rosetta iterative hybridization protocol improved accuracy in 14 cases and decreased accuracy in only one (solid line in Fig. 2A) (fig. S1; see SM). Contact prediction accuracy, and hence predicted structure accuracy, depends on the number of sequences in the family, the diversity of these sequences, and the length of the protein. A measure that incorporates all three factors [Nf, the number of sequence clusters at an 80% sequence identity–clustering threshold divided by the square root of the protein length (21)] correlates well with contact prediction accuracy (21) and model accuracy (Fig. 2A and fig. S1) over a broad range of families.

Fig. 2 Metagenome data greatly increased fraction of structures that can be accurately modeled.
(A) Dependence of coevolution guided Rosetta structure-prediction accuracy on the effective number of sequences Nf (a function of both sequence number and diversity; see methods definition) in the protein family. For each of 27 proteins of known structure, the multiple sequence alignment was subsampled, and residue-residue contacts were predicted by using GREMLIN. Rosetta structure-prediction calculations were then used to generate ~20,000 models, and a single model was selected on the basis of the Rosetta energy and the fit to the coevolution constraints; the average TM score of these selected models over all 27 cases is shown on the y axis (dashed line). Hybridization-based refinement of the top 20 models together with the top 10 map_align-based models for each case increases the average accuracy (solid line); models with fold-level accuracy (TM score of >0.5) are obtained for Nf ≥ 16, and models with accuracy typical of comparative modeling, for Nf of 64. (B) Fraction of protein families of unknown structure with at least 64 Nf. Dashed line: including only sequences in UniRef100 database; solid line: including sequences in UniRef100 database together with metagenome sequence data from the Joint Genome Institute (37). (C) Distribution of Nf values for 5211 Pfam families with currently unknown structure, after the addition of metagenomic sequences; 25% of the protein families have Nf > 64, 34% have Nf > 32, and 45% have Nf > 16.
How many protein families with currently unknown structure have Nf values in the range where accurate models can be built? The models in Fig. 1 were all generated for families with Nf > 64; accuracy falls off for lower values of Nf (Fig. 2A). As shown in Fig. 2B, fewer than 8% of families have Nf values of 64 or better. Modeling the remaining 92% of families of unknown structure at reasonable accuracy is not currently possible by using the sequence information in the UniRef100 database (35).
This limitation in structure modeling can be largely overcome by taking advantage of progress in a completely different research area. Metagenome sequencing projects, in which complex biological samples are shotgun sequenced, have provided insights into biological communities and provide a treasure trove of new sequence data (36, 37). The number of protein sequences determined in metagenome sequence projects is growing considerably faster than the UniRef100 database (solid versus dashed line in Fig. 2B). With the inclusion of metagenome sequence data, the number of sequences increases by as much as 100-fold for some families (table S2), and the fraction of families with unknown structure that can be accurately modeled using coevolution-guided structure-prediction methods increases dramatically. At Nf ≥ 64, the fraction increases from 0.08 to 0.25, and at Nf ≥ 32 [where fold level accuracy can be achieved (Fig. 2A)], the fraction increases from 0.16 to 0.33. To assess structure-prediction and model evaluation accuracy using metagenome data, we carried out a second set of benchmark calculations on 81 Pfam domains with recently solved structures and Nf ≥ 64 (fig. S1, E and F, and table S5). Structure-prediction accuracy was correlated with the extent of convergence of the lowest energy models and the fraction of predicted contacts present in these models (figs. S1F and S2). For 42 families, the predictions converged with most of the predicted contacts satisfied (see SM for convergence criteria) and of these, 25 had a TM score >0.7 and 13 a TM score >0.6 [in three of the four remaining cases, NMR structures of small transmembrane proteins, our models fit the predicted contacts much better, and in the last case, an intertwined dimer, our monomer model contained all the correct contacts (fig. S13)].
We generated coevolution based contact predictions using GREMLIN (4, 12) for the 1297 protein families with Nf ≥ 64 and built models for the 921 protein families (1024 domains) with many contacts between positions separated by more than five residues along the linear sequence (number of long range contacts > half the number of residues in protein). The structure-prediction calculations converged on models with predicted TM scores (based on the benchmark calculations) greater than 0.65 for 614 of the 1024 domains. A list of the Pfam families covered by these models is in table S3; the models are available at http://gremlin.bakerlab.org/meta/, along with an interactive 3D interface powered by 3Dmol.js (38) and D3.js (39) for visualization of coevolution contacts on the models. These structures provide close templates for comparative modeling of 487,306 UniRef100 and 3,868,268 Integrated Microbial Genomes metagenomic unique (less than 80% pairwise identity) sequences.
The converged models for the 614 Pfam families (table S3) provide a view of the hitherto unseen protein universe. To determine whether the models belong to known protein folds, we carried out structure-structure comparisons against the Structural Classification of Proteins (SCOP) (40) domain database. For 477 of the families, the models matched a protein of known structure over nearly the entire length and, hence, can be assigned to SCOP folds (52 distinct all alpha, 29 alpha/beta, 51 alpha+beta, and 28 all-beta folds). In a number of cases, the SCOP classifications are consistent with previous functional information; for example, the restriction endonuclease Xho I is assigned to the restriction enzyme fold, and a family of prokaryotic putative ubiquitin-like proteins is assigned the beta-grasp fold (to which ubiquitin belongs). For 137 of the domains, there were no significant structure matches of the models to the PDB (TM-align score < 0.5), and hence, these have new folds. Space limitations preclude showing here even a small number of the 614 models; instead, we show a small selection of the 3D structures in Fig. 3. They include the key developmental regulator Chordin; a key enzyme in cobalbumin synthesis; a metalloendopeptidase; and mercury and iron transporters. Six are transmembrane proteins, four have new folds, and several have complex topologies. These and the remaining 590 structure models not shown in Fig. 3 should provide a basis for understanding molecular function and mechanisms and should guide experimental structure determination (such efforts should be informed of the limitations of the modeling approach described in the supplementary text). While this manuscript was in preparation, crystal structures of members of 5 of the 614 families were published and are similar to the corresponding models (TM-align score ≥ 0.7) (see fig. S3 and table S4).

Fig. 3 Representative structure models for selected Pfam families.
Membrane proteins are on the top row; new folds on the bottom right. The multidomain models of the iron transporter and RNA helicase and the dimeric model of CobS, an enzyme in vitamin B synthesis, are guided by both intra- and inter-chain coevolution restraints.
The models presented in this paper fill in about 12% of the structural information missing for known protein families. That this could be accomplished using computational modeling methods was not at all apparent 5 years ago. This progress required integration of advances in disparate research areas: metagenome sequencing, coevolutionary analysis, and de novo protein structure-prediction methodology. This combined approach has a bright future: Extrapolating from the data in Fig. 2B suggests that in several years the majority of families will have sufficient number of sequences for accurate structure modeling. A current limitation is that most sequence data are for prokaryotes, but as fungal and other simple eukaryote genome structure prediction sequencing projects ramp up, the approach should become applicable to eukaryote specific protein families.
Acknowledgments
We thank P. Di Lena, N. Malod-Dognin, and R. Andonov for providing the source code for their software (Al-eigen and a_purva) and for their discussion and advice on contact map alignment. The 3D structures of 614 Pfam domains modeled in the study are available at http://gremlin.bakerlab.org/meta/. Other data are archived at the Dryad Digital Repository (doi:10.5061/dryad.27p4s). We also thank [email protected] and Charity engine participants for donating their computer time. The work performed by N.V., G.A.P., and N.C.K. was supported by the U.S. Department of Energy (DOE) Joint Genome Institute, a DOE Office of Science User Facility, under contract no. DE-AC02-05CH11231. Research reported here was supported by National Institute of General Medical Sciences, NIH, under award number R01GM092802. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Supplementary Material
Summary
Materials and Methods
Supplementary Text
Figs. S1 to S13
Tables S1 to S5
Resources
References and Notes
1
R. D. Finn, P. Coggill, R. Y. Eberhardt, S. R. Eddy, J. Mistry, A. L. Mitchell, S. C. Potter, M. Punta, M. Qureshi, A. Sangrador-Vegas, G. A. Salazar, J. Tate, A. Bateman, The Pfam protein families database: Towards a more sustainable future. Nucleic Acids Res. 44 (D1), D279–D285 (2016).
2
J. Söding, Protein homology detection by HMM-HMM comparison. Bioinformatics 21, 951–960 (2005).
3
G. T. Montelione, The Protein Structure Initiative: Achievements and visions for the future. F1000 Biol. Rep. 4, 7 (2012).
4
H. Kamisetty, S. Ovchinnikov, D. Baker, Assessing the utility of coevolution-based residue-residue contact predictions in a sequence- and structure-rich era. Proc. Natl. Acad. Sci. U.S.A. 110, 15674–15679 (2013).
5
D. S. Marks, L. J. Colwell, R. Sheridan, T. A. Hopf, A. Pagnani, R. Zecchina, C. Sander, Protein 3D structure computed from evolutionary sequence variation. PLOS ONE 6, e28766 (2011).
6
F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S. Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa, M. Weigt, Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl. Acad. Sci. U.S.A. 108, E1293–E1301 (2011).
7
T. A. Hopf, L. J. Colwell, R. Sheridan, B. Rost, C. Sander, D. S. Marks, Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012).
8
T. Nugent, D. T. Jones, Accurate de novo structure prediction of large transmembrane protein domains using fragment-assembly and correlated mutation analysis. Proc. Natl. Acad. Sci. U.S.A. 109, E1540–E1547 (2012).
9
D. T. Jones, D. W. Buchan, D. Cozzetto, M. Pontil, PSICOV: Precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics 28, 184–190 (2012).
10
D. S. Marks, T. A. Hopf, C. Sander, Protein structure prediction from sequence variation. Nat. Biotechnol. 30, 1072–1080 (2012).
11
J. I. Sułkowska, F. Morcos, M. Weigt, T. Hwa, J. N. Onuchic, Genomics-aided structure prediction. Proc. Natl. Acad. Sci. U.S.A. 109, 10340–10345 (2012).
12
S. Balakrishnan, H. Kamisetty, J. G. Carbonell, S. I. Lee, C. J. Langmead, Learning generative models for protein fold families. Proteins 79, 1061–1078 (2011).
13
M. Ekeberg, C. Lövkvist, Y. Lan, M. Weigt, E. Aurell, Improved contact prediction in proteins: Using pseudolikelihoods to infer Potts models. Phys. Rev. E Stat. Nonlin. Soft Matter Phys. 87, 012707 (2013).
14
S. Wickles, A. Singharoy, J. Andreani, S. Seemayer, L. Bischoff, O. Berninghausen, J. Soeding, K. Schulten, E. O. van der Sluis, R. Beckmann, A structural model of the active ribosome-bound membrane protein insertase YidC. eLife 3, e03035 (2014).
15
P. Tian, W. Boomsma, Y. Wang, D. E. Otzen, M. H. Jensen, K. Lindorff-Larsen, Structure of a functional amyloid protein subunit computed using sequence variation. J. Am. Chem. Soc. 137, 22–25 (2015).
16
S. Hayat, C. Sander, D. S. Marks, A. Elofsson, All-atom 3D structure prediction of transmembrane β-barrel proteins from sequences.. Proc. Natl. Acad. Sci. U.S.A. 112, 5413–5418 (2015).
17
T. A. Hopf, S. Morinaga, S. Ihara, K. Touhara, D. S. Marks, R. Benton, Homology-and coevolution-consistent structural models of bacterial copper-tolerance protein CopM support a “metal sponge” function and suggest regions for metal-dependent protein-protein interactions. Nat. Commun. 6, 6077 (2015).
18
L. A. Abriata, An homology-and coevolution-consistent structural model of bacterial copper-tolerance protein CopM supports function as a “metal sponge” and suggests regions for metal-dependent protein-protein interactions. Biorxiv 10.1101/013581 (2015).
19
S. Ovchinnikov, H. Kamisetty, D. Baker, Robust and accurate prediction of residue-residue interactions across protein interfaces using evolutionary information. eLife 3, e02030 (2014).
20
T. A. Hopf, C. P. I. Schärfe, J. P. G. L. M. Rodrigues, A. G. Green, O. Kohlbacher, C. Sander, A. M. J. J. Bonvin, D. S. Marks, Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife 3, (2014).
21
S. Ovchinnikov, L. Kinch, H. Park, Y. Liao, J. Pei, D. E. Kim, H. Kamisetty, N. V. Grishin, D. Baker, Large-scale determination of previously unsolved protein structures using evolutionary information. eLife 4, e09248 (2015).
22
S. Antala, S. Ovchinnikov, H. Kamisetty, D. Baker, R. E. Dempski, Computation and functional studies provide a model for the structure of the zinc transporter hZIP4. J. Biol. Chem. 290, 17796–17805 (2015).
23
Y. Zhang, J. Skolnick, Scoring function for automated assessment of protein structure template quality. Proteins 57, 702–710 (2004).
24
L. Vogeley, T. El Arnaout, J. Bailey, P. J. Stansfeld, C. Boland, M. Caffrey, Structural basis of lipoprotein signal peptidase II action and inhibition by the antibiotic globomycin. Science 351, 876–880 (2016).
25
G. Mao, Y. Zhao, X. Kang, Z. Li, Y. Zhang, X. Wang, F. Sun, K. Sankaran, X. C. Zhang, Crystal structure of E. coli lipoprotein diacylglyceryl transferase. Nat. Commun. 7, 10198 (2016).
26
R. B. Stockbridge, L. Kolmakova-Partensky, T. Shane, A. Koide, S. Koide, C. Miller, S. Newstead, Crystal structures of a double-barrelled fluoride ion channel. Nature 525, 548–551 (2015).
27
S. Safarian, C. Rajendran, H. Müller, J. Preu, J. D. Langer, S. Ovchinnikov, T. Hirose, T. Kusumoto, J. Sakamoto, H. Michel, Structure of a bd oxidase indicates similar mechanisms for membrane-integrated oxygen reductases. Science 352, 583–586 (2016).
28
H. Tsuchiya, S. Doki, M. Takemoto, T. Ikuta, T. Higuchi, K. Fukui, Y. Usuda, E. Tabuchi, S. Nagatoishi, K. Tsumoto, T. Nishizawa, K. Ito, N. Dohmae, R. Ishitani, O. Nureki, Structural basis for amino acid export by DMT superfamily transporter YddG. Nature 534, 417–420 (2016).
29
P. R. Feliciano, C. L. Drennan, M. C. Nonato, Crystal structure of an Fe-S cluster-containing fumarate hydratase enzyme from Leishmania major reveals a unique protein fold. Proc. Natl. Acad. Sci. U.S.A. 113, 9804–9809 (2016).
30
F. DiMaio, N. Echols, J. J. Headd, T. C. Terwilliger, P. D. Adams, D. Baker, Improved low-resolution crystallographic refinement with Phenix and Rosetta. Nat. Methods 10, 1102–1104 (2013).
31
S. Ovchinnikov, D. E. Kim, R. Y.-R. Wang, Y. Liu, F. DiMaio, D. Baker, Improved de novo structure prediction in CASP11 by incorporating coevolution information into Rosetta. Proteins 84 (suppl. 1), 67–75 (2016).
32
Y. Song, F. DiMaio, R. Y.-R. Wang, D. Kim, C. Miles, T. Brunette, J. Thompson, D. Baker, High-resolution comparative modeling with RosettaCM. Structure 21, 1735–1742 (2013).
33
W. R. Taylor, Protein structure comparison using iterated double dynamic programming. Protein Sci. 8, 654–665 (1999).
34
K. T. Simons, I. Ruczinski, C. Kooperberg, B. A. Fox, C. Bystroff, D. Baker, Improved recognition of native-like protein structures using a combination of sequence-dependent and sequence-independent features of proteins. Proteins 34, 82–95 (1999).
35
B. E. Suzek, et al.Y. Wang, H. Huang, P. B. McGarvey, C. H. WuUniProt Consortium., UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926–932 (2015).
36
V. Kunin, A. Copeland, A. Lapidus, K. Mavromatis, P. Hugenholtz, A bioinformatician’s guide to metagenomics. Microbiol. Mol. Biol. Rev. 72, 557–578 (2008).
37
V. M. Markowitz, I.-M. A. Chen, K. Chu, E. Szeto, K. Palaniappan, M. Pillay, A. Ratner, J. Huang, I. Pagani, S. Tringe, M. Huntemann, K. Billis, N. Varghese, K. Tennessen, K. Mavromatis, A. Pati, N. N. Ivanova, N. C. Kyrpides, IMG/M 4 version of the integrated metagenome comparative analysis system. Nucleic Acids Res. 42 (D1), D568–D573 (2014).
38
N. Rego, D. Koes, 3Dmol.js: Molecular visualization with WebGL. Bioinformatics 31, 1322–1324 (2015).
39
M. Bostock, V. Ogievetsky, J. Heer, D3: Data-driven documents. IEEE Trans. Vis. Comput. Graph. 17, 2301–2309 (2011).
40
A. Andreeva, D. Howorth, J.-M. Chandonia, S. E. Brenner, T. J. P. Hubbard, C. Chothia, A. G. Murzin, Data growth and its impact on the SCOP database: New developments. Nucleic Acids Res. 36 (Database), D419–D425 (2008).
41
S. R. Eddy, A new generation of homology search tools based on probabilistic inference. Genome Inform. 23, 205–211 (2009).
42
M. Remmert, A. Biegert, A. Hauser, J. Söding, HHblits: Lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat. Methods 9, 173–175 (2011).
43
S. Seemayer, M. Gruber, J. Söding, CCMpred—fast and precise prediction of protein residue-residue contacts from correlated mutations. Bioinformatics 30, 3128–3130 (2014).
44
T. F. Smith, M. S. Waterman, Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981).
45
N. Malod-Dognin, N. Yanev, R. Andonov, thesis, Institute for Research in Computer Science and Automation, France (2010).
46
N. Malod-Dognin, N. Pržulj, GR-Align: Fast and flexible alignment of protein 3D structures using graphlet degree similarity. Bioinformatics 30, 1259–1265 (2014).
47
P. Di Lena, P. Fariselli, L. Margara, M. Vassura, R. Casadio, Fast overlapping of protein contact maps by alignment of eigenvectors. Bioinformatics 26, 2250–2258 (2010).
48
D. A. Pelta, J. R. González, M. Moreno Vega, A simple and fast heuristic for protein structure comparison. BMC Bioinformatics 9, 161 (2008).
49
S. B. Needleman, C. D. Wunsch, A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970).
50
J. Ma, S. Wang, Z. Wang, J. Xu, MRFalign: Protein homology detection through alignment of Markov random fields. PLOS Comput. Biol. 10, e1003500 (2014).
51
D. E. Kim, F. Dimaio, R. Yu-Ruei Wang, Y. Song, D. Baker, One contact for every twelve residues allows robust and accurate topology-level protein structure modeling. Proteins 82 (suppl. 2), 208–218 (2014).
52
G. Wang, R. L. Dunbrack Jr., ., PISCES: A protein sequence culling server. Bioinformatics 19, 1589–1591 (2003).
53
J. Lee, H. A. Scheraga, S. Rackovsky, New optimization method for conformational energy calculations on polypeptides: Conformational space annealing. J. Comput. Chem. 18, 1222–1232 (1997).
54
H. Park, F. DiMaio, D. Baker, The origin of consistent protein structure refinement from structural averaging. Structure 23, 1123–1128 (2015).
55
A. A. Schäffer, L. Aravind, T. L. Madden, S. Shavirin, J. L. Spouge, Y. I. Wolf, E. V. Koonin, S. F. Altschul, Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 29, 2994–3005 (2001).
56
C. Angermüller, A. Biegert, J. Söding, Discriminative modelling of context-specific amino acid substitution probabilities. Bioinformatics 28, 3240–3247 (2012).
57
R Development Core Team, R: A language and environment for statistical computing. (R Foundation for Statistical Computing, Vienna, 2013).
Information & Authors
Information
Published In

Science
Volume 355 | Issue 6322
20 January 2017
20 January 2017
Copyright
Copyright © 2017, American Association for the Advancement of Science.
Submission history
Received: 22 June 2016
Accepted: 22 November 2016
Published in print: 20 January 2017
Acknowledgments
We thank P. Di Lena, N. Malod-Dognin, and R. Andonov for providing the source code for their software (Al-eigen and a_purva) and for their discussion and advice on contact map alignment. The 3D structures of 614 Pfam domains modeled in the study are available at http://gremlin.bakerlab.org/meta/. Other data are archived at the Dryad Digital Repository (doi:10.5061/dryad.27p4s). We also thank [email protected] and Charity engine participants for donating their computer time. The work performed by N.V., G.A.P., and N.C.K. was supported by the U.S. Department of Energy (DOE) Joint Genome Institute, a DOE Office of Science User Facility, under contract no. DE-AC02-05CH11231. Research reported here was supported by National Institute of General Medical Sciences, NIH, under award number R01GM092802. The content is solely the responsibility of the authors and does not necessarily represent the official views of the NIH.
Authors
Metrics & Citations
Metrics
Article Usage
Altmetrics
Citations
Export citation
Select the format you want to export the citation of this publication.
Cited by
- Structure-based protein function prediction using graph convolutional networks, Nature Communications, 12, 1, (2021).https://doi.org/10.1038/s41467-021-23303-9
- Principles and Methods in Computational Membrane Protein Design, Journal of Molecular Biology, (167154), (2021).https://doi.org/10.1016/j.jmb.2021.167154
- Learned Embeddings from Deep Learning to Visualize and Predict Protein Sets, Current Protocols, 1, 5, (2021).https://doi.org/10.1002/cpz1.113
- A2PF: An Automatic Protein Production Framework, Intelligent Systems Design and Applications, (80-91), (2021).https://doi.org/10.1007/978-3-030-71187-0_8
- Sparse generative modeling via parameter reduction of Boltzmann machines: Application to protein-sequence families, Physical Review E, 104, 2, (2021).https://doi.org/10.1103/PhysRevE.104.024407
- A roadmap for metagenomic enzyme discovery, Natural Product Reports, (2021).https://doi.org/10.1039/D1NP00006C
- Improving integrative 3D modeling into low‐ to medium‐resolution electron microscopy structures with evolutionary couplings, Protein Science, 30, 5, (1006-1021), (2021).https://doi.org/10.1002/pro.4067
- Improved protein structure prediction by deep learning irrespective of co-evolution information, Nature Machine Intelligence, 3, 7, (601-609), (2021).https://doi.org/10.1038/s42256-021-00348-5
- Folding non-homologous proteins by coupling deep-learning contact maps with I-TASSER assembly simulations, Cell Reports Methods, 1, 3, (100014), (2021).https://doi.org/10.1016/j.crmeth.2021.100014
- O-GlcNAc modification of small heat shock proteins enhances their anti-amyloid chaperone activity, Nature Chemistry, 13, 5, (441-450), (2021).https://doi.org/10.1038/s41557-021-00648-8
- See more
Loading...
View Options
View options
PDF format
Download this article as a PDF file
Download PDFGet Access
Log in to view the full text
AAAS login provides access to Science for AAAS Members, and access to other journals in the Science family to users who have purchased individual subscriptions.
- Become a AAAS Member
- Activate your AAAS ID
- Purchase Access to Other Journals in the Science Family
- Account Help
Log in via OpenAthens.
Log in via Shibboleth.
More options
Register for free to read this article
As a service to the community, this article is available for free. Login or register for free to read this article.
Buy a single issue of Science for just $15 USD.






