Username:
Password:
  Home                                  



















nmrshiftdb2 is a NMR database (web database) for organic structures and their nuclear magnetic resonance (nmr) spectra. It allows for spectrum prediction ( C, H and other nuclei) as well as for searching spectra, structures and other properties. The nmrshiftdb2 software is open source, the data is published under an open content license. The core of nmrshiftdb2 are fully assigned spectra with raw data and peak lists (we have pure peak lists as well). Those datasets are peer reviewed by a . The project is supported by a . nmrshiftdb2 is part of the and will provide a component for a curated repository there. Please consult the for more detailed information.

            
  

News about nmrshiftdb2

NMR prediction history      Sun, 25 Aug 2024 16:44:19 -0000 We have made available an updated version of the "NMR prediction history" originally published in our review paper . The online version will be updated with new data as they become available. Ideas and suggestions are welcome!

nmrshiftdb2 and FAIRness      Tue, 25 Jun 2024 18:14:11 -0000 We can proudly say we have been FAIR even before the term existed. I compiled an overview of the FAIRness of nmrshiftdb2 .

New server active      Mon, 17 Jun 2024 16:08:14 -0000 We have got new hardware to run the database for about three weeks now. We believe everything is running smoothly and occasional speed issues, which occured on the old hardware, have been resolved. The database should be fit to run for the foreseeable future.

20 years of nmrshiftdb2      Wed, 20 Dec 2023 10:59:55 -0000 We can celebrate the 20th anniversary of the database. We published a paper in MRC reviewing the development of the database.

13C/1H correlation search      Sun, 03 Dec 2023 20:44:32 -0000 We have added a 13C/1H correlation search. This searches for structures which have pairs of directly connected carbon and hydrogen atoms with certain shifts. For example, "135;5" would search for a structure which has a carbon with a shift of 135 ppm with a hydrogen attached with a shift of 5 ppm. Search is via the spectrum search, choose "13C/1H correlation" as "Spectrum type". Total/subspectrum search work as normal.

Structure dereplication in nmrshiftdb2      Sun, 11 Dec 2022 22:21:01 -0000 We recently worked with Jean-Marc Nuzillard on a paper about using predicted data for structure dereplication. Those data are part of nmrshiftdb2 and can be used for searches (if calculated spectra are used for searches, which is the case by default).

NMRium project      Tue, 13 Jul 2021 16:37:02 -0000 nmrium is a sister project of nmrshiftdb2. It is a web-based visualizer and editor for 1D and 2D NMR spectra. On the website, you can already test 1D and 2D NMR functionalities like peak picking, integration, assignment, and more, without installing software, completely in the browser. A close integration with the next version of nmrshiftdb2 is planned.

Raw data in downloads      Wed, 24 Mar 2021 21:10:39 -0000 We have added links to the raw data to the downloads, where raw data are available. In the sd files, there is an additional tag rawdata, the NMReDATA file has the link in the Spectrum_Location, and the cml files have an attribute rawdata in spectrum. This all is only the case if there are rawdata. We currently have more than 500 spectra with raw data for 1H and 13C, and about 300 2D spectra of various nuclei. The raw data so far were available only via the web interface.

Downloads are back      Wed, 10 Mar 2021 12:59:54 -0000 It turned out that due to a size limitation on sourceforge's svn, the download of the full data no longer worked. We have changed that to a file download, the new links are on the help page . For interested developers, the svn repository still exists and can be used. Only the http download is affected by the limit.

nmrshiftdb2 in identifiers.org      Sat, 01 Aug 2020 22:21:52 -0000 nmrshiftdb2 entries can now be resolved using identifiers.org. The prefix is nmrshiftdb2 and the id is the molecule id. For example, the identifier nmrshiftdb2:234 identifies the first entry in NMRShiftDB. A link via identifiers.org would be http://identifiers.org/resolve?query=nmrshiftdb2:234

rss feed

Latest Additions

nmr assignment database

2002-2010, © Stefan Kuhn 2010 - 2024.

BMRB makes bio-NMR data FAIR.

Findable, accessible, interoperable, re-usable..

BMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.

Recently released at BMRB:

Entry 51319: Phosphorylation motif dictates GPCR C-terminal domain conformation and arrestin interaction

Thank you for your letters of support!

Did you know that you can use NMR to study

coronaviruses?

nmr assignment database

Let's get started. What do you want to see?

Notice some changes? The BMRB has recently gotten a redesign! You'll notice a brand new home page as well as updated styling throughout the web site. If you have any feedback, please let us know using the "Support" icon in the lower right of all pages. If you'd like to continue viewing the page using the old styling, we'll preserve that option as long as feasible as well at legacy.bmrb.io .

About NMRtist

NMRtist is a cloud computing service for the fully automated analysis of protein NMR spectra (e.g. peak picking, chemical shift assignment, structure determination) using deep learning-based approaches. Each project created in NMRtist receives 30 GB of private storage, which can be filled by experimental data and analyzed using the available applications. You don't need to have any hardware resources or follow complex software configuration processes. NMRtist applications can be executed by just few mouse clicks in your web browser. All calculations are executed on NMRtist computational nodes, making the results available for download from NMRtist website.

ARTINA is a deep learning-based application for end-to-end protein structure determination by NMR spectroscopy. Using as input NMR spectra and the protein sequence, the method identifies automatically (strictly without any human intervention): cross-peak positions, chemical shift assignments, upper limit distance restraints, and the protein structure. ARTINA deep learning models have been trained with over 600 000 cross-peak examples from more than 1300 2D-4D spectra. The method demonstrated its ability to solve structures with a median backbone RMSD of 1.44 Å to PDB reference, and identified correctly 91.36% of the chemical shift assignments. View our short video tutorial to learn how to get started with ARTINA.

ARTINA and NMRtist can automate tasks such as cross-peak detection in 2D-4D NMR spectra, de novo chemical shift assignment of protein monomers and protein-ligand complexes, de novo structure calculation, structure-based chemical shift assignment, and chemical shift transfer. A more comprehensive list of system use cases can be found in the Articles & Tutorials section [ link ].

You can use the NMRtist platform free-of-charge (academic users) to perform automated peak picking, shift assignment, or full structure determination. Create a free account to use all functions of the service, or start an anonymous project by pressing the button below.

Recommended articles

Video tutorial.

This video tutorial introduces beginners to the NMRtist system, guiding them through the process of submitting an automated protein structure determination job, and showcasing representative results from such a job.

ARTINA manuscript

Artificial Intelligence for NMR Applications (ARTINA) is a deep learning-based approach to fully automated NMR protein structure determination. The method takes as input only NMR spectra and the protein sequence, and delivers automatically: peak lists, shift assignments, distance restraints, and the structure.

NMRtist Use Cases

This article summarizes the most common use cases of NMRtist and ARTINA, such as structure-based chemical shift assignment, chemical shift transfer, or de novo protein structure determination.

Examples of automatically determined structures

nmr assignment database

July 31, 2024, 11:01 a.m.

New Release: custom protein systems and enhanced data validation

In the new NMRtist release, our developments were aimed at facilitating a broader range of data analysis tasks in macromolecular NMR spectroscopy. In addition to existing features, you can now define custom protein systems while creating new projects, which includes, among others, custom residue types and complexes of proteins with small molecules, peptides, and metal ions. In principle, the new version may be used to facilitate other types of macromolecules, such as RNA and DNA, but our algorithms and deep learning models have been trained and tested explicitly for protein data.

The new version also brings updates that provide new insights into NMR data and make the use of the platform more convenient. For example, the chemical shift assignment now presents a visualization of “control points” that helps resolve issues with spectra referencing – one of the common hurdles NMRtist users faced in the previous version of the system.

We are currently working to enable methyl assignment and further support for solid-state experiments on our platform.

Jan. 15, 2024, 8:14 p.m.

[Manuscript] The 100-protein NMR spectra dataset: A resource for biomolecular NMR data analysis

Open dataset containing 1329 2D-4D NMR spectra that allow the reproduction of 100 protein structures from original measurements. This dataset was originally compiled for the development of the ARTINA deep learning-based spectra analysis method (see https://nmrdb.ethz.ch and the manuscript ).

Nov. 30, 2023, 8:15 p.m.

[Manuscript] Time-optimized protein NMR assignment with an integrative deep learning approach using AlphaFold and chemical shift prediction

Our new study, recently accepted in Science Advances ( https://www.science.org/doi/full/10.1126/sciadv.adi9323 ), explores the integration of in-silico predictions like AlphaFold with ARTINA, enhancing the efficiency and accuracy of NMR data analysis. This research represents a significant leap towards data-efficient use of our system for protein studies.

Feb. 2, 2023, 8:39 p.m.

[Manuscript] NMRtist: an online platform for automated biomolecular NMR spectra analysis

Our manuscript (application note), presenting the NMRtist platform, has been accepted for publication in Bioinformatics ( https://doi.org/10.1093/bioinformatics/btad066 ).

Dec. 21, 2022, midnight

NMRtist usage

Since the release of the platform in February 2022, NMRtist analysed 4 368 2D/3D/4D NMR spectra, completed 1 100 automated chemical shift assignment and 444 automated structure determination jobs.

Dec. 20, 2022, midnight

ARTINA and NMRtist presented to the broader audience

Between 06.2022 and 01.2023, we presented ARTINA and NMRtist at several NMR events, including: Chianti Workshop (Principina Terra, Italy), EUROMAR (Utrecht, The Netherlands), EMBO Practical Course (Basel, Switzerland), EMBO Lecture Course (Berhampur, India), Biomolecular NMR: Advanced Tools, Machine Learning (Gothenburg, Sweden), and ICMRBS (Boston, USA).

Oct. 19, 2022, midnight

[Manuscript] Rapid protein assignments and structures from raw NMR spectra with the deep learning technique ARTINA

Our manuscript, presenting the ARTINA workflow for rapid assignment and structure determination, has been published in Nature Communications ( https://doi.org/10.1038/s41467-022-33879-5 ).

Oct. 2, 2021, midnight

Biomolecular NMR: Advanced Tools workshop

NMRtist was presented at the Biomolecular NMR: Advanced Tools workshop (29.09-01.10 2021). All participants of the training, supervised by Prof. Peter Güntert and Dr. Piotr Klukowski, submitted datasets to the platform, obtaining automatically determined structures and/or assignments.

nmr assignment database

Biomolecular NMR Assignments

  • Provides an avenue for depositing these data into a public database at BioMagResBank.
  • Assignment Notes are published in biannual editions in June and December.
  • No page charges or fees for online color images.
  • Optional color images in print and open access publication fees apply.
  • Christina Redfield

nmr assignment database

Latest issue

Volume 18, Issue 1

Latest articles

Solution nmr backbone assignment of the n-terminal tandem zα1-zα2 domains of z-dna binding protein 1.

  • Lily G. Beck
  • Jeffrey B. Krall
  • Beat Vögeli

nmr assignment database

NMR-based solution structure of the Caulobacter crescentus ProXp-ala trans -editing enzyme

  • Antonia D. Duran
  • Eric M. Danhart
  • Mark P. Foster

nmr assignment database

Solution NMR backbone resonance assignment of the full-length resistance-related calcium-binding protein Sorcin

  • Kathleen Joyce Carillo

nmr assignment database

Chemical shift assignments of the α-actinin C-terminal EF-hand domain bound to a cytosolic C0 domain of GluN1 (residues 841–865) from the NMDA receptor

  • Johannes W. Hell
  • James B. Ames

nmr assignment database

1 H, 15 N and 13 C resonance assignments of eggcase silk protein 3

nmr assignment database

Journal information

  • Biological Abstracts
  • Chemical Abstracts Service (CAS)
  • Google Scholar
  • INIS Atomindex
  • Japanese Science and Technology Agency (JST)
  • Norwegian Register for Scientific Journals and Series
  • OCLC WorldCat Discovery Service
  • Science Citation Index Expanded (SCIE)
  • TD Net Discovery Service
  • UGC-CARE List (India)

Rights and permissions

Editorial policies

© Springer Nature B.V.

  • Find a journal
  • Publish with us
  • Track your research

NMR spectra processing for everybody

Unrestrained access to first-class online software for NMR spectra processing It is free and you can get started right away from your browser.

Process directly online

You don't have to go through the hassle of installing any software or applications. Click here to start.

1D and 2D spectra

NMRium accepts 1D and 2D spectras. For 1D spectra it can either be a FID or a fourrier transformed spectrum. Currently, only FT 2D spectra are allowed.

Smart peak picking

NMRium includes an advanced peak picking detection for 1D and 2D spectras and is able to generate the NMR string required for publication or patent.

All the processing and assignment can be stored as a “.nmrium” file. This file contains the original data as well as all the processing that was applied on the spectrum. Assignment of the molecule are also saved in the file.

Not just signal processing

NMRium also handles chemical structures. They can either be imported from a MDL Molfile, copy pasted directly in the molecule panel or drawn.

Perfect for teaching

Try out our structure elucidation exercises or create your own exercises ! They are great for students.

Great user experience

To provide an optimal user experience, the spectra processing is efficiently performed within the web browser.

Opens multiple file formats

Just drag and drop a JCAMP-DX file, a bruker folder or a JEOL file.

UNIVERSITY OF COLOGNE

change language Deutsch Deutsch

Faculty of Mathematics and Natural Sciences NMR facility

Open access nmr database with integrated lims, electronic structure elucidation tools.

nmrshiftdb2 is a NMR web database for organic structures and their NMR spectra. It allows for spectrum prediction (13C, 1H and other nuclei) as well as for searching spectra, structures and other properties. Last not least, it features peer-reviewed submission of datasets by its users.

The recently added lab information management system (LIMS) features (i) easy administration of orders and the retrieved data, (ii) concise overview over lab workload and (iii) work accomplished. In addition to managing their orders, users can assign their spectra and therefore benefit from all nmrshiftdb2 functions. The nmrshiftdb2 software is open source, the data is published under an open content license. More details are described in our flyer .

  • Magn. Reson. Chem. 2015 , 53 , 582-589, doi: 10.1002/mrc.4263.

More recently, we are also involved in electronic assignment and new, digital workflows for NMR data from lab to publication. We are participating in the NMReDATA project and the IDNMR initiative of the magnetic resonance division within the German Chemical Society (GDCh):

  • Magn. Reson. Chem. 2018 , 56 , 513-519, doi: 10.1002/mrc.4675.

Spectra Search NMR Spectrum

NMR Search provides a powerful interface for searching the database. You can build up queries that support a wide range of conditions, including Frequency, Tolerance, Exact Mass Range for 1H/13C. To get started, click the "Load Example" button to perform an example search.

Instructions:

  • You can filter by Frequency and Tolerance. . These fields are mandatory. Exact Mass is not.
  • The minimum tolerance for 1H search is 0.01 and 13C search is 0.1 .
  • Peaks and intensities must be numbers separated by new lines.

Error Message Holder

Molecules: Special Issue on "NMR Spectroscopy in Natural Product Structure Elucidation"

See here for details, quality considerations of published nmr-data, a summary of typos, misassignments and structure revisions, latest publications.

NPS Data Hub: a Web-based Community Driven Analytical Data Repository for New Psychoactive Substances Aaron Urbas, Torsten Schoenberger, Charlotte Corbett, Katrice Lippa, Felix Rudolphi, Wolfgang Robien Forensic Chemistry, 9, 76 (2018);   DOI: 10.1016/j.forc.2018.05.003

NMReDATA, a standard to report the NMR assignment and parameters of organic compounds Marion Pupier, Jean‐Marc Nuzillard, Julien Wist, Nils E. Schlörer, Stefan Kuhn, Mate Erdelyi, Christoph Steinbeck, Antony J. Williams, Craig Butts, Tim D.W. Claridge, Bozhana Mikhova, Wolfgang Robien, Hesam Dashti, Hamid R. Eghbalnia, Christophe Farès, Christian Adam, Pavel Kessler, Fabrice Moriaud, Mikhail Elyashberg, Dimitris Argyropoulos, Manuel Pérez, Patrick Giraudeau, Roberto R. Gil, Paul Trevorrow, Damien Jeannerat Magnetic Resonance in Chemistry, 2018, 1-13;   DOI: 10.1002/mrc.4737

Wiley's announcement of the 13 C-NMR Data Checker

Wiley Data Checker

Wiley's Data Checker on Twitter

And the winner is ..... Analytical and Bioanalytical Chemistry "Through the looking-glass challenge"

 
C-NMR Spectrum
don't enter any lines
C-NMR Peaklist
Enter as many lines as necessary
Optionally assign as many lines as possible
C-NMR Peaklist
based on CSEARCH technology
Enter your structure and the peaklist and
all will be done for you
C-NMR Peaklist
You can use this feature directly from Bruker's TOPSPIN
C-NMR Peaklist and
hundreds/thousands of Structure Proposals
  • Erlangen 2008
  • Erlangen 2010
  • Erlangen 2011
  • Erlangen 2012
  • Erlangen 2013
  • Erlangen 2014
  • Erlangen 2015
  • Erlangen 2016
  • Erlangen 2017
  • Chemietage-Salzburg 2017
  • Porto 2019 / 1 st NMReDATA-Meeting@SMASH
2014/Feb/01915,296
2015/Feb/01932,111+ 16,815
2016/Feb/01954,458+ 22,347
C13-NMR spectra used for the CSEARCH-Robot-Referee:
Predicted C13-NMR spectra for Similarity-Searches:
Evaluation of C13-NMR Assignments:YES, see
C13-NMR based Spectral Similarity-Searches:YES, see
Accessing evaluation directly from Bruker's TOPSPIN program family:YES
Accessing the Spectral Smilarity Search directly from Bruker's TOPSPIN program family:YES
Long-term storing of own spectral data:YES
Using own spectral data for prediction of C13-NMR chemical shifts:YES
Using own spectral data for evaluating new data:YES
Donating spectra to the community possible ?YES
Which type of classification is given ?Similar to the usual classification
of peer-reviewing: ACCEPT, MINOR &
MAJOR REVISION and REJECT
Is this "seal of approval" requested by journals?YES; an increasingly number of referees is
already requesting this evaluation
What happens when the evaluation gives a MAJOR REVISION or REJECT ?Depending on the personal setting a
Spectral Similarity Search or a
Structure Generation Process
can be automatically launched
Transfer of results to mobile devices possible:YES, using QR-codes
Can I cite the evaluation in my publication ?YES, the given QR-code
is a permanent URL
Resulting pages are protected against manipulation:YES
Complete history of all requests for a certain compound visible:YES, on a per-email basis
Can I implement accessing this system into my "ELN"-software?YES, ask for specifications
Is this system already in use with a Chemistry-related journal ?see here:
You need a more detailed evaluation including
drilling down into your data:
Use a more sophisticated implementation of
CSEARCH-technology and/or data from
my commercial cooperation partners:

and



Why do we need such a Robot-Referee for 13 C-NMR assignments based on high-quality data and excellent algorithms ?

The publication The referee-report The complete story I am quite sure you agree, that we do not want to solve future structure elucidation problems based on such "reference material" !

Wiley Data Checker available online

Protein NMR

A practical guide, introduction.

Most books on Protein NMR focus on theoretical aspects and pulse sequences with only little space devoted to resonance assignment and structure calculations. At the same time many software manuals provide detailed information on how to use the software, but assume prior knowledge of the concepts of assignment and structure calculation. This has produced a gap in this area which these webpages aim to bridge by describing the concepts of assignment in detail with the help of many illustrations. Much space and discussion is devoted to practical aspects.

The implementation of protein NMR assignment is described using the program CCPNmr Analysis . This program has been developed by CCPN and actively seeks input from the NMR community. CCPNmr Analysis is based on the detailed and well thought-out CCPN Data Model which has the advantage (a) that it feeds directly into the CCPN Format Converter thus simplifying the import from and export to other programs, and (b) that as more and more NMR-related programs adopt the CCPN Data Model it is likely to take on a key role in NMR data management – in a similar way to CCP4 for protein X-ray crystallography. CCPNmr Analysis is already one of the best assignment programs available while still being developed and provides excellent support via the CCPN Mailing List (a manual is also available). (Although I now work for the CCPN group, these webpages and my recommendation to use this program far predate this!)

Webpages include:

  • description of several resonance assignment strategies
  • simple descriptions and discussions of many multidimensional NMR experiments commonly used in protein NMR
  • basic usage of CCPNmr Analysis (versions 1 and 2 )
  • how to make publication quality figures using CCPNmr Analysis
  • advice for using CCPNmr Analysis with solid-state MAS NMR data
  • tutorial on protein assignment using solid-state MAS NMR data
  • description of isotopic labelling strategies commonly used in protein NMR
  • links to a large number of protein NMR software packages
  • suggested literature for further reading
  • links to other useful NMR webpages
Open Source Software for NMR Data Analysis rNMR Open Source Software for NMR Data Analysis - rnmr.nmrfam.wisc.edu
MNOVA acdlabs https://www.acdlabs.com/solutions/nmr-spectroscopy/
NMRbox is a resource for biomolecular NMR (Nuclear Magnetic Resonance) software.
4DSPOT Protein chemical shift prediction in 4-dimensions, with molecular flexibility as the 4th dimension ABACUS Combines assignment of protein NOESY spectra and structure determination ADAPT-NMR Enhancer Visualize the tilted 2D plane data from ADAPT-NMR AFNMR Quantum chemical estimates chemical shifts in proteins and nucleic acids ALATIS A tool for assigning unique and reproducible labels to all atoms of small molecules ALMOST All atom molecular simulation toolkit AlphaFold Neural network prediction of protein structure Amaterasu Simplify the screening, acquisition, processing and model fitting of R1ρ relaxation dispersion NMR datasets Amaterasu'kai Simplify the screening, acquisition, processing and model fitting of R1ρ relaxation dispersion NMR datasets AmberTools Amber is a set of molecular mechanical force fields and a package of molecular simulation programs ANATOLIA NMR software for spectral analysis of total lineshape ANSURR ANSURR uses backbone chemical shifts to validate the accuracy of a protein structure AQUA AQUA is a suite of programs for Analyzing the QUAlity of biomolecular structures ARIA Automates NOE assignment and NMR structure calculation ArShift Structure based predictor of protein aromatic side-chain proton chemical shifts ASDP Automated determination of protein structures and NOE assignments from NMR data Assign_SLP Genetic algorithm search for correct assignments of HSQC crosspeaks ATSAS A program suite for small-angle scattering data analysis from biological macromolecules AutoAssign Automating the analysis of backbone resonance assignments AutoDock Vina An open-source program for doing molecular docking Azara A suite of programs to process and view NMR data BATMAN An R package for estimating metabolite concentrations from NMR spectral data using a specialised MCMC algorithm BLAST Finds regions of similarity between biological sequences BLAST (legacy version) Finds regions of similarity between biological sequences CS-Rosetta (BMRB) Submission tool for sending jobs to the BMRB CS-Rosetta server calRW Distance-dependent atomic potential for protein structure modeling and structure decoy recognition calRW+ Orientation-dependent atomic potential for protein structure modeling and structure decoy recognition CambridgeCS Reconstruction of non-uniform spectra with compressed sensing CAMERA Maximum Entropy reconstruction of nonuniformly sparsely sampled data CARA The analysis of NMR spectra and computer aided resonance assignment Carma Analysis of molecular dynamics trajectories CATIA Analyze CPMG relaxation dispersion data and extract chemical exchange parameters of a two-site chemically exchanging system CcpNmr Analysis An NMR spectrum visualisation, resonance assignment and data analysis program CcpNmr Analysis Assign An NMR spectrum visualisation, resonance assignment and data analysis program CcpNmr Analysis Metabolomics Module of CcpNmr Analysis for analyzing metabolomics data CcpNmr Analysis Screen Module of CcpNmr Analysis for screening CcpNmr ChemBuild A graphical tool to construct chemical compound definitions for NMR CcpNmr SpecView SpecView provides a fast and easy way to visualize NMR spectrum and peak data CH3Shift Structure-based prediction of protein methyl group chemical shifts CHEMEX Fit chemical exchange induced shift and relaxation data Chimera An extensible interactive molecular visualization and analysis system CLUSTAL W/X Multiple alignment of nucleic acid and protein sequences Cluster 3.0 Clustering software for gene expression and data analysis CNS v1.3, 1.21-ARIA Provides a hierarchical approach for the most commonly used algorithms in structure determination COMPASS Experimental protein structure verification by scoring with a single, unassigned NMR spectrum CONNJUR Workflow Builder An open-source framework for software and data integration in bio-NMR Connjur Widgets Widgets built into the File Browser for parsing NMR metadata CPMG-Fit (Korzhnev) Fits CPMG relaxation dispersion data for analysis of chemical exchange in NMR spectroscopy CPMGFit (Palmer) Fits CPMG relaxation dispersion data for analysis of chemical exchange in NMR spectroscopy CSCDP Chemical shift calculation for chemically denatured proteins CS-GAMDy A robust algorithm for refining protein structures with NMR chemical shifts CS-Rosetta System for chemical shifts based protein structure prediction using ROSETTA CTFFIND Program for finding CTFs of electron micrographs CYRANGE Identification of domains from NMR structure bundles CYTOSCAPE Platform for visualizing molecular interaction networks and biological pathways and integrating networks with annotations DANGLE Predicts protein backbone angles and assignments of chemical shifts using a DB of known structures and shifts DASHA Process heteronuclear NMR relaxation data dataChord Spectrum Analyst NMR data processing and analysis software for big, small, and mixtures of molecules dataChord Spectrum Miner Integrated application for NMR metabolomics and spectrum mining DATAWARRIOR Data visualization and analysis program with embedded chemical intelligence DEEP Picker A deep neural network for accurate deconvolution of complex two-dimensional NMR spectra DOSY Toolbox For processing PFG-NMR diffusion data drawnmr Module for viewing NMR data in Python DYNAMO The NMR molecular dynamics and analysis system EISD Probability scores to ensembles of Intrinsically Disordered Proteins (IDPs) based on their fit to experimental data ENSEMBLE Tools to determine and analyze the weighted ensemble of structures in unfolded states FANDAS 2.0 Tool to predict peaks in multidimensional NMR experiments on proteins Farseer-NMR A software suite for automatic treatment, analysis and plotting of large and multivariable datasets of bioNMR peaklists FASTModelFree Rapid automated analysis of solution NMR spin-relaxation data FID-Net Deep Neural Networks for Analysing NMR time domain data. FitNMR Open-source R package for extracting peak parameters Flexible-meccano A tool for the generation of explicit ensemble descriptions of intrinsically disordered proteins and their associated experimental observables FoXS FoXS is a method for computing a theoretical scattering profile of a structure and fitting of experimental profile FuDA Analyse nD NMR correlation spectra Gctf GPU accelerate real-time determination and correction of the contrast transfer function Geometric-Approximation Computational approach to characterize protein dynamics from adiabatic relaxation dispersion experiments GISSMO Efficient calculation and refinement of spin system matrices GLOVE Fit relaxation dispersion data and test exchange models GNAT A general tool for processing NMR data GOAP All-atom statistical potential for protein structure prediction GREMLIN Method to learn a statistical model of a protein family capturing conservation and co-evolution patterns GROMACS A versatile package to perform molecular dynamics GUARDD Organizes, automates, and enhances the analytical procedures which operate on CPMG RD data HMMER Hidden Markov models for sequence profile analysis hmsIST Estimate the spectrum for nonuniformly sampled data using iterative soft thresholding HullRad Algorithm for calculating hydrodynamic properties of a macro-molecule from a structure file hydroNMR Calculation of NMR relaxation parameters of small macromolecules from a PDB file ICOSHIFT Solves signal alignment problems in metabonomic NMR data analysis. ImageJ-Fiji Image processing distribution of ImageJ, bundling many plugins to facilitate image analysis IMP IMP provides an open source C++ and Python toolbox for solving complex modeling problems InChI InChI provides unique labels for well-defined chemical substances INFOS Tool for fitting complicated spectra to better quantify peak amplitudes, integrals, and positions Jellyfish Simulate and view NMR spectra of spin systems in the liquid state experiencing J-couplings Jupyter Allows you to create and share documents that contain live code, equations, visualizations and narrative text KdCalc Determines binding constants by fitting NMR titration data in the fast exchange regime LarmorCa Predicts protein backbone (HN, N, HA, CA, CB) chemical shifts from a PDB structure LARMORD Simple and efficient program for predicting non-exchangeable 1H and protonated 13C RNA chemical shifts localCIDER Calculates and presents various sequence parameters associated with disordered protein sequences M2MTool A software tool to facilitate depositing data to BMRB from within NMRbox. MADByTE Using 2D NMR to streamline Natural Product Research MARS Robust automatic backbone assignment of proteins even with extreme chemical shift degeneracy MATLAB Numerical computing environment MATLAB Compiler Runtime Run compiled MATLAB applications or components without installing MATLAB MD2NOE Direct NOE simulation from long MD trajectories MDAnalysis MDAnalysis is a Python library to analyze molecular dynamics trajectories MddNMR A program for processing of non-uniformly sampled multidimensional NMR spectra MDTraj Read, write and analyze MD trajectories with only a few lines of Python code MESMER Analyzes the experimentally averaged data obtained from any number of experimental techniques such as SAXS and NMR MestReNova (Mnova) A top class software suite to process your analytical chemistry data MetaboAnalystR An R package for comprehensive analysis of metabolomics data Metabolomics toolbox Metabolomics toolbox MetScape A bioinformatics framework for the visualization and interpretation of metabolomic and expression profiling data MMTSB Toolset A collection of perl-based utilities and libraries for multiscale protein structure modeling Modelfree Optimizing “Lipari-Szabo model free” parameters to heteronuclear relaxation data MODELLER Homology or comparative modeling of protein three-dimensional structures Module2 Analyzes residual dipolar couplings and residual chemical shifts measured in partially aligned proteins and nucleic acids mol2sphere Convert a molecule into a set of spheres of variable radii for visualization and modeling Mollib Program and Python library for the validation, quality analysis and manipulation of molecular structures MOLMOL Molecular graphics program for displaying, analyzing, and manipulating biological macromolecules MolProbity All-atom structure validation for macromolecules MoSART To provide an easily extensible application for computing biomolecular structure from NMR data MotionCor2 Corrects electron beam-induced sample motion mTM-align Efficient protein structure comparisons MVAPACK Tools for processing and analyzing chemometric data NAMD NAMD is computer software for molecular dynamics simulation, written using the Charm++ parallel programming model NESSY Analyse NMR relaxation dispersion data of either CPMG or R1p (R1rho) dispersion experiments NESTA-NMR Fast and accurate reconstruction of NUS data nightshift Python command line utility and library for plotting simulated 2D and 3D NMR spectra from assigned chemical shifts in the BMRB NMRDraw NMRDraw is the companion graphical interface for NMRPipe and its processing tools NMRFAM-SPARKY A graphical NMR assignment and integration program for proteins, nucleic acids, and other polymers nmrfit Quantitative NMR analysis through least-squares fit of spectroscopy data NMRFx Analyst Data processing program utilizing Python for scripts and a full Java based GUI NMRFx Processor Data processing program utilizing Python for scripts and a full Java based GUI NMRFx Structure Features for structure calculation and chemical shift prediction nmrglue A module for working with NMR data in Python NmrLineGuru A graphical user interface ( GUI ) based user-friendly tool to simulate and fit NMR line shapes with multi-state equilibrium models NMRmix A Tool for the Optimization of Compound Mixtures in 1D 1H NMR Ligand Affinity Screens NMRPipe Multidimensional spectral processing and analysis of NMR data NMRPy A Python module for processing NMR spectra NMR-scripts A collection of small scripts of various functions nmrstarlib A Python library that facilitates reading and writing NMR-STAR formatted files used by BMRB for archival of NMR data NMRViewJ The Application for Visualization and Analysis of Macromolecular NMR Software nmr_wash Suppression of artifacts in NMR spectra obtained from sparsely sampled data NUScon Workflow tool for running NUS reconstructions on challenge data in support of NUScon evaluation nus-tool Utility for generating and analyzing NUS sample schedules Open Babel A chemical toolbox designed to search, convert, modify, or analyze chemical files OpenMM A high performance toolkit for molecular simulation OpenVnmrJ Open source version of Varian's/Agilent's VnmrJ software OSPREY Suite of programs for computational structure-based protein design PALES Prediction of sterically induced alignment in a dilute liquid crystalline phase PANAV PANAV is a Java based structure-independent chemical shift validation and re-referencing tool ParmEd A tool for aiding in investigations of biomolecular systems using molecular simulation packages (Amber, CHARMM, and OpenMM) PATI Predicts the alignment tensor and RDC's under steric alignment PDBStat NMR restraint analysis software and converter pdb-tools A swiss army knife for manipulating and editing PDB files. PAL A library of programs to assist in peak assignment and validation PEAKY Peak detection in NMR spectra for 1D to 4D spectra PINT A user-friendly software for rapid and accurate analysis of NMR spectra PLUMED The community-developed PLUgin for MolEcular Dynamics PLUQ Predict amino acid residue types and secondary structure assignments from chemical shifts POISSON-GENERATOR Generates NUS Poisson Gap sample schedules POMONA Chemical shift guided protein structure alignment PONDEROSA Software package for automated protein 3D structure determination POOL Suite of programs for protein loop backbone structure determination using RDCs PPM/PPM_ONE Chemical shift prediction for a single structure (PPM_One) or to account for motional averaging to molecular ensembles, such as MD simulations Probe Evaluates atomic packing by generating “contact dots” where atoms are in close proximity PROCHECK Checks the stereochemical quality of a protein structure PROMEGA Proline Omega angle prediction from sequence and chemical shifts PROTEIN-DYNAMICS Predict NH and methyl order parameters from structure py4xs py4xs: a python package for processing x-ray scattering data pyIPINE Python script for submitting and retrieving an I-PINE job PyMOL A molecular visualization system on an open source foundation, maintained and distributed by Schrödinger PyNMR-STAR A Python module for reading, writing, and manipulating NMR-STAR files PyShifts A PyMOL plugin for assessing the global quality of RNA structures using NMR chemical shifts Qhull Computes the convex hull of a shape, such as a protein QSched Quantile-directed nonuniform sampling RASMOL Molecular Graphics Visualisation Tool RASPnmr Uses structure-based chemical shift predictions to solve the backbone resonance assignment problem raw Processing and analysis of Small Angle X-ray Scattering (SAXS) data RCI Predict protein flexibility using secondary chemical shifts RCS A program for computing NMR aromatic ring current shifts RDC-PANDA/Analytic Programs for NOE assignment and structure determination starting with a global fold calculated from exact solutions to the RDC equations REDCAT The analysis of residual dipolar couplings (RDCs) for structure validation and elucidation REDCRAFT Tool for determining a protein's structure using residual dipolar couplings (RDCs) Reduce A program for adding hydrogens to a PDB molecular structure file relax Analysis software for Model-free, NMR relaxation (R1, R2, NOE), reduced spectral density mapping, relaxation dispersion RELION Empirical Bayesian approach to refinement of 3D reconstructions or 2D class averages in CryoEM Remediator Converts PDB files between PDBv2.3 and PDBv3.2 formats in either direction RESMAP Computes the local resolution of 3D density maps studied in structural biology, primarily in CryoEM Ring NMR Dynamics Characterization of protein and nucleic acid conformational dynamics and kinetics using solution and solid-state NMR RNAstructure RNAstructure is a complete package for RNA and DNA secondary structure prediction and analysis rNMR Visualizing and interpreting one and two dimensional NMR data RNMRTK General-purpose NMR data processing package, including maximum entropy spectral reconstruction Rosetta A software suite of algorithms for computational modeling and analysis of protein structures ROTDIF3 Determines the overall rotational diffusion tensor from spin-relaxation data RUNER Enables seamless modifications of atom force field parameters in the molecular modeling software package Xplor-NIH SEER Program for the reconstruction of non-uniformly sampled NMR data SES Recovers a representative conformational ensemble from underdetermined RDC data SHIFTS Computes proton chemical shifts from empirical formulas ShiftX Predicts 1H, 13C and 15N chemical shifts for your favorite protein SHIFTX2 Predicts both the backbone and side chain 1H, 13C and 15N chemical shifts for proteins SHIMpanzee A program for the simulation of NMR shim lineshapes SIMPSON General-purpose software package for simulating virtually all kinds of solid-state NMR experiments SMILE Algorithm to integrate a priori information about NMR signals for reconstruction of non-uniformly sampled (NUS) multidimensional data SPARTA+ Neural network algorithm to make rapid chemical shift prediction on the basis of known structure CONNJUR Spectrum Translator CONNJUR Spectrum Translator is a free, extensible, and open source application for NMR File Spectral Format Conversion Spinach Spinach is a fast spin dynamics simulation library SpinDrops An interactive quantum spin simulator that uses the DROPS Representation ssNake Versatile tool for processing and analysing NMR spectra SSP Secondary structure propensities from chemical shifts and 13C chemical shift referencing STARXML XML to and from NMR-STAR converter STRIDE Protein secondary structure assignment from atomic coordinates TALOS+ Prediction of protein backbone torsion angles from NMR chemical shifts TALOS-N Prediction of protein backbone and sidechain torsion angles from NMR chemical shifts tameNMR Suite of tools for processing and analysis of NMR data from metabolomics experiments taurenmd A command-line interface for analysis routines of Molecular Dynamics data Tensor2 NMR Relaxation analysis of internal motions using the Lipari-Szabo or extended Lipari-Szabo method TensorView A software tool for displaying NMR tensors tiger A periodic table with detailed information on their NMR properties TITAN 2D NMR lineshape analysis to monitor protein-ligand interactions TopSpin Bruker's software package for NMR data processing and analysis TreeView Gene Expression Visualization Tool TreeView3 Java app for visualizing large data matrices. It can load a dataset, cluster it, browse it, customize its appearance and export it into a figure. TREND Resolve trends of change in imaging, spectra, or other data with ease UCBShift Predict chemical shifts for backbone atoms and β-carbon of a protein in solution using machine learning Unblur Aligns the frames of movies recorded on an EM to reduce image blurring due to beam-induced motion VMD VMD is a molecular visualization program for displaying, animating, and analyzing large biomolecular systems VMD-XPLOR Combination of the X-PLOR structure determination program and VMD Visual Molecular Dynamics Wattos Programs to analyze, annotate, parse, and disseminate NMR data WHAT IF Versatile molecular modelling package that is specialized on working with proteins and the molecules in their environment XDrawChem Molecule drawing program XEASY Interactive, computer-supported NMR spectrum analysis Xipp NMR analysis software for biomolecules Xplor-NIH A structure determination program which builds on the X-PLOR program XSSP A series of PDB-related databanks for everyday needs xyza2pipe Cross conversion environment of higher dimensional NMR spectra in several different formats
  • Old revisions
  • Cite current page
  • Back to top
  • Recent Changes
  • Media Manager

Task Runner

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts
  • Open access
  • Published: 18 December 2020

A method for validating the accuracy of NMR protein structures

  • Nicholas J. Fowler   ORCID: orcid.org/0000-0002-6005-935X 1 ,
  • Adnan Sljoka   ORCID: orcid.org/0000-0002-2398-9523 2 , 3 &
  • Mike P. Williamson   ORCID: orcid.org/0000-0001-5572-1903 1  

Nature Communications volume  11 , Article number:  6321 ( 2020 ) Cite this article

  • Computational biophysics
  • Solution-state NMR
  • Structural biology

We present a method that measures the accuracy of NMR protein structures. It compares random coil index [RCI] against local rigidity predicted by mathematical rigidity theory, calculated from NMR structures [FIRST], using a correlation score (which assesses secondary structure), and an RMSD score (which measures overall rigidity). We test its performance using: structures refined in explicit solvent, which are much better than unrefined structures; decoy structures generated for 89 NMR structures; and conventional predictors of accuracy such as number of restraints per residue, restraint violations, energy of structure, ensemble RMSD, Ramachandran distribution, and clashscore. Restraint violations and RMSD are poor measures of accuracy. Comparisons of NMR to crystal structures show that secondary structure is equally accurate, but crystal structures are typically too rigid in loops, whereas NMR structures are typically too floppy overall. We show that the method is a useful addition to existing measures of accuracy.

Similar content being viewed by others

nmr assignment database

LoCoHD: a metric for comparing local environments of proteins

nmr assignment database

Extended experimental inferential structure determination method in determining the structural ensembles of disordered protein states

nmr assignment database

Tutorial: a guide for the selection of fast and accurate computational tools for the prediction of intrinsic disorder in proteins

Introduction.

Protein structures are probably the single most important resource for understanding protein function, and are deposited in the protein data bank (PDB), which currently contains around 160,000 structures, of which around 90% are X-ray diffraction structures, 8% are nuclear magnetic resonance (NMR) structures, and the rest are mainly from electron microscopy (EM) 1 . The NMR structures are relatively small in number, but are important because they include a high proportion of small proteins with under-represented folds. Most NMR structures are determined in solution, whereas X-ray structures are determined in a crystalline environment. Arguably this makes NMR structures more representative of in vivo structures. However, structures are only useful if they are accurate (i.e., close to the “true” structure) and (equally importantly) can be shown to be accurate. The PDB has therefore become increasingly concerned about validation of structures in the database: the community needs objective and reliable measures to check whether the structure deposited is accurate. The PDB set up four task forces to provide recommendations for validation: for crystallography, NMR, EM, and small-angle scattering, which have all reported 2 , 3 , 4 , 5 and have created a suite of validation tools for the PDB 6 . They concluded that validation cannot be based on a single measure. The measures used comprise a combination of geometrical tests, and comparison to input data. Because it is expected that crystal structures and solution structures have the same physical forces underlying them, the geometrical tests for crystal and NMR structures are identical, and include clashscore (how well atoms are packed together), an analysis of Ramachandran outliers (how well the backbone dihedral angles comply with structural norms), and an analysis of sidechain outliers. The comparisons to input data are necessarily different for X-ray and NMR structures. For X-ray structures there is a very good measure, namely the R factor, which is the difference between the intensities of experimental diffraction data, and those calculated from the final structure. If the R factor is low (typically less than about 20%) then the structure is almost certainly essentially correct. In structural biology there is a strong temptation to over-fit the data, i.e., to add extra detail in order to improve the fit between experimental data and structure. Hence, a second measure was developed: R free , which is an R factor calculated using 10% of the diffraction data that was set aside and not used in the refinement 7 . R free should be similar in size to R for a structure that is not over-refined. Together these two measures provide a reliable guide to the accuracy of the crystal structure.

Unfortunately, no such measure exists for NMR structures 8 , 9 , 10 , 11 . The original experimental data have no direct mathematical relationship with the structure in the way that diffraction data do; and the experimental input restraints, of which the most common and useful are distance restraints obtained from NOESY spectra, require extensive manipulation and interpretation of the original data before they can be used as restraints. Furthermore, the quantity of information comprising the experimental restraints is far less for NMR, and the information is much more local. This makes NMR structures inherently less precise, and probably less accurate too, and also means that cross-validation by missing out 10% of the data, as used for R free , is not generally possible for NMR structures 12 . NMR structures thus tend to be validated using an unsatisfactory set of restraint comparisons, typically comprising number of restraints per residue, restraint violations, and structure precision (RMS distance between members of the ensemble) 5 , 13 . None of these is a direct comparison to the input data, and the third of these is explicitly a measure of precision, not of accuracy, and it is already well established that there is little relationship between precision and accuracy 14 , 15 , 16 , 17 .

Hence there is a pressing need to find a better validation measure for NMR structures. Here, we present such a measure. A good validation method should (like the R factor) as far as possible compare input data directly to structure. The most obvious input data for NMR structures is the spectra. There have been attempts to do this 18 , 19 but there are major difficulties: there is no good way of accurately calculating chemical shifts from structures; dynamics in solution have big effects on spectra; there are many experimental artifacts in NMR spectra; and the number and variety of input spectra used in structure calculations makes it hard to define or measure what should be compared. Hence, we have here used backbone chemical shifts as our input data. These can usually be obtained reliably and rapidly, and there is little or no manipulation or sorting required, by contrast to distance restraints. The method described here is named ANSURR (Accuracy of NMR Structures using Random Coil Index and Rigidity).

The structure of this paper is that we outline the method before demonstrating how we have validated the method using a range of “good” and “bad” structures and by comparing to other typical measures of structure accuracy. We then demonstrate the power of the method by using it to make comparisons between crystal structures and NMR structures.

Outline of the method

Backbone chemical shift assignments (i.e., HN, 15 N, 13 Cα, 13 Cβ, Hα, and C′) can usually be obtained rapidly, semi-automatically, and reliably from a set of triple resonance spectra obtained from 15 N, 13 C double labeled protein. In order to determine a protein NMR structure, shift assignments are the necessary first stage 20 , meaning that any protein that has an NMR structure must have backbone shift assignments (which are now required to be submitted with the structures). Crucially, shift assignments are subject to minimal manipulation. This is very different from distance restraints obtained from NOE spectra. For distance restraints there are inevitably many stages of data sorting and rejection, no matter whether the restraints are inputted manually or automatically. Some person or computer must decide which signals to include, how to assign them, when to reject or modify the restraints, and how to set the calibration between peak intensity and distance restraint. All of these reduce the value of distance restraints as independent quality measures. For all these reasons, backbone assignments are better validation input than distance restraints.

In our method, backbone chemical shift assignments are compared to a structure. Although a number of programs can calculate shifts from structures, they are not sufficiently accurate to perform a useful comparison except in rather general terms 14 , 21 . Hence, the heart of our method is that the backbone shifts are used to calculate the local rigidity of the backbone, based on an established measure, the random coil index (RCI), which calculates how similar each of the six backbone shifts is to a tabulated “random coil shift” value 22 . It has been shown to provide a remarkably reliable guide to local rigidity, whether measured by NMR relaxation or by crystallographic B factor 22 , 23 .

We compare local rigidity as predicted by RCI to that computed from a structure using techniques from mathematical rigidity theory. Several software packages and methodologies relying on rigidity theory such as the program Floppy Inclusions and Rigid Substructure Topography (FIRST) 24 , 25 and its various implementations and extensions have been developed for fast computational predictions of rigidity and flexibility of protein structures. Starting with a protein structure, FIRST creates a topological graph (a constrained network consisting of nodes and edges), where atoms are represented by vertices (nodes), and edges represent the constraints corresponding to the intramolecular interactions of a protein e.g., covalent bonds, hydrogen bonds and hydrophobic interactions. Applying the mathematically well-established pebble game algorithm and molecular theorem 26 , FIRST then determines locally rigid subgraphs (rigid regions in the network), a process referred to as rigid cluster decomposition. The degree of flexibility can be quantified as a function of hydrogen bond energy by repeating rigid cluster decomposition as edges corresponding to hydrogen bonds are removed incrementally from the graph, and noting the energy at which the Cα atom of a residue no longer belongs to a rigid subgraph, i.e., becomes flexible. We convert this energy to a Boltzmann population ratio, effectively giving the probability that a residue is flexible.

The two measures of local rigidity (RCI and FIRST) are then compared and a numerical comparison gives a score: a measure of how well the local rigidities match, and thus whether the structures produce a local rigidity that matches the one described by the RCI. Following extensive trials, we use two different measures of similarity: (a) The correlation between the two. This tests whether the peaks and troughs are in the same places. Peaks are locally mobile regions while troughs are locally rigid regions, generally regular secondary structure. This comparison therefore mainly shows whether the secondary structure is correct. (b) The root-mean square deviation (RMSD) between the two. This tests whether overall the structure is too rigid or too floppy. It is strongly influenced by the geometry of hydrogen bonds and other non-covalent interactions in the structure. As discussed below, the overall rigidity of a structure is determined by not just backbone but also sidechain interactions. Protein structures are often compared by superimposing backbones (often cartoons). Two structures can look very similar in a comparison like this, but one can be much worse than the other in terms of the accuracy of the hydrogen bond network or side chain orientations. In order to assess the relationship between structure and function, it is important that sidechain positions should be correct. The RMSD measure between RCI and FIRST is therefore important because it measures the kind of accuracy needed to interpret function.

Correlation and RMSD are simple numerical values, but they do not scale linearly to intuitive measures of accuracy. In the output from ANSURR, we therefore present the numerical values, but we also calculate the percentile of each measure relative to all NMR structures in the PDB with good chemical shift completeness (see below for further discussion of completeness), which we term correlation score and RMSD score , respectively. These are relative values (and are thus likely to change slightly as more structures are added to the PDB), but are easier for the user to interpret. The crystallographic validations in the PDB adopt a similar procedure for both geometrical tests and R free . In what follows, we report the scores rather than the numerical values.

Correlation and RMSD scores highlight different aspects of accuracy, so we decided not to combine them into a single score to represent overall accuracy. Instead, we plot both on a single graph, as demonstrated in Fig.  1 for four different models of the same protein. The most accurate models (those with good scores for both correlation and RMSD) appear in the top right-hand corner of the plot.

figure 1

In the four plots, the blue lines show the flexibility predicted by RCI while the orange lines show flexibility predicted by FIRST. In the center of the figure is the ANSURR analysis showing the RMSD and correlation scores derived from the four models. The two models on the right are from the CNW dataset 27 (refined in explicit solvent), while the two on the left are from the CNS set (refined in vacuo). As is typical, the CNW-refined structures have better RMSD, meaning that the calculated flexibilities compare well on average. The two models at the bottom have poor correlations, because the locations of the peaks do not match well between RCI and FIRST. The two at the top both have good correlations, because the locations of the peaks do match, even though (in the case of the top left structure) their heights are very different.

RECOORD CNS (unrefined) vs. CNW (refined) structures

There is currently no accepted method for measuring the accuracy of an NMR structure. There are also no databases of “good” or “bad” structures. We have therefore created or adopted datasets that can reasonably be assumed to be bad or good. There are also a range of methods that have been used to measure structure quality, including the geometrical methods described above. We compare our findings to these methods in turn.

The RECOORD project 27 set out to standardize and tabulate methods for NMR structure calculation. It produced a curated set of structure restraints, which were applied in a consistent manner to more than 500 proteins from the PDB, and then analysed the resultant structures. It carried out two sets of structure calculations on each protein: one using a typical simulated annealing calculation in vacuo using CNS (termed CNS) and another using CYANA (termed CYA) 28 , 29 . They then took these two sets of structures and refined them in explicit water using ARIA (termed CNW and CYW, respectively) 30 . There is an extensive literature indicating that refinement of NMR structures in explicit water produces better geometries and generally better quality structures 31 , so not surprisingly, the CNW/CYW structures are better.

We have therefore carried out a comparison of those CNS and CNW datasets for which there is sufficient (>75%) chemical shift completeness, which comprises a set of 173 ensembles each made up of 25 models (see Supplementary Table  1 for details). From here on we refer to these datasets as CNS75 and CNW75, respectively. In Fig.  2a , the differences in average correlation and RMSD score for each of the 173 ensembles are depicted in a histogram. There is no real improvement in correlation score on refinement in water, with an average improvement of only 1.0. This is expected, as the secondary structure, which ultimately determines the location of peaks and troughs and therefore correlation, changes very little during refinement. As an example, Fig.  2b shows the lack of change in fold for one model. In contrast, RMSD scores are greatly improved, with an average increase of 36.2 and with only one ensemble scoring worse after refinement. This is mostly due to the improvement in hydrogen bonding which acts to rigidify the entire protein. This can be seen in the difference in computed rigidity before and after refinement (Fig.  2c ).

figure 2

a Histogram showing the change in average correlation score (blue) and RMSD score (orange), comparing ensembles from the CNS75 to the CNW75 sets. RMSD scores improve dramatically while there is no significant change in correlation scores. b Backbone superposition of CNS model 14 and CNW model 14 of the restriction of telomere capping protein 3 from S. cerevisiae (PDB ID 1nyn), as a typical example of the effect of refinement in explicit solvent. Although the RMSD score is much better after refinement, the backbones do not look very different. c Comparisons of RCI (gray) with flexibility calculated using FIRST for representative models from CNS (blue) and CNW (orange) refinements. The colored bars at the top of each plot show the regular secondary structures: α-helix (red) and β-sheet (blue). The three proteins are (top) the N-terminal domain of VAM3P from S. cerevisiae (CNS/CNW model 4, PDB ID 1hs7), a largely helical protein; (middle) a single-domain antibody from Brucella (CNS/CNW model 20, PDB ID 1ieh), a largely β-sheet protein, and (bottom) the restriction of telomere capping protein 3 from S. cerevisiae (CNS/CNW model 14, PDB ID 1nyn), a mixed α/β protein.

Decoy vs experimental structures

A straightforward way to generate a pool of structures of varying accuracy is to calculate decoys. We used the 3DRobot web server 32 , which begins from a crystal or NMR structure, identifies possible structure scaffolds from a library, assembles them together, and then refines them. The sets of structures generated using 3DRobot are designed to have a high density of structures close to the native state with good hydrogen bonding and compactness, and of high diversity. In other words, they should look like genuine proteins, with good packing and hydrogen bonds, and they should span a range, from structures that closely resemble the native state, to ones that are very different, although still with good packing and hydrogen bonding. These sets therefore allow us to test whether ANSURR can discriminate between structures that are all geometrically good structures, but differ in their accuracy.

For about half (79 of 173) of the ensembles in the CNW75 dataset (see Supplementary Table  2 for a list of the chosen models), we calculated a group of 300 decoys. These decoys were then compared to the experimental structure using a Global Distance Test (GDT), which measures the similarity between two structures, calculated as the largest set of Cα atoms in the model structure falling within a defined cut-off of their position in the test structure, after superimposing the structures 33 . A selection of results is shown in Fig.  3a (results for all 79 sets of decoys are depicted in Supplementary Fig.  1 ). The score for the experimental structure is indicated by a black asterisk and scores for decoys are circles, colored according to their GDT.

figure 3

a Each plot shows one protein, indicated by its PDB code and the percentage of α-helix and β-sheet in the experimental structure, according to DSSP 61 . The experimental structure is indicated by an asterisk and is the best scoring model in the NMR ensemble, according to our method. The other data show decoys generated by 3DRobot 32 , and color coded by their Global Distance Test (GDT), a measure of similarity to the target 33 , as indicated by the color bar on the right. For two proteins, red boxes indicate the set of decoys used to calculate mean hydrogen bond correctness, as discussed in the text. b A comparison of experimental structure (orange) and best decoy (blue) for the protein 1gh5.

From inspection of the examples shown in Fig.  3a , it can be seen that the experimental model is usually one of the best structures, as one would expect. Also apparent is that as GDT increases (i.e., as decoys become more like the experimental structure), both the validation scores tend towards those of the experimental structure, confirming that our method does specifically validate accuracy. There is a consistent difference between α-helical proteins (e.g., 1itf) and β-sheet proteins (e.g., 1gh5). Helical proteins tend to improve more in their correlation score than in their RMSD score. This seems reasonable: helices are almost always rigid 26 , but not necessarily in the correct location, whereas β-sheet proteins tend to improve more in their RMSD score, because β-sheets can adopt a wide range of local geometries, implying that β-sheet proteins can appear almost correct but have poor hydrogen bonds and thus be much too floppy. Scores for proteins with both α-helical and β-sheet content tend to move in a diagonal, a combination of both effects.

The protein 1bqz presents an interesting example. It is DnaJ, a largely helical protein, and unusually there are many decoys that have a better correlation score but considerably worse RMSD score than the experimental structure, despite most having GDT of around 80 and with some close to 100. However, calculated hydrogen bond correctness scores 34 i.e., the percentage of hydrogen bonds in the experimental structure that also appear in the decoy, show that these high correlation score decoys (indicated in Fig.  3a with a red box) have poor hydrogen bond geometries (average hydrogen bond correctness of only 47%), and hence a poor RMSD score. By contrast, decoys for 1cfc that approach the accuracy of the experimental structure have good RMSD and correlation scores and have better hydrogen bond geometries (average hydrogen bond correctness of 69%).

Another interesting example is the beta-fold protein 1gh5 (an antifungal protein from S. tendae ). There are some decoys with better correlation and only marginally worse RMSD scores than the experimental structure, suggesting that they are actually more accurate. Figure  3b compares the experimental structure and best scoring decoy. Immediately obvious (and reassuring) is that at backbone level, both structures are very similar. We note that the experimental structure has a relatively poor correlation score. It is therefore possible that some of the refined decoys genuinely are more accurate: such behavior has been noted before 35 . Inspection of the full dataset in Supplementary Fig.  1 suggests that this is not uncommon. NMR structure refinement is a joint optimization against NMR restraints and known properties of proteins. The observation that some decoys have better scores than NMR structures implies that in some NMR structure calculations, the balance is not yet optimal, and more weight needs to be given to packing and hydrogen bonding for example. We therefore feel that this finding is not a problem with the method: on the contrary, it shows that the method is useful for identifying incompletely refined structures and improving them.

Comparison between ANSURR and conventional predictors of accuracy

Conventional predictors of accuracy include the number of restraints per residue used to generate a structure, the number of restraint violations, and the total energy of the structure. The RMSD between models in an ensemble is often used to gauge precision, and by proxy to provide a guide to accuracy. Whilst these measures are expected to be related to accuracy, they do not explicitly determine it. Here we compare these measures to the average RMSD score (Fig.  4a ) and correlation score (Fig.  4b ) for each ensemble in the CNW75 dataset.

figure 4

a RMSD score, b correlation score. For each plot, the line of best fit and the Pearson correlation coefficient are shown. For the comparisons with ensemble RMSD, fits are shown for all points (red) and for only those points with an ensemble RMSD ≤ 2.5 Å (blue). The statistical significance of the correlation coefficient is indicated by *** p  < 0.001 ** p  < 0.01, and * p  < 0.05, determined using a two-tailed Pearson test. p values are (by row, left to right) a 3 × 10 −8 , 7 × 10 −5 , 6 × 10 −6 , 3 × 10 −6 , 0.56, 0.70, 9 × 10 −3 , 2 × 10 −6 , 2 × 10 −6 (red), 5 × 10 −6 (blue) and b 0.43, 0.22, 0.09, 0.32, 0.07, 0.30, 0.047, 0.13, 4 × 10 −3 (red), 0.02 (blue).

Overall the correlations are much stronger for RMSD score than correlation score. This is not surprising. These predictors largely assess local accuracy, and thus relate to RMSD score better than correlation score.

There is a moderate positive correlation between the number of distance restraints per residue and RMSD score. This is reasonable: a structure with a higher density of distance restraints is expected to be more tightly defined and therefore more (correctly) rigid overall 36 . Categorizing distance restraints according whether they are sequential, medium or long-range reveals a slightly better correlation for medium/long-range restraints than for sequential restraints. This is again expected, as medium/long-range restraints provide more information on protein fold, and for this reason are considered a better predictor of accuracy 37 .

The number of distance restraint violations per residue does not correlate with either validation score. Roughly two thirds of structures do not have any violations at all, because structures are normally refined until there are no, or no significant, violations. It is fairly common practice that restraints that are routinely violated during a structure calculation will be discarded along the way. In fact, programs which automate NMR structure calculation do exactly that. For this reason, restraint violations are clearly not a good predictor of accuracy 8 , 13 , 38 .

The number of dihedral restraints per residue does not correlate with either validation score, but dihedral restraint violations do. This is probably because the restraints themselves are relatively weak, so that they do not particularly guide the structure to become more accurate. However, weak negative correlation to dihedral restraint violations suggests that these kinds of restraints successfully flag major issues.

There is a moderate negative correlation to the total energy of the structure. Typically, the selection of the final set of structures to represent the ensemble is based on total energy, and the correlation seen here suggests that this is a reasonable way of identifying good structures.

Both RMSD score and correlation score are negatively correlated with ensemble RMSD suggesting that more precise ensembles do also tend to be more accurate. However, if those ensembles with RMSD larger than 2.5 Å are excluded (blue fit lines) then the gradient becomes almost zero, suggesting that for better structures, ensemble RMSD is a poor guide to accuracy. Similar comments have been made previously 14 , 15 , 16 , 17 , 39 .

In summary, our measures of accuracy match reasonably well to expectations: the number of distance restraints per residue is a fairly good predictor of accuracy, while dihedral restraints, and distance and angle violations, are not. Precision (ensemble RMSD) is a poor predictor of accuracy, while overall energy is surprisingly good as a predictor of accuracy.

Comparison between ANSURR and geometry-based validation measures

It is unclear whether a correlation should be expected between geometrical quality and accuracy. However, given that NMR structure calculation is to a large extent an optimization of models, using both NMR-derived restraints and knowledge-derived geometrical factors simultaneously, it is reasonable to expect that an accurate structure should also have good geometrical quality. We therefore compared our validation scores with two widely used indicators of geometrical quality: Ramachandran outliers and clashscore 40 . The program ramalyze (part of the Molprobity suite of validation tools) was used to compute the φ/ψ angles for each residue in the CNW75 dataset and categorize them as either favorable, allowed or outlier. The program clashscore (also part of Molprobity) was used to compute the average number of clashes per 1000 atoms for each ensemble in the CNW75 dataset. In Fig.  5a, b , the results for each ensemble are plotted against RMSD score and correlation score, respectively.

figure 5

The top part shows correlations between geometry-based measures and a RMSD score, b correlation score. The statistical significance of the correlation coefficient is indicated by *** p  < 0.001, ** p  < 0.01, and * p  < 0.05, determined using a two-tailed Pearson test. p values are (top to bottom) a 1 × 10 −16 , 4 × 10 −17 , 2 × 10 −13 , 0.44 b 8 × 10 −5 , 2 × 10 −4 , 2 × 10 −5 , 8 × 10 −3 . c Comparison of ANSURR to ResProx, using 300 decoys generated by 3DRobot for the test PDB file 1cfc. The horizontal axis is the Global Distance Test, a measure of similarity to the test structure (see Fig.  3 ), which is indicated by the red asterisk. The left box assesses the decoys using ANSURR, where for simplicity we have combined the RMSD score and correlation score into a single sum. There are no decoys with better ANSURR score than the test structure. The right box assesses the same set of decoys using ResProx. There are 57 decoys with better (i.e., lower) ResProx values than the test structure. See Supplementary Fig.  2 for more comparisons.

The correlation between Ramachandran distribution and RMSD score is the best for any of the measures presented here. In other words, an ensemble with good Ramachandran distribution (high percentage in the favored category, low percentage in the additionally allowed category, small percentage in the outlier category) is likely to have good accuracy. It seems reasonable to find that the most accurate structures are in general those with the best backbone geometry, as was proposed many years ago 41 .

Geometrical measures have previously been combined together into a consensus quality indicator called Resolution-by-proxy or ResProx, which combines 25 geometrical measures, and has excellent agreement ( R  = 0.92) with X-ray structure resolution 42 . In Fig.  5c we take one PDB structure (1cfc) and generate 300 decoys (i.e., structures with good protein quality, but spanning a range of similarity to the 1cfc structure as assessed by the Global Distance Test), and show that there is a reasonable match between ResProx score and GDT. In other words, structures that are closer to the NMR structure are in general of better geometrical quality. However, we also show that the match is much better for ANSURR: in other words, ANSURR performs much better than a consensus goodness measure based simply on geometrical features. Supplementary Fig.  2 includes results for a range of other proteins, with similar results in all cases.

We have also carried out a similar comparison, but against the consensus measure PROSESS, which combines a wide range of both geometry-based and restraint-based measures, and is thus the closest available consensus test for ANSURR 43 . The PROSESS scores are critically dependent on NOE restraint violations, and are thus subject to the same problems as discussed in the previous section. A more detailed discussion can be found in  Supplementary Information .

Comparison of NMR and X-ray crystal structures

An obvious first test for this method is to compare NMR and X-ray crystal structures. It is important to stress here that because we compare the structures to time-averaged chemical shifts obtained using solution NMR, we are explicitly testing how well the structures compare to the average state of the protein in solution. Crystal structures are almost always based on many more experimental values, and more precisely measured values, than NMR structures. One would therefore inherently expect them to be more accurate, except that crystal structures represent the structure of the protein in a crystalline environment, whereas the NMR chemical shifts measure structural rigidity in solution. We are therefore here making a somewhat unfair, but important, comparison, namely how well X-ray structures represent the structure of a protein in solution.

Here we compare X-ray structures for 68 proteins taken from the set used to train the SHIFTX2 program for predicting chemical shifts 44 with corresponding NMR structures taken from the PDB (see “Methods” section for details). We validated each structure using our method and averaged the validation scores over each chain for X-ray structures, and each model for NMR ensembles. The results are shown in Fig.  6 . The correlation scores for X-ray and NMR structures are very similar. In other words, the locations of rigid and flexible regions, generally representing regular secondary structure in solution, are calculated similarly well by both methods. The slightly lower correlation score for X-ray structures originates from some loops seeming to be too rigid. That is, X-ray structures are missing some peaks in flexibility that should be there according to RCI. Crystal structures are obtained from crystalline arrays, and are usually obtained at cryo-temperatures, both of which will tend to reduce the observed flexibility. There is a large body of evidence 45 , 46 , 47 that crystal structures obtained at room temperature show much more local variability than do structures obtained at cryo-temperatures, and calculations on lysozyme confirm that the room temperature structures have flexibility that matches the RCI data much better than cryo-temperature structures (Supplementary Note  1 and Supplementary Figs.  3 – 6 ). By contrast, in the RMSD score comparison, on average crystal structures are significantly better. When one inspects the data for individual proteins, it is clear that NMR structures are in general much too flexible, particularly in loop regions. This is not unexpected, as NMR structures often have few restraints in loops.

figure 6

a RMSD score and b correlation score. The mean values for each score are shown in the inset box.

We present a method for determining the accuracy of NMR structures. A range of methods have been proposed previously 10 , 13 , 41 , 48 , including various attempts at an NMR R factor 18 , 19 , 49 , 50 , 51 . Our method has the merits of being simple, rapid, and in agreement with intuitive expectations. Considering that the first NMR structure of a globular protein was published in 1985 52 , it is remarkable that it has taken this long to come up with a workable measure. The lack of a good measure of accuracy has inhibited researchers from using NMR structures; it is hoped that this method will give users more confidence in the use of structural data from NMR. ANSURR is not a reliable measure of accuracy on its own: as is done for X-ray crystallography, it needs to be combined with other measures, typically geometrical tests.

Because there are no general methods for measuring accuracy, and thus no agreed sets of “good” or “bad” NMR structures, we have been forced to create our own comparisons. Similarly, there are a range of measures that have been proposed for measuring accuracy. In particular, the PDB NMR validation task force 5 has recommended a set of measures, combining geometrical comparisons and comparisons to input data. These measures are investigated here. We find that the best current indicator of accuracy is a Ramachandran analysis, using either the proportion of residues in the favored region or the proportion of outliers. We find that the RMSD between models in an ensemble is a poor measure of accuracy (though an excellent measure of precision, reinforcing the concept that accuracy and precision are largely independent). Other common restraint-based measures of accuracy, such as restraints per residue 8 or restraint violations, are also poor measures of accuracy 53 . We suspect that part of the problem is that the route from NOE spectrum to distance restraint contains a large number of user-defined decisions (many of which are increasingly being made by the programs, and are thus becoming even more opaque), so that the link between spectrum and restraint is ill defined.

An interesting conclusion to come from this comparison is that the most common measure of structural similarity, backbone RMSD, misses many of the interesting differences. Structures can look very similar when superimposed on the backbone, but contain large variability in sidechain position and hydrogen bond geometry, which has major impact on docking algorithms and on functional aspects such as allostery, enzyme catalysis 54 , and dynamics.

Now that we have a reliable measure of accuracy, it can be applied to some key problems, for example: (1) how good are the NMR ensembles in the PDB? (2) Can we determine which structures in an ensemble are good, and which are not, and can we therefore improve the ensemble? (3) Is it possible to use experimental NMR data to validate or refine protein structure prediction methods? (4) Can one use these methods to identify local errors in NMR structures? We plan to address these questions in the future.

Random coil index (RCI)

RCI quantifies local (i.e., per residue) protein flexibility by calculating an inverse weighted average of backbone secondary chemical shifts. We calculate RCI essentially as done by Berjanskii and Wishart 22 , though with a few differences. In the originally published method, the weighting coefficients were not normalized. That is, the sum of the weights for different combinations of shifts did not add up to the same value and therefore the baseline rigidity measure could vary when comparing RCI values calculated with different combinations of shifts. We addressed this by simply dividing the sum of weighted secondary shifts by the sum of the weighting coefficients. We therefore compute RCI as:

where the Δ δ I are secondary chemical shifts and A – F are weighting coefficients. Some nuclei (Cα, Cβ) are more descriptive than others (HN, NH) and so have larger weighting coefficients. Missing chemical shifts have a weighting coefficient of zero. Another difference is that we use random coil values and nearest neighbor sequence corrections using data obtained from intrinsically disordered proteins 55 , rather than data based on unfolded peptides or proteins (see e.g., 56 ). A result of these differences is that our approach outputs a value between 0 and 0.2, rather than between 0 and 0.6 as in the originally published method.

We use the set of optimized weighting coefficients for each of the 63 different combinations of backbone chemical shifts as found in the downloadable Python version of RCI http://www.randomcoilindex.com/ . For some combinations, we found the similarity between flexibility predicted by RCI and FIRST is significantly decreased suggesting that, in these instances, RCI is a poor predictor of flexibility. Ultimately, the most reliable validation scores are obtained when a full complement of backbone chemical shifts are provided. Our method will allow validation with any combination/completeness of shifts, but the resulting validation score is flagged as less reliable if total chemical shift completeness drops below 75%. For proteins with sufficient chemical shift completeness (≥75%), we assume that residues with completely missing backbone chemical shift assignments are missing because the residues are highly mobile. We assign such residues a secondary chemical shift of zero (i.e., they are assumed to be entirely random coil-like) prior to 3-residue smoothing. However, these data points are not used when calculating validation scores. We note that artificially reducing chemical shift completeness by randomly removing some assignments resulted in worse RMSD and correlation scores, indicating that RCI is more accurate with a greater shift completeness (Supplementary Fig.  7 ).

Floppy inclusions and rigid substructure topography (FIRST)

Given a protein structure, FIRST 25 generates a graph (constraint network) composed of vertices (nodes), which represent atoms; and edges, which represent constraints imposed by the local geometry. Single covalent bonds are modeled by five edges between bonded atoms; double bonds by six; hydrophobic interactions, which are less geometrically constraining, by two; and hydrogen bonds by between one and five, depending on how one chooses to model them. Overall this multigraph represents a generic realization of a molecular body-bar framework in rigidity theory 26 . Typically, rigidity analysis is performed at a range of hydrogen bond energy cut-off values, where hydrogen bonds that meet the cut-off threshold are assigned five edges while weaker interactions are ignored.

Atoms are considered to be rigid bodies each with six degrees of freedom (three position and three orientation). These degrees of freedom are removed as constraints are added between them. One edge removes up to one degree of freedom e.g., a single covalent bond can remove up to five degrees of freedom between the two bonded atoms. FIRST then uses the combinatorial pebble game algorithm (which checks the counting condition prescribed by rigidity theory 57 ) to rapidly decompose the graph into maximum rigid clusters and flexible regions, a process known as rigid cluster decomposition. We consider a residue to be rigid if the Cα atom belongs to a rigid cluster that contains at least 15 atoms: this is a useful caveat because it prevents prolines and aromatic residues automatically showing up as rigid.

Relative flexibility is quantified using a process termed hydrogen bond dilution, which is analogous to the thermal denaturation of a protein. Dilution involves incrementally removing edges associated with hydrogen bonds in the graph (weakest to strongest), repeating rigid cluster decomposition and noting the hydrogen bond energy at which the Cα atom of each residue is no longer part of a rigid cluster i.e., becomes flexible. An important benefit of the dilution plot is that the exact energy of each hydrogen bond is not critical to the analysis. We have adapted this slightly, choosing to convert the energies to a Boltzmann population ratio at 298.15 K to represent the probability that a residue is flexible.

Comparing RCI and FIRST

A simple comparison of RCI and FIRST is not ideal, because the frequency distributions of RCI and FIRST output values are different (Supplementary Fig.  8a, b ). The main difference is that RCI is calculated as the inverse of averaged secondary chemical shifts and therefore it is not possible to achieve a RCI value of zero. We decided to rescale RCI values so that the mode RCI value (0.024) becomes “zero” and round up any subsequent negative values. At the other end of the scale, particularly noticeable is a large spike in RCI values at 0.2 which is comprised of terminal residues. A similar spike, also comprised of terminal residues, is present in the frequency distribution of FIRST at Boltzmann population ratio equal to one (i.e., completely flexible at 298.15 K). We therefore decided to scale RCI values so that these spikes align. Subsequent values above one (i.e., apparently more flexible than terminal residues) are rounded down, although such instances are very rare. The equation below outlines how we compute rescaled RCI \(\left( {R_{{\mathrm{RCI}}}^\prime } \right)\) from the original RCI values ( R RCI ):

Comparing the frequency distribution of the rescaled RCI and FIRST output values shows good agreement (Supplementary Fig.  8c, d ).

Validation scores

RCI and FIRST are compared using two different measures. One is the correlation, calculated using a Spearman rank correlation coefficient. The other is the root mean square deviation (RMSD), calculated as:

where N is the number of residues in the protein, \(R_{{\mathrm{RCI}}}^\prime\) is the local rigidity computed with RCI and rescaled as described above, and R FIRST is the local rigidity computed with FIRST. The numerical values of correlation score and RMSD score are reported as the percentiles relative to a reference dataset formed of structures from the CNS and CNW datasets from the RECOORD recalculated structure database, which provide a representative selection of different fold types, before and after explicit solvent refinement.

Dataset of comparable X-ray and NMR structures

To build a dataset of comparable X-ray and NMR structures, we made use of the set of X-ray structures that were used to train the SHIFTX2 program for predicting chemical shifts 44 . This set comprises 197 high-resolution and high-quality structures, which are representative of different fold types. We extracted structures which had corresponding NMR structures in the PDB, and backbone chemical shift completeness of at least 75%. Our final dataset consisted of 80 X-ray structures and 121 corresponding NMR structures for 68 different proteins. PDB and BMRB IDs are provided in Supplementary Table  3 .

X-ray structures required some processing. If the structure contained multiple conformations (typical in high resolution X-ray structures), then we only considered the first of these as they appeared in the PDB file. Missing atoms and small breaks in the protein structure were identified using an in-house program and fixed using MODELLER 58 . MODELLER was also used to replace non-standard residues related to conditions required for crystallization (e.g., selenomethionine was replaced with methionine). Structures were protonated using REDUCE with the option to optimize adjustable groups 59 .

Reporting summary

Further information on research design is available in the  Nature Research Reporting Summary linked to this article.

Data availability

Source data are listed in  Supplementary Information and are from publicly available databases: specifically, the Protein Data Bank ( www.rcsb/org ), Biological Magnetic Resonance Bank (BMRB: www.bmrb.io ) and RECOORD ( www.ebi.ac.uk/pdbe/recalculated-nmr-data ). The accession codes of PDB and BMRB entries used in this study are listed in the  Supplementary Information file. Data supporting the findings of this work are available within the paper and its  Supplementary Information . The datasets generated and analysed during the current study are available from the corresponding author (MPW) upon request.

Code availability

The program and associated documentation can be downloaded from github.com/nickjf/ANSURR, https://doi.org/10.5281/zenodo.4161586 60 . A typical calculation on an ensemble of 20 models for a 150-residue protein takes less than a minute.

Berman, H. M. et al. The protein data bank. Nucleic Acids Res. 28 , 235–242 (2000).

Article   ADS   CAS   PubMed   PubMed Central   Google Scholar  

Read, R. J. et al. A new generation of crystallographic validation tools for the Protein Data Bank. Structure 19 , 1395–1412 (2011).

Article   CAS   PubMed   PubMed Central   Google Scholar  

Henderson, R. et al. Outcome of the first electron microscopy validation task force meeting. Structure 20 , 205–214 (2012).

Article   CAS   PubMed   Google Scholar  

Trewhella, J. et al. Report of the wwPDB small-angle scattering task force: data requirements for biomolecular modeling and the PDB. Structure 21 , 875–881 (2013).

Montelione, G. T. et al. Recommendations of the wwPDB NMR validation task force. Structure 21 , 1563–1570 (2013).

Gore, S. et al. Validation of structures in the Protein Data Bank. Structure 25 , 1916–1927 (2017).

Brunger, A. T. Free R-value: a novel statistical quantity for assessing the accuracy of crystal structures. Nature 355 , 472–475 (1992).

Article   ADS   CAS   PubMed   Google Scholar  

Snyder, D. A., Bhattacharya, A., Huang, Y. P. J. & Montelione, G. T. Assessing precision and accuracy of protein structures derived from NMR data. Proteins 59 , 655–661 (2005).

Vuister, G. W., Fogh, R. H., Hendrickx, P. M. S., Doreleijers, J. F. & Gutmanas, A. An overview of tools for the validation of protein NMR structures. J. Biomol. NMR 58 , 259–285 (2014).

Spronk, C. A. E. M., Nabuurs, S. B., Krieger, E., Vriend, G. & Vuister, G. W. Validation of protein structures derived by NMR spectroscopy. Progr. NMR Spectrosc. 45 , 315–337 (2004).

Article   CAS   Google Scholar  

Nabuurs, S. B., Spronk, C. A. E. M., Vuister, G. W. & Vriend, G. Traditional biomolecular structure determination by NMR spectroscopy allows for major errors. PLos Comput. Biol. 2 , 71–79 (2006).

Brünger, A. T., Clore, G. M., Gronenborn, A. M., Saffrich, R. & Nilges, M. Assessing the quality of solution nuclear magnetic resonance structures by complete cross-validation. Science 261 , 328–331 (1993).

Article   ADS   PubMed   Google Scholar  

Huang, Y. J., Rosato, A., Singh, G. & Montelione, G. T. RPF: a quality assessment tool for protein NMR structures. Nucleic Acids Res. 40 , W542–W546 (2012).

Williamson, M. P., Kikuchi, J. & Asakura, T. Application of 1 H NMR chemical shifts to measure the quality of protein structures. J. Mol. Biol. 247 , 541–546 (1995).

CAS   PubMed   Google Scholar  

Zhao, D. Q. & Jardetzky, O. An assessment of the precision and accuracy of protein structures determined by NMR: dependence on distance errors. J. Mol. Biol. 239 , 601–607 (1994).

Saccenti, E. & Rosato, A. The war of tools: how can NMR spectroscopists detect errors in their structures? J. Biomol. NMR 40 , 251–261 (2008).

Spronk, C. A. E. M. et al. The precision of NMR structure ensembles revisited. J. Biomol. NMR 25 , 225–234 (2003).

Gronwald, W. et al. RFAC, a program for automated NMR R-factor estimation. J. Biomol. NMR 17 , 137–151 (2000).

Gronwald, W. et al. AUREMOL-RFAC-3D, combination of R-factors and their use for automated quality assessment of protein solution structures. J. Biomol. NMR 37 , 15–30 (2007).

Wüthrich, K. NMR of Proteins and Nucleic Acids . (Wiley, New York, 1986).

Book   Google Scholar  

Wishart, D. S. Interpreting protein chemical shift data. Prog. Nucl. Magn. Reson. Spectrosc. 58 , 62–87 (2011).

Berjanskii, M. V. & Wishart, D. S. Application of the random coil index to studying protein flexibility. J. Biomol. NMR 40 , 31–48 (2008).

Berjanskii, M. V. & Wishart, D. S. A simple method to predict protein flexibility using secondary chemical shifts. J. Am. Chem. Soc. 127 , 14970–14971 (2005).

Sljoka, A. & Wilson, D. Probing protein ensemble rigidity and hydrogen-deuterium exchange. Phys. Biol. 10 , 056013 (2013).

Article   ADS   PubMed   CAS   Google Scholar  

Jacobs, D. J., Rader, A. J., Kuhn, L. A. & Thorpe, M. F. Protein flexibility predictions using graph theory. Proteins 44 , 150–165 (2001).

Whiteley, W. Counting out to the flexibility of molecules. Phys. Biol. 2 , S116–S126 (2005).

Nederveen, A. J. et al. RECOORD: a recalculated coordinate database of 500+ proteins from the PDB using restraints from the BioMagResBank. Proteins 59 , 662–672 (2005).

Brunger, A. T. et al. Crystallography & NMR system: a new software suite for macromolecular structure determination. Acta Cryst. D 54 , 905–921 (1998).

Güntert, P. Automated NMR protein structure calculation. Progr. NMR Spectrosc. 43 , 105–125 (2003).

Linge, J. P., Habeck, M., Rieping, W. & Nilges, M. ARIA: automated NOE assignment and NMR structure calculation. Bioinformatics 19 , 315–316 (2003).

Linge, J. P., Williams, M. A., Spronk, C. A. E. M., Bonvin, A. M. J. J. & Nilges, M. Refinement of protein structures in explicit solvent. Proteins 50 , 496–506 (2003).

Deng, H., Jia, Y. & Zhang, Y. 3DRobot: automated generation of diverse and well-packed protein structure decoys. Bioinformatics 32 , 378–387 (2016).

Zemla, A. LGA: a method for finding 3D similarities in protein structures. Nucleic Acids Res. 31 , 3370–3374 (2003).

Keedy, D. A. et al. The other 90% of the protein: Assessment beyond the Cαs for CASP8 template-based and high-accuracy models. Proteins 77 , 29–49 (2009).

Mao, B., Tejero, R., Baker, D. & Montelione, G. T. Protein NMR structures refined with Rosetta have higher accuracy relative to corresponding X-ray crystal structures. J. Am. Chem. Soc. 136 , 1893–1906 (2014).

Clore, G. M., Robien, M. A. & Gronenborn, A. M. Exploring the limits of precision and accuracy of protein structures determined by nuclear magnetic resonance spectroscopy. J. Mol. Biol. 231 , 82–102 (1993).

Nabuurs, S. B. et al. Quantitative evaluation of experimental NMR restraints. J. Am. Chem. Soc. 125 , 12026–12034 (2003).

Huang, Y. P. J. et al. An integrated platform for automated analysis of protein NMR structures. Methods Enzymol. 394 , 111–141 (2005).

Simon, K., Xu, J., Kim, C. & Skrynnikov, N. Estimating the accuracy of protein structures using residual dipolar couplings. J. Biomol. NMR 33 , 83–93 (2005).

Chen, V. B. et al. MolProbity: all-atom structure validation for macromolecular crystallography. Acta Crystallogr. Sect. D 66 , 12–21 (2010).

Laskowski, R. A., Rullmann, J. A. C., MacArthur, M. W., Kaptein, R. & Thornton, J. M. AQUA and PROCHECK-NMR: Programs for checking the quality of protein structures solved by NMR. J. Biomol. NMR 8 , 477–486 (1996).

Berjanskii, M., Zhou, J., Liang, Y., Lin, G. & Wishart, D. S. Resolution-by-proxy: a simple measure for assessing and comparing the overall quality of NMR protein structures. J. Biomol. NMR 53 , 167–180 (2012).

Berjanskii, M. et al. PROSESS: a protein structure evaluation suite and server. Nucleic Acids Res. 38 , W633–W640 (2010).

Han, B., Liu, Y., Ginzinger, S. W. & Wishart, D. S. SHIFTX2: significantly improved protein chemical shift prediction. J. Biomol. NMR 50 , 43–57 (2011).

Tilton, R. F., Dewan, J. C. & Petsko, G. A. Effects of temperature on protein structure and dynamics: X-ray crystallographic studies of the protein ribonuclease-A at 9 different temperatures from 98 K to 320 K. Biochemistry 31 , 2469–2481 (1992).

Fraser, J. S. et al. Accessing protein conformational ensembles using room-temperature X-ray crystallography. Proc. Natl Acad. Sci. USA 108 , 16247–16252 (2011).

Halle, B. Biomolecular cryocrystallography: Structural changes during flash-cooling. Proc. Natl Acad. Sci. USA 101 , 4793–4798 (2004).

Doreleijers, J. F., Rullmann, J. A. C. & Kaptein, R. Quality assessment of NMR structures: a statistical survey. J. Mol. Biol. 281 , 149–164 (1998).

Gonzalez, C., Rullmann, J. A. C., Bonvin, A. M. J. J., Boelens, R. & Kaptein, R. Toward an NMR R factor. J. Magn. Reson. 91 , 659–664 (1991).

ADS   CAS   Google Scholar  

Thomas, P. D., Basus, V. J. & James, T. L. Protein structure determination using distances from 2-dimensional nuclear Overhauser effect experiments: effect of approximations on the accuracy of derived structures. Proc. Natl Acad. Sci. USA 88 , 1237–1241 (1991).

Withka, J. M., Srinivasan, J. & Bolton, P. H. Problems with, and alternatives to, the NMR R factor. J. Magn. Reson. 98 , 611–617 (1992).

Williamson, M. P., Havel, T. F. & Wüthrich, K. Solution conformation of proteinase inhibitor IIA from bull seminal plasma by 1 H nuclear magnetic resonance and distance geometry. J. Mol. Biol. 182 , 295–315 (1985).

Vranken, W. F. NMR structure validation in relation to dynamics and structure determination. Prog. Nucl. Magn. Reson. Spectrosc. 82 , 27–38 (2014).

Kim, T. H. et al. The role of dimer asymmetry and protomer dynamics in enzyme catalysis. Science 355 , eaag2355 (2017).

Article   PubMed   CAS   Google Scholar  

Tamiola, K., Acar, B. & Mulder, F. A. A. Sequence-specific random coil chemical shifts of intrinsically disordered proteins. J. Am. Chem. Soc. 132 , 18000–18003 (2010).

Schwarzinger, S. et al. Sequence-dependent correction of random coil NMR chemical shifts. J. Am. Chem. Soc. 123 , 2970–2978 (2001).

Katoh, N. & Tanigawa, S. A proof of the molecular conjecture. Discret. Comput. Geom. 45 , 647–700 (2011).

Article   MathSciNet   MATH   Google Scholar  

Webb, B. & Sali, A. Comparative protein structure modeling using MODELLER. Curr. Protoc. Protein Sci. 86 , 5.6.1–5.6.37 (2016).

Article   Google Scholar  

Word, J. M., Lovell, S. C., Richardson, J. S. & Richardson, D. C. Asparagine and glutamine: using hydrogen atom contacts in the choice of side-chain amide orientation. J. Mol. Biol. 285 , 1735–1747 (1999).

Fowler, N. J., Sljoka, A. & Williamson, M. P. A method for validating the accuracy of NMR protein structures. GitHub.com/nickjf/ANSURR https://doi.org/10.5281/zenodo.4161586 (2020).

Kabsch, W. & Sander, C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22 , 2577–2637 (1983).

Download references

Acknowledgements

We thank the Biotechnology and Biological Science Research Council (BBSRC) for funding to N.J.F. (BB/P020038/1), and CREST, Japan Science and Technology Agency (JST) JPMJCR1402 and PRISM JPMJCR18Z3 for funding to A.S.

Author information

Authors and affiliations.

Dept of Molecular Biology and Biotechnology, University of Sheffield, Sheffield, UK

Nicholas J. Fowler & Mike P. Williamson

RIKEN Center for Advanced Intelligence Project, RIKEN, 1-4-1 Nihombashi, Chuo-ku, Tokyo, 103-0027, Japan

Adnan Sljoka

Dept of Chemistry, University of Toronto, UTM, 3359 Mississauga Road North, Mississauga, ON, L5L 1C6, Canada

You can also search for this author in PubMed   Google Scholar

Contributions

M.P.W. and A.S. conceived the study. N.J.F. wrote the code and did the analysis. All authors wrote the manuscript.

Corresponding authors

Correspondence to Adnan Sljoka or Mike P. Williamson .

Ethics declarations

Competing interests.

The authors declare no competing interests.

Additional information

Peer review information Nature Communications thanks the anonymous reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary information, peer review file, reporting summary, rights and permissions.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Cite this article.

Fowler, N.J., Sljoka, A. & Williamson, M.P. A method for validating the accuracy of NMR protein structures. Nat Commun 11 , 6321 (2020). https://doi.org/10.1038/s41467-020-20177-1

Download citation

Received : 25 September 2020

Accepted : 13 November 2020

Published : 18 December 2020

DOI : https://doi.org/10.1038/s41467-020-20177-1

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

By submitting a comment you agree to abide by our Terms and Community Guidelines . If you find something abusive or that does not comply with our terms or guidelines please flag it as inappropriate.

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

Sign up for the Nature Briefing newsletter — what matters in science, free to your inbox daily.

nmr assignment database

 
   
  Tools for NMR spectroscopists
--> -->
  • Predict 1D 1 H NMR spectra
  • Predict 1D 13 C NMR spectra
  • Predict COSY spectra
  • Predict HSQC / HMBC spectra
  • Simulated second order effect in 1 H NMR spectra
  • Make some NMR exercises or share them with your students

1 H NMR prediction was possible thanks to the tool of the FCT-Universidade NOVA de Lisboa developped by Yuri Binev and Joao Aires-de-Sousa. Y. Binev, M.M. Marques, J. Aires-de-Sousa, Prediction of 1H NMR coupling constants with associative neural networks trained for chemical shifts J. Chem. Inf. Model. 2007 , 47 /(6), 2089-2097.

This website does not contain any database of NMR spectra but allows to predict easily 13C as well as 1 H spectra.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • My Bibliography
  • Collections
  • Citation manager

Save citation to file

Email citation, add to collections.

  • Create a new collection
  • Add to an existing collection

Add to My Bibliography

Your saved search, create a file for external citation management software, your rss feed.

  • Search in PubMed
  • Search in NLM Catalog
  • Add to Search

Toward creation of a universal NMR database for stereochemical assignment: complete structure of the desertomycin/oasomycin class of natural products

  • PMID: 11456839
  • DOI: 10.1021/ja004154q

PubMed Disclaimer

Similar articles

  • Toward creation of a universal NMR database for the stereochemical assignment of acyclic compounds: proof of concept. Lee J, Kobayashi Y, Tezuka K, Kishi Y. Lee J, et al. Org Lett. 1999 Dec 30;1(13):2181-4. doi: 10.1021/ol990379y. Org Lett. 1999. PMID: 10836073
  • Toward the creation of NMR databases in chiral solvents for assignments of relative and absolute stereochemistry: scope and limitation. Hayashi N, Kobayashi Y, Kishi Y. Hayashi N, et al. Org Lett. 2001 Jul 12;3(14):2249-52. doi: 10.1021/ol010109r. Org Lett. 2001. PMID: 11440591
  • Stereochemical Assignment of the C21-C38 Portion of the Desertomycin/Oasomycin Class of Natural Products by Using Universal NMR Databases: Proof. Tan CH, Kobayashi Y, Kishi Y. Tan CH, et al. Angew Chem Int Ed Engl. 2000 Dec 1;39(23):4282-4284. doi: 10.1002/1521-3773(20001201)39:23 3.0.CO;2-U. Angew Chem Int Ed Engl. 2000. PMID: 29711890 No abstract available.
  • Macrolactam analogues of macrolide natural products. Hügel HM, Smith AT, Rizzacasa MA. Hügel HM, et al. Org Biomol Chem. 2016 Dec 7;14(48):11301-11316. doi: 10.1039/c6ob02149b. Org Biomol Chem. 2016. PMID: 27812587 Review.
  • Desertomycin: a potentially interesting antibiotic. Uri JV. Uri JV. Acta Microbiol Hung. 1986;33(4):271-83. Acta Microbiol Hung. 1986. PMID: 3307272 Review. No abstract available.
  • Dactylides A-C, three new bioactive 22-membered macrolides produced by Dactylosporangium aurantiacum. Kumar P, Nalli Y, Singh S, Wakchaure PD, Gor R, Ghadge VA, Kim E, Ramalingam S, Azger Dusthackeer VN, Yoon YJ, Ganguly B, Shinde PB. Kumar P, et al. J Antibiot (Tokyo). 2023 Sep;76(9):503-510. doi: 10.1038/s41429-023-00632-z. Epub 2023 May 19. J Antibiot (Tokyo). 2023. PMID: 37208457
  • Identification of a New Antimicrobial, Desertomycin H, Utilizing a Modified Crowded Plate Technique. Mohamed OG, Dorandish S, Lindow R, Steltz M, Shoukat I, Shoukat M, Chehade H, Baghdadi S, McAlister-Raeburn M, Kamal A, Abebe D, Ali K, Ivy C, Antonova M, Schultz P, Angell M, Clemans D, Friebe T, Sherman D, Casper AM, Price PA, Tripathi A. Mohamed OG, et al. Mar Drugs. 2021 Jul 27;19(8):424. doi: 10.3390/md19080424. Mar Drugs. 2021. PMID: 34436264 Free PMC article.
  • Synthesis-Driven Stereochemical Assignment of Marine Polycyclic Ether Natural Products. Fuwa H. Fuwa H. Mar Drugs. 2021 Apr 29;19(5):257. doi: 10.3390/md19050257. Mar Drugs. 2021. PMID: 33947080 Free PMC article. Review.
  • Isolation, Structure Elucidation and Biological Evaluation of Lagunamide D: A New Cytotoxic Macrocyclic Depsipeptide from Marine Cyanobacteria. Luo D, Putra MY, Ye T, Paul VJ, Luesch H. Luo D, et al. Mar Drugs. 2019 Feb 1;17(2):83. doi: 10.3390/md17020083. Mar Drugs. 2019. PMID: 30717076 Free PMC article.
  • Heat Shock Protein-Inducing Property of Diarylheptanoid Containing Chalcone Moiety from Alpinia katsumadai. Nam JW, Lee YS. Nam JW, et al. Molecules. 2017 Oct 17;22(10):1750. doi: 10.3390/molecules22101750. Molecules. 2017. PMID: 29039794 Free PMC article.

Publication types

  • Search in MeSH

Related information

  • PubChem Compound (MeSH Keyword)

LinkOut - more resources

  • MedlinePlus Health Information
  • Citation Manager

NCBI Literature Resources

MeSH PMC Bookshelf Disclaimer

The PubMed wordmark and PubMed logo are registered trademarks of the U.S. Department of Health and Human Services (HHS). Unauthorized use of these marks is strictly prohibited.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings

Preview improvements coming to the PMC website in October 2024. Learn More or Try it out now .

  • Advanced Search
  • Journal List
  • Metabolites
  • PMC11123270

Logo of metabolites

Accurate Prediction of 1 H NMR Chemical Shifts of Small Molecules Using Machine Learning

Tanvir sajed.

1 Department of Biological Sciences, University of Alberta, Edmonton, AB T6G 2E9, Canada

Zinat Sayeeda

Brian l. lee, mark berjanskii.

2 Department of Computing Science, University of Alberta, Edmonton, AB T6G 2E8, Canada

Vasuk Gautam

David s. wishart.

3 Department of Laboratory Medicine and Pathology, University of Alberta, Edmonton, AB T6G 2B7, Canada

4 Faculty of Pharmacy and Pharmaceutical Sciences, University of Alberta, Edmonton, AB T6G 2H7, Canada

Associated Data

The 1 H predictor is available as a webserver at https://prospre.ca . Likewise, the data used in training and testing PROSPRE are also available through the same webserver address.

NMR is widely considered the gold standard for organic compound structure determination. As such, NMR is routinely used in organic compound identification, drug metabolite characterization, natural product discovery, and the deconvolution of metabolite mixtures in biofluids (metabolomics and exposomics). In many cases, compound identification by NMR is achieved by matching measured NMR spectra to experimentally collected NMR spectral reference libraries. Unfortunately, the number of available experimental NMR reference spectra, especially for metabolomics, medical diagnostics, or drug-related studies, is quite small. This experimental gap could be filled by predicting NMR chemical shifts for known compounds using computational methods such as machine learning (ML). Here, we describe how a deep learning algorithm that is trained on a high-quality, “solvent-aware” experimental dataset can be used to predict 1 H chemical shifts more accurately than any other known method. The new program, called PROSPRE (PROton Shift PREdictor) can accurately (mean absolute error of <0.10 ppm) predict 1 H chemical shifts in water (at neutral pH), chloroform, dimethyl sulfoxide, and methanol from a user-submitted chemical structure. PROSPRE (pronounced “prosper”) has also been used to predict 1 H chemical shifts for >600,000 molecules in many popular metabolomic, drug, and natural product databases.

1. Introduction

NMR is ideal for determining the structure of small organic molecules, both natural and synthetic. This is because NMR spectra are characterized by sharp, well-defined peaks that can be directly associated with specific atoms within a given molecule. These peaks correspond to the chemical shifts, which can often be assigned to specific atoms or atomic groups in the molecule of interest. NMR chemical shifts, including 1 H, 13 C, and 15 N chemical shifts, are very sensitive to the electronic environment surrounding each nucleus and can provide a wealth of information about a molecule’s covalent and non-covalent structure. Not only are the chemical shifts sensitive to the type and character of nearby atoms but chemical shifts are also remarkably consistent or “predictive” for different chemical groups or chemical environments. This sensitivity and behavioural consistency have allowed chemists to produce various chemical shift tables that provide chemical shift ranges for various chemical groups and to use these tables to deduce the identity of key chemical groups and thereby determine the precise structures of additional small molecules.

As a result, NMR has become routinely used in the determination of novel structures prepared via organic synthesis, in characterizing newly discovered compounds or contaminants [ 1 , 2 , 3 ], in drug metabolite characterization [ 4 , 5 ], in natural product discovery [ 6 ], and the deconvolution of metabolite mixtures in biofluids, especially in metabolomics and exposomics [ 7 , 8 ]. The 1 H and 13 C chemical shift assignments for many of these molecules have been deposited into a variety of NMR spectral reference libraries. These include the Human Metabolome Database (HMDB) [ 9 ], the Biological Magnetic Resonance Databank (BMRB) [ 10 ], NMRShiftDB2 [ 11 ], the Spectral Database System (SDBS) [ 12 ], and the Natural Products Magnetic Resonance Database (NP-MRD) [ 13 ]. In addition, several commercial NMR spectral libraries have been developed, including Advanced Chemistry Development (ACD/Labs) and the Wiley spectral database collection.

The intention of these experimentally collected NMR spectral libraries is to help others more easily characterize novel compounds or characterize/quantify known compounds using NMR analysis. Specifically, by matching or partially matching measured NMR spectra to experimentally collected NMR spectral reference libraries, it is hoped that the chemical shift assignment of new compounds can be facilitated, or the identification of previously known compounds can be rapidly performed. Unfortunately, the number of available experimental NMR reference spectra for applications in NMR-based metabolomics, NMR-based medical diagnostics, or NMR-based drug-related studies is quite small. For instance, in the field of metabolomics, fewer than 1000 compounds with high-quality NMR spectra have been deposited into the HMDB [ 9 ]. This compares to the >250,000 chemicals that are in the HMDB (which translates to <0.5% compound coverage). Likewise, the number of experimentally assigned NMR spectra in DrugBank [ 14 ] is <200, whereas the number of known drugs and drug metabolites in DrugBank is >12,700 (which translates to <1.6% compound coverage). Similarly, the number of experimentally assigned NMR spectra in the NP-MRD is <20,000 whereas the number of known natural products in the NP-MRD is >300,000 (which translates to <7% coverage coverage). With the ever-increasing number of known human metabolites, known drugs or drug metabolites, and known natural products being studied and identified, collecting experimental NMR data on each of these compounds and completing their assignments is an almost impossible task.

To address this gap between measured experimental NMR data and known structural data, a number of individuals have proposed “in silico” or “reference-free” approaches to small molecule characterization [ 15 , 16 ]. In particular, by accurately predicting the NMR chemical shifts (or other observables such as mass spectra or retention times) using known or predicted chemical structures, it may be possible to greatly accelerate compound identification or confirmation. Indeed, accurate prediction of NMR chemical shifts or NMR spectra of the millions of known compounds would allow the creation of an enormous library of predicted NMR spectra that could be readily used for the identification (and quantification) of compounds in almost any sample. More specifically, these in silico databases could confirm and validate structures of newly synthesized drugs or drug metabolites, facilitate the characterization of natural products with compelling medicinal properties, or assist with the NMR-based metabolomic analysis of urine, blood, or cerebrospinal fluid to aid in medical diagnoses.

NMR chemical shift prediction is nearly 70 years old [ 17 ] and hundreds of papers have been published on the subject (reviewed in [ 17 ]). There are four general approaches: (1) rule-based methods; (2) structure similarity approaches; (3) quantum mechanical (QM) approaches; and (4) machine learning (ML) methods. Early examples of rule-based approaches date from the 1950s [ 18 , 19 ] to estimate 13 C chemical shifts of methylene groups. Since then, many more extensions of this rule-based or additive approach for chemical shift calculation have been developed, enabling the prediction of chemical shifts for many different classes of organic compounds. However, because of their high level of uncertainty and the limited applicability of additive rules to work for more exotic structures, work on rule-based methods for chemical shift prediction has largely stopped.

Structure similarity methods use databases of structure fragments and their chemical shifts to predict 1 H and/or 13 C chemical shifts [ 11 , 20 , 21 ]. In these methods, the structure is queried against a large database of structures and experimental 1 H/ 13 C shifts to identify exactly matching or similar substructures. When similar substructures are found, the predicted chemical shifts are returned as the weighted average of the experimental chemical shift values corresponding to the matched structures. A popular method for encoding atomic environment information is the Hierarchical Ordered Spherical description of Environment coding (HOSE code) method [ 21 ], described in 1978 and first used for chemical shift prediction in 2003 [ 11 ]. NMRShiftDB provides an openly accessible HOSE-code-based chemical shift prediction tool [ 22 ]. HOSE code methods can achieve 1 H chemical shift predictions with errors (MAE) of 0.2–0.3 ppm [ 22 ].

More recently, QM calculations that employ Density Functional Theory (DFT) techniques have become particularly popular [ 23 ]. DFT can provide chemical shift prediction results that are reasonably close to experimental values, with RMSEs (root mean square errors) of 0.2–0.4 ppm for 1 H shifts [ 24 , 25 ]. Unfortunately, the time required for performing a DFT calculation, even for small organic molecules, is very long, and grows exponentially with the number of atoms. The speed of chemical shift prediction is a very important criterion, especially if one is trying to calculate chemical shifts for millions of molecules. As a result, there has been a move towards faster approaches that use ML.

ML-based approaches to predict NMR chemical shifts are often 100-1000X faster than QM approaches and offer similar accuracy. The first ML methods used relatively simple Artificial Neural Networks (ANNs) [ 26 ]. Meiler et al. [ 27 ] developed an ANN model that had superior performance in comparison with rule-based methods. Aires-DeSousa et al. used counter propagation neural networks (CPNNs) [ 28 ] and later Feed Forward Neural Networks (FFNNs) [ 29 ] and Associative Neural Networks (ASNNs) to predict 1 H chemical shifts, achieving a mean absolute error (MAE) of 0.19 ppm. More recently, deep neural networks such as Graph Neural Networks (GNNs) have shown particularly promising results. Jonas and Khun [ 30 ] used a GNN to predict both 1 H and 13 C chemical shifts and found that their GNN either matched or outperformed the traditional HOSE code method. In particular, their 1 H predictor had a reported MAE of 1.43 ppm for 13 C and 0.28 ppm for 1 H. In 2021, Guan et al. [ 24 ] tried an approach called transfer learning (TL). They developed a GNN model, which they named CASCADE, using DFT-calculated chemical shift data, to predict chemical shifts and then applied TL to incrementally improve the DFT-trained model. Interestingly, this approach bypassed the problems with collecting and curating (fixing/cleaning) large chemical shift datasets (needed for ML-based training and testing). Despite these advances, the accuracy of NMR chemical shift prediction remains stuck in a state where the best predictors can only predict 1 H shifts with an error (MAE) of ~0.20 ppm and 13 C shifts with an MAE of >2.00 ppm [ 11 , 24 ].

Our own experience in building experimental NMR spectral databases for HMDB, NP-MRD, and DrugBank showed that many of the training datasets used in previously published ML-based methods had significant problems with erroneous chemical shift assignments, incorrect chemical shift referencing, and a lack of appropriate accommodation for solvent effects. We hypothesized that by correcting for these database problems, the accuracy of 1 H (and as will be shown in an upcoming publication, 13 C) chemical shift prediction could be improved. In this paper, we first describe how we built a high-quality, reference-corrected, “solvent-aware” experimental NMR dataset for developing ML predictors of 1 H chemical shifts. We then demonstrate how this dataset was used to train a neural network for predicting 1 H shifts via transfer learning from an existing GNN that was trained on DFT chemical shifts. Finally, we present a web-based implementation of this 1 H chemical shift predictor which we call PROSPRE (PROton Shift PREdictor). PROSPRE takes a chemical structure (as a SMILES string) as input and accurately (MAE ~0.10 ppm) predicts its 1 H chemical shifts in water (at neutral pH), chloroform, methanol, and dimethyl sulfoxide ( Figure 1 ).

An external file that holds a picture, illustration, etc.
Object name is metabolites-14-00290-g001.jpg

A flowchart of the ( A ) transfer learning process and ( B ) 1 H predictions. Please see the explanation in the text.

2.1. Creating a Solvent-Aware 1 H Chemical Shift Dataset for Training and Validation

Accurately predicting 1 H chemical shifts using ML methods requires large collections of correct chemical structures with correct placement of all protons and accurate, experimentally assigned 1 H chemical shifts. These structure/shift collections also must have consistent atomic numbering schemes and information about solvents that were used to prepare NMR samples. NMR solvents are known to significantly affect the observed 1 H chemical shifts, the presence/absence of 1 H signals, and the time-averaged structures of organic molecules [ 31 , 32 , 33 ]. Different solvents also require the use of different chemical shift reference standards (such as tetramethylsilane [TMS] or trimethylsilylpropanoic acid [TSP]) which can also lead to systematic chemical shift changes [ 34 ]. In the fields of NMR-based diagnostics, metabolomics, exposomics, and drug metabolism, almost all chemical compounds are dissolved in water. On the other hand, in the fields of organic chemistry and natural product research, almost all chemical compounds are dissolved in organic solvents (methanol, dimethyl sulfoxide, chloroform, etc.). As our primary interest is in biological systems, our initial focus was on assembling a high-quality dataset of small molecule 1 H chemical shift assignments in water. Based on the quality, coverage and solvent choices among existing NMR databases, we decided to work with just three NMR spectral libraries: (1) the Human Metabolome Database (HMDB), (2) the Biological Magnetic Resonance Databank (BMRB), and (3) the Guided Ideographic Spin System Model Optimization (GISSMO) library [ 35 ]. The HMDB [ 9 ] is a comprehensive, high-quality, freely available online database of the small molecule metabolites found in the human body. It contains experimentally collected 1 H NMR spectra for 768 compounds. We found the experimental NMR data and 1 H chemical shift assignments were of very high quality and almost all were collected in water. The second NMR spectral library we used was the BMRB [ 10 ]. The BMRB contains over 1000 biological small molecules with assigned 1 H chemical shifts at multiple spectrometer frequencies. We found the experimental NMR data and 1 H chemical shift assignments in the BMRB were of high quality (a few assignment errors were evident) and almost all chemical shifts were collected in water. The third chemical shift library we chose was GISSMO library [ 35 ]. The GISSMO database contains about 1000 small molecules and small molecule fragments with assigned or chemical shifts for 1 H. Almost all the chemical shifts in GISSMO were collected in water. Chemical shifts in these databases were mostly referenced to the internal standard, DSS (4,4-dimethyl-4-silapentane-1-sulfonic acid) at 0.00 ppm and acquired at a pH between 7.0–7.4. To round out our dataset for 1 H chemical shift assignments in non-aqueous solvents and to extend the utility of our predictor to other applications (natural products and organic synthesis), we also extracted structures and chemical shift data from the NMRShiftDB database. The NMRShiftDB contains 1 H NMR assignments for mostly non-biological or synthetic compounds where the most common solvents are chloroform, dimethyl sulfoxide, and methanol.

2.1.1. The Training Dataset

The training dataset consisted of 577 molecules with complete 3D structures (with attached protons) and fully assigned 1 H chemical shifts in water. A total of 430 of these molecules were obtained from the HMDB library. These 430 molecules had a total of 3333 experimentally measured 1 H chemical shift values. Another 103 molecules were obtained from the BMRB library, which corresponded to 508 experimentally measured 1 H chemical shifts. The last set of 44 molecules was collected from the GISSMO library, which contributed 366 experimentally measured 1 H chemical shifts. Altogether, our training dataset consisted of 4207 experimentally measured 1 H chemical shift values from 577 diverse molecules. These 577 molecules had an average molecular weight of 162 Daltons (Da), ranging from 31 Da to 566 Da. All 1 H chemical shifts in the training dataset were collected in water and referenced to DSS. The assembled training dataset contained a structurally diverse range of molecules including organic acids, alcohols, amino acids, and nucleotides. Note that most of the molecules chosen were relatively water soluble and had a biological origin (microbial, plant or animal). The bias towards human metabolites and natural products was deliberate as we are primarily interested in predicting 1 H chemical shifts for compounds that can be used as biomarkers for diagnostics, for metabolomics or exposomics applications, and for drug research. All 1 H chemical shift assignments were checked and confirmed by multiple NMR experts through manual inspection of the available 1D and 2D NMR spectra and by comparison to both published literature assignments and suggestions provided by commercial NMR assignment tools (see details in Section 2.2 ).

2.1.2. The Holdout Datasets

Two holdout sets, not previously seen by our ML model, were used to test the performance of the different trained ML models for 1 H chemical shift prediction. Our first holdout dataset consisted of 36 structurally diverse molecules chosen at random from the HMDB, BMRB, or GISSMO, each of which was dissolved in water and each of which was referenced to DSS. These 36 molecules had a total of 272 experimentally measured 1 H chemical shifts with an average molecular weight of 156 Da (ranging from 78–307 Da).

The second holdout dataset consisted of 22 organic compounds that were chosen at random from the NP-MRD database. These 22 compounds had a total of 442 experimentally determined 1 H chemical shifts. All 22 compounds were dissolved in deuterated chloroform and referenced to tetramethylsilane (TMS). These solvent and chemical shift reference conditions are obviously different than those in the first holdout set. Therefore, to bring the chemical shift data in line with what is reported for compounds dissolved in water and referenced to DSS, we made chemical shift adjustments. Based on data provided by Wishart et al. [ 34 , 36 ], we adjusted all TMS-referenced 1 H chemical shifts in the second holdout set to match DSS-referenced 1 H chemical shifts. Furthermore, because chloroform has a different polarity and hydrogen bonding character than water, we also had to adjust the reported 1 H chemical shifts to match those reported in water, using a locally developed solvent scaling equation (see Supplementary Materials, Figure S1 ). For the molecules in this second holdout set, the average molecular weight was 306 Da, ranging from 224–429 Da.

2.2. Data Cleaning and Correction

A persistent problem with chemical shift assignments is that there is no standard or consistent way to label which atom numbers are assigned to which 1 H chemical shifts. Typically, chemical shift assignments are presented visually with atom labels marked on an image of the molecular structure and the chemical shifts are presented separately in a table with the corresponding atom labels from the structural image. While this visual approach to structural or chemical shift mapping works well for humans, it is not computer readable. Further complicating the matter is the fact that atom numbering of most molecules drawn with commercial software tools varies depending on how it was drawn by each user. When we analyzed publicly available NMR assignments and corresponding structures, we found that the molecular structures did not have the same pattern of atom numbering. Moreover, not all structure files were consistent. We found that some of the molecular structure files for some chemicals were rendered as “flat” two-dimensional structures, whereas others were rendered as proper 3D structures.

To overcome these problems, we first used a program called Atom Label Assignment Tool using InChI String (ALATIS) [ 37 ] to generate robust 3D molecular structures and consistent atom numbering. Next, using Marvin Sketch (version 20) from ChemAxon (Budapest, Hungary), we rotated the structures around different axes to align with the molecular images available in the databases. After performing these manipulations, we manually mapped the two atom number schemes to each other by comparing their images side by side. We then manually changed the atom numbers in the chemical shift assignment files.

After completing the structure “cleaning” and remediation process, we then manually checked all the 1 H chemical shift assignments for all the molecules in both the training and the two holdout datasets. To facilitate this checking and correction procedure, we used a commercial program called MNOVA [ 38 ]. MNOVA is a popular NMR data analysis package which offers a full selection of software tools for processing and visualizing high-resolution NMR spectra. We used MNOVA-predicted chemical shifts to identify manually assigned chemical shifts that seemed unusual or questionable. If the difference between the MNOVA predicted shift and the observed/reported shifts was >1.0 ppm for any hydrogen atom in any given molecule, we manually rechecked those assignments by inspecting the available 1 H and/or 1 H- 13 C NMR spectra and, if necessary, made appropriate corrections if errors were found. If we could not rationalize the difference, we discarded that entry. We also used information from the Reich 1 H chemical shift database [ 39 ] to cross-check the experimentally reported 1 H NMR chemical shift values against those predicted based on their known positions within molecules. Additionally, we used the BMRB database to compare reported 1 H chemical shift assignments against those reported in the HMDB database (where structural overlaps occurred). This also helped correct misassigned chemical shifts. To further confirm the chemical shift assignments or assignment changes, several NMR experts with >10 years of NMR experience reviewed each other’s assignments.

2.3. Machine Learning Method

To train our 1 H NMR predictor, we used a graph neural network (GNN) and a similar fine-tuning or TL strategy that was previously employed for refining 13 C NMR predictions in CASCADE [ 24 ]. Specifically, the CASCADE GNN ( Figure S2 ), which was originally trained on the DFT8K dataset (consisting of 8000 DFT optimized structures and ~200,000 DFT computed 1 H chemical shifts), served as the starting point for our fine-tuning process [ 24 ]. The input for our modified GNN model included nodes that encode atom types with edges representing interatomic distances, targets for the chemical shift values, and connectivity between atoms in a tensor form. Feature initialization involved creating embeddings. These embeddings included 256 entries, each, for node and edge features based on atom types and interatomic distances, respectively. For later steps in the network, edge features were updated by combining edge and node features through trainable weights and activation functions. Unlike previous layers, weights in dense layers of the message passing and edge network were kept trainable. Only 6 layers in our GNN were allowed to be trainable or tunable so that original weights in most of the other layers of the GNN remained unaffected. After the edge feature update, the message-passing step allowed atoms to exchange information based on their spatial and chemical contexts by combining updated edge features with atom (i.e., node) features. If multiple messages to the same node were present, they were pulled into a single node before updating the node features. Just as with previous steps, weights in the message passing and node updating steps were frozen. The final prediction of NMR chemical shifts was achieved by passing the updated node features through three dense layers with sizes of 256, 256, and 128. The final readout layer generated a single number (i.e., chemical shift value).

Our GNN was implemented utilizing Keras (version 2.3.1) [ 40 ] and TensorFlow (version number 2.2.0) [ 41 ] frameworks. The model training was conducted with batch size of 32 on an in-house Dell Precision 5820 with 24 GB Nvidia RTX A5000 (Nvidia Corporation, Santa Clara, USA). Optimization of the models was performed with the Adam optimizer, a first-order gradient-based optimization method, in conjunction with using MAE as the loss function. An initial learning rate of 5 × 10 −4 was set with a follow-up learning rate decay of 4% every 70 epochs. The maximal number of epochs was set to 1200. An early stopping mechanism was implemented to evaluate the validation loss at every 10 epochs. The termination rule was to stop the training when the validation loss increased by more than 10% compared to the previous checkpoint and then select the model from the iteration exhibiting the lowest validation loss for further use.

2.4. 1 H NMR Chemical Shift Predictions for Different Solvents and Internal Standards

All of the training data for our 1 H chemical shift predictor were determined with compounds dissolved in H 2 O. While water is a common solvent used in NMR-based metabolomics, in the world of natural product chemistry and organic chemical synthesis, most compounds are dissolved in other solvents, such as methanol, chloroform, or dimethyl sulfoxide. It is also known that different solvents will cause systematic “solvent” shifts (due to anisotropic effects) that will move chemical shifts upfield or downfield relative to those measured in water. Likewise, organic solvents tend to prevent hydrogen exchange (unlike water) and so hydrogen atoms from labile hydrogens attached to OH and NH function groups will be visible in the NMR spectrum. To determine the systematic shift arising from methanol, chloroform, and dimethyl sulfoxide relative to water, we evaluated the reported 1 H chemical shift values of a number of identical compounds dissolved in water, methanol, chloroform, and dimethyl sulfoxide [ 33 ]. With this information in hand, we were able to identify straightforward linear relationships between the 1 H chemical shift values reported in water and those reported in methanol, chloroform as well as dimethyl sulfoxide. These equations and the quality of the fit between the different pairs of 1 H chemical shifts are shown in Figure S1 . The equations have been incorporated into PROSPRE (PROtein Shift PREdictor) to adjust the predicted 1 H chemical shift values for molecules dissolved in methanol, chloroform, and dimethyl sulfoxide, respectively. The linear relationship we determined for solvent correction was quite surprising but has proven to be robust for the solvents evaluated in subsequent studies. To adjust chemical shifts for different internal chemical shift referencing standards, we used correction factors published elsewhere [ 42 ].

3.1. Performance Evaluation

To evaluate PROSPRE, we first assessed the improvement achieved via fine tuning of our GNN on the training set of 4027 1 H chemical shifts. Prior to fine tuning (using the original CASCADE model), the MAE between the predicted and the observed 1 H chemical shifts was 0.28 ppm for the training set. After fine tuning, the MAE was just 0.08 ppm. Clearly, fine tuning led to a substantial improvement in the accuracy of our predictor. Next, we assessed both the chemical shift correlation and the 1 H chemical shift errors (MAE) and between predicted and observed 1 H chemical shifts for the two holdout datasets. As noted earlier, one holdout set was for compounds dissolved in water and the other holdout set was for compounds dissolved in organic solvents. The correlation between PROSPRE-predicted and experimental 1 H chemical shifts for the holdout datasets is shown in Figure 2 A,C. In addition, we compared PROSPRE’s accuracy with the accuracies of other popular algorithms, including MNOVA [ 38 ], NMRShiftDB2 [ 11 ], and CASCADE [ 24 ] ( Figure 2 B–D and Figure S3 ). Specifically, for the first holdout dataset of 272 1 H chemical shifts from 36 HMDB entries dissolved in water, we found that PROSPRE substantially outperformed all three predictors. In particular, PROSPRE had an MAE of 0.10 ppm for the first holdout dataset. MNOVA, NMRShiftDB2, and CASCADE yielded MAEs of 0.15, 0.17, and 0.21 ppm, respectively. To further test the performance of PROSPRE, we also evaluated it against a second holdout dataset. This second holdout set consisted of 1 H chemical shift assignments from the NP-MRD that included 22 molecules with 442 experimental 1 H chemical shift assignments in chloroform. PROSPRE had a MAE of 0.19 ppm for the second holdout dataset. In comparison, MNOVA, NMRShiftDB2, and CASCADE had MAEs of 0.20, 0.25, and 0.46, respectively ( Table 1 ). However, all MAEs from the second holdout set were higher than those of the first holdout set. The higher MAE for PROSPRE with the second holdout set was not unexpected due to the fact that PROSPRE was trained on water-soluble compounds, which tend to be chemically less diverse that water-insoluble compounds.

An external file that holds a picture, illustration, etc.
Object name is metabolites-14-00290-g002.jpg

Correlation of 1 H chemical shifts predicted with PROSPRE ( A , C ) and CASCADE ( B , D ) with experimental shifts for holdout dataset 1 ( A , B ) and holdout dataset 2 ( C , D ). Mean absolute error (MAE, in ppm) and R 2 (coefficient of determination) are shown on the plots. Regression trend lines (shown in red) were obtained by fitting the data with equation Y = AX, where A = slope.

Performance of PROSPRE, NMRShiftDB, MNOVA, and CASCADE for predicting 1 H chemical shifts for holdout datasets #1 and #2.

Method\DatasetHoldout Dataset #1 (MAE) Holdout Dataset #2 (MAE)
PROSPRE0.10 ppm0.19 ppm
NMRShiftDB0.17 ppm0.25 ppm
MNOVA0.15 ppm0.20 ppm
CASCADE0.21 ppm0.46 ppm

1 MAE: mean absolute error.

3.2. Applications

The high quality of PROSPRE’s 1 H chemical shift predictions led us to use PROSPRE to predict the chemical shifts for >400,000 biologically important compounds. These are compounds that have structures but do not have experimental 1 H chemical shift assignments. Specifically, we applied PROSPRE to the prediction of 1 H chemical shifts (and the generation of the corresponding 1D 1 H NMR spectra at multiple spectrometer frequencies) for nearly 250,000 molecules in the latest release of HMDB [ 9 ], for nearly 13,000 molecules in the latest release of DrugBank [ 14 ], and for nearly 280,000 molecules in the latest release of NP-MRD [ 13 ]. Plans are being made to apply PROSPRE to the prediction of 1 H chemical shifts for all compounds in MiMeDB [ 43 ] (a microbial metabolite database), ECMDB [ 44 ] (an E. coli metabolome database), YMDB [ 45 ] (a yeast metabolome database), the NORMAN-SLE [ 46 ] (a database of exposure and exposome compounds), and DARK-NPS [ 47 ] (a database of 8.9 million hypothesized novel psychoactive substances). The intent of these accurate, large-scale predictions is to generate sufficient quantities of high-quality NMR data to facilitate NMR spectral matching for facile compound identification (of known unknowns) and to support the development of resources for in silico metabolomics for the identification of unknown unknowns [ 48 ]. Requests for large scale or custom 1 H chemical shift predictions and the generation of corresponding predicted 1 H NMR spectra at multiple NMR spectrometer frequencies are welcome and can be made directly to the corresponding author.

3.3. The PROSPRE Webserver

We programmed PROSPRE as a comprehensive suite to support the prediction of 1 H NMR chemical shifts in multiple solvents. It accepts user input in the form of SMILES via ChemAxon’s JChem interface [ 49 ], translates the SMILES notation into 3D atomic coordinates in the SDF format and restores or/and renumbers hydrogen atoms utilizing the RDKit library. Subsequently, the GNN algorithm calculates ML features from the 3D model and predicts 1 H NMR chemical shifts. The front end of PROSPRE is coded with Ruby on Rails while all backend calculations are done with Python. PROSPRE is available at https://prospre.ca as of 10 May 2024. A separate version of PROSPRE can also be found on the NP-MRD database ( https://np-mrd.org/ ) at the top of the homepage under “Utilities” as “ 1 H NMR Predictor” in the dropdown menu.

To operate the PROSPRE webserver, users must provide: (1) a SMILES string or SDF file, which can be directly pasted into the MarvinJS applet (or users can draw the structure into the MarvinJS applet), (2) the type of solvent, and (3) the reference. For the type of solvent, users can choose from methanol, water, chloroform, or dimethyl sulfoxide from the dropdown menu. For the type of reference, users can choose from TMS, DSS, or TSP. After pressing the “Predict” button, the submitted structure and predicted 1 H chemical shifts are generated in a separate window. To assist users in running the PROSPRE, two example compounds (Example 1 and Example 2) are provided. Clicking on the corresponding “Load Example” buttons will autofill the required fields after which users can press the “Predict” button to obtain the NMR prediction. A sample input interface of PROSPRE for ethyl acetate (HMDB0031217) is shown in Figure 3 A. The SMILES string of ethyl acetate (CCOC(C)=O) was converted by ChemAxon’s JChem plugin to atomic coordinates and displayed in a standard 2D format. Users must then select the solvent and internal standard from the pull-down options listed under “Solvent” and “Reference”, respectively. The output page of PROSPRE ( Figure 3 B) shows a model of ethyl acetate with numbered atoms using Jmol plugin [ 50 , 51 ], predicted 1 H chemical shift values, the selected solvent, and the chemical shift reference. Predicted chemical shifts can be downloaded from the webserver as a CSV file.

An external file that holds a picture, illustration, etc.
Object name is metabolites-14-00290-g003.jpg

( A ) An example of PROSPRE webserver input page where the SMILES string for ethyl acetate was inserted and converted by ChemAxon’s JChem (version 22) into a 2D structural model. The menu for solvent selection is shown on the right. ( B ) An example of a PROSPRE output page. The predicted 1 H chemical shift values (right) and a structural model of ethyl acetate with numbered atoms (left) are shown.

4. Discussion

Our results demonstrated that using a carefully curated “solvent-aware” training set of experimental 1 H shifts, with detailed information about solvents and chemical shift reference compounds, made it possible to generate a high-quality predictive ML model for 1 H chemical shift prediction. As shown in the Results section, PROSPRE outperformed other well-regarded, popular 1 H chemical shift prediction tools that were tested in this study. Indeed, as far as we are aware, PROSPRE appears to be the most accurate 1 H chemical shift predictor that has so far been described. We attribute this result to the careful, painstaking curation of the training dataset that was done in this study. As noted earlier, the performance of PROSPRE was higher for the first HMDB-derived holdout dataset than for the second, NP-MRD-derived dataset. We suspected that the reduced performance by PROSPRE for the second holdout dataset was due to undertraining on chemical structure classes that were more frequent in the second holdout dataset but under-represented in the training dataset and holdout dataset #1.

To test this hypothesis, we used ClassyFire (version 1.0) [ 52 ] to quantitatively assess the chemical structure classes seen in PROSPRE training dataset and the two (HMDB/water and NP-MRD/chloroform) holdout datasets. ClassyFire is a computer program that automatically classifies all known chemical compounds into one of more than 4800 different structural categories using chemical structure information. Using ClassyFire, we found that our original training dataset contained molecules from 90 different chemical subclasses. For the first holdout dataset (with 36 molecules from the HMDB), 34/36 had structures that belonged to at least one of these chemical subclasses. On the other hand, for the second holdout dataset (with 22 molecules from the NP-MRD), only 3/22 molecules belonged to chemical subclasses found in the original training dataset. Table S1 shows the chemical subclass distribution for the training dataset, the first holdout dataset (from HMDB), and the second holdout dataset (NP-MRD). We also evaluated the chemical similarity of the two holdout sets against the training dataset via a cosine similarity score using the percentage of each ClassyFire chemical subclasses ( Table S1 ). The cosine similarity between holdout set 1 and the training set was 0.95, while the cosine similarity between holdout set 2 and the training set was just 0.22. Given the data distribution, the variation in the structures in the training dataset and cosine similarity scores, we can conclude that its inferior performance for the NP-MRD (second) holdout dataset was largely due to the fact that PROSPRE had not been trained on a sufficient number of molecules belonging to the chemical subclasses seen in the NP-MRD (second) holdout dataset. Given the focus on water-soluble metabolites for the training set of molecules and chemical shifts originally used to develop PROSPRE, this was not entirely unexpected.

Therefore, future efforts will be focused on accumulating 1 H NMR assignments and corresponding molecular structures from classes that are under-represented in the PROSPRE training set ( Figure 4 , Table S1 ). In addition, we would like to evaluate how much the inclusion of multiple conformers (generated via rapid conformer generation tools such as RDKit [ 53 ] or OpenBabel [ 54 ]) could help improve the accuracy of PROSPRE’s 1 H chemical shift predictions.

An external file that holds a picture, illustration, etc.
Object name is metabolites-14-00290-g004.jpg

The distribution of compounds by chemical subclass in the two holdout datasets compared to the PROSPRE training dataset. The last bar indicates the total number of compounds for which chemical subclasses were unknown or for which ClassyFire could not determine.

5. Conclusions

1 H NMR spectroscopy is widely used in organic synthetic chemistry for organic compound identification. It is also used for drug metabolite characterization, natural product discovery, and the deconvolution of metabolite mixtures in biofluids (metabolomics and exposomics). In many cases, compound identification by NMR can be achieved by matching measured NMR spectra to experimentally collected NMR spectral reference libraries. However, the limited availability of experimental NMR reference spectra, especially for many biologically relevant molecules, has significantly hindered this process. Indeed, with <5% of many biologically relevant compounds having experimental 1 H NMR spectra, the fields of NMR-based metabolomics, exposomics, and natural product chemistry have suffered enormously. PROSPRE is intended to alleviate this problem by enabling the accurate prediction of 1 H NMR chemical shifts using only a chemical structure as input. As shown in this manuscript, PROSPRE achieves the highest accuracy yet reported for 1 H chemical shift prediction, especially for water-soluble, biologically relevant compounds. PROSPRE is also capable of accurately predicting 1 H chemical shifts in a number of other solvents commonly used in NMR spectroscopy, including chloroform, dimethyl sulfoxide, and methanol. This ability to handle different solvents enhances the versatility and applicability of PROSPRE across different experimental conditions.

In addition to making PROSPRE freely available as an easy-to-use webserver, we have applied PROSPRE to the prediction of 1 H chemical shifts (and the generation of 1 H NMR spectra) for nearly 600,000 known, biologically relevant compounds. This information has been deposited into publicly available databases such as HMDB, DrugBank, and the NP-MRD. These spectra should facilitate the identification of known unknowns for applications in metabolomics, exposomics, pharmacology, and clinical diagnostics. Through this work, we believe that PROSPRE will significantly expand the coverage of metabolites that can be analyzed using NMR spectroscopy, thereby broadening the potential scope of metabolomics studies. We are in the process of providing similar predicted 1 H chemical shift data and NMR spectral datasets to facilitate the identification of unknown unknowns for applications in natural product chemistry, drug metabolism, and forensic science. We are also planning to update PROSPRE to include predictions for molecules dissolved in aromatic solvents such as pyridine or benzene. For these solvents, it is expected that more complex non-linear effects would be more evident and more complex solvent correction effects will have to be developed. Overall, our hope is that PROSPRE will allow the fields of NMR-based metabolomics, exposomics, drug discovery, and clinical diagnostics to prosper well into the 21st century.

Acknowledgments

The authors thank Marcia LeVatte for her help proofreading and editing the manuscript and administrative help entering the references.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/metabo14050290/s1 , Figure S1: Linear equations that can be used to predict the 1 H chemical shift values of hydrogen atoms for molecules dissolved in chloroform (CDCl 3 ), DMSO ((CD 3 ) 2 SO), and methanol (CD 3 OD) relative to those dissolved in water; Figure S2: Illustration of the modified GNN process used to create 1 H chemical shift predictions; Figure S3: Correlation of 1 H chemical shifts predicted with NMRShiftDB ( A , C ) and MNOVA ( B , D ), with experimental shifts for holdout dataset 1 ( A , B ) and holdout dataset 2 ( C , D ); Table S1: Distribution (by percentage) of compounds by chemical subclass in the PROSPRE training dataset compared to the first holdout dataset (from HMDB) and the second holdout dataset (NP-MRD).

Funding Statement

The research was funded by the National Centre for Complementary and Integrative Health (NCCIH), the Office of Dietary Supplements (ODS) of the National Institute of Health (NIH) grant number U24 AT010811, the Natural Sciences and Engineering Research Council (NSERC), Genome Canada, and the Canada Foundation for Innovation (CFI).

Author Contributions

Conceptualization, D.S.W.; methodology, T.S., Z.S. and F.W.; software, T.S. and B.L.L.; validation, T.S., B.L.L. and M.B.; formal analysis, B.L.L. and M.B.; resources, D.S.W. and Z.S.; data curation, Z.S.; writing—original draft preparation, M.B.; writing—review and editing, D.S.W.; visualization, M.B.; supervision, D.S.W.; project administration, D.S.W. and V.G.; funding acquisition, D.S.W. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Data availability statement, conflicts of interest.

The authors declare no conflicts of interest.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

IMAGES

  1. Assisted NMR assignments using the predictions

    nmr assignment database

  2. 2D NMR- Worked Example 3 (Full Spectral Assignment)

    nmr assignment database

  3. A Step-By-Step Guide to 1D and 2D NMR Interpretation

    nmr assignment database

  4. A Step-By-Step Guide to 1D and 2D NMR Interpretation

    nmr assignment database

  5. NMR assignment and mapping of binding sites of Gαi1. (A) 2D [ 15 N, 1

    nmr assignment database

  6. NMR resonance assignment of Dz5C–RNA2ʹF a) Extracts of a [¹H,¹H]-NOESY

    nmr assignment database

VIDEO

  1. How to upload to Firebase Storage using Angular 2

  2. Top 5 Players to watch heading into CWL Pro League Division A

  3. #myqueen🎶#abhira💕 #armaan #abhimaan #samriddhi #rohit #yrkkh #viral #trending #love #shorts#status 📸

  4. SDBS database

  5. POKY: Manual Protein NMR Backbone Assignment by Mikayla Truong

  6. Calculate and Analyze NMR using WebMO

COMMENTS

  1. nmrshiftdb2

    nmrshiftdb2 is a NMR database (web database) for organic structures and their nuclear magnetic resonance (nmr) spectra. ... integration, assignment, and more, without installing software, completely in the browser. A close integration with the next version of nmrshiftdb2 is planned. Raw data in downloads Wed, 24 Mar 2021 21:10:39 -0000

  2. BMRB

    BMRB makes bio-NMR data FAIR. Findable, Accessible, Interoperable, Re-usable. BMRB collects, annotates, archives, and disseminates spectral and quantitative data derived from NMR spectroscopic investigations of biological macromolecules and metabolites.

  3. NMRtist

    NMRtist is a cloud computing service for the fully automated analysis of protein NMR spectra (e.g. peak picking, chemical shift assignment, structure determination) using deep learning-based approaches. Each project created in NMRtist receives 30 GB of private storage, which can be filled by experimental data and analyzed using the available applications.

  4. Home

    Biomolecular NMR Assignments is a dedicated forum for publishing sequence-specific resonance assignments for proteins and nucleic acids. Provides an avenue for depositing these data into a public database at BioMagResBank. Assignment Notes are published in biannual editions in June and December. No page charges or fees for online color images.

  5. NMRbox.org

    NMRbox is a resource for biomolecular NMR (Nuclear Magnetic Resonance) software. It provides tools for finding the software you need, documentation and tutorials for getting the most out of the software, and cloud-based virtual machines for executing the software. To ensure consistent performance and reliability, we have scheduled routine ...

  6. NMRium

    NMRium includes an advanced peak picking detection for 1D and 2D spectras and is able to generate the NMR string required for publication or patent. Export. All the processing and assignment can be stored as a ".nmrium" file. This file contains the original data as well as all the processing that was applied on the spectrum.

  7. NMRshiftDB2

    nmrshiftdb2 is a NMR web database for organic structures and their NMR spectra. It allows for spectrum prediction (13C, 1H and other nuclei) as well as for searching spectra, structures and other properties. ... More recently, we are also involved in electronic assignment and new, digital workflows for NMR data from lab to publication.

  8. Twenty years of nmrshiftdb2: A case study of an open database for

    Commercial and non-commercial NMR databases have been on the market for a long time, sometimes disappearing and reappearing under a different name or the same data being used in multiple products. ... the other hand. This allows recording relationships on that level as well. In particular, it is possible to record an assignment on the database ...

  9. NP-MRD: Spectra Search NMR Spectrum

    Spectra Search. NMR Spectrum. NMR Search. NMR Search provides a powerful interface for searching the database. You can build up queries that support a wide range of conditions, including Frequency, Tolerance, Exact Mass Range for 1H/13C. To get started, click the "Load Example" button to perform an example search.

  10. CSEARCH-NMR-Server

    Evaluation of C13-NMR Assignments: YES, see here: C13-NMR based Spectral Similarity-Searches: ... Create a private database - Every user can select previously performed evaluations as reference database for upcoming requests. It is strongly recommended to use only datasets of superior quality for this purpose.

  11. The 100-protein NMR spectra dataset: A resource for ...

    The fundamental data produced by biomolecular NMR spectroscopy are multidimensional NMR spectra. All NMR-based information is derived from these spectra, generally by chemical shift assignment ...

  12. Assigning NMR spectra of RNA, peptides and small organic molecules

    NMR assignment typically involves analysis of peaks across multiple NMR spectra. Chemical shifts of peaks are measured before being assigned to atoms using a variety of methods. ... Database proton NMR chemical shifts for RNA signal assignment and validation. J Biomol NMR. 2013; 55:33-46. doi: 10.1007/s10858-012-9683-9. [PMC free article ...

  13. Protein NMR

    Much space and discussion is devoted to practical aspects. The implementation of protein NMR assignment is described using the program CCPNmr Analysis. This program has been developed by CCPN and actively seeks input from the NMR community. CCPNmr Analysis is based on the detailed and well thought-out CCPN Data Model which has the advantage (a ...

  14. NP-MRD: the Natural Products Magnetic Resonance Database

    The Natural Products Magnetic Resonance Database. (NP-MRD) is a comprehensive, freely available elec-. tronic resource for the deposition, distribution, searching and retrieval of nuc lear ...

  15. NMR Data Analysis

    A graphical NMR assignment and integration program for proteins, nucleic acids, and other polymers. nmrfit. Quantitative NMR analysis through least-squares fit of spectroscopy data. NMRFx Analyst. Data processing program utilizing Python for scripts and a full Java based GUI. NMRFx Processor

  16. Toward Creation of a Universal NMR Database for the Stereochemical

    Using triol 1 as a representative example of natural products containing two contiguous propionate units, 13C and 1H NMR databases for the stereochemical assignment of acyclic compounds have been created. Chemical shift increments due to the presence of additional functional groups as well as solvent effects are discussed.

  17. A method for validating the accuracy of NMR protein structures

    In order to determine a protein NMR structure, shift assignments are the necessary first stage 20, meaning that any protein that has an NMR structure must have backbone shift assignments (which ...

  18. Simulate and predict NMR spectra

    This website does not contain any database of NMR spectra but allows to predict easily as well as H spectra. Simulate and predict NMR spectra directly from your webbrowser using standard HTML5. You can also simulate 13C, 1H as well as 2D spectra like COSY, HSQC, HMBC. Second order effect like AB, ABX, AA'XX' can be simulated as well.

  19. PDF NMR chemical shift assignments of RNA oligonucleotides to ...

    NMR-based studies are often rate-limited 21 by the assignment of chemical shifts. Automation of the chemical shift assignment process can greatly facilitate 22 structural studies, however, accurate chemical shift predictions rely on a robust and complete chemical shift 23 database for training. We searched the Biological Magnetic Resonance Data ...

  20. Toward creation of a universal NMR database for stereochemical ...

    Toward creation of a universal NMR database for stereochemical assignment: complete structure of the desertomycin/oasomycin class of natural products J Am Chem Soc . 2001 Mar 7;123(9):2076-8. doi: 10.1021/ja004154q.

  21. Toward Creation of a Universal NMR Database for the Stereochemical

    Using the C.5−C.10 portion of the oasomycin class of natural products, the reliability and usefulness of an NMR database for the stereochemical assignment of acyclic compounds has been demonstrated. The predicted relative stereochemistry based on the NMR database has unambiguously been established via synthesis.

  22. Accurate Prediction of 1H NMR Chemical Shifts of Small Molecules Using

    To round out our dataset for 1 H chemical shift assignments in non-aqueous solvents and to extend the utility of our predictor to other applications (natural products and organic synthesis), we also extracted structures and chemical shift data from the NMRShiftDB database. The NMRShiftDB contains 1 H NMR assignments for mostly non-biological or ...