What belongs in the Protein Data Bank?
The rise of high-throughput crystallography is among the most exciting recent developments for fragment finding. Historically deemed too slow for primary screening, crystallography was reserved for select hits from an assay cascade. Now crystallographic screens up-front sometimes yield hundreds of hits. Many have been deposited in the Protein Data Bank (PDB). In a recent (open access) Protein Sci. commentary, Mariusz Jaskolski (Mickiewicz University), Bernhard Rupp (Medical University Innsbruck), and collaborators in the US question this practice.
In particular, the researchers ask whether molecules processed using Pan-Dataset Density Analysis (PanDDA) belong in the PDB. The method, which we described here, is typically used when hundreds of compounds have been soaked into crystals of the same protein. Most molecules will not bind, and these empty structures can be averaged to provide a background map to better identify weakly-bound ligands that may have only partial occupancy.
The researchers seem suspicious of this technique, referring to “supposed ligands” that may “confuse most biomedical researchers” and “degrade the PDB integrity,” the effect of which “could be disastrous.” To support their argument, they provide two examples from the PDB where the atomic models diverge from the electron density calculated using conventional methods and one with wonky statistics.
To avoid “contamination of the PDB by suboptimal structures,” the researchers suggest depositing structures from large-scale crystallographic screens in a separate database. Alternatively, they suggest clearer annotation. (To be fair, all three of the examples cited are already prominently marked “PanDDA analysis group deposition.”)
Needless to say, this is controversial. In a bioRxiv preprint, Manfred Weiss (Helmholtz-Zentrum Berlin) and collaborators in the US, Germany, Sweden, and the Netherlands, some of whom co-developed PanDDA, take a different view.
The researchers agree that group depositions need to be marked clearly, but they argue that they squarely belong in the PDB rather than in a separate repository. Moreover, “commentaries that underestimate the knowledge of PDB users, that ignore the opportunities present in heterogenous crystallographic data, and that miss out on chances for education on structure quality do more harm than good.”
The three examples described by Jaskolski and colleagues are re-examined, and while it is true that two of them do show poor occupancy using conventional methods, the ligands are clearly visible when PanDDA is used. (In the third case, there was an error in the resolution cutoff during automated processing, but the data could be successfully reprocessed manually.)
PanDDA was developed specifically to identify small, low occupancy ligands, so the researchers argue that these entries “cannot and should not be treated in the same way” as other ligands. Banning them from the PDB would potentially impede future research.
Weiss and colleagues refer to the Structural Genomics campaign of the late 1990s and early 2000s to solve myriad structures of diverse proteins, most of which were not being otherwise studied. At the time some commentators derided this effort as “stamp collecting.” Yet the number and diversity of structures thus deposited into the PDB likely contributed to the success of automated protein folding algorithms such as AlphaFold2.
Similarly, including structures from PanDDA processing could lead to unforeseen advances. For example, Weiss and colleagues suggest we may be able to “extract all aspects of conformationalas well as of compositional heterogeneity out of all these data sets.” A better understanding of the role of protein dynamics in ligand binding is likely to require thousands of similar datasets of the kind being uploaded.
Personally, I believe that scientists should be wary of all published information. As the old saying goes, trust, but verify. As evidenced by my five-part series “Getting misled by crystal structures,” even conventional structures in the PDB should not necessarily be taken at face value. With that precaution, I’ll hold with the conclusion of Weiss and colleagues: “As long as the data is there, let’s embrace it and make it available!”