By way of background, the world of protein structure modelling has been abuzz for the last couple of years since leading groups studying protein fold prediction (in particular DeepMind and RossetaFold) have been making advances towards being able to predict protein structures from their amino acid sequence; indeed the DeepMind team managed to go considerably further than previous efforts in the CASP blind test using their machine-learning driven methods which critically incorporated the concept of co-evolution into the methods used. Co-evolution of pairs of residues gives prediction algorithms a heads-up: if two residues in a sequence tend to change in concert due to evolution, it gives an indication that the residues are interacting within the fold of the protein, and so you can constrain the prediction to make this so.
Recently, the groups who’ve been developing fold prediction software have released their codes to the community, and this has set some protein crystallographers scrambling to see if, by using the codes, they could solve some of their trickier datasets. Perhaps the modelling code can generate a model that would give them a refinement start point which has so far eluded them.
But now the crystallographers won’t have to use the software at all. Yesterday, the EMBL went further. They’ve clearly had the codes for a while now and access to sufficient computational resources, as the team have generated a database of more than 350,000 predicted protein structures. To put it in context, there are 53,484 structures in the PDB that have Homo Sapiens as their host organism (and many of these structures will be of the same protein, perhaps with different bound substrates, or in different conditions.) – suddenly the world has models of many proteins that they didn’t have yesterday; if the models are good enough, the science it enables is mind-boggling. It could render protein crystallography redundant, right?
The catch, however, is there's a lot riding on the clause "if the models are good enough". The probability is that some of the models will be excellent (and indeed a lot of biologists are looking at that right now …) and some won’t be. (The models have per-residue confidence levels in the prediction presented in the system.)
What the release of this data enables is exciting. One can imagine going a lot further: for example, these are static snapshots, but they might enable molecular dynamics simulations of the key structures which could yield information on an enzyme’s suitability as a therapeutic target. Drug hunters could do that now, rather than first having to have a published structure. The same could be said for people working in understand the mechanisms of action of proteins; suddenly they have models of the bits; perhaps they can start to have more general hypotheses for how the bits interact with each other, or generate hypotheses as to how mutations will affect a given structure, and so on. If you are interested in reading more there’s a nice piece here about the potential impacts.
That said, I’m pretty certain we are not at the "Porky Pig" moment for the wwPDB yet.
Protein crystallography will be enhanced by this amazing resource, but we will still need protein structures going forward. The first structure of a protein enables so many experimental things that just can’t be replaced by computation (try as we may!) that I think the PDB is going to be getting plenty more structures for many years to come, and the structures are going to symbiotically exist with improving fold models being generated by access to additional experimental data.
In some ways, the deposition of all these amazing models is appropriate in the year when the PDB is celebrating its 50th birthday. Without the research by many structural biologists, and the collection and curation of all that information by the wwPDB consortia members, it would not have been possible for the developers of AlphaFold and the EMBL to do this. To paraphrase Newton "If they have seen further it is by standing on the shoulders of Giants."