Proteins are complex molecules that comprise the bulk of functional parts of living things. They are encoded in DNA and RNA as genes. By sequencing DNA, or sometimes even the proteins themselves, we can learn the amino acid sequence that makes up a protein. But the way chemical strands fold into the correct 3D shape to form a functional protein is hard to predict. And it’s also often difficult to observe the structure of a protein directly, as they are small enough that the important details involve the positions of individual atoms.
Nevertheless, there has been much effort to understand the structures of proteins in humans and other organisms. The structures can explain why a gene and the protein it encodes is essential, why a particular mutation causes cancer, or which drug molecules can fit in a protein pocket to alter the protein’s activity. In short, we can learn how living things work (or don't work) and how we can intervene.
There has been an impressive diversity of approaches for predicting protein structures. For example, over the last decade I’ve been intrigued by Foldit, a computer game used to crowdsource human problem solving to find protein structures that best satisfy realistic chemical constraints. Of course, many techniques beyond human intuition are used for prediction too.
Researchers have organized to objectively benchmark all their structure prediction methods as they improve them. Generally, previously unknown protein structures are found experimentally, and predictions are tested against them. An existing Metaculus question provides a good overview of the protein folding problem and the role of these competitions:
The most prominent is the Critical Assessment of Techniques for Protein Structure Prediction, or CASP, which is held every two years. DeepMind, a subsidiary of Alphabet Inc, made a splash at CASP13 in 2018 with their AlphaFold model. Its performance was a big improvement over the previous competition’s top model, standing out even amidst an uptick in general progress in the field.
DeepMind returned with AlphaFold2 at CASP14 in 2020 with an even bigger improvement. Top prediction methods are now accurate enough that such improvement cannot really happen again without reformulating the objective, or at least the way performance is measured.
Alphabet looks eager to continue applying AI to biochemical problems, given the recent introduction of a new subsidiary company called Isomorphic Labs to focus on the drug discovery process. Personnel, intellectual property, and projects relevant to protein folding are likely changing hands within Alphabet, but I am mainly curious whether the general group of people and resources who were behind AlphaFold will once again top the leaderboard.
One way competing groups could catch up quickly is to use the AlphaFold source code that was released this year. This is in fact already being done to predict (or “compute”, a term made appropriate by these accuracy levels) protein complexes. Others have created OpenFold, a more trainable version of AlphaFold2.
CASP15 is scheduled to take place in mid-2022. I find it unlikely that the focus will still be on the basic folding problem, since in many cases the experimental “ground truth” is no longer more reliable than the predictions.
Some people even suggest that the protein structure prediction problem is now solved, by which they mean that we’ve generally figured out how to predict the basic structures for natural proteins—and not just the easy cases. But we certainly have a lot to learn about how protein folding works.
It reminds me of the human genome sequencing effort, which was declared “complete” in 2003, while researchers kept working on the difficult regions until this year. The complex systems that have arisen from evolution always have special cases and contexts that lead to more specific areas of focus.
CASP already has multiple ranking categories, with AlphaFold2 having topped the “Regular targets” chart. Perhaps CASP will remove this category and continue with specialty categories for more difficult cases. Or, they could continue using a general ranking with an assortment of especially difficult proteins.
If that’s the case, who will top it? If a main winner is declared for CASP15, will it be an entry from an Alphabet company?
I would predict that if an Alphabet entry were submitted to CASP15, it would win some kind of overall ranking. Given AlphaFold’s improvement from CASP13 to CASP14, plus the advent of Isomorphic Labs, I expect Alphabet to put a lot of resources into protein structure research.
Most of my uncertainty is in whether they will compete. The publicity they gained from CASP must have helped in the formation of Isomorphic Labs, but perhaps their next major demonstration will be more applied. If that involves designing new proteins or finding drugs that interact with proteins, CASP may no longer be a useful venue for them.
I’m fairly uncertain about the outcome, but I lean towards the negative. A 50/50 chance of an Alphabet participant seems reasonable to me, and there is of course a chance that other groups could prevail. Overall I’d give a 40% chance that the winner, if declared, will be an Alphabet company.
With all this talk of protein structure prediction, we should also appreciate the exciting progress in the experimental determination of protein structures. The structures of proteins in different contexts are collected in the Protein Data Bank. The number of structures deposited each year has been increasing roughly linearly, rising from 2,938 in 2000 to 15,436 in 2020. These come from different types of experiments, mainly X-ray crystallography, NMR, and cryo-electron microscopy (cryo-EM).
Cryo-EM has developed especially rapidly due to recent advances in technology and image processing. Certain types of structures that have been difficult to observe with other methods can now be determined with high resolution.
Given these changes, how many structures will be newly deposited to the Protein Data Bank archive in 2025?
A linear trend would suggest 17,112. Given the increasing ease of determining structures and the utility of structures for diverse proteins, I predict a higher number. I’d also expect the pandemic to spur the routine determination of many variants of important viral proteins. Taking a hint from the trends I’ve seen with other biological repositories, such as ones from DNA, I’ll fit an exponential curve and predict 24,030 structures deposited in 2025.
Going forward, the line between observing and predicting protein structure may become less clear. Structures will be inferred from images using complex algorithms, while predictions will be increasingly accurate. It will be interesting to see how advances in one will affect progress and forecasting in the other.