Proteins are large, complex molecules essential in sustaining life. Nearly every function our body performs—contracting muscles, sensing light, or turning food into energy—can be traced back to one or more proteins and how they move and change. The recipes for those proteins—called genes—are encoded in our DNA.
What any given protein can do depends on its unique 3D structure. For example, antibody proteins that make up our immune systems are ‘Y-shaped’, and are akin to unique hooks. By latching on to viruses and bacteria, antibody proteins are able to detect and tag disease-causing microorganisms for extermination. Similarly, collagen proteins are shaped like cords, which transmit tension between cartilage, ligaments, bones, and skin.
Other types of proteins include CRISPR and Cas9, which act like scissors and cut and paste DNA; antifreeze proteins, whose 3D structure allows them to bind to ice crystals and prevent organisms from freezing; and ribosomes that act like a programmed assembly line, which help build proteins themselves.
But figuring out the 3D shape of a protein purely from its genetic sequence is a complex task that scientists have found challenging for decades. The challenge is that DNA only contains information about the sequence of a protein’s building blocks called amino acid residues, which form long chains. Predicting how those chains will fold into the intricate 3D structure of a protein is what’s known as the “protein folding problem”.
The bigger the protein, the more complicated and difficult it is to model because there are more interactions between amino acids to take into account. As noted in Levinthal’s paradox, it would take longer than the age of the universe to enumerate all the possible configurations of a typical protein before reaching the right 3D structure.
The ability to predict a protein’s shape is useful to scientists because it is fundamental to understanding its role within the body, as well as diagnosing and treating diseases believed to be caused by misfolded proteins, such as Alzheimer’s, Parkinson’s, Huntington’s and cystic fibrosis.
An understanding of protein folding will also assist in protein design, which could unlock a tremendous number of benefits. For example, advances in biodegradable enzymes—which can be enabled by protein design—could help manage pollutants like plastic and oil, helping us break down waste in ways that are more friendly to our environment. In fact, researchers have already begun engineering bacteria to secrete proteins that will make waste biodegradable, and easier to process.
Over the past five decades, scientists have been able to determine shapes of proteins in labs using experimental techniques like cryo-electron microscopy, nuclear magnetic resonance or X-ray crystallography, but each method depends on a lot of trial and error, which can take years and cost tens of thousands of dollars per structure. This is why biologists are turning to AI methods as an alternative to this long and laborious process for difficult proteins.
Critical Assessment of protein Structure Prediction, or CASP, is a community-wide, worldwide experiment for protein structure prediction taking place every two years since 1994. CASP provides research groups with an opportunity to objectively test their structure prediction methods and delivers an independent assessment of the state of the art in protein structure modeling to the research community and software users.
Even though the primary goal of CASP is to help advance the methods of identifying protein three-dimensional structure from its amino acid sequence, many view the experiment more as a “world championship” in this field of science. More than 100 research groups from all over the world participate in CASP on a regular basis and it is not uncommon for entire groups to suspend their other research for months while they focus on getting their servers ready for the experiment and on performing the detailed predictions.
In the most recent CASP experiment, 98 entries were accepted for 43 protein structures. The entry ranked second correctly solved three of the 43 protein structures, for a success rate of 7%.
The entry ranked first, that of Google DeepMind's algorithm AlphaFold, correctly solved 25 of the 43 protein structures, or 58.1%. Here is a non-technical press article on the feat, and here is DeepMind's blog post on it.