So, back to the three ingredients: Outstanding Domain Expertise + Outstanding Machine Learning Expertise + Outstanding Compute. How did they each play a role? Here’s some of the things each brought to the party:
1. Domain expertise
Knowledge of everything that the field had accomplished before and how, hypotheses on how to improve upon these, design of new input features, smoothing approaches for the final structure optimisation, interpretation of model failures and domain-specific corrections of these, etc.
2. Machine Learning Expertise
The key innovation in the first AlphaFold was to predict full distance probability distributions rather than a single distance number for each acid pair — a predisposition that is very much in the blood of machine learning researchers. Similarly, the use of end-to-end optimisation in AlphaFold 2 will have fit very much with the ethos that the ML team will have brought to the table.
31,000 is a very small dataset by deep learning standards. There were no doubt a bunch of data efficiency techniques deployed to squeeze as much information as possible out of this, frankly, tiny dataset (e.g., self-supervised learning, simulated data generation, active learning, etc.)
Architecturally, AlphaFold is based on a deep convolutional neural net — borrowed from the world of image processing, whilst AlphaFold 2 appears to be based on a bespoke transformer architecture — borrowed from the world of natural language processing.
Many can write these things down but few can do all of them to the level of sophistication that was required. DeepMind is discovering, developing and improving ML techniques day in day out across a wide range of applied and pure research domains (self-driving cars, data centre energy optimisation, recommendation engines, game-playing, neuroscience, robotics, speech, etc., etc.). It will absolutely be best-in-world for a subset. The wide range of authors cited on the first AlphaFold paper, representing diverse ML research interests, shouldn’t be a surprise.
3. Compute
AlphaFold 2 required 16 TPUv3s (roughly equivalent to 100–200 GPUs) for several weeks to train the model and then between 5 and 40 GPUs for hours to days to predict the structure for each new protein (refs. 1, 8). This level of compute is beyond the reach of most academic teams — perhaps every academic team. Two of the most advanced of the other participants said they used roughly 4 GPUs to train their models for a couple of weeks (ref. 9). DeepMind was able to deploy nearly two orders of magnitude more compute and yet described the compute resource as “relatively modest” (ref. 1) — which of course, by their standards, it was.
A whole greater than the sum of its parts
But the three competencies don’t work in isolation. The core approach will have been a product of both domain and ML expert input working together. The compute resource meant that the team were able to iterate through different ideas, techniques and tweaks much faster than their peers. The final solutions won’t have been the only composite approaches they tried. They will have cycled through and discarded a range of other technique ingredients as well. Access to compute resource made this possible. And state of the art ML engineering will have enabled them to get to this low level of compute. The whole is very much greater than the sum of the parts.