Beyond AlphaFold | IFP

Overview

AlphaFold exemplifies the potential of AI-for-science. With targeted investment, science funders can unlock more tailored AI models of its class to tackle previously unsolvable scientific challenges.

The requirements for developing more such models are well understood and primarily center on sophisticated scientific datasets. These datasets hold far more information than traditional analytic methods can parse; AI models trained on this data can help uncover the natural laws that govern the phenomena embedded in it. Scientists can then use the trained models to accelerate the design of new drugs, materials, or other breakthrough technologies.

This piece provides a framework for understanding how AlphaFold-like AI models fit in the broader AI-for-science landscape and introduces the concept of Natural Law Models (NLMs): scientific AI models trained on experimental data or physics-based simulations, explicitly designed to learn the underlying natural phenomena. Calling these systems “natural law models” does not necessarily imply that they directly encode final, closed-form scientific theories. Rather, NLMs often produce intermediate representations or high-fidelity heuristics that can guide further discovery.

Developing effective NLMs requires three inputs:

Large, standardized, high-quality scientific datasets;
Experts with deep knowledge of both AI methods and the relevant scientific domain; and
Access to substantial computing resources.

Different scientific fields face distinct challenges in obtaining the necessary training data. By targeting precise pinch points, science funders can support the development of new NLMs that accelerate discovery and generate spillover benefits for the broader AI ecosystem.

What distinguishes NLMs from other AI-for-science systems?

In 2024, the developers of the AI model AlphaFold2 won the Nobel Prize in Chemistry for solving a long-intractable problem: how to predict the structure of a protein from its sequence of amino acids. AlphaFold learned a powerful empirical approximation of the principles governing protein folding — one that has already reshaped structural biology and drug discovery, though debate continues over the extent to which it “understands” protein biophysics. The latest iteration of AlphaFold is helping researchers quickly and more precisely generate new drug candidates by predicting how drugs will bind to proteins to alter their function.

NLMs like AlphaFold are just one part of the AI-enabled scientific ecosystem. Other applications of AI-for-science include general-purpose tools that use Large Language Models (LLMs). Researchers use these tools to supercharge their workflows, speeding up coding, writing, literature comprehension and synthesis, and data analysis. LLMs also underpin agentic AI “scientific copilots,” which are designed to independently perform every step of the scientific workflow, aiding or replacing the human researcher. AI systems, such as Kosmos and Asta Agents, are already capable of synthesizing and summarizing scientific literature, analyzing data, generating hypotheses, and reasoning across domains.¹

There’s considerable investment into applying AI to the entire scientific process. Startups such as Lila Sciences, Periodic Labs, and Radical AI aim to connect these agentic AI “scientific copilots” with autonomous labs, creating full-stack platforms in which AI agents rapidly design and test hypotheses in materials science and biology without human input. The Department of Energy’s Genesis Mission is similarly focused on creating AI platforms for scientific discovery, national security, and energy technologies. The most sophisticated of these platforms also use NLMs.

NLMs are increasingly critical to both human and AI-driven scientific workflows. State-of-the-art datasets are often too complex, high-dimensional, or noisy for traditional analytic methods alone to reveal the natural laws that they contain; NLMs enable researchers to extract structure from the data. Though trained NLMs will ideally fully capture the principles of the underlying phenomena in the data, they can still be useful for guiding experiments and technology development even if their embedded understanding is incomplete. NLMs can also massively speed up simulations by either learning the dynamics of known natural laws from simulated data, or by more directly solving complex equations by using neural operators, a unique type of neural network architecture.² Scientists and LLM agents in AI platforms then use these NLMs to develop and refine hypotheses.

Notably, NLMs differ from LLMs in implementation. While LLMs are trained on mostly text corpuses, systems like AlphaFold are specifically trained on scientific data and have different neural network architectures.³ The different elements of their architectures often map onto distinct processes of the natural phenomena under investigation.

Some observers argue that general-purpose models will ultimately outperform NLMs. This draws on the “bitter lesson” of machine learning, an observation that general approaches that scale with compute tend to outperform domain-specific models over time because they can better take advantage of the ongoing exponential increase in compute that follows from Moore’s Law. Under this view, LLMs will eventually supersede the need for NLMs. But NLMs offer tremendous value for advancing scientific discovery today and can lead to breakthroughs that save millions of lives while we wait to see whether LLMs live up to their promised potential. Furthermore, the datasets needed to develop new NLMs will also be necessary for advancing science with general-purpose models. Even if LLMs prevail, the data can be added to the training dataset of new models, used to fine-tune existing models, or fed into established models for analysis.

Ingredients for NLMs

Studying the development of state-of-the-art NLMs shows that they tend to require similar inputs:

AlphaFold was trained using experimental protein structures from the Protein Data Bank (PDB) and genetic sequences from the National Institutes of Health’s (NIH) GenBank, and the most recent version of the AlphaFold system combines diffusion models, transformers, and other neural network elements. The PDB was founded in 1971 and has been a bastion of structural biology ever since. It is an assemblage of data centers across the world, each funded by government science agencies and/or non-profits. GenBank dates back to 1982 and is supported by the US federal government.
Evo2 is a model that learns the structure of DNA to help sort out DNA, RNA, and protein-related questions linked to DNA sequences. The model can discern interactions between DNA elements that are very far apart to reveal how different parts of DNA interact in healthy cells and predict which genetic mutations might lead to illness. It can also generate novel bits of DNA which could, for example, engineer yeast to make antibodies to combat disease. Evo2 was trained on OpenGenome2, a publicly available dataset that contains genetic sequences from all domains of life. Researchers compiled this dataset from many different sources and have made it freely available for others to use.
GNOME and MatterGen are materials science models that can generate new materials and predict their properties. These materials could outperform today’s semiconductors, thermoelectrics, and magnets. Both models were trained using materials properties from the Materials Project database, which is supported by the Department of Energy. MatterGen also used data from the Alexandria Materials Database. The materials properties in these databases were calculated with simulations that rely on quantum mechanics, requiring considerable compute.
GenCast is a weather model that outperforms conventional models on a wide range of metrics while running more quickly and at lower cost. The model was trained using the publicly available ERA5 reanalysis dataset, which is hosted on the EU-sponsored Climate Data Store . While weather models like GenCast have not discovered new weather phenomena, their advantages over conventional models have led to their use in disaster preparedness and in helping farmers decide when to plant their crops.

The ingredients that these NLMs have in common include:

Large, well-curated datasets or physics-based simulations
Like any AI model, NLMs are trained on large datasets. The training data can be real or simulated — AlphaFold uncovered the relationship between amino acid sequences and measured protein structures using real data, while GenCast was trained on a simulated weather dataset that merged a physics-based model with real-world observations.

Large datasets are an incredibly valuable scientific resource, independent of their use in an NLM or LLM. They can be used by scientists across the world to derive new insights without having to rerun the underlying experiments, and their flaws or gaps can help direct scientific effort to fix the root cause.
Scientific expertise
NLMs require deep domain knowledge, both in machine learning methods and in the field of interest. John Jumper, one of the two DeepMind researchers awarded the Nobel Prize for AlphaFold, has a PhD in theoretical chemistry. One of the DeepMind principal investigators for GenCast worked on global circulation models at Caltech. And Evo2, a DNA foundational model from the Arc Institute, was developed by professors of computational biology, bioengineering, and chemical engineering. Each of these NLMs rests on a new, specialized model architecture designed specifically for its scientific domain. These models often map different parts of their neural network architecture to different elements of the underlying physical processes. Model creators who understand both the science and the machine learning algorithms seem to be a key ingredient for effectively performing this mapping.
Dedicated compute time and resources
As with LLMs, training NLMs can require significant compute time, which comes at great cost. Training Evo2, for example, was made possible by NVIDIA giving the scientists access to 2,000 of their H100 GPUs, which would have cost ~$10 million for the final model training run. These resources are out of reach for most academic researchers. However, not all models require this amount of compute for training. GenCast, for example, was trained on 32 TPUv5 instances over 5 days, which likely cost closer to $10,000.

Field-specific bottlenecks

Different fields face distinct bottlenecks in assembling the necessary data to train an NLM. For the most part, state-of-the-art models have already seized the low-hanging fruit. Developing NLMs for the remaining fields will therefore require considerable innovation and effort to meet ongoing challenges of measurement, data collection, and scientifically-informed algorithm development.

The table below provides a field-specific breakdown of data limitations for a few example areas of science. More detail can be found in Appendix 2.

How to unlock more models

The federal research enterprise is poised to take on the effort needed to enable more NLM development. The Genesis Mission is tasked with unleashing “a new age of AI‑accelerated innovation and discovery that can solve the most challenging problems of this century,” and the White House’s FY27 R&D budget memo emphasizes the need for “AI for the acceleration of scientific discovery.”

The UK government has also outlined a clear AI for Science Strategy that mirrors the framing and policy suggestions detailed here.

Based on the prerequisites for NLMs, funders, program officers, and researchers should pursue the following strategies to catalyze new ones:

Generate new datasets
Stimulate new algorithm development through novel funding mechanisms
Train model trainers
Provide compute

Generate new datasets

When considering how to develop a dataset for a given scientific domain, science funders should first answer the following questions:

If the necessary data exists, does it need to be curated and aggregated to be useful for model training? In that case, focus on developing data standards and/or repositories.
If the data doesn’t exist, can it be generated by running more well-defined experiments? In that case, automated labs could help.
If the datasets cannot be generated with existing approaches, which new experimental tools or techniques need to be developed to obtain the data?

Establish data standards and repositories

AI models are trained on large datasets of uniformly formatted data. Thus, one of the first challenges when training an AI model is finding enough data and formatting it to meet a common standard.

Before building a new scientific dataset on which to train a model, researchers must define or select data standards for formatting the corpus. They also need to establish or identify a data repository that can store the large datasets necessary for training. The repositories may be public or private, depending on the access requirements.

Governments around the world have invested in programs to establish data standards and curation to support many areas of biology research. These programs are in their early days, but will likely surface valuable lessons about implementation that future initiatives can learn from:

Bridge2AI, an NIH-backed program to assemble new flagship datasets for biomedical research.
The National Microbiome Initiative, which hosted a workshop “focused on developing a vision for microbiome data science to address gaps in existing infrastructure.” This workshop led to a pilot that developed standards for data, metadata, and bioinformatic workflows.
The EMBER Archive, the NIH BRAIN Initiative’s data archive for neuroscience data.

Establishing data standards and repositories has secondary benefits for science. Many granting agencies and scientific publications require that researchers make their data publicly available. However, these entities rarely offer guidelines on how to do so, forcing each lab to determine its own data standards and protocols for generating, analyzing, and storing data. This often results in bespoke standards that complicate data sharing across labs and make it difficult for external parties to aggregate the resulting datasets. Both of these problems can be solved by establishing data standards and common repositories. Establishing repositories would also reduce the number of stranded datasets and allow smaller labs to participate in analysis efforts.⁴

What can science funders do?⁵

Hold workshops to set standards. Government or non-profit-sponsored workshops can assemble stakeholders to agree on standards across a given scientific community.
Host or support repositories. Governments and non-profits can build and host new repositories or support existing public repositories, and require that they maintain high data standards and rigorous data curation protocols.
Sponsor efforts to use LLMs to standardize and aggregate data. Rather than requiring scientists to manage their data, fund computer and data scientists to develop software tools, facilitated by frontier AI models, to ingest a relevant dataset, format it, and publish it to a common repository. Given the reliability issues with some LLMs, these tools must be carefully tested and validated to ensure the validity of the resulting datasets.
Support data management lab staff. Scientists are increasingly expected to be experts in data and computer science on top of their core responsibilities. For example, funders and journals require data and code to be made publicly available, a process that requires considerable time and effort. However, few additional resources are provided to support these requests. Scientists also often lack training in software engineering, which can lead to widespread errors that could be exacerbated with AI. Science funders can help fill the gap by allowing scientists to hire contractors or new lab members to help with data and coding, or by providing software tools to handle data management.

Support automated labs

The lab work necessary to generate scientific data can be tedious and time-consuming. Automating even part of the process would free scientists to spend more time developing new hypotheses, analyzing datasets, or collaborating. Automated labs — robotic labs in which machines perform all of the experiments and/or measurements — can generate new datasets for training models, provide data for model reinforcement learning, and test model predictions. In this context, a model trained via reinforcement learning would determine which new data would be most informative, direct the lab to obtain that data, and then update itself. This approach was recently used to slash the amount of testing data needed to assess battery lifetimes.

National labs and universities have already begun to explore the potential of automated labs for materials science, chemistry, and biology through efforts such as:

OpenBind, an initiative from the UK government to generate “massive, high-quality, fit-for-purpose protein-ligand structure and affinity datasets.”
Autonomous labs at Lawrence Berkeley National Lab and Argonne National Lab.
Call for proposals for programmable cloud labs from the National Science Foundation.
Carnegie Mellon University’s Cloud Lab.

But automated labs need to be designed thoughtfully. Automation is ideal for well-defined parameter sweeps where experimental conditions are incrementally swept over a range of values to find an optimal output, but flexible exploratory work still needs human-driven methods. Many groundbreaking discoveries have come from developing new measurement techniques or chasing down an unexpected finding. Automation should thus not be viewed as a cure-all.

Further, building a one-size-fits-all lab will be almost impossible for most fields, given the wide variety of experimental conditions and measurement possibilities. Automated labs should therefore be carefully tailored to fit a defined subspace of scientific questions.

What can science funders do?

Support ongoing efforts to develop tailored autonomous labs at National Labs, universities, and startups.⁶ This could include assembling stakeholders, identifying key focus areas, and then directing resources accordingly, similar to the roadmap development proposed in the next section.

Define data acquisition roadmaps and support tool development

Often, the necessary training data cannot yet be measured due to technological or scientific limitations. In this case, roadmaps to develop new tools and techniques can provide pathways for generating the desired datasets.

Data acquisition roadmaps are multi-year visions of the intermediate steps necessary to achieve a sufficiently large and detailed dataset, often developed by a governing council or by convening relevant stakeholders. Roadmaps can help unify a field around a shared vision, reduce unnecessary duplication, and spark new collaborations. They can also help to identify which stakeholders are best positioned to tackle a given challenge in the field.

Examples of successful roadmapping efforts include:

Governmental roadmapping initiatives, including advisory councils at the National Institutes of Health, JASON studies, the Department of Energy Office of Science’s Roadmaps for Fusion and Quantum Information Sciences, the Materials Genome Initiative, and the Brain Research Through Advancing Innovative Neurotechnologies Initiative (BRAIN Initiative);
The International Roadmap for Devices and Systems; and
The National Academies’ Decadal Survey on Astronomy and Astrophysics.

New tool development alone can often lead to new scientific discoveries: The microscope led to the discovery of cells; X-ray crystallography led to the discovery of the structure of DNA; and radio telescopes validated general relativity.

What can science funders do?

Facilitate roadmap development. For fields in which the desired dataset can be readily described but is not yet available, funders can task the scientific community with developing a roadmap to obtain this data through advisory councils or workshops. Roadmapping efforts have successfully delegated tasks and resources and spurred new collaborations in engineering and technical fields like quantum computing and fusion, but remain underutilized in biomedical research. Currently, many areas of basic biomedical research are funded in parallel by the National Institutes of Health and non-profits such as the Chan Zuckerberg Initiative or the Howard Hughes Medical Institute, but with little coordination. Funders can organize and sponsor coordinating bodies to align and direct these parallel efforts.
Sponsor tool development. Funders can direct resources to focus specifically on engineering the necessary new technologies a field needs (e.g., by forming focused research teams). Adding competition to the science instrument industry, which is heavily consolidated, may also be necessary to unlock the innovation needed to automate more lab equipment.

Stimulate new algorithm development through novel funding mechanisms

Once new datasets are established, they can then be incorporated into grand challenges to spur new model development. Grand challenges offer a prize to any team that develops a model that exceeds a well-defined performance metric when executing a clearly defined task with the data.⁷ Launching a grand challenge can focus the scientific community on specific problems and spur model innovation as entrants compete for the top score. AlphaFold, for example, rose to prominence by achieving a significantly higher score than other competitors at a grand challenge for protein structure prediction.⁸

Nontraditional funding mechanisms can also be used to promote the development of new algorithms and systems that improve model accessibility or that test and validate model predictions. Lean, for example, is a programming language specifically designed to formalize math and to test either human or machine-generated proofs. Lean was developed by a Focused Research Organization (FRO), a non-profit established with the sole purpose of developing a concrete new tool or data set that neither industry nor academia would be incentivized to create on their own. A similar concept, X-Labs, has been proposed as a federal science funding mechanism to tackle similar challenges.

What can science funders do?

Commission new grand challenges and evaluations. Funders can develop and fund competitions in areas where AI could drive breakthroughs for national priorities.
Organize and fund FROs or X-Labs centered around building or supporting NLMs.

Train model trainers

Successful NLMs needed development teams with deep knowledge in both AI and the scientific domain of interest. As noted previously, many of the lead developers of Google DeepMind’s frontier NLMs are also PhD-level experts in those scientific areas. This interdisciplinary expertise enables a deep understanding of both the strengths and weaknesses of the training datasets and the core scientific principles that underlie their field. Domain experts can ensure principles inform the model structure, mapping different physical processes onto different neural network elements.

Experts can also recognize flaws in datasets and design solutions to get around them. These flaws might arise from 1) combining multiple datasets that provide incomplete coverage over the whole of the desired data resource, or include datasets with measurement errors or incomplete information, or 2) using simulated data that incompletely captures the true properties of the modeled physical phenomena. For example, the ERA5 dataset forces a physics-based model to match experimental data, producing a fine-grained, simulated dataset. However, the fitting process may result in the simulated data no longer following physical conservation laws, thereby becoming unrealistic. A model trained on this dataset risks learning malformed rules that defy the laws of physics.

Teaching scientists how to build and train neural network models will therefore be essential for developing new NLMs. Scientists have been quick to adopt machine learning and AI, but they are not always trained in best practices. This lack of training can lead to significant methodological problems with their work, such as data leakage, which affects model results and reproducibility.

Even if general-purpose models prove dominant, domain experts will remain essential for shaping training objectives, evaluating model outputs, and translating a model’s predictions into scientific insights.

Further, novel model architectures are likely to arise from efforts to develop NLMs for new scientific applications. These architectures could enable new functionalities, energy efficiencies, or other elements that could be incorporated in nonscientific AI models.

What can science funders do?

Fund training programs for scientists to learn how to develop NLMs. The Department of Energy’s National Labs are a logical place to host these trainings, as they contain the nation’s supercomputers and employ a deep bench of technical experts.
Support hybrid teams, with machine learning (ML) engineers embedded in scientific labs. Theorists and experimentalists have long worked together in the lab to drive discovery. For example, the transistor was developed in a lab at Bell Labs by John Bardeen, a theorist; Walter Brittain, an experimentalist; and William Shockley, an applied physicist who managed the project. Embedding experienced ML engineers in experimental labs is likely to unlock similarly innovative advances in NLMs by allowing for rapid feedback and collaboration between thinkers and tinkerers.
Reward scientists for rigorous NLM development and use. Researchers are strongly incentivized to publish as many papers as quickly as possible, which can lead them to rush through important steps in validating new models that they develop, documenting how to set up and use those models, or confirming that they are using existing models correctly. Prizes aimed at incentivizing these steps can help mitigate these failure modes.

Provide compute

The true potential of AI was only realized with the advent of powerful GPU/TPU supercomputers. Training models on these supercomputers can take days and cost tens of millions of dollars, if not more. Access to these resources and to the capital needed to use them is a key part of developing any model.

Examples of existing and forthcoming supercomputing resources for science include:

The National Artificial Intelligence Research Resource pilot;
The Department of Energy’s Genesis Mission and plans to build AI infrastructure on DOE lands; and
The National Science Foundation’s National Center for Atmospheric Research Derecho supercomputing resource.

Allowing scientists access to more compute time to develop NLMs may also lead to new ways to extract more value from each training run. Scientists already have experience using supercomputers for technical applications (e.g., physics-based modeling) and have developed clever ways to get the most out of their algorithms in those use cases.

What can science funders do?

Survey how much compute NLMs need. An increasing number of funders are interested in providing compute for NLMs, but it is unclear how much is actually needed for training. Surveying frontier labs, startups, and academics to determine the scale of the need would help funders plan accordingly.
Provide compute. Once the scale of the need is better understood, funders can respond by a) buying or coordinating node time on industry GPU supercomputers, or b) building new supercomputer user facilities specifically for NLMs.

Extracting insights from NLMs

Studying an NLM itself can reveal new natural phenomena. Once NLMs learn representations of new natural laws, scientists can extract these representations — as physical equations, visual features, etc. — to add to the corpus of scientific knowledge; these insights can then inform new hypotheses. For example, experts extracted visual characteristics of different types of synapses that had previously eluded researchers from a model trained to recognize different types of synapses in the brain. These visual features will likely inform future hypotheses about synaptic structure and function.

Extracting insights from NLMs will require investing in AI interpretability efforts. AI interpretability is the study of the inner workings of AI models in order to explain how they generate their outputs. Interpretability efforts for NLMs can draw on current research on the “mind” embedded in LLMs, but will also need to develop new methods for NLMs, since the models differ in structure. To be able to build on NLM discoveries, science funders should therefore direct funds toward both the interpretability of existing NLMs and new model development. One example to emulate could be the National Science Foundation’s National Deep Inference Fabric, which enables “scientists and students to perform detailed and reproducible experiments on large pretrained AI models.”

Potential risks

Just as science currently suffers from a lack of transparency about negative results, failed efforts to develop NLMs should be carefully documented and broadly publicized.⁹ Failure studies can guide future model builders away from failed neural network architectures or point to flaws or challenges associated with a given dataset. Knowledge sharing can also help demonstrate where new models only provide marginal gains at the cost of substantial additional compute, or emphasize other pitfalls to avoid, such as data leakage or overly precise hyperparameter tuning that leads to a loss of generality. For these reasons and others, many efforts to develop NLMs are likely to fail. However, if the sources of failure are well publicized, future developers can learn and course-correct, ensuring that the original efforts are not for naught.

Developers also need to contend with the risk that successful models are used by bad actors to design harmful viruses or materials, or otherwise threaten our health, safety, or security. NLMs present dual-use risks across multiple threat domains, including biosecurity, chemical weapons development, and cyberattacks, where the same capabilities that accelerate beneficial research could also lower barriers for malicious actors. LLMs, for example, provide expert-level virology troubleshooting, which can assist both researchers and bad actors. NLMs can also exacerbate these risks by providing novice users with expert-level technical knowledge and autonomous completion of complex research tasks. Developing and/or requiring safety or governance mechanisms can help mitigate possible misuse. For high-dual-use domains, policymakers and NLM builders and funders could consider levers like model evaluation or red-teaming, tiered access or controlled release, know-your-customer requirements that verify and monitor who accesses powerful capabilities, and incident response planning. Model-specific safeguards such as safety fine-tuning, constitutional AI, or input/output filtering systems that screen responses for dangerous content can provide additional layers of protection at the model level. Designing and implementing these mechanisms is beyond the scope of this piece, but we encourage science funders, NLM builders, and policymakers to seriously consider and preempt possible concerns with the model development efforts they fund.

Conclusion

AI models, including both agentic AI scientists and the NLMs discussed here, offer a path toward breakthroughs that promise to revolutionize scientific discovery.

NLMs have the potential to rapidly accelerate the pace of scientific discovery in two ways: 1) by extracting heuristics for undiscovered natural laws from large experimental datasets, and 2) by greatly decreasing simulation time and/or compute requirements in well-understood scientific and engineering domains. Successful models have had similar development needs: large and well-curated datasets; teams with both deep computer and hard science expertise; and ample node time on GPU or TPU supercomputers.

Even if general-purpose models subsume the need for domain-specific NLMs, the core constraints for using these models for scientific advancement remain the same. By investing in field-specific solutions to address these needs — including developing new measurement capabilities, data standards or repositories, technical expertise, or access to the necessary compute — science funders can accelerate discovery. These investments are likely to unlock broader secondary benefits for both science and AI, regardless of which modeling paradigm ultimately proves dominant.

Appendix 1: Data requirements for surrogate vs. generative models

NLMs can be grouped into two categories: surrogate models and generative models:

Surrogate models perform pattern completion to make new predictions. These models are often viewed as the latest advancement in a long history of applied machine learning methods for science — distant descendants of fitting curves to points to create trend lines. AlphaFold, for example, learned to fit protein structures to amino acid sequences.
Generative models can produce entirely new natural constructs. These models undergo unsupervised training to learn unique relationships in the training data and then use those relationships to generate novel constructs. Evo2, which can generate new candidate genomes for entirely imagined organisms, is one such example.

Surrogate models and generative models rely on different types of data. Surrogate models map one type of data (e.g., amino acid sequences) to another (e.g., protein structures), while generative models look for relationships between elements within a data type (e.g., the relationship between nucleotides in DNA).

Appendix 2: Select field-specific bottlenecks

Materials Science and chemistry

Inorganic materials

Functional materials discovery

Functional materials leverage the unique electronic and optical properties of a material to perform a given function — to absorb or admit light, to compute, etc.

The bottleneck in functional materials discovery is materials synthesis; it can take years to make a new material. Making new materials often requires constructing new equipment with exacting tolerances for temperature, pressure, and concentration of material precursors and exploring a large parameter space to find the right conditions to make high quality samples.

Robust theories already exist to predict materials properties from first principles and have led to a wealth of new candidate materials for synthesis. They primarily rely on a simulation approach known as Density Functional Theory (DFT), which uses quantum mechanics. GNoME and MatterGen both rely on DFT simulations to train their models. The trained models then provide a DFT heuristic that can be run more quickly than the physics-based simulations. This approach allows for accelerated throughput that can help explore the large space of materials with four or more elements. Having a longer list to select from before trying to make a new material is useful. However, materials scientists are skeptical that most of the proposed new materials can actually be synthesized and that their properties would differ all that much from existing materials.¹⁰

Structural materials development

Structural materials are used to make objects.

Bottlenecks in developing new structural materials lie in linking their microstructure to their thermal, mechanical, and other physical properties. Microstructure characteristics include crystal grain size, shape, and orientation, phase distribution, dislocation density, etc. Autonomous labs can be used to rapidly synthesize materials with a range of these properties to measure their performance, and to then train models to predict the performance from the properties.

Component design

The bottleneck in component design is simulation run time, which can be accelerated with AI. Components such as heat exchangers and suspensions are often made of multiple materials, and their physical properties are approximated by simulations. These simulations discretize space and time to approximate how the component will perform in operating conditions. Simulation accuracy requires very fine discretizations, but the smaller the time step or the more refined the space considerations, the longer the run time. AI can help optimize these discretizations by efficiently iterating to achieve optimal overall design.

Fusion

Challenges in fusion are primarily challenges in materials or component design: how to confine a plasma, how to design a plasma blanket that will be robust to sustained operation, how to consistently provide fuel, and so on. Hence, the bottlenecks to fusion fall into the materials design categories listed above. AI has been successfully used to control fusion plasma, reducing plasma disruption by improving the control of the magnetic field via real-time feedback with sensors.

Physics and engineering

Quantum computing

Quantum computing faces two bottlenecks: the need to develop more algorithms for performing computations on quantum computers and the need to create more qubits in quantum computers by improving the interplay between computational algorithms, algorithms for error correction, and device fabrication tolerances. Mathematical AI models, which are not covered here, could likely help with algorithm development on both of these fronts.

Earth science

GenCast is a useful weather model, but like most weather models, it has a limited time horizon. The intrinsically chaotic nature of weather limits how far into the future any weather model can see.

Furthermore, current weather forecasting and climate modeling have difficulty modeling subseasonal timescales — lead times between two weeks and two months. These timescales are strongly influenced by complex coupling between different parts of the environment, such as between the atmosphere and ocean or between the troposphere and stratosphere. AI could help model these interactions. AI could also be used to simulate small-scale processes, such as cloud and aerosol formation, which are notoriously difficult to capture in global climate models.

Earthquake science is also increasingly turning to deep neural networks for prediction, though data for large earthquakes is, thankfully, scarce.

Biology

Systems neuroscience

Systems neuroscience faces multiple bottlenecks, from cataloging the different types of neurons in the brain to measuring their connectivity and biophysical properties to recording their activity with sufficient spatial and temporal resolution. The technology needed to address each of these challenges is still evolving, and so previous attempts to establish clear data types and data repositories have not yet gained full acceptance across the field. Multiple organizations, including the Chan Zuckerberg Initiative, the Allen Institute, the Howard Hughes Medical Institute, and the National Institutes of Health, each fund research and engineering efforts to solve these problems. Given the many stakeholders, this field is ripe for roadmapping efforts.

Cell biology

Tracking the interactions of multiple different molecules within a cell over time is a major bottleneck in cell biology. As in systems neuroscience, the technology is still evolving, and multiple stakeholders are funding tracking efforts. This field would also benefit from roadmapping.

DNA models also still have room to improve. For example, Evo2 cannot yet grapple with interactions between multiple genetic variants; if mutations to more than one gene are the cause of an illness, it is unlikely to identify them. Opportunities to address this and other limitations include generating more genetic data linked to physical and behavioral phenotypes, integrating DNA models with protein models, and adding knowledge about molecular interaction networks in cells.

Protein binding and drug discovery

Current AI protein models face some limitations. AlphaFold can only predict structured regions of proteins that are similar to proteins already in its training dataset. Proteins often contain both structured regions that are relatively fixed and unstructured regions that are “floppy.” These unstructured regions can interact in important ways with other molecules in a cell, and AlphaFold does not yield any insights into those interactions. Building in simulation tools or known biophysical mechanisms or interactions could offer a path forward.

However, just as LLMs have demonstrated clear usability as data and models improve, still-developing protein models have already influenced biology research while continuing to improve. Protein models would benefit from cross-species genomic data, which could help identify evolutionarily conserved regions and possible interactions, further fine-tuning models.

Protein structure and other biological AI models are also already being used to design new candidate drugs. More — and better formatted — screening and toxicology data, such as compounds and antibodies that bind to proteins and peptides, would help speed the design and testing process.

Acknowledgements

Thanks to Adam Marblestone, Jeff Snyder, Karthik Duraisamy, Steven Henle, Molly Menzel, Nick Sofroniew, Kristin Branson, Oliver Stephenson, and Max Katz for helpful discussion.

Dan Turner-Evans is a Senior Metascience Research Fellow at IFP.

These scientific agents may be the precursors to the “country of geniuses in a datacenter” that Dario Amodei envisioned in Machines of Loving Grace.
Care should be taken to confirm that the models learn the correct laws, as they can learn invalid approximations.
Though NLMs often use the transformer architecture that underlies LLMs, they may also feature elements of diffusion models, convolutional neural networks, graph neural networks, neural operators, or other neural network elements. See Appendix 1 for a discussion of surrogate vs. generative NLMs.
Stranded datasets are large datasets, many of which cost millions of dollars to generate, that are often under-analyzed due to barriers faced by outside groups when trying to examine them. These barriers include limited documentation of the dataset’s structure or API, which often requires considerable technical expertise to use even if well documented, and access to the necessary local storage and compute to analyze the data.
See also Kickstarting Collaborative, AI-Ready Datasets in the Life Sciences with Government-funded Projects by Erika DeBenedictis, Ben Andrew, and Pete Kelly.
Dean Ball provides further detail on how DOE could build and support autonomous labs for materials science.
Grand challenges have also been referred to as Common Task Methods.
The 14th Critical Assessment of Structure Prediction (CASP) competition, supported by a grant from the National Institutes of Health.
Nick McGreivy details one such failure in his reflections on the AI-for-science hype, where he describes uncovering limitations in using AI models for problems in fluid mechanics.
Cheetham, Anthony K., and Ram Seshadri. “Artificial Intelligence Driving Materials Discovery? Perspective on the Article: Scaling Deep Learning for Materials Discovery,” Chemistry of Materials vol. 36, no. 8 (2024), p. 3490–95, https://doi.org/10.1021/acs.chemmater.4c00643.