This piece was originally published on IFP Senior Technology Fellow Tim Hwang’s Substack Macroscience on December 13th, 2024, as a response to an open RFP for short papers on “negative metascience.” diagnosing places where the infrastructure for science has broken down, and how we might do better.
Introduction
For all of human history until the past 100 years, infectious disease has been our deadliest foe. In the first decades of the 20th century, nearly one in a hundred Americans would die of an infectious disease every year. To put that into context, the American infectious disease death rate was 10 times lower during the height of the COVID-19 pandemic in 2021. The relief we enjoy from the ancient specter of deadly disease is largely due to antibiotic treatments like penicillin, as well as improved sanitation, and nutrition.
But this relief may soon be coming to an end. If nothing is done, antibiotic resistance promises a return to the historical norm of more frequent death from infectious diseases. As humans use more antibiotics, we inadvertently run the world’s largest selective breeding program for bacteria which can survive our onslaught of drugs. By the late 1960s, 80% of cases of Staphylococcus aureus, a common and notorious bacterial infection agent, had grown resistant to penicillin. Since then, we have discovered many more powerful antibiotic drugs, but our usage is accelerating — while our discovery rate is, at best, stagnating.
As a result, antibiotic resistance is spreading. Today, certain forms of Staphylococcus aureus, like MRSA, are resistant to even our most powerful antibiotics, and the disease results in 20 thousand deaths every year in the US. One promising solution to antibiotic resistance is hidden in dragon blood.
Komodo dragons, native to a few small islands in Indonesia, are the world’s largest lizards. They eat carrion and live in swamps. Their saliva hosts many of the world’s most stubborn and infectious bacteria, but Komodos almost never get infected. Even when they have open wounds, Komodo dragons can trudge happily along through rotting corpses and mud without worry.
Their resilience is partially due to an arsenal of chemicals in their blood called antimicrobial peptides. These peptides are short sequences of amino acids, the building blocks of proteins. These chemical chains glom onto negatively charged bacteria (but not neutrally charged animal cells) and force open holes in the membrane, killing the infectious bacterium. Humans have peptides too, and we use them for everything from regulating blood sugar with insulin to fighting infections.
Peptides are especially promising candidates for antibiotic-resistant pathogens for two reasons:
- They are easily programmable and synthesized. Their properties and structure are direct outcomes of linear amino acid chains so it’s easy to work with them computationally and apply machine learning (ML) and bioinformatics.
- Peptides are resistant to resistance. Researchers can use them to target more fundamental properties of bacteria, whereas antibiotics target particular molecular pathways that are often closed by a single, small mutation. For example, bacterial membranes are almost universally negatively charged; it is a feature of their physiology that is not easily mutated away. Therefore, peptides that use this negative charge to seek out and destroy invading bacteria are difficult to avoid, even after those bacteria evolve through generations of intensive selective breeding as a result of being targeted.
Even though peptides are short — usually less than 50 amino acids — the combinatorial space of peptide sequences is vast. It’s difficult to search this space for peptides that effectively combat the resistant superbugs that threaten to return us to the medieval world of deadly infections. However, searching for these peptides is a well-defined problem with easy-to-measure inputs and outputs. The fundamental research problem is perfectly poised to benefit from rapid advances in computation. The newest research in this field builds ML models that predict which sequences of amino acids will be bio-active against certain pathogens (similar to Deepmind’s AlphaFold), then develops those peptides and tests the model’s predictions.
But progress in this field is too slow to meet the challenge of antibiotic resistance. This isn’t just due to inherent difficulties in the science — progress towards antimicrobial peptides is slowed by scattered, poorly maintained, and small datasets of peptide sequences paired with experimentally verified properties. ML thrives on big data, but the largest database of peptides only has a few thousand experimentally validated sequences and only tracks three or four chemical properties, like antimicrobial activity and host toxicity. These properties are often difficult to compare across peptide databases.
Most importantly, there is almost zero negative data in these sources. Scientists test hundreds or thousands of peptides to find one that is active against some pathogen, then publish a paper about the one that succeeded. That success might go into the database, but all of the preceding failures are kept in the file drawer, even though they are, at current margins, far more valuable for ML models than one more successful data point.
Making a better dataset is feasible and desirable, but no actor in science today has the incentives to do it. Open data sets are a public good, so private research organizations will tend to underinvest. The non-pecuniary rewards in academia like publications and prestige are pointed towards splashy results in big journals, rather than a foundational piece of infrastructure such as a dataset.
This problem is solvable with an investment in public data production. A massive, standardized, and detailed dataset of one million peptide sequences and their antimicrobial properties (or lack thereof) would accelerate progress toward new drugs that can kill antibiotic-resistant pathogens. This would replicate the success of datasets like the Protein Structure Initiative and the Human Genome Project and put us on track to defeat these drug-resistant diseases before they roll back the clock on the medical progress of the past century.
What are peptides, and how do they work?
Proteins are the machinery of biology: they constitute the motors, factories, and control surfaces of cellular life. Some proteins are incredibly complex, like this motor protein made of thousands of amino acids.

Peptides are a particular kind of protein. They are short and simple without many moving parts. Instead of using intricate and specialized binding sites like larger proteins, peptides use thousands of copies of themselves and preferential chemical attractions to perform various tasks in the human body, like regulating blood sugar or pain sensitivity.
Antimicrobial peptides are designed to kill pathogens that invade the body. They are subjects of active research in microbiology. Our bodies employ antimicrobial peptides naturally. Peptides like Defensin or LL-37 are most frequently found on our skin or in our mouths and noses as the first line of defense against all of the pathogens we come into contact with.
Much is still unknown about how peptides work and how to target them, but antimicrobial peptides tend to have a positive charge and two different surfaces along their structure that either attract or repel water. This attracts them to pathogenic bacteria, which have negatively charged membranes. Then, the peptide’s hydrophobic and hydrophilic surfaces interact with the membrane to drill holes in the bacteria, and the bacterial cell collapses and dies. Lower concentrations of peptides may not kill the invading pathogens, but they will slow down their metabolic processes, giving a head start to the rest of our immune system.
Eukaryotic membranes, which normal human cells are made of, have different fats on their membranes, which means they are much closer to neutrally charged and aren’t as vulnerable to the attacks that peptides make on cell membranes. Peptides can also target gram-positive vs gram-negative bacteria. They can preferentially attach to bacteria with thin, single-layer membranes or thick, multilayered ones. This specificity is important because it can help preserve non-pathogenic, beneficial bacteria while still attacking invaders.
None of this targeting is perfectly accurate. Peptides are sent out millions at a time and, since they get stronger as the concentration on a cell increases, small differences in chemical preference lead to big differences in activity. Some of our cells will bump into these peptides by chance and potentially be affected, but hundreds of times more peptides will be reliably attracted to targets like negative charge and particular chemicals on the cell walls of bacteria. This is similar to how traditional antibiotics work: There is some degree of targeting, but a heavy dose of antibiotics will still harm beneficial bacteria and human cells. That tradeoff is often worth it in the fight against a deadly disease.
Peptides have two big advantages over antibiotics.
- Peptides resist resistance. While antibiotics often target very narrow biochemical reaction pathways into a bacteria’s metabolism or particular proteins found in the cytoplasm of pathogens; peptides target general properties of a bacteria’s entire membrane, like charge or lipid composition. This means that antibiotics are slightly more specific, but also that antibiotics are easy to resist. Changing one residue in a target protein is much easier than changing the electric charge over the entire bacterial surface. This general targeting has allowed antimicrobial peptides to be effective first defenses against pathogens for millions of years without changing much.
- Peptides are easier to synthesize and mass manufacture than antibiotics. Biology has done most of the heavy lifting for us here. Proteins are so versatile and fundamental to so many biological processes that nearly every cell has completely general purpose protein factories. We can take single-celled organisms that are simple and easy to grow, like yeast, insert the right DNA instructions, add sugar, and the yeast will start pumping out copies of the desired protein. There are dozens of companies that will synthesize custom proteins on demand for reasonable prices. By rapidly synthesizing and testing hundreds of different peptides, you can screen for effective and non-toxic treatments and scale them up in six or seven days. This is a stark contrast to small molecule antibiotic manufacturing, where figuring out how to synthesize a particular chemical can take years of trial and error, and making that synthesis efficient can take even longer.
The broad-spectrum chemical warfare and mass manufacturing ease of antimicrobial peptides make them a promising avenue for combating antibiotic-resistant pathogens. Their ability to disrupt fundamental properties of bacterial cells, rather than specific molecular pathways, suggests that peptide-based treatments could remain effective over longer periods compared to traditional antibiotics, and the ease of synthesis means that new treatments can be made in weeks instead of years when the need does arise.
The frontier of peptide research
Peptides have in vitro effects on the toughest antibiotic-resistant infections including MRSA, HIV, fungal infections, and even cancer. But they still aren’t common on pharmacy shelves or in hospital treatments. Some current clinical trials will change this, but the main barrier is still in the fundamental research.
Peptides are chains of chemicals that vary in length. Each link in the chain is 1 of 20 amino acids. Thus, the combinatorial space of possible peptides is incomprehensibly massive. We have mapped a tiny fraction of this space. Only a few thousand peptides are registered in databases, and there are even fewer with all the important information on not only antimicrobial activity, but also specific targeting and host cell toxicity. Much of the research on peptides has started by indexing naturally occurring peptides. Although this research makes use of evolution’s exploration of this combinatorial space over billions of years, it’s nowhere close to comprehensive.
Google’s AlphaFold has made significant progress in predicting the 3D structure of proteins based only on their amino acid sequence. The frontier of research on antimicrobial peptides uses similar techniques but with slightly different inputs and outputs. Like Alphafold, antimicrobial peptide researchers use ML to make predictions based on a protein’s amino acid sequence, helping them explore the vast space of possible peptides and filter it down to the most promising candidates. However, ML models of peptides more directly target the medical properties of the peptides, like human toxicity or activity against bacteria, rather than just predicting their 3D structure. ML prediction on peptides may also be more tractable than AlphaFold because peptides are so much shorter than most proteins.
Based on a database of a few thousand peptide sequences, researchers have used ML techniques to predict new peptides that are active (in-vitro) against MRSA, HIV, or cancer, and often at higher rates than naturally occurring analogs. One way they did this is by splicing, shuffling, and combining some of the existing sequences into new ones. Other approaches apply successive filters to the database and then combine the properties of those filtered sequences into a new peptide. Both of these approaches created peptides with high degrees of in-vitro activity against multi-drug-resistant infections like Staphylococcus aureus.
All of this research is very promising, but it’s still moving slow because of one main constraint: data.
The problem: not enough data
ML needs data. Google’s AlphaGo trained on 30 million moves from human games and orders of magnitude more from games it played against itself. The largest language models are trained on at least 60 terabytes of text. AlphaFold was trained on just over 100,000 3D protein structures from the Protein Data Bank.
The data available for antimicrobial peptides is nowhere near these benchmarks. Some databases contain a few thousand peptides each, but they are scattered, unstandardized, incomplete, and often duplicative. Data on a few thousand peptide sequences and a scattershot view of their biological properties are simply not sufficient to get accurate ML predictions for a system as complex as protein-chemical reactions. For example, the APD3 database is small, with just under 4,000 sequences, but it is among the most tightly curated and detailed. However, most of the sequences available are from frogs or amphibians due to path-dependent discovery of peptides in that taxon. Another database, CAMPR4, has on the order of 20,000 sequences, but around half are “predicted” or synthetic peptides that may not have experimental validation, and contain less info about source and activity. The formatting of each of these sources is different, so it’s not easy to put all the sequences into one model. More inconsistencies and idiosyncrasies stack up for the dozens of other datasets available.
There is even less negative training data; that is, data on all the amino-acid sequences without interesting publishable properties. In current ML research, labs will test dozens or even hundreds of peptide sequences for activity against certain pathogens, but they usually only publish and upload the sequences that worked. Training a model without this data makes it extremely difficult to avoid false positive predictions. Since most data currently available is “positive” — i.e, peptides that do have antimicrobial properties — negative data is especially valuable.
Expanding the dataset of peptides and including negative observations is feasible and desirable, but no one in science has an incentive to do it. Open data sets are a public good: anyone can costlessly copy-paste a dataset, so it is difficult and often socially wasteful to put it behind a paywall. Therefore, we can’t rely on private pharmaceutical companies to invest sufficiently in this kind of open data infrastructure. Even if they did, they would fight hard to keep this data a trade secret. This would help firms recoup their investment, but it would prevent other firms and scientists from using the data, undercutting the reason it was so valuable in the first place.
Non-monetary rewards in academia, like publications and prestige, point toward splashy results in big journals, not toward foundational infrastructure like open datasets. Scientists are often altruistic with open datasets and tools that they’ve developed for personal use. In the field of antimicrobial peptides, researchers host open peptide databases and prediction tools free for anyone to use. They are motivated by a genuine desire to see progress in this field, but genuine desire doesn’t pay for all of the equipment and labor required to scale up these databases to ML-efficient size.
The most common funding mechanisms for researchers in this field reinforce the shortfall in data infrastructure investment. Project-based grants, like the NIH’s R01, are focused on specific research questions or outcomes. These grants usually have relatively short timelines (e.g., 3-5 years) and emphasize novel findings and publications as key metrics of success.
This emphasis on short-term project-based grants stems from a desire for measurable outcomes, accountability, and novelty. University tenure committees and academics themselves heavily weigh high-impact publications and grant funding. Building infrastructure, while valuable to the scientific community, typically generates fewer publications, is often seen as less prestigious or less interesting, and has more spillover benefits that aren’t credited. NIH program officers also want clear metrics of their impact, and the higher-ups need to convince Congress that they aren’t wasting billions of dollars by enforcing accountability of their funding decisions to those metrics. Accountability is easier with smaller projects that have a shorter gap between investment and return. Mistakes are less damaging when the funding amounts are small and more of the responsibility for funding decisions lies outside of the NIH, in expert external review panels. Another important metric targeted by the NIH is novelty. The NIH and its remit from Congress explicitly prizes novelty of research and its results. Internal and external calls for the NIH to pursue more “high-risk, high-reward” research reinforce this desire for discrete projects with novel designs over and above expansions of already established scientific techniques.
The million-peptide database project is not a high-risk high-reward experiment, or a counterintuitive result that can turn into a highly cited paper or patent. Instead, it’s a massive scale-up of established procedures for synthesizing and testing peptides that will be more expensive and time-consuming than a project-based grant and have a less legible connection to the metric of success tracked by academics, the NIH, and Congress.
The solution: a million-peptide database
The data problem facing peptide research is solvable with targeted investments in data infrastructure. We can make a million-peptide database
There are no significant scientific barriers to generating a 1,000x or 10,000x larger peptide dataset. Several high-throughput testing methods have been successfully demonstrated, with some screening as many as 800,000 peptide sequences and nearly doubling the number of unique antimicrobial peptides reported in publicly available databases. These methods will need to be scaled up, not only by testing more peptides, but also by testing them against different bacteria, checking for human toxicity, and testing other chemical properties, but scaling is an infrastructure problem, not a scientific one.
This strategy of targeted data infrastructure investments has three successful precedents: PubChem, the Human Genome Project, and ProteinDB.
- The NIH’s PubChem is a database of 118 million small molecule chemical compounds that contains nearly 300 million biological tests of their activity, e.g. their toxicity or activity against bacteria. More than the peptide database proposed here, PubChem is about aggregation and standardization rather than direct data creation. It combined existing databases, and invited academics to add new molecules to the collection. Although Pubchem began in the early 2000s and was first released in 2004, it is still incredibly useful to the chemistry research community. With an annual budget of $3 million, PubChem exceeded the size of the leading private molecule database from Advanced Chemistry Development by around 10,000x in 2011 and made the data free. PubChem is credited with supporting a renaissance in ML for chemistry.
- Another success is the Human Genome Project. Unlike PubChem, the Human Genome Project couldn’t rely on collating existing data, and had to industrialize DNA sequencing to get through the 3 billion base pairs of human DNA in time. This 13-year effort began in the early 1990s and cost about $3.8 billion. Over the course of the project, the per-base cost of DNA sequencing plummeted by ~100,000-fold. By 2011, sequencing machines could read about 250 billion bases in a week, compared to 25,000 in 1990 and 5 million in 2000. Before the HGP, gene therapies were less than 1% of clinical trials; today they comprise more than 16%, all building off the data infrastructure foundation laid by the project.
- Perhaps the closest analog to the million-peptide database proposal is ProteinDB, a database of around 150,000 complex proteins and their 3D structure. This open database began as a project of the Department of Energy’s Brookhaven National Laboratory in the early ‘70s and has since evolved into an international scientific collaboration between research centers in the US, Europe, and Japan. Like PubChem, ProteinDB has become the primary depository for protein structure discoveries — and like the Human Genome Project, ProteinDB was paired with a large data generation program, the Protein Structure Initiative (PSI).
The Protein Structure Initiative was a $764 million project funded by the U.S. National Institute of General Medical Sciences between 2000 and 2015. The PSI developed high-throughput methods for protein structure determination and contributed thousands of unique protein structures to the database. By 2006, PSI centers were responsible for about two-thirds of worldwide structural genomics output. The hundreds of thousands of detailed 3D protein structures in the PSI databank became the essential training data behind the success of AlphaFold.
These projects cut against the NIH’s structural incentives for smaller, shorter, investigator-led grants, but they still succeeded. PubChem was housed within the National Library of Medicine, which already had a mandate for data infrastructure. Rather than competing with R01s, PubChem received dedicated funding through the NIH Common Fund. It also managed some of the drawbacks of data infrastructure projects in legibility and credit assignment by creating clear metrics of success around database usage, downloads, and a formal citation mechanism for database entries. Similarly, the Protein Structure Initiative was funded through the National Center for Research Resources, another NIH division with an explicit focus on research infrastructure.
The Human Genome Project overcame its barriers through a strong presidential endorsement and dedicated Congressional funding that bypassed normal NIH processes. It sustained this political momentum by developing clear technical milestones, like cost per base pair, that could be evaluated without relying on traditional academic metrics.
Here’s how a scientific funder like the NIH can adapt the success of ProteinDB, the Protein Structure Initiative, PubChem, and the Human Genome Project to create a million-peptide database:
Like PubChem, start by merging and standardizing existing peptide datasets, and open them to all. This alone would be a big help for ML in peptide research. A researcher today who wants to use all available peptide data in their model has to collect dozens of files, interpret poorly documented variables, and filter everything into a standardized format. Hundreds of researchers are currently duplicating all of this work for their projects. Thousands of hours of their time could be saved if the NIH or NSF paid to organize this data once and for all and opened the results to all interested researchers. Setting a Schelling point for all future data additions would also help keep the data standardized as the dataset grows.
Collecting existing data won’t be nearly enough to get to a million-peptide database. The next step, like the Protein Structure Initiative and the Human Genome Project, is to industrialize peptide testing. Mass-produced protein synthesis and testing are already well-established techniques in the field, so this project won’t need any 100,000x advances in technology to succeed like the HGP did. A scientific funding organization like the NIH only needs to support scaling up these existing techniques. Researchers can already test tens or hundreds of thousands of peptides simultaneously.
Industrializing peptide testing is more complicated than the demonstrations in individual research papers, because we need to screen for lots of variables in addition to a single measure of anti-microbial activity as the above research projects are doing. We want to know about the peptide’s activity against a broad range of bacteria, viruses, fungi, and cancer cells, we want to know about the peptide’s effects on benign human cells or beneficial bacteria so it doesn’t do too much collateral damage, and we want to know about the peptides that failed to have any interesting effects so our ML models know what to avoid. For peptide testing to match the scale needed by ML models, it needs to be funded beyond the resources available for a single paper.
This effort requires a purpose-made grant from a scientific funding agency like the NIH or the NSF, not a standard academic-led research project grant. The focus here should not be papers, citations, or prestige. The sole focus should be data. With a grant like this, a million-peptide database is achievable well below the budget and timeline standard set by the Protein Structure Initiative and the Human Genome Project.
Retail custom proteins cost $5-$10 per amino acid. At an average peptide length of 20 amino acids, that’s around $200 per peptide. That cost is just for the synthesis, not all of the time and labor required for testing, so a reasonable upper bound on the cost of a million-peptide database is $350 million. Even this large upper-bound cost is likely justified by the potential impact of antimicrobial peptides. The direct treatment costs for just six drug-resistant infections is around $4.6 billion annually in the US, with a far greater cost coming from excess mortality and damaged health.
The actual cost is likely far less than this $350 million upper bound. Performing protein synthesis in-house and in bulk, rather than buying retail, can greatly reduce costs. Additionally, these synthesis costs are for the highest-quality resin synthesis. High-throughput methods like SPOT synthesis can be less than 1% of the cost per peptide and allow researchers to synthesize thousands of peptides at once. Clinical use of the tested peptides would probably require retesting with more expensive, higher purity methods, but you’d only need to retest the few most promising candidates. For the purpose of supplying millions of data points to an ML model, the purity of this high throughput method is more than sufficient.
Other methods use mass-produced DNA plasmids to induce bacteria like E. Coli to produce peptides on long chains attached to their membrane which, if they’re antimicrobial, end up killing the host cell. Researchers can then blend up all of the E. Coli and check which of the DNA plasmids copied themselves and which did not. The plasmids that didn’t reproduce are the ones which encoded antimicrobial peptides and prevented their host bacteria from multiplying. This method allowed University of Texas researchers to test 800,000 peptides at once, at a cost significantly lower than any other high throughput testing method. The downside is that you never get to isolate the actual peptide from the bacterial culture, which limits the types of tests you can run. But scaling up this process could easily generate hundreds of thousands of peptide candidates with some verified anti-microbial activity that can then move on to more detailed tests.
The time required to build a million-peptide database is also reasonable, perhaps less than five years. A single researcher can synthesize 400 peptides on a 20×20 cm cellulose sheet in 6 days using SPOT synthesis and can probably perform tests for antimicrobial activity, human toxicity, and other traits in another week. With an automated pipetting machine the yield increases to 6-8 thousand peptides in the same six days. A rate of 8,000 peptides synthesized and tested every two weeks would get to a million peptides in 1,800 days, just under five years. Most importantly, almost all of these processes are highly parallelizable, so scaling up the number of peptides you want to test doesn’t necessarily increase the amount of time it takes if you can set up another researcher or pipetting machine working in parallel.
The failure of standard scientific incentives to fund the creation of the peptide database is solvable. A single concentrated effort over several years would lay a foundation for an ML renaissance in antimicrobial peptide research, as PubChem, the HGP, and ProteinDB did for their respective fields.
Peptide research can fix antibiotics
The infectious diseases that harried humanity for millennia are regaining strength as antibiotic resistance spreads. Every year, antibiotic-resistant infections claim over 1.2 million lives worldwide. Peptides, in dragon blood and human spit, have been nature’s first line of defense against these infections for millions of years. We can learn from and improve upon nature’s example, making effective new treatments for some of the world’s deadliest and intransigent diseases.
More than simply preserving the 20th-century safety that antibiotics created, peptides might exceed the effectiveness and versatility of antibiotics. Peptides are just short proteins, and proteins are the machinery of all living things. Peptides can thus help prevent not only bacterial infections, like antibiotics, but have demonstrated in-vitro activity against viruses, fungal infections, and cancer. Peptides are also programmable and easy to manufacture. Once we figure out how the properties of a peptide change as we substitute different amino acid building blocks, we might be able to design, test, and mass manufacture new treatments within weeks, rather than the decades it takes for new antibiotics to come to market.
The path toward this future is clear. ML prediction on the sequence of amino acids is a promising and tractable way to advance our understanding and control over the properties of antimicrobial peptides. The most difficult scientific bottlenecks with this strategy have been crossed; all we need for the next step is scale.
That means we need data. The existing data infrastructure for antimicrobial peptides is tiny and scattered: a few thousand sequences with a couple of useful biological assays scattered across dozens of data providers. No one in science today has the incentives to create this data. Pharma companies can’t make money from it and researchers can’t get any splashy publications. This means researchers are duplicating expensive legwork collating and cleaning all of this data and are not getting optimal results as it’s simply not enough information to fully take advantage of the ML approach.
Scientific funding organizations like the NIH or the NSF can fix this problem. The scientific knowledge required to massively scale the data we have on antimicrobial peptides is well-established and ready to go. It wouldn’t be too expensive or take too long to get a clean dataset of a million peptides or more with detailed information on their activity against the most important resistant pathogens and its toxicity to human cells. This is well within the scale of successful projects that these organizations have funded in the past like PubChem, the HGP, and ProteinDB.
We can meet this challenge and solve it quickly if we target our resources towards building open data infrastructure that thousands of research projects will use. Let’s not wait while antibiotic-resistant pathogens get stronger.