In 1965, Margaret Dayhoff published the Atlas of Protein Sequence and Structure, which collated the 65 proteins whose amino acid sequences were then known. Inspired by that Atlas, today we are releasing the Dayhoff Atlas of protein sequence data and protein language models.

Jul 25, 2025 · 8:14 PM UTC

6
88
3
297
The Dayhoff Atlas dramatically expands the scale and diversity of publicly available protein data by providing the largest open dataset of natural proteins to date, GigaRef, and a first-in-class, large-scale dataset of synthetic proteins, BackboneRef.
1
2
28
How do dataset choice and model scale affect the quality of proteins generated by the Dayhoff model? In the first study of its kind, we generated sequences from different Dayhoff models and tested them head-to-head in the lab, measuring whether expressed in E. coli.
2
15
Learning on GigaRef yielded a small increase in the fraction of expressed proteins. Increasing model and dataset scale further improved the expression rate. Augmenting training with structure-based synthetic data from BackboneRef produced the highest expression success rate.
1
2
1
20
The Dayhoff Atlas was a big team effort, including @SarahAlamdari Alex J Lee Kaeli Kaymak-Loveless @samir_char @garykbrixi @cdomingoenrich @WChentong Suyue Lyu @nfusi @ntenenz @avapamini
13
Replying to @KevinKaichuang
Another question, in EvoDiff paper you guys used ProtT5, and here you use ProtBert for the FPD calculations. I'm interested in using FPD myself but was wondering how to choose the pLM. Is there any reason you guys preferred ProtBert over other pLMs, or in your exp. all are ok?
1
4
They're actually both ProtT5 -- gotta correct the EvoDiff preprint. In practice, you'll get very similar results using any reasonable PLM.
1
3
Replying to @KevinKaichuang
This is really cool, I especially like the idea behind unrolling the MSA! And just curious, how much compute/time was required to fold the ~2.4M ProteinMPNN sequences with OmegaFold?
3