Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

In 1965, Margaret Dayhoff published the Atlas of Protein Sequence and Structure, which collated the 65 proteins whose amino acid sequences were then known. Inspired by that Atlas, today we are releasing the Dayhoff Atlas of protein sequence data and protein language models.

Jul 25, 2025 · 8:14 PM UTC

297

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

The Dayhoff Atlas dramatically expands the scale and diversity of publicly available protein data by providing the largest open dataset of natural proteins to date, GigaRef, and a first-in-class, large-scale dataset of synthetic proteins, BackboneRef.

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

How do dataset choice and model scale affect the quality of proteins generated by the Dayhoff model? In the first study of its kind, we generated sequences from different Dayhoff models and tested them head-to-head in the lab, measuring whether expressed in E. coli.

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

Learning on GigaRef yielded a small increase in the fraction of expressed proteins. Increasing model and dataset scale further improved the expression rate. Augmenting training with structure-based synthetic data from BackboneRef produced the highest expression success rate.

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:14 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

Our models, code, and data are openly available on Github, Zenodo, and Huggingface. huggingface.co/collections/m… zenodo.org/records/15265289 github.com/microsoft/dayhoff

GitHub - microsoft/dayhoff

Contribute to microsoft/dayhoff development by creating an account on GitHub.

github.com

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:15 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

There's a lot more in the preprint! biorxiv.org/content/10.1101/…

The Dayhoff Atlas: scaling sequence diversity for improved protein generation

Modern biology is powered by the organization of biological information, a framework pioneered in 1965 by Margaret Dayhoff’s Atlas of Protein Sequence and Structure. Databases descended from this...

biorxiv.org

Kevin K. Yang 楊凱筌 · Jul 25, 2025 · 8:15 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 25

The Dayhoff Atlas was a big team effort, including @SarahAlamdari Alex J Lee Kaeli Kaymak-Loveless @samir_char @garykbrixi @cdomingoenrich @WChentong Suyue Lyu @nfusi @ntenenz @avapamini

youngsu ko · Jul 26, 2025 · 3:35 AM UTC

youngsu ko @youngsuko9

Jul 26

Replying to @KevinKaichuang

Another question, in EvoDiff paper you guys used ProtT5, and here you use ProtBert for the FPD calculations. I'm interested in using FPD myself but was wondering how to choose the pLM. Is there any reason you guys preferred ProtBert over other pLMs, or in your exp. all are ok?

Kevin K. Yang 楊凱筌 · Jul 31, 2025 · 7:52 PM UTC

Kevin K. Yang 楊凱筌 @KevinKaichuang

Jul 31

They're actually both ProtT5 -- gotta correct the EvoDiff preprint. In practice, you'll get very similar results using any reasonable PLM.

more replies

youngsu ko · Jul 26, 2025 · 3:29 AM UTC

youngsu ko @youngsuko9

Jul 26

Replying to @KevinKaichuang

This is really cool, I especially like the idea behind unrolling the MSA! And just curious, how much compute/time was required to fold the ~2.4M ProteinMPNN sequences with OmegaFold?