PhD student @mldcmu | CS ‘22 @nyutandon

Joined February 2023
Ellie Haber retweeted
Final version of our L2G paper is now published at @TmlrOrg! Kudos to @WenduoC for leading this work. Paper: openreview.net/forum?id=5NM4…
Can we skip genomic Foundation Model pretraining? Our work L2G repurposes language LLMs for genomics via cross-modal transfer, matching fine-tuned genomic FMs. Kudos to @WenduoC & amazing collab w/ @atalwalkar. L2G, language to genome; L2G, life’s too good biorxiv.org/content/10.1101/…
10
25
Ellie Haber retweeted
My amazing PhD student Wendy Yang @muyu_wendy_yang is graduating this summer & seeking industry R&D roles! She's published in ISMB, Nature Methods, and interned at Genentech. Strong in #AI/#ML for gene regulation. Looking for top AI+bio talent? Contact Wendy: muyuy@andrew.cmu.edu
Congrats to @muyu_wendy_yang on a successful PhD thesis defense today! Wendy developed a series of ML methods to study genome organization & function, and genome editing - expanding our toolkit for uncovering genome principles. Here is a photo with the happy thesis committee 🎉
1
6
1
27
Ellie Haber retweeted
RL with verifiable reward has shown impressive results in improving LLM reasoning, but what can we do when we do not have ground truth answers? Introducing Self-Rewarding Training (SRT): where language models provide their own reward for RL training! 🧵 1/n
Check out this incredible work led by @alam_shahul 🥼
We introduce #POPARI, an interpretable, spatially-aware factor-based model for multi-sample #spatialtranscriptomics. Huge kudos to @alam_shahul for his incredible effort (yes, 80+ equations!). In collaboration with @immunoliugy and @insitubiology. biorxiv.org/content/10.1101/…
7
Ellie Haber retweeted
In our lab, we take developing computational methods seriously - and we name them well ! Here are some of our #spatialtranscriptomics ML methods so far 👇More to come ..
3
9
2
123
Unified integration of spatial transcriptomics across platforms biorxiv.org/content/10.1101/… 🧬🖥️🧪 github.com/elliehaber07/LLOK…
15
1
47
Ellie Haber retweeted
George R.R. Martin holds the first new dire wolf born in 10,000 years
Ellie Haber retweeted
Unified integration of spatial transcriptomics across platforms biorxiv.org/content/10.1101/…
15
62
Ellie Haber retweeted
Training with more data = better LLMs, right? 🚨 False! Scaling language models by adding more pre-training data can decrease your performance after post-training! Introducing "catastrophic overtraining." 🥁🧵+arXiv 👇 1/9
Ellie Haber retweeted
Interacting with the external world and reacting based on outcomes are crucial capabilities of agentic systems, but existing LLMs’ ability to do so is limited. Introducing Paprika 🌶️, our work on making LLMs general decision makers than can solve new tasks zero-shot. 🧵 1/n
5
94
6
464
Ellie Haber retweeted
Proper and meaningful benchmark datasets are crucial for advancing genomic LLMs/FMs, and ML methods for genomics in general. Fantastic collab w/ @lileics's group. Amazing work led by @WenduoC @ZhenqiaoSong @zocean636
DNALONGBENCH: A Benchmark Suite for Long-Range DNA Prediction Tasks biorxiv.org/cgi/content/shor… #biorxiv_bioinfo
6
31
Ellie Haber retweeted
Can we skip genomic Foundation Model pretraining? Our work L2G repurposes language LLMs for genomics via cross-modal transfer, matching fine-tuned genomic FMs. Kudos to @WenduoC & amazing collab w/ @atalwalkar. L2G, language to genome; L2G, life’s too good biorxiv.org/content/10.1101/…
1
22
4
117
Ellie Haber retweeted
Selecting good pretraining data is crucial, but rarely economical. Introducing ADO, an online solution to data selection with minimal overhead. 🧵 1/n
4
75
8
348
Ellie Haber retweeted
Thrilled to launch the AI4BIO Center @CarnegieMellon! Our goal is to tackle grand challenges in understanding how cells work using AI/ML. Excited to help recruit faculty and foster collaboration across @SCSatCMU and campus. There is truly no place like CMU cs.cmu.edu/news/2024/ai4bio
Ellie Haber retweeted
Attention is all you need; at least the matrices are, if you want to distill Transformers into alternative architectures, like Mamba, with our new distillation method: MOHAWK! We also release a fully subquadratic, performant 1.5B model distilled from Phi-1.5 with only 3B tokens!
Ellie Haber retweeted
Models with different randomness make different predictions at test time even if they are trained on the same data. In our latest ICLR paper (oral), we investigate how models learn different features, and the effect this has on agreement and (potentially) calibration. 1/
4
33
2
142
Ellie Haber retweeted
Excited to share scGHOST, now published @NatureMethods, graph-based #ML identifying #3Dgenome subcompartments in single cells. Kudos to @KyleXiongCMU & @RuochiZhang, who worked so closely on this. Exciting time for single-cell epigenomics & multiomics! nature.com/articles/s41592-0…