Millions of children need help with speech, but there are far too few clinicians. Want to know if AI can responsibly bridge this gap? Check out our EMNLP'25 paper Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology arxiv.org/abs/2509.16765 🧵

Oct 31, 2025 · 4:40 PM UTC

4
11
27
The problem is significant: over 3.4 million children in the U.S. have speech disorders, but there's a 20-to-1 gap between affected children and available Speech-Language Pathologists (SLPs). Multimodal LMs could be beneficial, but their clinical utility remains largely untested. buffalo.edu/ai4exceptionaled… 🧵 3/10
1
2
2
To address this, we introduce SLPHelm, the first comprehensive benchmark for evaluating MLMs in speech pathology. Developed in collaboration with clinical experts, it evaluates models across five core tasks, spanning from initial diagnosis to granular symptom identification. 🧵 4/10
1
2
2
We evaluated 15 state-of-the-art MLMs and found that none consistently meet clinically acceptable performance thresholds. This highlights a major gap between current model capabilities and the reliability needed for real-world clinical use. 🧵 5/10
1
2
2
To close this gap, we developed domain-specific finetuning strategies that boost performance by ~10% in specific tasks 🧵 6/10
1
2
2
Our robustness analysis uncovered a critical issue: a systematic gender performance gap. Across multiple models, performance was consistently better for male speakers than female speakers, highlighting an urgent need for bias mitigation to ensure equitable care. 🧵 7/10
1
2
2
Counterintuitively, we also found that more reasoning isn't always better. Enabling Chain-of-Thought (CoT) prompting, a method designed to improve reasoning, actually degraded performance on certain classification tasks. 🧵 8/10
1
2
3
We are publicly releasing all of our work to accelerate research in this vital area. Code: github.com/stanford-crfm/hel… Dataset: huggingface.co/datasets/SAA-… We thank Yifan Mai, @percyliang, and @StanfordCRFM for their help integrating this benchmark in HELM crfm.stanford.edu/helm/ 🧵 9/10
1
2
3
We thank @katherinemiller and @StanfordHAI for bringing our research to the broader community: hai.stanford.edu/news/using-… 🧵 10/10
2
3
Replying to @fagunpatel19998
the real question here is not just "can ai help" but "what happens when ai screening catches something it's uncertain about." in clinical settings, the model's uncertainty is as important as its accuracy. this work is valuable because you're asking how to use models responsibly in a domain where false confidence is genuinely dangerous. that thoughtfulness about deployment context is what separates science from justification