Millions of children need help with speech, but there are far too few clinicians. Want to know if AI can responsibly bridge this gap?
Check out our EMNLP'25 paper
Finetuning and Comprehensive Evaluation of Language Models for Speech Pathology
arxiv.org/abs/2509.16765
🧵
Oct 31, 2025 · 4:40 PM UTC
Blog @StanfordAILab: ai.stanford.edu/blog/slp-hel…
w/ @martinakaduc @sangttruong Jody Vaynshtok @sanmikoyejo @nickhaber @StanfordEng @NUSingapore
🧵 2/10
The problem is significant: over 3.4 million children in the U.S. have speech disorders, but there's a 20-to-1 gap between affected children and available Speech-Language Pathologists (SLPs). Multimodal LMs could be beneficial, but their clinical utility remains largely untested.
buffalo.edu/ai4exceptionaled…
🧵 3/10
To address this, we introduce SLPHelm, the first comprehensive benchmark for evaluating MLMs in speech pathology. Developed in collaboration with clinical experts, it evaluates models across five core tasks, spanning from initial diagnosis to granular symptom identification.
🧵 4/10
We evaluated 15 state-of-the-art MLMs and found that none consistently meet clinically acceptable performance thresholds. This highlights a major gap between current model capabilities and the reliability needed for real-world clinical use.
🧵 5/10
To close this gap, we developed domain-specific finetuning strategies that boost performance by ~10% in specific tasks
🧵 6/10
Our robustness analysis uncovered a critical issue: a systematic gender performance gap. Across multiple models, performance was consistently better for male speakers than female speakers, highlighting an urgent need for bias mitigation to ensure equitable care.
🧵 7/10
Counterintuitively, we also found that more reasoning isn't always better. Enabling Chain-of-Thought (CoT) prompting, a method designed to improve reasoning, actually degraded performance on certain classification tasks.
🧵 8/10
We are publicly releasing all of our work to accelerate research in this vital area.
Code: github.com/stanford-crfm/hel…
Dataset: huggingface.co/datasets/SAA-…
We thank Yifan Mai, @percyliang, and @StanfordCRFM for their help integrating this benchmark in HELM crfm.stanford.edu/helm/
🧵 9/10
We thank @katherinemiller and @StanfordHAI for bringing our research to the broader community:
hai.stanford.edu/news/using-…
🧵 10/10









