EMNLP 2024 Main: "A Peek into Token Bias: Large Language Models Are Not Yet Genuine Reasoners"
arxiv.org/abs/2406.11050
TD; DR: The generalization of reasoning capabilities still suffers from token bias. It is a probabilistic pattern matching rather than genuine reasoning.
🕑Tuesday 16:00 - 17:30
📍Riverfront Hall
We support the findings in🍎Apple's trending GSM-Symbolic paper, which has referenced our work to question the true reasoning abilities of LLMs. LLMs perform impressively on reasoning benchmarks 📊, but we wonder if the language model's performance on reasoning benchmarks is a mirage? 🤔
We'll be at EMNLP
@emnlpmeeting tomorrow at 4 PM in the poster session🌟 at Riverfront Hall, sharing our latest research with guidance from
@DanRothNLP @camillo_taylor @weijie444 from Penn
@PennEngineers @Wharton and
@tanwimallick from Argonne
@argonne
🤖 We developed a hypothesis-testing framework that tests models on classic logic problems. We perform token perturbations, especially those tokens irrelevant to the underlying logic, and observe statistically significant results. We call this the token bias💡
For instance, the famous "Linda Problem" in psycholog👩⚕️ is usually answered correctly. However, change it to the "Bob Problem"👨⚕️, the performance shifts. Similarly, we swap "horses🐎" for "bunnies🐰" in the "twenty-five horses problem" in graph theory. These changes don’t affect the underlying logic but highlight memorization🧠over genuine reasoning🤔
Come chat with us at the session!