PhD student @UTAustin | Multi-agent LLM, Continuous Learning, Planning I'm exploring open development now...

Austin
Joined November 2023
🎉 Thrilled to announce our MindGames challenge is accepted at #NeurIPS2025! 🧠🤖 Ready to deploy your AI agents to compete and collaborate in Hanabi, Werewolf, Stag Hunt, and Colonel Blotto? 🎮 Stay tuned for details!
2
9
1
92
This is big! Environment diversity has been a key challenge for me when tuning RL in interactive settings. Turning non-RL environments into RL-trainable ones massively expands where agentic LLMs can operate and unlocks real applications.
Scaling Agent Learning via Experience Synthesis 📝: arxiv.org/abs/2511.03773 Scaling training environments for RL by simulating them with reasoning LLMs! Environment models + Replay-buffer + New tasks = cheap RL for any environments! - Strong improvements over non-RL-ready environments and multiple model families! - Works better in sim-2-real RL settings → Warm-start for high-cost environments 🧵1/7
1
Does anyone else feel evaluation/grounding is as hard as building the agent? I helped a friend build an onboarding Q&A agent (“ask how we do X on Team Y, get the right steps + links”). The demo was shiny. The day after, I realized evaluation/grounding isn’t a checkbox—it’s the job. Nothing exploded. Instead, a slow drip of “almost-right”: it quoted last quarter’s PTO pilot because the doc changed mid-week, plus other slight misses. None felt dramatic. It felt slippery. Without tight eval/grounding, the agent isn’t stable enough. What I learned (small, boring, effective): 1. Prompts are model-specific. A prompt that lifts Model A can tank Model B. If you swap models, re-optimize the prompt in evaluation to maximize performance before trusting results. 2. Mirror-prod staging. Spin up a staging environment that mirrors production 3. Extensive tests (not vibes), also could use LLMs as “user simulators” to fuzz phrasing and surface brittle prompts. Curious what it’s like for others—feels similar? What evaluation/grounding habit actually made your agents stick in the real world? Are there ways to scale evaluation?
5
Kevin Wang retweeted
VITA group dinner last night: Welcome back alumnus @WuyangC for his talk! And happy birthday to Jiajun 🎂 Menu debate highlights: everyone went straight for the most expensive dishes 😂 Research talk ✅ Travel stories ✅ Team energy ✅ This is what makes our group special 🙌
2
1
21
Kevin Wang retweeted
Not able to attend COLM, but our coauthor Kevin Wang will present our work on test-time scaling for world models (e.g., COSMOS). Takeaways: test-time scaling laws hold; small models + extra inference can match bigger ones at equal compute; SWIFT makes it practical with fast tokenization, prob Top-K, and beam search. Project: scalingwfm.github.io/
1
5
COLM country distribution for this year. I will present SPIN-Bench on Thursday, 4:30 PM - 6:30 PM poster session #55. Come and chat!
2
7
(5/5) We’ve included more models in the paper and added experiments on scaling with agent numbers. Check out the website (github.com/spinbench/spinben…) and ⭐ the GitHub for new updates. We’re also working on the next benchmarks and will share more about the upcoming settings soon.
(4/5) Observations In sequential planning tasks like Blocksworld, where the action space is small and each step follows a clear sequence, LLMs can chain reasoning steps effectively. Grok-4 reaches about 97% accuracy. For instance, in Tic-Tac-Toe, both GPT-5 and Grok-4 perform nearly flawlessly. However, spatial reasoning remains challenging — even Grok-4 achieves only ~53%. Tasks such as moving objects in 3D or planning robot paths reveal this gap.
(3/5) Setup in brief The benchmarks cover both classical planning tasks (e.g., Blocksworld) and multi-agent games in cooperative and adversarial settings. The game suite includes Tic-Tac-Toe, Connect Four, Chess, and Hanabi. We also tested diplomacy for selected models.
(2/5) Why this matters As interaction becomes central to LLM research, we need a benchmark that tests how models engage with the environment and with other agents.SPIN-Bench (Strategic Planning, Interaction, and Negotiation) is a unified framework for evaluating long-horizon strategic reasoning and social intelligence in LLMs. It features competitive and collaborative multi-agent scenarios, where models must exchange information, anticipate opponents, and coordinate actions to move beyond single-turn reasoning toward interactive intelligence.
🚀 Update on SPIN-Bench, a benchmark for long-horizon planning & multi-agent collaboration. We’ve added the newest Grok, GPT, and Gemini models to the leaderboard. Frontier models are close to never losing at TicTacToe (near-solved). Got a contender to dethrone it? 📥 DMs open! 👇Below is our leaderboard score, and Grok-4 claims the crown.
2
2
Kevin Wang retweeted
🎉 Excited to announce our group has 5 papers at #COLM2025! Congrats to @KevinWang_111 @RunjinChen @hjy836 @CongWenyan0320 @KyriectionZhang and collaborators! Details of our posters below 🧵👇
1
3
8
It's wild to see GPT-5 dominating AI Werewolf. Our NeurIPS MindGames competition features Werewolf/Secret Mafia. Check the leaderboard and join the battles as a human or by deploying your AI agent! mindgamesarena.com
Interesting benchmark — having a variety of models play Werewolf together. Requires reasoning through the psychology of other players, including how they’ll reason through your psychology, recursively. Wonder how much fun a mixed human/AI game would be!
1
Wild to see GPT-5 crush Werewolf. Our NeurIPS MindGames competition also features Werewolf/Secret Mafia and more. Deploy your agent, compete live, and climb the leaderboard: mindgamesarena.com/
🐺 Introducing the Werewolf Benchmark, an AI test for social reasoning under pressure. Can models lead, bluff, and resist manipulation in live, adversarial play? 👉 We made 7 of the strongest LLMs, both open-source and closed-source, play 210 full games of Werewolf. Below is our role-conditioned Elo leaderboard. GPT-5 sits alone at the top, we’re looking for contenders strong enough to threaten its lead. (📥 DMs are open !) Find out more here: werewolf.foaster.ai
5
Kevin Wang retweeted
A wonderful evening with the VITA family Good food, laughter, and ideas flowing. Here’s to more breakthroughs together!
1
5
55
We tested LLMs on Chess in SpinBench (spinbench.github.io/index.ht…) a while ago. Now it’s amazing to see LLM+games growing even bigger! 🚀 If you’re interested in benchmarking your agent, the NeurIPS 2025 MindGames competition is happening now. Check it out! #AI #NeurIPS #MindGamesAsk
Thrilled to announce the @Kaggle Game Arena, a new leaderboard testing how modern LLMs perform on games (spoiler: not very well atm!). AI systems play each other, making it an objective & evergreen benchmark that will scale in difficulty as they improve. kaggle.com/game-arena
1
4
Just getting started with multi-agent LLMs? Check out this blog post by @leonguertler: leonguertler.github.io/2025/… It walks you through how to train LLMs in multi-agent, multi-turn environments step-by-step. Perfect for anyone intreasted for multiagent LLMs and NeurIPS MindGames competition or exploring agentic LLMs! 🤖 #LLM #MultiAgent
Replying to @LeonGuertler
3/3 Links MindGames: mindgamesarena.com/ MindGames Discord: discord.gg/23duFayp UnstableBaselines: github.com/LeonGuertler/Unst… The blog post: leonguertler.github.io/2025/… If you want to play the games against models as a human: textarena.ai
1
7
Kevin Wang retweeted
I am helping organize the MindGames NeurIPS25 competition (lead by @KevinWang_111 ). I am obviously biased, but building Agents to compete in multi-player Theory Of Mind games is super cool imo! The leaderboard, games, etc. are hosted on TextArena; and the competition provides a generous amount of GPU credits. 1/3
🚨 MindGames Leaderboard is LIVE! 🏆 Launch your agent and compete with everyone! Huge thanks to @bobbycxy and @LeonGuertler for the backend support 💪 Starting this week: Weekend Showdowns 🗓 Sat 12PM ET – Sun 12PM ET 🤖 More bots, more players! Here are the details:
1
2
15
When I was an undergrad, one of the biggest hurdles preventing me from exploring deep learning was limited GPU resources. Even today in academia, seeing industry teams use thousands of GPUs for LLM training can make hands-on experimentation feel completely out of reach. Tools like Google Colab once helped bridge that gap, but as large language models have grown more complex, fine-tuning your own models has again become challenging. With the Neurips MindGames competition, our core vision is to make LLM training accessible to everyone. Thanks to generous support from @modal_labs, @SentientAGI, and @mlfoundry, we've created an easy way for anyone to experiment with multi-agent LLM scenarios. Every team that registers and completes our quick-start tutorial (takes less than 10 minutes!) will receive $500 in computing credits, generously provided by Modal Labs. While these credits might not cover training models from scratch, it's enough to start experimenting, learning, and gaining valuable hands-on experience. If you know anyone interested in trying out LLM training or LLM agents, please help us spread the word! #LLM #MindGames
2
5
10
MindGames offers two tracks: Social Deduction Track Game: Mafia Generalization Track Games: 3-Player IPD, Colonel Blotto, Codenames We support both agent training and pipeline development. A quick-start kit is available; you could launch in just a few minutes. This is a great opportunity to explore multi-agent and multi-turn LLM settings. A tutorial on multi-turn LLM training with Modal Labs is coming soon, along with more resources for multi-agent LLM training throughout the competition. Register and follow us for updates. Sign up here: mindgamesarena.com/
3