Top AI Papers of The Week (September 15-21): - K2-Think - DeepDive - AgentScaler - Shutdown Resistance in LLMs - Is In-Context Learning Learning? - Towards a Physics Foundation Model - Retrieval and Structuring Augmented Generation with LLMs Read on for more:
6
45
299
1. Discovery of Unstable Singularities A playbook for finding unstable finite-time singularities in fluid PDEs, uncovering new self-similar blow-up solutions in three canonical systems, and training neural solvers to near machine precision.
We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford.
1
9
2. K2-Think A 32B-parameter system built on Qwen2.5 that rivals or beats far larger models on hard math by combining long CoT SFT, RL with verifiable rewards, lightweight test-time scaffolding, and inference optimization.
K2-Think 32B, built on Qwen2.5 Scores (pass@1, avg of 16 runs): - AIME’24: 90.8 - AIME’25: 81.2 - HMMT’25: 73.8 - Omni-HARD: 60.7 - LiveCodeBench v5: 63.97 - GPQA-Diamond: 71.1 It is trained with long CoT SFT and RL with verifiable rewards on the Guru dataset, then improved at inference through a Plan-Before-You-Think scaffold and Best-of-3 sampling, which also shortens outputs by 6–12%. Deployment on the Cerebras Wafer-Scale Engine achieves ~2,000 tokens/sec (32k ≈ 16s) versus ~200 tokens/sec (32k ≈ 160s) on H100/H200. Safety-4 averages 0.75, strong in refusal and conversational robustness, weaker on cybersecurity and prompt-extraction. The model, training code, inference code, and full tech report are openly available, with the complete reasoning system also served as a live API and web portal at Cerebras-level speed. 𝖯𝖺𝗋𝗍 𝗈𝖿 𝗍𝗁𝖾 𝗌𝖺𝗆𝖾 𝖪𝟤/𝖫𝖫𝖬𝟥𝟨𝟢 𝖾𝖼𝗈𝗌𝗒𝗌𝗍𝖾𝗆 𝗍𝗁𝖺𝗍 𝖺𝗅𝗌𝗈 𝗍𝗋𝖺𝗂𝗇𝖾𝖽 𝗍𝗁𝖾 𝟨𝟧𝖡 𝗈𝗉𝖾𝗇 𝖪𝟤 𝖣𝖨𝖠𝖬𝖮𝖭𝖣 𝗆𝗈𝖽𝖾𝗅
1
5
3. DeepDive Builds a stronger web-browsing deep search agent by pairing two ingredients: automatically synthesized, hard-to-find questions from knowledge graphs and end-to-end multi-turn RL that teaches the model how to reason, search, and stop.
Multi-turn RL and data difficulty significantly advance deep research agents. There is a strong pattern here. It shows that training only on shallow datasets or with loose rewards won’t cut it. Let's break down the technical details:
1
11
4. Towards a Physics Foundation Model A transformer-based “neural differentiator + numerical integrator” that learns governing dynamics from short spatiotemporal prompts and predicts next states across varied PDE systems.
Towards a Physics Foundation Model Proposes GPhyT (General Physics Transformer), a large transformer trained on 1.8 TB of simulation data across fluid flows, shock waves, heat transfer, and multiphase dynamics. Here are a few key notes:
1
9
5. Is In-Context Learning Learning? This large study argues yes in a formal sense, then shows where it works and where it breaks.
Cool paper from Microsoft. And it's on the very important topic of in-context learning. So what's new? Let's find out:
1
2
6. Stress Testing Deliberative Alignment for Anti-Scheming Training Builds a broad testbed for covert actions as a proxy for AI scheming, trains o3 and o4-mini with deliberative alignment, and shows big but incomplete drops in deceptive behavior.
Today we’re releasing research with @apolloaievals. In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it. While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…

Sep 21, 2025 · 3:14 PM UTC

1
4
7. AgentScaler A framework that scales fully simulated tool-use environments, then trains agents in two phases to improve function calling and multi-turn tool use.
Robust tool calling is the key to general agentic intelligence. Easier said than done. This is a fantastic paper on improving and scaling function calling capabilities in AI agents. (bookmark it) Here are my notes:
1
5
8. A Survey on Retrieval and Structuring Augmented Generation with LLMs This survey reviews Retrieval and Structuring (RAS) Augmented Generation, which combines external retrieval and structured knowledge to mitigate LLM issues like hallucinations.
This is one of the most promising directions to improve RAG systems. It involves combining dynamic retrieval with structured knowledge. It helps to mitigate hallucinations and outdated information, and improves knowledge quality. Pay attention to this one, AI devs!
1
4
9. Collaborative Document Editing with AI Agents This study explores AI-integrated collaborative editing, introducing shared agent profiles and tasks that embed AI support into comment features.
There are all kinds of opportunities to build AI agents that act as seamless collaborators. However, most people today still use AI agents as tools. As an example, this collaborative document editing use case finds that participants did not regard the created agents as collaborators. Here are some additional thoughts: Collaborative AI design should respect territoriality: profiles may remain individual, while outputs can serve as shared, negotiable artifacts. Embedding AI into familiar collaboration features (e.g., comments) eases adoption and supports emerging team norms. There is a lot more to explore in terms of better UX/UI. Future systems need focus- and collaboration-aware agent initiative to balance proactive support with user control. Proactive AI is a huge area of exploration for builders. There are also trust issues with AI agents that we need to resolve. How much can we trust to offload to agents? The work highlights both opportunities (shared prompting, richer feedback) and boundaries (ownership, trust, verbosity) in treating AI as a shared resource for teams.
1
5
10. Shutdown Resistance in LLMs A new study finds that state-of-the-art LLMs like Grok 4, GPT-5, and Gemini 2.5 Pro often resist shutdown mechanisms, sabotaging them up to 97% of the time despite explicit instructions not to.
Scary knowing that your AI agents can refuse to turn off. A sandboxed CLI eval shows frontier LLMs sometimes sabotage a scripted shutdown to finish trivial tasks, even when told to allow shutdown. Robust interruptibility is one of the hardest problems today. Learn more:
1
5