1. Discovery of Unstable Singularities
A playbook for finding unstable finite-time singularities in fluid PDEs, uncovering new self-similar blow-up solutions in three canonical systems, and training neural solvers to near machine precision.
We’re announcing a major advance in the study of fluid dynamics with AI 💧 in a joint paper with researchers from @BrownUniversity, @nyuniversity and @Stanford.
Sep 21, 2025 · 3:14 PM UTC
2. K2-Think
A 32B-parameter system built on Qwen2.5 that rivals or beats far larger models on hard math by combining long CoT SFT, RL with verifiable rewards, lightweight test-time scaffolding, and inference optimization.
K2-Think 32B, built on Qwen2.5
Scores (pass@1, avg of 16 runs):
- AIME’24: 90.8
- AIME’25: 81.2
- HMMT’25: 73.8
- Omni-HARD: 60.7
- LiveCodeBench v5: 63.97
- GPQA-Diamond: 71.1
It is trained with long CoT SFT and RL with verifiable rewards on the Guru dataset, then improved at inference through a Plan-Before-You-Think scaffold and Best-of-3 sampling, which also shortens outputs by 6–12%. Deployment on the Cerebras Wafer-Scale Engine achieves ~2,000 tokens/sec (32k ≈ 16s) versus ~200 tokens/sec (32k ≈ 160s) on H100/H200. Safety-4 averages 0.75, strong in refusal and conversational robustness, weaker on cybersecurity and prompt-extraction.
The model, training code, inference code, and full tech report are openly available, with the complete reasoning system also served as a live API and web portal at Cerebras-level speed.
𝖯𝖺𝗋𝗍 𝗈𝖿 𝗍𝗁𝖾 𝗌𝖺𝗆𝖾 𝖪𝟤/𝖫𝖫𝖬𝟥𝟨𝟢 𝖾𝖼𝗈𝗌𝗒𝗌𝗍𝖾𝗆 𝗍𝗁𝖺𝗍 𝖺𝗅𝗌𝗈 𝗍𝗋𝖺𝗂𝗇𝖾𝖽 𝗍𝗁𝖾 𝟨𝟧𝖡 𝗈𝗉𝖾𝗇 𝖪𝟤 𝖣𝖨𝖠𝖬𝖮𝖭𝖣 𝗆𝗈𝖽𝖾𝗅
3. DeepDive
Builds a stronger web-browsing deep search agent by pairing two ingredients: automatically synthesized, hard-to-find questions from knowledge graphs and end-to-end multi-turn RL that teaches the model how to reason, search, and stop.
4. Towards a Physics Foundation Model
A transformer-based “neural differentiator + numerical integrator” that learns governing dynamics from short spatiotemporal prompts and predicts next states across varied PDE systems.
6. Stress Testing Deliberative Alignment for Anti-Scheming Training
Builds a broad testbed for covert actions as a proxy for AI scheming, trains o3 and o4-mini with deliberative alignment, and shows big but incomplete drops in deceptive behavior.
Today we’re releasing research with @apolloaievals.
In controlled tests, we found behaviors consistent with scheming in frontier models—and tested a way to reduce it.
While we believe these behaviors aren’t causing serious harm today, this is a future risk we’re preparing for. openai.com/index/detecting-a…
7. AgentScaler
A framework that scales fully simulated tool-use environments, then trains agents in two phases to improve function calling and multi-turn tool use.
8. A Survey on Retrieval and Structuring Augmented Generation with LLMs
This survey reviews Retrieval and Structuring (RAS) Augmented Generation, which combines external retrieval and structured knowledge to mitigate LLM issues like hallucinations.
9. Collaborative Document Editing with AI Agents
This study explores AI-integrated collaborative editing, introducing shared agent profiles and tasks that embed AI support into comment features.
There are all kinds of opportunities to build AI agents that act as seamless collaborators.
However, most people today still use AI agents as tools.
As an example, this collaborative document editing use case finds that participants did not regard the created agents as collaborators.
Here are some additional thoughts:
Collaborative AI design should respect territoriality: profiles may remain individual, while outputs can serve as shared, negotiable artifacts.
Embedding AI into familiar collaboration features (e.g., comments) eases adoption and supports emerging team norms. There is a lot more to explore in terms of better UX/UI.
Future systems need focus- and collaboration-aware agent initiative to balance proactive support with user control. Proactive AI is a huge area of exploration for builders.
There are also trust issues with AI agents that we need to resolve. How much can we trust to offload to agents?
The work highlights both opportunities (shared prompting, richer feedback) and boundaries (ownership, trust, verbosity) in treating AI as a shared resource for teams.
10. Shutdown Resistance in LLMs
A new study finds that state-of-the-art LLMs like Grok 4, GPT-5, and Gemini 2.5 Pro often resist shutdown mechanisms, sabotaging them up to 97% of the time despite explicit instructions not to.









