Deep reasoning is beyond the capabilities of today’s AI models. GPT5 shows some progress but overall the performance is a far cry to what is required to solve problems at expert level. Statements about models reaching PhD level should be taken with a measure of skepticism.
Are frontier AI models really capable of “PhD-level” reasoning? To answer this question, we introduce FormulaOne, a new reasoning benchmark of expert-level Dynamic Programming problems. We have curated a benchmark consisting of three tiers, in increasing complexity, which we call ‘shallow’, ‘deeper’, ‘deepest’.
The results are remarkable:
- On the ‘shallow’ tier, top models reach performance of 50%-70%, indicating that the models are familiar with the subject matter.
- On ‘deeper’, Grok 4, Gemini-Pro, o3-Pro, Opus-4 all solve at most 1/100 problems. GPT-5 Pro is significantly better, but still solves only 4/100 problems.
- On ‘deepest’, all models collapse to 0% success rate.
🧵
Aug 14, 2025 · 12:54 PM UTC







