gary IH fung · Nov 7, 2025 · 1:58 AM UTC

gary IH fung

gary IH fung

@garyfung

Nov 7

Damn. And original Kimi was already on par or SOTA on creative writing x.com/garyfung/status/194667… so gap is likely widening To be seen on how good this is on coding, if it beats MiniMax M2? ❤️ seeing open weights winning or closely following ClosedAI

Karmay

@karmay007

Nov 6

From my tests, Kimi K2 thinking is better than everything Xai, Anthropic, Google has to offer atm. The only thing that is better than this is Gpt 5 codex (at code) and Gpt 5 pro (at high level algorithm design) It beats the SOTA at creative writing by a mile. Good work @crystalsssup!

gary IH fung · Nov 7, 2025 · 2:55 AM UTC

gary IH fung · Nov 7, 2025 · 2:55 AM UTC

gary IH fung

@garyfung

Nov 7

Holy shit, beats grok 4 heavy too on HLE with Kimi thinking’s own heavy

elie

@eliebakouch

Nov 6

ok we're at 51% with "heavy" mode > Heavy Mode: K2 Thinking Heavy Mode employs an efficient parallel strategy: it first rolls out eight trajectories simultaneously, then reflectively aggregates all outputs to generate the final result.

Nov 7, 2025 · 2:55 AM UTC

gary IH fung · Nov 7, 2025 · 3:02 AM UTC

gary IH fung

@garyfung

Nov 7

since you already got Kimi K2 non-think in your garden. K2 Thinking addition when @windsurf ? GPT-5 high reasoning intelligence with higher speed and lower cost, want! grok.com/share/bGVnYWN5_de3c…

gary IH fung · Nov 7, 2025 · 3:31 AM UTC

gary IH fung

@garyfung

Nov 7

this 1 shotted a Word clone? 🤯 how much more this thing can do, iterating with me function app showcase at moonshotai.github.io/Kimi-K2…

gary IH fung · Nov 7, 2025 · 3:54 AM UTC

gary IH fung

@garyfung

Nov 7

for agentic coding. Updated table of additional models I care about using with direct or normalized scores of - SWE-bench Verified - Terminal-Bench according to Kimi kimi.com/share/19a5c71b-18e2… while i also test its agentic web search capability. A bit better on factuality & comprehensive than Grok which has been my daily driver!

gary IH fung · Nov 7, 2025 · 9:43 AM UTC

gary IH fung

@garyfung

Nov 7

reranked with added telecom bench. Weighted score based on 50% SWE-bench Verified, 25% Terminal-Bench, 25% τ²-Bench Telecom with with that, the top 3 models are near neck to neck for agentic coding: mix of intelligent "swe" problem solving and coding + reliable tool calls

gary IH fung · Nov 8, 2025 · 12:02 AM UTC

gary IH fung

@garyfung

Nov 8

AA full scores:

Artificial Analysis

@ArtificialAnlys

Nov 7

Replying to @ArtificialAnlys

Individual results across all evaluations in the Artificial Analysis Intelligence Index: MMLU-Pro, GPQA Diamond, Humanity's Last Exam, LiveCodeBench, SciCode, AIME 2025, IFBench, AA-LCR, Terminal-Bench Hard, 𝜏²-Bench Telecom

gary IH fung · Nov 9, 2025 · 4:26 AM UTC

gary IH fung

@garyfung

12h

Near SOTA at math too. How does that work with SOTA in creative writing, entirely different disciplines 🤔

Chase Brower

@ChaseBrowe32432

Nov 8

Kimi K2 Thinking is now the top-performing non-TTC model for math