Artificial Analysis · Nov 6, 2025 · 9:10 PM UTC

Artificial Analysis · Nov 6, 2025 · 9:10 PM UTC

Artificial Analysis

@ArtificialAnlys

Nov 6

MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.

Nov 6, 2025 · 9:10 PM UTC

278

1,939

Artificial Analysis · Nov 7, 2025 · 11:20 PM UTC

Artificial Analysis

@ArtificialAnlys

Nov 7

Full results now present on artificialanalysis.ai/ Post with analysis of results here:

Artificial Analysis

@ArtificialAnlys

Nov 7

Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals @Kimi_Moonshot's Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This positions it clearly above all other open weights models, including the recently released MiniMax-M2 and DeepSeek-V3.2-Exp, and second only to GPT-5 amongst proprietary models. It used the highest number of tokens ever across the evals in Artificial Analysis Intelligence Index (140M), but with MoonShot’s official API pricing of $0.6/$2.5 per million input/output tokens (for the base endpoint), overall Cost to Run Artificial Analysis Intelligence Index comes in cheaper than leading frontier models at $356. Moonshot also offers a faster turbo endpoint priced at $1.15/$8 (driving a Cost to Run Artificial Analysis Intelligence Index result of $1172 for the turbo endpoint - second only to Grok 4 as the most expensive model). The base endpoint is very slow at ~8 output tokens/s while the turbo is somewhat faster at ~50 output tokens/s. The model is one of the largest open weights models ever at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release in Moonshot AI’s Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Moonshot AI only refers to post-training in their announcement. This release highlights the continued trend of post-training & specifically RL driving gains in performance for reasoning models and in long horizon tasks involving tool calling. Key takeaways: ➤ Details: text only (no image input), 256K context window, natively released in INT4 precision, 1T total with 32B active (~594GB) ➤ New leader in open weights intelligence: Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This is the highest open weights score yet and significantly higher than gpt-oss-120b (61), MiniMax-M2 (61), Qwen 235B A22B 2507 (57) and DeepSeek-V3.2-Exp (57). This release continues the trend of open weights models closely following proprietary models in intelligence achieved ➤ China takes back the open weights frontier: Releases from China based AI labs have led in open weights intelligence offered for most of the past year. OpenAI’s gpt-oss-120b release in August 2025 briefly took back the leadership position for the US. Moonshot AI’s K2 Thinking takes back the leading open weights model mantle for China based AI labs ➤ Strong agentic performance: Kimi K2 Thinking demonstrates particular strength in agentic contexts, as showcased by its #2 position in the Artificial Analysis Agentic Index - where it is second only to GPT-5. This is mostly driven by K2 Thinking achieving 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Top open weights coding model, but behind proprietary models: K2 Thinking does not score a win in any of our coding evals - it lands in 6th place in Terminal-Bench Hard, 7th place in SciCode and 2nd place in LiveCodeBench. Compared to open weights models, it is in first or first equal for each of these evals - and therefore comes in ahead of previous open weights leader DeepSeek V3.2 in our Artificial Analysis Coding Index ➤ Biggest leap for open weights in Humanity’s Last Exam: K2 Thinking’s strongest results include Humanity’s Last Exam, where we measured a score of 22.3% (no tools) - an all time high for open weights models and coming in only behind GPT-5 and Grok 4 ➤ Verbosity: Kimi K2 Thinking is very verbose - taking 140M total tokens are used to run our Intelligence Index evaluations, ~2.5x the number of tokens used by DeepSeek V3.2 and ~2x compared to GPT-5. This high verbosity drives both higher cost and higher latency, compared to less verbose models. On Mooshot’s base endpoint, K2 Thinking is 2.5x cheaper than GPT-5 (high) but 9x more expensive than DeepSeek V3.2 (Cost to Run Artificial Analysis Intelligence Index) ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct. It continues to only support text inputs and outputs ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware ➤ Access: The model is available on @huggingface with a modified MIT license. @Kimi_Moonshot is serving an official API (available globally) and third party inference providers are already launching endpoints - including @basetenco, @FireworksAI_HQ, @novita_labs, @parasail_io

Sol Traveler · Nov 6, 2025 · 10:11 PM UTC

Sol Traveler

@soltraveler_sri

Nov 6