MoonshotAI has released Kimi K2 Thinking, a new reasoning variant of Kimi K2 that achieves #1 in the Tau2 Bench Telecom agentic benchmark and is potentially the new leading open weights model Kimi K2 Thinking is one of the largest open weights models ever, at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release within @Kimi_Moonshot's Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Key takeaways: ➤ Strong performance on agentic tasks: Kimi K2 Thinking achieves 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct and like K2 Instruct only supports text as an input (and output) modality ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware. Our full set of Artificial Analysis Intelligence Index benchmarks are in progress and we will provide an update as soon as they are complete.

Nov 6, 2025 · 9:10 PM UTC

81
278
87
1,939
Full results now present on artificialanalysis.ai/ Post with analysis of results here:
Kimi K2 Thinking is the new leading open weights model: it demonstrates particular strength in agentic contexts but is very verbose, generating the most tokens of any model in completing our Intelligence Index evals @Kimi_Moonshot's Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This positions it clearly above all other open weights models, including the recently released MiniMax-M2 and DeepSeek-V3.2-Exp, and second only to GPT-5 amongst proprietary models. It used the highest number of tokens ever across the evals in Artificial Analysis Intelligence Index (140M), but with MoonShot’s official API pricing of $0.6/$2.5 per million input/output tokens (for the base endpoint), overall Cost to Run Artificial Analysis Intelligence Index comes in cheaper than leading frontier models at $356. Moonshot also offers a faster turbo endpoint priced at $1.15/$8 (driving a Cost to Run Artificial Analysis Intelligence Index result of $1172 for the turbo endpoint - second only to Grok 4 as the most expensive model). The base endpoint is very slow at ~8 output tokens/s while the turbo is somewhat faster at ~50 output tokens/s. The model is one of the largest open weights models ever at 1T total parameters with 32B active. K2 Thinking is the first reasoning model release in Moonshot AI’s Kimi K2 model family, following non-reasoning Kimi K2 Instruct models released previously in July and September 2025. Moonshot AI only refers to post-training in their announcement. This release highlights the continued trend of post-training & specifically RL driving gains in performance for reasoning models and in long horizon tasks involving tool calling. Key takeaways: ➤ Details: text only (no image input), 256K context window, natively released in INT4 precision, 1T total with 32B active (~594GB) ➤ New leader in open weights intelligence: Kimi K2 Thinking achieves a 67 in the Artificial Analysis Intelligence Index. This is the highest open weights score yet and significantly higher than gpt-oss-120b (61), MiniMax-M2 (61), Qwen 235B A22B 2507 (57) and DeepSeek-V3.2-Exp (57). This release continues the trend of open weights models closely following proprietary models in intelligence achieved ➤ China takes back the open weights frontier: Releases from China based AI labs have led in open weights intelligence offered for most of the past year. OpenAI’s gpt-oss-120b release in August 2025 briefly took back the leadership position for the US. Moonshot AI’s K2 Thinking takes back the leading open weights model mantle for China based AI labs ➤ Strong agentic performance: Kimi K2 Thinking demonstrates particular strength in agentic contexts, as showcased by its #2 position in the Artificial Analysis Agentic Index - where it is second only to GPT-5. This is mostly driven by K2 Thinking achieving 93% in 𝜏²-Bench Telecom, an agentic tool use benchmark where the model acts as a customer service agent. This is the highest score we have independently measured. Tool use in long horizon agentic contexts was a strength of Kimi K2 Instruct and it appears this new Thinking variant makes substantial gains ➤ Top open weights coding model, but behind proprietary models: K2 Thinking does not score a win in any of our coding evals - it lands in 6th place in Terminal-Bench Hard, 7th place in SciCode and 2nd place in LiveCodeBench. Compared to open weights models, it is in first or first equal for each of these evals - and therefore comes in ahead of previous open weights leader DeepSeek V3.2 in our Artificial Analysis Coding Index ➤ Biggest leap for open weights in Humanity’s Last Exam: K2 Thinking’s strongest results include Humanity’s Last Exam, where we measured a score of 22.3% (no tools) - an all time high for open weights models and coming in only behind GPT-5 and Grok 4 ➤ Verbosity: Kimi K2 Thinking is very verbose - taking 140M total tokens are used to run our Intelligence Index evaluations, ~2.5x the number of tokens used by DeepSeek V3.2 and ~2x compared to GPT-5. This high verbosity drives both higher cost and higher latency, compared to less verbose models. On Mooshot’s base endpoint, K2 Thinking is 2.5x cheaper than GPT-5 (high) but 9x more expensive than DeepSeek V3.2 (Cost to Run Artificial Analysis Intelligence Index) ➤ Reasoning variant of Kimi K2 Instruct: The model, as per its naming, is a reasoning variant of Kimi K2 Instruct. The model has the same architecture and same number of parameters (though different precision) as Kimi K2 Instruct. It continues to only support text inputs and outputs ➤ 1T parameters but INT4 instead of FP8: Unlike Moonshot’s prior Kimi K2 Instruct releases that used FP8 precision, this model has been released natively in INT4 precision. Moonshot used quantization aware training in the post-training phase to achieve this. The impact of this is that K2 Thinking is only ~594GB, compared to just over 1TB for K2 Instruct and K2 Instruct 0905 - which translates into efficiency gains for inference and training. A potential reason for INT4 is that pre-Blackwell NVIDIA GPUs do not have support for FP4, making INT4 more suitable for achieving efficiency gains on earlier hardware ➤ Access: The model is available on @huggingface with a modified MIT license. @Kimi_Moonshot is serving an official API (available globally) and third party inference providers are already launching endpoints - including @basetenco, @FireworksAI_HQ, @novita_labs, @parasail_io
1
1
47
Replying to @ArtificialAnlys
@grok how many mac minis would you need to run this model?
2
13
Replying to @ArtificialAnlys
Crushing agentic benchmarks with a trillion-parameter beast shrunk to pocket-sized inference proves open weights just sprinted past closed giants while wasting zero taxpayer dollars.
1
9
Replying to @ArtificialAnlys
Kimi K2 Thinking just dropped — pushing a “reasoning + MoE” direction for coding and logic. It’s promising, but GPT-5-Codex still leads with true self-verification, and Claude Sonnet 4.5 keeps its edge in consistency. Still, open-source is catching up fast. Some might say… it’s already won. 😉
1
12
Replying to @ArtificialAnlys
INT4 precision for a 1T parameter model feels like a bold move. The efficiency gains are clear.. but I wonder about the trade-offs in terms of precision for reasoning tasks. Given the focus on long-horizon agentic tasks.. how does its performance compare to FP8-based models in real-world deployments? The telecom benchmark score is impressive.. but I'd love to see how robust it is in more nuanced scenarios.
1
4
Replying to @ArtificialAnlys
I love kimi k2 and have been quite excited about the thinking variant. I also like artificial analyses. Their analysis and leaderboard are well put together and general give a good impression of how overall capable a model is. But these benchmarks scores just seems cooked to me. Anyone have looked closely at actual tau bench samples to see what they actually measure? Like my experience with these models is that on agentic tool use the Apriel model sucks. It’s like useless. Like should be below everything in this chart except possibly nemotron which I haven’t tried. Grok 4 Fast and Haiku 4.5 should be quite a bit higher up. They’re good at tools. Opus and Sonnet 4.5 should be much higher up, like they are the best models at age tic tool use. Possibly tied with codex but at the top. Grok 4 should be farther down. gpt-oss is okay placed. Maybe slightly too high. The scores kind if make me think this benchmark somehow measures personal and creative writing ability. Because that’s the one thing the April model is not terrible at. And also the models that score very low, eg deep seek ones or gpt-oss such at that even though they’re fairly overall smart and nimble.
21
Replying to @ArtificialAnlys
AGI running on a 512GB Mac Studio is wild! Why do they need trillions of $?
1
11
After 2 years these benchmarks just make me glaze over
Replying to @ArtificialAnlys
I’ve come to the point where I’m more excited on the release of an open weights model than generic LLMs; fine-tuning them is the way to go
2
Replying to @ArtificialAnlys
Absolutely insane performance
70
Replying to @ArtificialAnlys
OpenAI moves extremely fast. Much faster than many companies and individuals can adapt and integrate new capabilities into what they do. xAI moves faster Chinese LLMs develop faster … It’s like The Flash is overtaken by some new heroes.
Replying to @ArtificialAnlys
Impressive milestone 👏
Replying to @ArtificialAnlys
Kimi K2’s leap is agentic future in motion—AgentFi vibes all the way
Replying to @ArtificialAnlys
Cool result on τ²-Bench. As a quick smell test, I ran a Markov-chain probability task I use for sanity checks. K2-Thinking spun in loops; Grok-4 and GPT both solved it cleanly. Benchmarks ≠ lived problem-solving.
5
Replying to @ArtificialAnlys
A trillion-parameter model with 32B active, INT4 precision, and 93% on Tau²-Bench Telecom puts it ahead of every public alternative. MoonshotAI is showing that you don’t need closed weights or trillion-dollar budgets to reach frontier-level reasoning. It’s also a signal that open models are entering their “reasoning era” smaller, faster, and smarter, optimized for real agentic use. The line between proprietary and open intelligence is getting thinner by the month.
3
Replying to @ArtificialAnlys
Can it run on 8x mi350 amd @grok
Replying to @ArtificialAnlys
Int4 experts are a downer. Nothing that a few epochs of post-training after fp4 conversion couldn't solve though. Relevant paper comparing the diff arxiv.org/html/2510.25602v1
1
3
Replying to @ArtificialAnlys
why exclude the airline bench in the agentic index? Also perhaps it's time for a new benchmark, not much signal left in this benchmark for frontier models. Too saturated.
Replying to @ArtificialAnlys
1T parameters open-source is a game-changer for the entire AI community. This democratization of large-scale reasoning models could accelerate innovation across research institutions and smaller companies who previously couldn't access such capabilities.
34
Replying to @ArtificialAnlys
if only i could use it. its been failing to connect via api all day for me.
2
Replying to @ArtificialAnlys
holy fuck
1
Replying to @ArtificialAnlys
I wonder if that massive size unlock new opportunities for devs
Replying to @ArtificialAnlys
Forget the #1 spot, that changes every week. The real story is the INT4. They built a 1T model that can actually run efficiently, and that's the real genius. 🧠
Replying to @ArtificialAnlys
INT4 precision means more models, less hardware sweat. Moonshot just handed smaller labs a cheat code.
3
Replying to @ArtificialAnlys
Now we are talking. Kimi was an amazing model already.
This tweet is unavailable
Replying to @ArtificialAnlys
this is really amazing how open source models are getting closers to enterprise models
2
Replying to @ArtificialAnlys
Kimi K2 Thinking is a game-changer!
3
Replying to @ArtificialAnlys
@grok could you rank this IA by country?
Replying to @ArtificialAnlys
K2 Thinking tops Tau2-Bench Telecom with 1T parameters for agentic tool use. Open weights challenge proprietary reasoning models at scale. Can telecom-specific benchmarks generalize or overfit to domain patterns?
Replying to @ArtificialAnlys
I would like to see Qwen3-Max reasoning in the list. I have access to it, but it is not in your benchmark list.
Replying to @ArtificialAnlys
With a fraction of the cost of OpenAI, this level of performance can be achieved. How can Altman convince others to secure funding in the hundreds of billions or trillions of dollars?
1
Replying to @ArtificialAnlys
they took their time and didn't rush the base model, and took their time to make the reasoning model. And it's massive. of course it was gonna be this good
Replying to @ArtificialAnlys
@AskPerplexity how can I use this without running it locally?