Illustration for: Moonshot's K2 Thinking tops open‑source AI, beating GPT‑5 and MiniMax‑M2
Open Source

Moonshot's K2 Thinking tops open‑source AI, beating GPT‑5 and MiniMax‑M2

2 min read

When I first saw Moonshot’s K2 Thinking pop up on the repo, it felt like the open-source crowd finally got a serious contender. In the past weeks it’s already topping a handful of the usual benchmarks that developers and researchers keep an eye on. That’s a bit surprising, given that MiniMax-M2 - the open-weight leader that landed only a few weeks earlier - was still fresh in everyone’s mind.

Yet K2 Thinking is being run through the same test suite and, on several of those tasks, it seems to pull ahead. What’s catching people’s attention is its knack for complex reasoning and fluid language generation, all without the licensing fees you see on commercial models. It’s hard to say whether this edge will hold up as the next round of evaluations rolls out, but the early numbers do hint at a shift.

Open-source projects might be starting to nibble at the margins that big players like GPT-5 and Claude Sonnet 4.5 have long owned, and that could change the adoption game.

Across these tasks, K2 Thinking consistently outperforms GPT-5's corresponding scores and surpasses the previous open-weight leader MiniMax-M2--released just weeks earlier by Chinese rival MiniMax AI. Open Model Outperforms Proprietary Systems GPT-5 and Claude Sonnet 4.5 Thinking remain the leading proprietary "thinking" models. Yet in the same benchmark suite, K2 Thinking's agentic reasoning scores exceed both: for instance, on BrowseComp the open model's 60.2 % decisively leads GPT-5's 54.9 % and Claude 4.5's 24.1 %. K2 Thinking also edges GPT-5 in GPQA Diamond (85.7 % vs 84.5 %) and matches it on mathematical reasoning tasks such as AIME 2025 and HMMT 2025.

Related Topics: #open-source AI #K2 Thinking #GPT-5 #MiniMax-M2 #Claude Sonnet 4.5 #Moonshot #agentic reasoning #BrowseComp

It looks like a free, open-source model might actually outpace the flagship from a well-funded U.S. lab. Moonshot AI’s Kimi K2 Thinking claims the top spot, based on third-party benchmarks that put it ahead of GPT-5, Claude Sonnet 4.5 and the new MiniMax-M2.

In most of the tests the scores were higher, the same numbers that once crowned MiniMax-M2 as the open-weight leader. The report, however, doesn’t spell out which datasets or metrics were used, so the real size of the edge is a bit fuzzy. The headline also shines a light on a Chinese startup’s rise, while flagging worries about OpenAI’s aggressive expansion and big spending.

I’m not sure yet whether this performance gap will turn into wider adoption - the field moves so fast it’s hard to predict. Still, seeing a free model beat a paid proprietary one feels like a subtle shift in the competition, and many of us will be watching for more proof.

Further Reading

Common Questions Answered

How does Moonshot's K2 Thinking compare to GPT-5 on the benchmark suite?

According to third‑party benchmark results, K2 Thinking consistently outperforms GPT-5 across the same suite of tasks, achieving higher scores in areas such as agentic reasoning. The open‑source model even exceeds GPT-5's performance on the BrowseComp task with a 60.2% score.

What advantage does K2 Thinking have over the previous open‑weight champion MiniMax‑M2?

K2 Thinking surpasses MiniMax‑M2 on the evaluated benchmark suite, posting higher overall scores despite MiniMax‑M2 being released only weeks earlier. Analysts attribute this edge to K2 Thinking's stronger agentic reasoning capabilities and broader test coverage.

Which specific task highlighted K2 Thinking's superiority, and what was the reported score?

The BrowseComp task highlighted K2 Thinking's superiority, where the model achieved a decisive 60.2% score, outpacing both GPT-5 and Claude Sonnet 4.5. This result underscores the model's advanced reasoning and information‑retrieval abilities.

What limitations does the article note about the benchmark results for K2 Thinking?

The article points out that the benchmark report does not disclose the specific datasets or evaluation criteria used, leaving some uncertainty about the exact conditions of the tests. Consequently, while the scores favor K2 Thinking, the lack of methodological detail limits full verification.