Illustration for: Moonshot AI's Kimi K2 Thinking scores 71.3% on SWE‑Bench, beating leading models
Open Source

Moonshot AI's Kimi K2 Thinking scores 71.3% on SWE‑Bench, beating leading models

2 min read

When I first saw Moonshot AI’s new model, Kimi K2 Thinking, I wondered if it could actually cut through the usual hype. The team ran it through SWE-Bench, that set of real-world coding tasks that checks whether a language model can spit out a correct software patch. Those scores tend to matter a lot; developers and companies often glance at them before swapping a commercial tool for an open-source one.

It’s not a given that open-source LLMs can keep up, but K2 Thinking’s numbers look surprisingly close to the leaders. Moonshot’s own chart pits the model against the usual heavyweights, pointing out a few spots where it actually nudges ahead. If you compare it side-by-side with GPT-5, Claude Sonnet 4.5 or even China’s Deepseek-V3.2, the gap seems to be narrowing, though it’s still a bit early to call it a win.

Still, the results hint that the balance might be shifting, and I’m curious to see how the detailed figures line up.

For coding, K2 Thinking scored 71.3 percent on SWE-Bench Verified and 61.1 percent on SWE-Multilingual. Moonshot's comparison chart shows that these results put K2 Thinking ahead of some leading commercial models like GPT-5 and Claude Sonnet 4.5, as well as Chinese rival Deepseek-V3.2 in certain tests. To show off its coding skills, Moonshot highlights a demo where Kimi K2 Thinking generated a fully functional Word-style document editor from a single prompt.

The company says the model delivers strong results on HTML, React, and other front-end tasks, turning prompts into responsive, production-ready apps. Moonshot also points to K2 Thinking's step-by-step reasoning abilities.

Related Topics: #Moonshot AI #Kimi K2 Thinking #SWE‑Bench #GPT‑5 #Claude Sonnet 4.5 #Deepseek‑V3.2 #HTML #React

Kimi K2 Thinking hits 71.3 % on SWE-Bench Verified and 61.1 % on SWE-Multilingual, nudging past a handful of well-known commercial models in those particular tests. The scores look good, but they’re limited to the benchmarks listed - it’s unclear if the edge holds up on everyday coding work. Reportedly the training bill was about $4.6 million, which feels like a hefty spend for an open-source project.

The MIT licence adds a twist: any company pulling in over $20 million a month or serving more than 100 million monthly active users has to showcase the Kimi K2 name. That clause probably stems from worries about U.S. firms switching to cheaper Chinese-origin models, yet we don’t really know how it will affect adoption.

In the comparison chart K2 Thinking sits ahead of GPT-5, Claude Sonnet 4.5 and Deepseek-V3.2 on some test slices, though those slices aren’t fully described. Bottom line: the model posts an impressive record for open-source LLMs, but whether those numbers turn into real-world advantage or shift the market remains an open question.

Common Questions Answered

What scores did Moonshot AI's Kimi K2 Thinking achieve on the SWE‑Bench benchmarks?

Kimi K2 Thinking scored 71.3 percent on the SWE‑Bench Verified benchmark and 61.1 percent on the SWE‑Multilingual benchmark. These results place it ahead of several leading commercial models in those specific tests.

How does Kimi K2 Thinking's performance compare to commercial models like GPT‑5 and Claude Sonnet 4.5?

According to Moonshot AI's comparison chart, K2 Thinking outperformed GPT‑5, Claude Sonnet 4.5, and the Chinese model Deepseek‑V3.2 on certain SWE‑Bench tests. The advantage is specific to the benchmark metrics reported, not necessarily all coding tasks.

What does the benchmark demo involving a Word‑style document editor illustrate about Kimi K2 Thinking?

The demo shows K2 Thinking generating a fully functional Word‑style document editor from a single prompt, highlighting its ability to produce complex, end‑to‑end code solutions. This example serves as a practical illustration of the model's coding capabilities beyond raw benchmark numbers.

What are the reported development costs and licensing terms for Kimi K2 Thinking?

Moonshot AI states that training Kimi K2 Thinking cost about $4.6 million, indicating a substantial investment for an open‑source effort. The model is released under an MIT license, allowing broad use and modification by developers and enterprises.