Moonshot AI's Kimi K2 Thinking scores 71.3% on SWE‑Bench, beating leading models
Moonshot AI’s latest release, Kimi K2 Thinking, has been put through the same coding gauntlet that separates hype from real progress. The model was evaluated on SWE‑Bench, a benchmark that measures how well language models can generate correct software patches across a range of real‑world tasks. Those numbers matter because developers and enterprises often look to these scores when deciding whether an open‑source alternative can replace a commercial offering.
While many open‑source LLMs still lag behind the big players, K2 Thinking’s performance suggests a shift in the balance. The company’s own comparison chart lines up the results against the current heavyweights, highlighting where the new model overtakes them. This sets the stage for a closer look at the exact figures and how they stack up against GPT‑5, Claude Sonnet 4.5 and even China’s Deepseek‑V3.2.
For coding, K2 Thinking scored 71.3 percent on SWE-Bench Verified and 61.1 percent on SWE-Multilingual. Moonshot's comparison chart shows that these results put K2 Thinking ahead of some leading commercial models like GPT-5 and Claude Sonnet 4.5, as well as Chinese rival Deepseek-V3.2 in certain tests. To show off its coding skills, Moonshot highlights a demo where Kimi K2 Thinking generated a fully functional Word-style document editor from a single prompt.
The company says the model delivers strong results on HTML, React, and other front-end tasks, turning prompts into responsive, production-ready apps. Moonshot also points to K2 Thinking's step-by-step reasoning abilities.
What does Kimi K2 Thinking’s benchmark performance really mean? At 71.3 percent on SWE‑Bench Verified and 61.1 percent on SWE‑Multilingual, the model outpaces several well‑known commercial systems in those specific tests. The numbers, however, are confined to the benchmarks cited; it’s unclear whether the advantage persists in broader coding tasks.
Training reportedly cost about $4.6 million, a figure that suggests a significant investment for an open‑source effort. Yet the MIT license adds a commercial twist: firms pulling in over $20 million a month or serving more than 100 million monthly active users must display the Kimi K2 name prominently. That clause hints at worries over U.S.
companies adopting cheaper Chinese‑origin models, but the actual impact on adoption remains uncertain. The comparison chart places K2 Thinking ahead of GPT‑5, Claude Sonnet 4.5, and Deepseek‑V3.2 in certain test slices, though the scope of those slices isn’t fully detailed. In short, the model sets a notable record among open‑source LLMs, but whether the scores translate into real‑world advantage or market shift is still open.
Further Reading
- Moonshot's $4.6 million 'Kimi K2 Thinking' takes top spots on reasoning benchmarks, beating GPT-5 and Claude - Implicator
- Introducing Kimi K2 Thinking - Moonshot - Moonshot AI (Official Blog)
- kimi-k2-thinking - Ollama - Ollama
Common Questions Answered
What scores did Moonshot AI's Kimi K2 Thinking achieve on the SWE‑Bench benchmarks?
Kimi K2 Thinking scored 71.3 percent on the SWE‑Bench Verified benchmark and 61.1 percent on the SWE‑Multilingual benchmark. These results place it ahead of several leading commercial models in those specific tests.
How does Kimi K2 Thinking's performance compare to commercial models like GPT‑5 and Claude Sonnet 4.5?
According to Moonshot AI's comparison chart, K2 Thinking outperformed GPT‑5, Claude Sonnet 4.5, and the Chinese model Deepseek‑V3.2 on certain SWE‑Bench tests. The advantage is specific to the benchmark metrics reported, not necessarily all coding tasks.
What does the benchmark demo involving a Word‑style document editor illustrate about Kimi K2 Thinking?
The demo shows K2 Thinking generating a fully functional Word‑style document editor from a single prompt, highlighting its ability to produce complex, end‑to‑end code solutions. This example serves as a practical illustration of the model's coding capabilities beyond raw benchmark numbers.
What are the reported development costs and licensing terms for Kimi K2 Thinking?
Moonshot AI states that training Kimi K2 Thinking cost about $4.6 million, indicating a substantial investment for an open‑source effort. The model is released under an MIT license, allowing broad use and modification by developers and enterprises.