Skip to main content
Moonshot AI's Kimi K2.6, a large language model, achieves 54.0 HLE-Full score, scaling to 300 agents.

Editorial illustration for Moonshot AI launches Kimi K2.6, scores 54.0 on HLE-Full, scales to 300 agents

Moonshot AI's Kimi K2.6: 300 Agents, Advanced Reasoning

Moonshot AI launches Kimi K2.6, scores 54.0 on HLE-Full, scales to 300 agents

2 min read

Moonshot AI’s newest release, Kimi K2.6, pushes the envelope on two fronts: it can stitch together code that spans dozens of reasoning cycles, and it can marshal a swarm of up to 300 sub‑agents to carry out 4,000 coordinated steps. The open‑source model is built for “long‑horizon” tasks, meaning it isn’t just answering a single prompt but managing a cascade of actions that unfold over time. That capability matters because most benchmarks still reward short, isolated answers, leaving a gap when real‑world problems demand sustained, multi‑agent collaboration.

Here’s the thing: the community has long used Humanity’s Last Exam (HLE‑Full) as a litmus test for that kind of depth. It’s widely regarded as one of the toughest knowledge benchmarks, especially when tools are in play. So when a fresh contender posts a score that tops even the latest proprietary offerings, the result draws a clear line in the sand.

Perhaps the most striking number for agentic workloads is Humanity's Last Exam (HLE-Full) with tools: K2.6 scores 54.0 -- leading every model in the comparison, including GPT-5.4 (52.1), Claude Opus 4.6 (53.0), and Gemini 3.1 Pro (51.4). HLE is widely considered one of the hardest knowledge benchmarks, and the with-tools variant specifically tests how well a model can leverage external resources autonomously. Internally, Moonshot evaluates long-horizon coding gains using their Kimi Code Bench, an internal benchmark covering diverse, complicated end-to-end tasks across languages and domains, where K2.6 demonstrates significant improvements over K2.5.

Moonshot AI’s Kimi K2.6 arrives as an open‑sourced, multimodal agentic model built for long‑horizon coding tasks and front‑end generation from natural language. It can coordinate up to 300 specialized sub‑agents across 4,000 steps, a scale that the release team highlights for practical deployment. On Humanity’s Last Exam (HLE‑Full) with tools, K2.6 posts a 54.0 score, edging out GPT‑5.4 (52.1), Claude Opus 4.6 (53.0) and Gemini 3.1 Pro (51.4).

The benchmark is widely regarded as one of the toughest knowledge tests, so the result draws attention. Yet the article offers no data on real‑world software‑engineering workloads beyond the benchmark, leaving it unclear whether the reported gains will translate into consistent productivity gains for developers. Moreover, the claim of “massively parallel agent swarms” rests on internal testing; external verification is absent.

Will the open‑source community adopt K2.6 and validate its performance at scale? The answer will depend on subsequent experiments and integration experiences, which remain to be documented.

Further Reading

Common Questions Answered

How does Kimi K2.6 perform on the Humanity's Last Exam (HLE-Full) benchmark?

Kimi K2.6 scored an impressive 54.0 on the HLE-Full with tools benchmark, outperforming other leading AI models like GPT-5.4, Claude Opus 4.6, and Gemini 3.1 Pro. This benchmark is considered one of the most challenging knowledge tests, specifically evaluating an AI model's ability to autonomously leverage external resources.

What makes Kimi K2.6's agent coordination capabilities unique?

Kimi K2.6 can coordinate up to 300 specialized sub-agents across 4,000 coordinated steps, representing a significant advancement in long-horizon task management. This capability allows the model to stitch together complex reasoning cycles and manage cascading actions that unfold over extended periods.

What type of tasks is Kimi K2.6 designed to handle?

Kimi K2.6 is specifically built for 'long-horizon' tasks, meaning it can manage complex, multi-step processes beyond simple prompt responses. The open-source, multimodal agentic model excels at coordinating intricate workflows, particularly in coding and front-end generation from natural language.