Terminal‑Bench 2.0 launches with Harbor, testing any container‑installable agent
Why does a new testing framework matter now? Companies are racing to ship AI agents that can run inside containers, yet there’s no single place to measure them end‑to‑end. Terminal‑Bench 2.0 arrives with Harbor, a companion system that promises to fill that gap.
While the tech is impressive, the real question is whether developers can actually plug any container‑installable agent into a reproducible benchmark without rebuilding pipelines from scratch. Here’s the thing: Harbor is positioned as a one‑stop shop for evaluating, fine‑tuning and benchmarking agents, and it’s already been used internally to validate the latest releases. The integration with Terminal‑Bench 2 means users won’t need to juggle separate tools for supervised fine‑tuning or reinforcement‑learning loops.
Instead, they get a unified environment that can scale with their workloads. The promise is clear—flexibility across architectures, custom benchmark creation and seamless deployment—all wrapped in a single framework.
Designed to generalize across agent architectures, Harbor supports:
Designed to generalize across agent architectures, Harbor supports: Evaluation of any container-installable agent Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines Custom benchmark creation and deployment Full integration with Terminal-Bench 2. Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.
Early Results: GPT-5 Leads in Task Success Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate -- the highest among all agents tested so far. Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents. Top 5 Agent Results (Terminal-Bench 2.0): Codex CLI (GPT-5) -- 49.6% Codex CLI (GPT-5-Codex) -- 44.3% OpenHands (GPT-5) -- 43.8% Terminus 2 (GPT-5-Codex) -- 43.4% Terminus 2 (Claude Sonnet 4.5) -- 42.8% The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.
Submission and Use To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation. harbor run -d [email protected] -m "
" -a " " --n-attempts 5 --jobs-dir Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark. Aiming for Standardization The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure.
Harbor arrives with a clear purpose. It lets developers drop any container‑installable agent into a controlled testbed, then run Terminal‑Bench 2.0’s tougher, more realistic tasks. The framework promises scalable supervised fine‑tuning and reinforcement‑learning pipelines, plus the ability to craft custom benchmarks on demand.
Integration is tight; Harbor feeds results straight into Terminal‑Bench, and the team says they have already used the combo internally. Yet, whether the broader community will adopt a container‑centric workflow remains uncertain, especially given the effort required to containerize existing agents. The release tackles long‑standing pain points around reproducibility and scaling, but it does not yet demonstrate how well it handles agents that were not designed with containers in mind.
If the promised generality holds up under diverse architectures, the dual launch could smooth the path from prototype to production. Until independent users put Harbor through its paces, the practical impact of these tools will stay largely speculative.
Further Reading
- Introducing Terminal-Bench 2.0 and Harbor - Terminal-Bench News
- Pushing Claude Code, OpenAI Codex, Factory Droid, et al to the limits - YouTube (Interview with Terminal-Bench creators)
- Terminal-Bench - Vals AI - Vals AI
Common Questions Answered
What does Harbor enable developers to do with any container‑installable agent in Terminal‑Bench 2.0?
Harbor allows developers to drop any container‑installable agent into a controlled testbed without rebuilding pipelines, feeding the results directly into Terminal‑Bench 2.0 for evaluation. This integration supports end‑to‑end testing of agents on realistic, tougher tasks.
How does Harbor support scalable supervised fine‑tuning (SFT) and reinforcement learning (RL) pipelines?
Harbor provides built‑in pipelines that can scale supervised fine‑tuning and reinforcement learning workloads across many rollouts, as demonstrated by tens of thousands of internal runs during benchmark creation. These pipelines are designed to work seamlessly with Terminal‑Bench 2.0’s evaluation framework.
Can users create custom benchmarks with Harbor, and if so, how is this achieved?
Yes, Harbor includes functionality for custom benchmark creation and deployment, allowing users to define their own tasks and metrics. The framework then integrates these custom benchmarks into Terminal‑Bench 2.0, ensuring results are captured in the same reproducible environment.
What evidence does the article provide that Harbor and Terminal‑Bench 2.0 are ready for public use?
The article notes that Harbor has been used internally to run tens of thousands of rollouts while building the new benchmark, and it is now publicly available via harborframework.com with documentation for testing and submitting agents. This demonstrates both extensive internal validation and readiness for broader community adoption.