Terminal‑Bench 2.0 launches with Harbor, testing any container‑installable agent
We’ve started seeing a lot of AI agents being packaged as containers lately, and it’s a bit of a mess trying to test them end-to-end. That’s where Terminal-Bench 2.0 and its new companion, Harbor, come in. The idea is simple: drop any container-ready agent into a benchmark that’s already set up, without having to rebuild your whole pipeline.
Harbor bills itself as a single place to evaluate, fine-tune and benchmark agents, and the team has already used it internally for their latest releases. If the integration works as advertised, you shouldn’t have to juggle separate tools for supervised fine-tuning or reinforcement-learning loops. Instead, you get one environment that can grow with your workload.
The claim is pretty straightforward, flexibility across different architectures, the ability to craft custom benchmarks, and a smoother deployment process, all inside one framework.
Designed to work with a range of agent architectures, Harbor supports:
Designed to generalize across agent architectures, Harbor supports: Evaluation of any container-installable agent Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines Custom benchmark creation and deployment Full integration with Terminal-Bench 2. Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.
Early Results: GPT-5 Leads in Task Success Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate -- the highest among all agents tested so far. Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents. Top 5 Agent Results (Terminal-Bench 2.0): Codex CLI (GPT-5) -- 49.6% Codex CLI (GPT-5-Codex) -- 44.3% OpenHands (GPT-5) -- 43.8% Terminus 2 (GPT-5-Codex) -- 43.4% Terminus 2 (Claude Sonnet 4.5) -- 42.8% The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.
Submission and Use To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation. harbor run -d terminal-bench@2.0 -m "
" -a " " --n-attempts 5 --jobs-dir Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use. According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark. Aiming for Standardization The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure.
Harbor shows up with a pretty straightforward goal. You can drop any container-installable agent into its testbed and then throw the tougher, more realistic tasks from Terminal-Bench 2.0 at it. The setup claims to support scalable supervised fine-tuning and reinforcement-learning pipelines, and it even lets you spin up custom benchmarks on the fly.
Integration feels tight - Harbor pipes the results straight into Terminal-Bench, and the developers say they’ve already used the combo themselves. Still, it’s unclear whether the wider community will bite on a container-first workflow, especially since turning existing agents into containers can be a bit of work. The release does address long-standing headaches around reproducibility and scaling, but we haven’t seen how it copes with agents that weren’t built for containers.
If the promised flexibility survives a variety of architectures, the two tools together could make moving from prototype to production smoother. Until some independent users put Harbor through its paces, the real impact stays mostly speculative.
Common Questions Answered
What does Harbor enable developers to do with any container‑installable agent in Terminal‑Bench 2.0?
Harbor allows developers to drop any container‑installable agent into a controlled testbed without rebuilding pipelines, feeding the results directly into Terminal‑Bench 2.0 for evaluation. This integration supports end‑to‑end testing of agents on realistic, tougher tasks.
How does Harbor support scalable supervised fine‑tuning (SFT) and reinforcement learning (RL) pipelines?
Harbor provides built‑in pipelines that can scale supervised fine‑tuning and reinforcement learning workloads across many rollouts, as demonstrated by tens of thousands of internal runs during benchmark creation. These pipelines are designed to work seamlessly with Terminal‑Bench 2.0’s evaluation framework.
Can users create custom benchmarks with Harbor, and if so, how is this achieved?
Yes, Harbor includes functionality for custom benchmark creation and deployment, allowing users to define their own tasks and metrics. The framework then integrates these custom benchmarks into Terminal‑Bench 2.0, ensuring results are captured in the same reproducible environment.
What evidence does the article provide that Harbor and Terminal‑Bench 2.0 are ready for public use?
The article notes that Harbor has been used internally to run tens of thousands of rollouts while building the new benchmark, and it is now publicly available via harborframework.com with documentation for testing and submitting agents. This demonstrates both extensive internal validation and readiness for broader community adoption.