Terminal‑Bench 2.0 launches with Harbor, testing any container‑installable agent

November 8, 2025 • 3 min read

We’ve started seeing a lot of AI agents being packaged as containers lately, and it’s a bit of a mess trying to test them end-to-end. That’s where Terminal-Bench 2.0 and its new companion, Harbor, come in. The idea is simple: drop any container-ready agent into a benchmark that’s already set up, without having to rebuild your whole pipeline.

Harbor bills itself as a single place to evaluate, fine-tune and benchmark agents, and the team has already used it internally for their latest releases. If the integration works as advertised, you shouldn’t have to juggle separate tools for supervised fine-tuning or reinforcement-learning loops. Instead, you get one environment that can grow with your workload.

The claim is pretty straightforward, flexibility across different architectures, the ability to craft custom benchmarks, and a smoother deployment process, all inside one framework.

Designed to work with a range of agent architectures, Harbor supports:

Designed to generalize across agent architectures, Harbor supports: Evaluation of any container-installable agent Scalable supervised fine-tuning (SFT) and reinforcement learning (RL) pipelines Custom benchmark creation and deployment Full integration with Terminal-Bench 2. Harbor was used internally to run tens of thousands of rollouts during the creation of the new benchmark. It is now publicly available via harborframework.com, with documentation for testing and submitting agents to the public leaderboard.

Early Results: GPT-5 Leads in Task Success Initial results from the Terminal-Bench 2.0 leaderboard show OpenAI's Codex CLI (command line interface), a GPT-5 powered variant, in the lead, with a 49.6% success rate -- the highest among all agents tested so far. Close behind are other GPT-5 variants and Claude Sonnet 4.5-based agents. Top 5 Agent Results (Terminal-Bench 2.0): Codex CLI (GPT-5) -- 49.6% Codex CLI (GPT-5-Codex) -- 44.3% OpenHands (GPT-5) -- 43.8% Terminus 2 (GPT-5-Codex) -- 43.4% Terminus 2 (Claude Sonnet 4.5) -- 42.8% The close clustering among top models indicates active competition across platforms, with no single agent solving more than half the tasks.

Submission and Use To test or submit an agent, users install Harbor and run the benchmark using simple CLI commands. Submissions to the leaderboard require five benchmark runs, and results can be emailed to the developers along with job directories for validation. harbor run -d terminal-bench@2.0 -m "" -a "" --n-attempts 5 --jobs-dir Terminal-Bench 2.0 is already being integrated into research workflows focused on agentic reasoning, code generation, and tool use.

According to co-creator Mike Merrill, a postdoctoral researcher at Stanford, a detailed preprint is in progress covering the verification process and design methodology behind the benchmark. Aiming for Standardization The combined release of Terminal-Bench 2.0 and Harbor marks a step toward more consistent and scalable agent evaluation infrastructure.

Terminal-Bench 2.0 launches alongside Harbor, a new framework for testing agents in containers - VentureBeat AI

Related Topics: #Terminal-Bench 2.0 #Harbor #AI agents #container-installable #supervised fine-tuning #reinforcement learning #GPT-5 #OpenAI #Codex CLI #leaderboard

Harbor shows up with a pretty straightforward goal. You can drop any container-installable agent into its testbed and then throw the tougher, more realistic tasks from Terminal-Bench 2.0 at it. The setup claims to support scalable supervised fine-tuning and reinforcement-learning pipelines, and it even lets you spin up custom benchmarks on the fly.

Integration feels tight - Harbor pipes the results straight into Terminal-Bench, and the developers say they’ve already used the combo themselves. Still, it’s unclear whether the wider community will bite on a container-first workflow, especially since turning existing agents into containers can be a bit of work. The release does address long-standing headaches around reproducibility and scaling, but we haven’t seen how it copes with agents that weren’t built for containers.

If the promised flexibility survives a variety of architectures, the two tools together could make moving from prototype to production smoother. Until some independent users put Harbor through its paces, the real impact stays mostly speculative.

Common Questions Answered

What does Harbor enable developers to do with any container‑installable agent in Terminal‑Bench 2.0?

Harbor allows developers to drop any container‑installable agent into a controlled testbed without rebuilding pipelines, feeding the results directly into Terminal‑Bench 2.0 for evaluation. This integration supports end‑to‑end testing of agents on realistic, tougher tasks.

How does Harbor support scalable supervised fine‑tuning (SFT) and reinforcement learning (RL) pipelines?

Harbor provides built‑in pipelines that can scale supervised fine‑tuning and reinforcement learning workloads across many rollouts, as demonstrated by tens of thousands of internal runs during benchmark creation. These pipelines are designed to work seamlessly with Terminal‑Bench 2.0’s evaluation framework.

Can users create custom benchmarks with Harbor, and if so, how is this achieved?

Yes, Harbor includes functionality for custom benchmark creation and deployment, allowing users to define their own tasks and metrics. The framework then integrates these custom benchmarks into Terminal‑Bench 2.0, ensuring results are captured in the same reproducible environment.

What evidence does the article provide that Harbor and Terminal‑Bench 2.0 are ready for public use?

The article notes that Harbor has been used internally to run tens of thousands of rollouts while building the new benchmark, and it is now publicly available via harborframework.com with documentation for testing and submitting agents. This demonstrates both extensive internal validation and readiness for broader community adoption.

🎓

Featured Review

No Code MBA

Build AI apps without coding. Our in-depth course review.

Read Review

Terminal‑Bench 2.0 launches with Harbor, testing any container‑installable agent

Common Questions Answered

What does Harbor enable developers to do with any container‑installable agent in Terminal‑Bench 2.0?

How does Harbor support scalable supervised fine‑tuning (SFT) and reinforcement learning (RL) pipelines?

Can users create custom benchmarks with Harbor, and if so, how is this achieved?

What evidence does the article provide that Harbor and Terminal‑Bench 2.0 are ready for public use?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds

Related Reading

Oracle, NVIDIA deepen tie-up to speed sovereign AI and government digital shift

Skilling programs lag AI; students must prioritize aspiration and depth

EPAM and Cursor partner to scale AI coding for global enterprise customers

Common Questions Answered

What does Harbor enable developers to do with any container‑installable agent in Terminal‑Bench 2.0?

How does Harbor support scalable supervised fine‑tuning (SFT) and reinforcement learning (RL) pipelines?

Can users create custom benchmarks with Harbor, and if so, how is this achieved?

What evidence does the article provide that Harbor and Terminal‑Bench 2.0 are ready for public use?

Most Popular

Rob Pike’s AI‑generated ‘act of kindness’ spams draft tribute to his work

Meta adds Spotify AI music, Kannada/Telugu, and noise filtering to AI Glasses

Fusion reactors could produce dark‑sector particles via neutron emissions

Gemini 3 Flash Offers Fast Multimodal Reasoning for Video, Data, Visual Q&A

NeuroPixel.AI draws global brands with production‑ready design automation tools

Qwen‑Image‑2512 launches, rivals Google’s Nano Banana Pro in AI image generation

OpenAI Opens Submissions for Apps Using ChatGPT’s SDK, Unveiled at DevDay

OpenAI launches App Directory, accepts ChatGPT apps with privacy notices

Sora 2 Generates Disturbing AI Kid Videos as Legal Grey Area Persists

72% of US teens surveyed have used AI companions, Common Sense Media finds