Illustration for: Harbor Framework Enables Sandbox Agent Execution on Docker, Modal, Daytona
Research & Benchmarks

Harbor Framework Enables Sandbox Agent Execution on Docker, Modal, Daytona

3 min read

Evaluating the DeepAgents command‑line interface on Terminal Bench 2.0 raises a practical question: how do you run dozens of agent trials without drowning in setup overhead? Researchers need a way to spin up isolated environments, feed each model the same benchmark tasks, and collect results automatically. The challenge isn’t just reproducibility; it’s also about scaling those runs across different container platforms without rewriting code for each provider.

While the test suite itself is straightforward, the infrastructure behind it can become a bottleneck, especially when experiments span Docker, cloud‑native services, or emerging sandbox runtimes. Here’s the thing: a single framework that abstracts those details could free teams to focus on algorithmic tweaks rather than plumbing. That’s where the following description becomes relevant.

---

Harbor: Sandboxed Agent Execution

This is where Harbor comes in. Harbor is a framework for evaluating agents in containerized environments at scale, supporting Docker, Modal, Daytona, E2B, and Runloop as sandbox providers. It handles: - Automatic test execution on benchmark tasks - Automated reward

Advertisement

Harbor: Sandboxed Agent Execution This is where Harbor comes in. Harbor is a framework for evaluating agents in containerized environments at scale, supporting Docker, Modal, Daytona, E2B, and Runloop as sandbox providers. It handles: - Automatic test execution on benchmark tasks - Automated reward scoring to verify task completion - Registry of pre-built evaluation datasets like Terminal Bench Harbor handles all the infrastructure complexity of running agents in isolated environments, letting you focus on improving your agent.

Harbor offers a sandbox environment with shell-execution capabilities. We built a HarborSandbox backend that wraps this environment and implements file-system tools (e.g., edit_file , read_file , write_file , ls ) on top of shell commands. It measures how well agents operate in computer environments via the terminal.

Example tasks: path-tracing : Reverse-engineer C program from rendered imagechess-best-move : Find optimal move using chess enginegit-multibranch : Complex git operations with merge conflictssqlite-with-gcov : Build SQLite with code coverage, analyze reports Tasks have a wide range of difficulty--some require many actions (e.g., cobol-modernization taking close to 10 minutes with 100+ tool calls) while simpler tasks complete in seconds. Automated Verification: Each task includes verification logic that Harbor runs automatically, assigning a reward score (0 for incorrect, 1 for correct) based on whether the agent's solution meets the task requirements. Baseline Results We ran the DeepAgents CLI with claude-sonnet-4-5 on Terminal Bench 2.0 across 2 trials, achieving scores of 44.9% and 40.4% (mean: 42.65%).

Related Topics: #Harbor #Docker #Modal #Daytona #Terminal Bench #DeepAgents #sandbox #reward scoring #E2B

Did the DeepAgents CLI meet expectations? The Terminal Bench 2.0 run offered a concrete set of 89 tasks spanning software engineering, biology, security and gaming. Harbor supplied the containerized backbone, routing each benchmark through Docker, Modal, Daytona, E2B or Runloop without manual intervention.

Automatic test execution and reward calculation were handled by the framework, freeing the authors to focus on agent behavior. Results showed the CLI could invoke shell commands, manipulate files and retain memory across steps, yet the summary stops short of quantifying success rates or error margins. Consequently, it remains unclear whether DeepAgents consistently solves tasks or struggles with particular domains.

The evaluation demonstrates that sandboxed, scalable testing is feasible, but the data presented does not settle questions about real‑world applicability. Further detail on performance metrics would be needed to gauge practical usefulness. For now, Harbor proves a functional layer for systematic agent assessment, while DeepAgents’ true capabilities stay partially hidden behind the benchmark’s aggregate numbers.

Further Reading

Common Questions Answered

Which sandbox providers does the Harbor framework support for containerized agent execution?

Harbor supports Docker, Modal, Daytona, E2B, and Runloop as sandbox providers. This variety allows researchers to run agents on multiple container platforms without rewriting code for each environment.

How does Harbor automate test execution and reward scoring for the Terminal Bench 2.0 benchmark?

Harbor automatically runs each benchmark task in an isolated container and calculates a reward score to verify task completion. The framework handles both execution and scoring, eliminating manual setup and result collection.

What types of tasks are included in the Terminal Bench 2.0 suite used with the DeepAgents CLI?

Terminal Bench 2.0 comprises 89 tasks that cover software engineering, biology, security, and gaming domains. These diverse tasks test the CLI's ability to invoke shell commands, manipulate files, and solve domain‑specific problems.

In what ways does Harbor reduce the overhead of running dozens of agent trials?

Harbor abstracts the infrastructure complexity by routing each benchmark through Docker, Modal, Daytona, E2B, or Runloop without manual intervention. This automation frees researchers to focus on agent behavior rather than environment setup.

Advertisement