Know your AI agent actually works.

A CLI-first framework for sandboxed agent evaluation. Fully Harbor compatible, just faster.

Harbor compatible · Powered by Rust · Open source

 curl -fsSL https://seaport.run/install.sh | bash

Why Seaport

Everything Harbor does, faster

The same tasks and datasets, on a rebuilt performance core.

Drop-in Harbor compatible

Same task format, same datasets, same scripts. Point Seaport at your existing Harbor tasks and they run unchanged. No migration, no rewrite.

Fast by default

Task environments are built and pulled once, then cached and reused. Identical images are never pulled twice, so warm runs start almost instantly.

Prepares, then runs

A preflight phase resolves, pulls, and builds every environment up front and in parallel, so the run itself is spent on agents, not setup.

Sandboxed by default

Agents run fully isolated, so untrusted code can't touch your machine. Test boldly without worrying about what your agent might do.

Works with any agent

Claude Code, Codex, or your own homegrown agent. If it runs in a terminal, Seaport can evaluate it. Swap agents with a single flag.

Numbers you can trust

Every run is deterministic and lands as clean JSON. Track pass rates over time, compare agents head-to-head, and catch regressions early.

Workflow

From idea to score in four steps

No new framework to learn. If you've written a shell script, you already know how to use Seaport.

01

Describe the task

Write what the agent should do and a quick test that checks if it nailed it. That's the whole setup.
02

Choose your agent

Plug in Claude Code, Codex, or your own. Want a baseline? Seaport can run the known-good solution to sanity-check the task itself.
03

Let it run

Seaport spins up a clean, isolated environment, hands the task to your agent, and grades the result. Totally hands-off.
04

Read the score

Get a clear pass rate plus a full transcript of what your agent tried, so you know not just if it failed, but why.

hello-world/

 hello-world/
├── instruction.md # the prompt given to the agent
├── task.toml # metadata, timeouts, environment
├── environment/
│   └── Dockerfile
├── solution/
│   └── solve.sh # used by the oracle agent
└── tests/
    └── test.sh # writes reward.txt: 1 or 0

task.toml

 [task]
name = "acme/hello-world"
description = "Create the expected output file." [environment]
docker_image = "ubuntu:24.04"
network_mode = "no-network"
build_timeout_sec = 600.0

Safe to run anything

Let agents loose without losing sleep

You're handing real code to an AI and letting it run. Seaport keeps every run sealed off from your machine, so a misbehaving agent can't do any damage. You just see the result and move on.

Every run starts from a clean, throwaway environment
Agents can't reach your files, secrets, or network
Strict time and resource limits, so nothing runs away
Locked down by default, no config required

jobs/seaport-<run-id>/

 jobs/seaport-<run-id>/
├── config.json
├── result.json # pass/fail counts, avg reward
└── <task-name>/
├── config.json
    ├── result.json
    ├── agent/
    │   └── trajectory.json # command, exit, stdout/err
└── verifier/
        ├── reward.txt
        ├── test-stdout.txt
        └── test-stderr.txt

Results

See exactly what happened

Get a clean pass rate at a glance, and a full record of every attempt when you want to dig in. It's all plain JSON, so you can drop it into a dashboard, a spreadsheet, or your CI pipeline and watch your agent improve.

Pass rate

at a glance

Full logs

of every attempt

Plain JSON

CI-ready

Get started

Your first eval, in two minutes

One line to install, one line to run.

 curl -fsSL https://seaport.run/install.sh | bash

Star on GitHub Get started

Works on macOS, Linux, and Windows