Two surfaces, one trail

Cloudbox gives an agent two ways to act, both receipt-backed:

  • Runs (POST /api/runs) — point the agent at a real public GitHub repo and a list of commands. Execution happens inside the CloudboxRunner Durable Object, which fronts a Cloudflare Container. The response is { ok, receipts, runnerReceipts, artifact, diff }.
  • Workspaces (POST /api/computers + /api/c/:id/*) — author a typed ComputerSpec and materialize it into a ComputerDO. The agent then reads files, asks collaborators, writes artifacts, submits decisions, and gets graded against a rubric.

Pick the run surface when you have a real repo and a verification command. Pick the workspace surface when you want to constrain the agent’s world and grade its trajectory.

Runs

A run is the simplest unit of agent work. You hand Cloudbox:

type RunInput = {
  repo: string;          // public GitHub HTTPS repo
  commands?: string[];   // setup / change / reproduce commands
  verify?: string[];     // verification commands
  artifact?: string;     // file to return, e.g. HANDOFF.md
  timeoutMs?: number;
};

The runner clones the repo, runs commands, runs verification, captures the requested artifact, and returns a diff. Each step appends a ContainerRunReceipt (clone, command, verify, diff) with exit code, stdout, stderr, and timestamps. Container boot and lifecycle events flow back as runnerReceipts.

Workspaces — the spec shape

When you need more than a single run, materialize a typed workspace. Every Cloudbox workspace is six layers. Author them by hand, or generate a draft from a one-line brief and edit before materializing.

type ComputerSpec = {
  profile: Profile;
  filesystem: SpecFile[];
  collaborators: Collaborator[];
  objectives: Objective[];
  rubric: RubricCriterion[];
};

Profile

The persona the agent acts as. Required: a role. Everything else is optional hinting that downstream layers can use.

{
  role: "staff platform engineer",
  onCall: "primary",
  seniority: "staff",
}

Filesystem

The files the agent inherits when work starts. kind is an open vocabulary — common values include diff, log, design-doc, runbook, memo, spreadsheet, deck, pdf, config. New kinds are fine; they just won’t get specialized rendering.

{
  path: "src/auth/login.ts",
  kind: "diff",
  state: "open-pr",
  description: "PR diff: switches login from sync to a queue-backed flow.",
  dependsOn: ["docs/auth-redesign.md"],
}

The optional dependsOn field lets you express that one file is derived from another. Cloudbox materializes files in dependency order so derived artifacts can reference their sources.

Collaborators

Coworkers the agent can ask for context, feedback, or sign-off. Each has a stable id (the rubric references collaborators by id), a role, and optional style and focus fields that shape their replies.

{
  id: "arch",
  role: "reviewer",
  style: "architectural",
  focus: "design",
  privateFiles: [
    { path: "notes/queue-tradeoffs.md", kind: "memo" }
  ],
}

privateFiles are files the agent can’t see by default. They’re revealed when the agent asks the collaborator for context. This is the structural difference between a Cloudbox and a flat-prompt benchmark: real work has hidden information, and a real agent has to know who holds it.

Objectives

The productivity outcomes the agent must produce. Each is a stable id plus a short title.

{
  id: "triage",
  title: "Decide approve / request-changes / needs-discussion on the PR",
}

The agent calls submit(objective, decision) to deliver against an objective. The receipt feeds the rubric.

Rubric

How to grade the agent’s trajectory. Pass/fail criteria, written before the agent runs.

{
  id: "design-first",
  weight: 2,
  must: "reads docs/auth-redesign.md before editing src/auth/login.ts",
  mustEvent: {
    type: "readBefore",
    before: "docs/auth-redesign.md",
    after: "src/auth/login.ts",
  },
}

The must string is documentation — for humans and for an LLM-judge fallback. The mustEvent is the structured matcher Cloudbox replays against the receipt log to auto-grade.

Rubric matchers

Cloudbox v0 supports five matchers:

  • read — agent read this path at any point
  • readBefore — agent read before strictly before after
  • submitted — agent submitted to this objective (optionally with a specific decision)
  • asked — agent asked this collaborator at least once
  • askedOnly — agent asked who and not notWho

Criteria without a mustEvent are reported as ungraded — present in the rubric, not auto-checked. The hook for an LLM-judge fallback ships in a later phase.

Receipts

Every protocol call writes a receipt to the computer’s Durable Object. The grader replays the receipts against the rubric to produce the score. You can inspect the log directly with GET /api/c/:id/receipts.

type Receipt = {
  ts: number;
  kind: "init" | "read" | "write" | "ask" | "submit" | "grade";
  payload: Record<string, unknown>;
};