Two surfaces, one trail
Cloudbox gives an agent two ways to act, both receipt-backed:
- Runs (
POST /api/runs) — point the agent at a real public GitHub repo and a list of commands. Execution happens inside theCloudboxRunnerDurable Object, which fronts a Cloudflare Container. The response is{ ok, receipts, runnerReceipts, artifact, diff }. - Workspaces (
POST /api/computers+/api/c/:id/*) — author a typedComputerSpecand materialize it into aComputerDO. The agent then reads files, asks collaborators, writes artifacts, submits decisions, and gets graded against a rubric.
Pick the run surface when you have a real repo and a verification command. Pick the workspace surface when you want to constrain the agent’s world and grade its trajectory.
Runs
A run is the simplest unit of agent work. You hand Cloudbox:
type RunInput = {
repo: string; // public GitHub HTTPS repo
commands?: string[]; // setup / change / reproduce commands
verify?: string[]; // verification commands
artifact?: string; // file to return, e.g. HANDOFF.md
timeoutMs?: number;
};
The runner clones the repo, runs commands, runs verification, captures the requested artifact, and returns a diff. Each step appends a ContainerRunReceipt (clone, command, verify, diff) with exit code, stdout, stderr, and timestamps. Container boot and lifecycle events flow back as runnerReceipts.
Workspaces — the spec shape
When you need more than a single run, materialize a typed workspace. Every Cloudbox workspace is six layers. Author them by hand, or generate a draft from a one-line brief and edit before materializing.
type ComputerSpec = {
profile: Profile;
filesystem: SpecFile[];
collaborators: Collaborator[];
objectives: Objective[];
rubric: RubricCriterion[];
};
Profile
The persona the agent acts as. Required: a role. Everything else is optional hinting that downstream layers can use.
{
role: "staff platform engineer",
onCall: "primary",
seniority: "staff",
}
Filesystem
The files the agent inherits when work starts. kind is an open vocabulary — common values include diff, log, design-doc, runbook, memo, spreadsheet, deck, pdf, config. New kinds are fine; they just won’t get specialized rendering.
{
path: "src/auth/login.ts",
kind: "diff",
state: "open-pr",
description: "PR diff: switches login from sync to a queue-backed flow.",
dependsOn: ["docs/auth-redesign.md"],
}
The optional dependsOn field lets you express that one file is derived from another. Cloudbox materializes files in dependency order so derived artifacts can reference their sources.
Collaborators
Coworkers the agent can ask for context, feedback, or sign-off. Each has a stable id (the rubric references collaborators by id), a role, and optional style and focus fields that shape their replies.
{
id: "arch",
role: "reviewer",
style: "architectural",
focus: "design",
privateFiles: [
{ path: "notes/queue-tradeoffs.md", kind: "memo" }
],
}
privateFiles are files the agent can’t see by default. They’re revealed when the agent asks the collaborator for context. This is the structural difference between a Cloudbox and a flat-prompt benchmark: real work has hidden information, and a real agent has to know who holds it.
Objectives
The productivity outcomes the agent must produce. Each is a stable id plus a short title.
{
id: "triage",
title: "Decide approve / request-changes / needs-discussion on the PR",
}
The agent calls submit(objective, decision) to deliver against an objective. The receipt feeds the rubric.
Rubric
How to grade the agent’s trajectory. Pass/fail criteria, written before the agent runs.
{
id: "design-first",
weight: 2,
must: "reads docs/auth-redesign.md before editing src/auth/login.ts",
mustEvent: {
type: "readBefore",
before: "docs/auth-redesign.md",
after: "src/auth/login.ts",
},
}
The must string is documentation — for humans and for an LLM-judge fallback. The mustEvent is the structured matcher Cloudbox replays against the receipt log to auto-grade.
Rubric matchers
Cloudbox v0 supports five matchers:
read— agent read this path at any pointreadBefore— agent readbeforestrictly beforeaftersubmitted— agent submitted to this objective (optionally with a specific decision)asked— agent asked this collaborator at least onceaskedOnly— agent askedwhoand notnotWho
Criteria without a mustEvent are reported as ungraded — present in the rubric, not auto-checked. The hook for an LLM-judge fallback ships in a later phase.
Receipts
Every protocol call writes a receipt to the computer’s Durable Object. The grader replays the receipts against the rubric to produce the score. You can inspect the log directly with GET /api/c/:id/receipts.
type Receipt = {
ts: number;
kind: "init" | "read" | "write" | "ask" | "submit" | "grade";
payload: Record<string, unknown>;
};