Troubleshooting guide
How users and agents should diagnose common setup, tool, Signal, Friction, and MCP problems.
Troubleshooting should preserve evidence and avoid hidden fixes. If a setup problem changes operating authority, fix it through the normal ROST workflow.
Reading command errors (start here)
Every command failure returns a structured error you can act on — you should rarely see an opaque "internal error" for a known state:
code— a stable machine code. A known business precondition (for example a manifest that is not yet signed, or a dry run that has not passed) returnsCOMMAND_PRECONDITION_FAILED(HTTP 409), not a 500.message— a readable explanation of what was not satisfied.help— the exact next command to run. On a failure, read the `help` field and run the command it names — do not retry the same call blindly. The CLI also prints it as a→ try: …line.requestId— quote this when reporting a genuine internal error (the only thing a real 500 returns; it never leaks internal detail).
Before calling a command that changes state, discover its exact shape so you do not guess the JSON: rost command schema <id> (or rost_describe_command) returns the input/output JSON Schema, the help pointer, and a validated worked example; rost command list (or rost_list_commands) enumerates the surface.
Common checks
- Onboarding seems stuck: call
onboarding.status/rost_onboard_statusand inspect missing graph, Charter, Compass, or staffing steps. - Graph looks wrong: read
graph.get/rost://graphto confirm seat ids, parents, and occupancy before mutating. - Agent cannot act: check the Charter, permission manifest, Steward chain, and token scope with
agent.status/rost_get_agent_statusandseat.get. - Signal looks wrong: read
signal.list/rost_list_signalsand check owner seat, cadence, target, and evidence. - Friction is noisy: read
friction.listand check whether the underlying Charter or measurable is unclear. - Escalations are aging: read
escalation.list/rost_list_escalations; a human resolves through the Steward queue. - MCP access fails: revoke and recreate the narrowest token after checking scope (
mcp_token.revokethenrost mcp install --client <client> --scope seat --seat-id <seat-id>; standalonemcp installrequires an explicit--scope).
Surface-specific failures
- "Not logged in" or 401: run
rost login, thenrost whoami. - "Wrong tenant":
rost tenantsthenrost use <tenant-slug-or-id>. - "Command denied by scope or manifest": a seat token cannot run tenant-admin setup. Switch to a tenant-admin token, or ask a human Steward to update the seat Charter and permission manifest.
- "Confirmation required": the command is gated. The CLI prints
rost command confirmation.approve --json ...or a web link. A human approves; an agent does not approve its own request. - Revoked, expired, or invalid MCP token: run
rost mcp install --client <client> --scope <tenant-admin|seat>again to mint and register a fresh one (standalone install requires an explicit--scope), or rotate with--rotate <old-token-id>(rotation inherits the old token's scope, so no--scopeneeded). Tokens minted bymcp installdefault to a 90-day expiry — checkexpires_in_daysinrost command mcp_token.list.
Agent-creation failures
These are the common blockers when adding an agent (see the add-agents guide and the custom agents guide):
- Missing Steward: an agent occupancy or go-live is blocked because no Steward chain resolves to a human. Name a human Steward on the seat, then retry. The no-orphan-agent rule is enforced server-side; do not route around it.
- Failed dry run: the draft is kept and the failure reason is shown. Read it with
agent.status/rost_get_agent_status, fix the Charter, manifest, or tool decision, thenagent.run_dry_runagain. A passed dry run is required before go-live. - Declined tool: declining a proposed tool updates the permission manifest and the dry-run task. If the agent then cannot complete the task, either grant a narrower tool or adjust the Charter so the work still routes safely or escalates.
- Expired confirmation: a pending human gate expired before approval. Re-issue the gated command (for example
agent.create_from_template,charter.sign_manifest, oragent.go_live) and approve the new confirmation; an agent never approves its own request. - Runner offline: a Local Runner lane agent cannot run because its Runner is offline. Check state with
runner.list/rost_list_runners; bring the Runner back or re-pair it withrunner.pairing.start. Scheduled runs should surface as Friction or escalation, not fail silently. - Token revoked after go-live: a live agent shows degraded because its credential or MCP token was revoked. Re-mint the narrowest token with
rost mcp installor re-ingress the credential through the vault path; scheduled runs fail toward Friction/escalation until it is restored.
When to stop for confirmation
Most reads are safe to run while diagnosing. Any fix that changes authority, credentials, go-live state, or a durable decision is gated — human_required, credential_flow, or dangerous — and routes through a human confirmation (confirmation.approve). Diagnose freely; stop before approving.
Agent guidance
Name the failing surface, collect evidence, recommend the smallest correction, and escalate when the fix changes authority, credentials, or go-live state. Never paste secrets into chat or tool arguments while troubleshooting.