Docs

Troubleshooting guide

How users and agents should diagnose common setup, tool, Signal, Friction, and MCP problems.

company setupstaffingoperating rhythm

Troubleshooting should preserve evidence and avoid hidden fixes. If a setup problem changes operating authority, fix it through the normal ROST workflow.

Reading command errors (start here)

Every command failure returns a structured error you can act on — you should rarely see an opaque "internal error" for a known state:

  • code — a stable machine code. A known business precondition (for example a manifest that is not yet signed, or a dry run that has not passed) returns COMMAND_PRECONDITION_FAILED (HTTP 409), not a 500.
  • message — a readable explanation of what was not satisfied.
  • help — the exact next command to run. On a failure, read the `help` field and run the command it names — do not retry the same call blindly. The CLI also prints it as a → try: … line.
  • requestId — quote this when reporting a genuine internal error (the only thing a real 500 returns; it never leaks internal detail).

Before calling a command that changes state, discover its exact shape so you do not guess the JSON: rost command schema <id> (or rost_describe_command) returns the input/output JSON Schema, the help pointer, and a validated worked example; rost command list (or rost_list_commands) enumerates the surface.

Common checks

  • Onboarding seems stuck: call onboarding.status / rost_onboard_status and inspect missing graph, Charter, Compass, or staffing steps.
  • Graph looks wrong: read graph.get / rost://graph to confirm seat ids, parents, and occupancy before mutating.
  • Agent cannot act: check the Charter, permission manifest, Steward chain, and token scope with agent.status / rost_get_agent_status and seat.get.
  • Signal looks wrong: read signal.list / rost_list_signals and check owner seat, cadence, target, and evidence.
  • Friction is noisy: read friction.list and check whether the underlying Charter or measurable is unclear.
  • Escalations are aging: read escalation.list / rost_list_escalations; a human resolves through the Steward queue.
  • MCP access fails: revoke and recreate the narrowest token after checking scope (mcp_token.revoke then rost mcp install --client <client> --scope seat --seat-id <seat-id>; standalone mcp install requires an explicit --scope).

Surface-specific failures

  • "Not logged in" or 401: run rost login, then rost whoami.
  • "Wrong tenant": rost tenants then rost use <tenant-slug-or-id>.
  • "Command denied by scope or manifest": a seat token cannot run tenant-admin setup. Switch to a tenant-admin token, or ask a human Steward to update the seat Charter and permission manifest.
  • "Confirmation required": the command is gated. The CLI prints rost command confirmation.approve --json ... or a web link. A human approves; an agent does not approve its own request.
  • Revoked, expired, or invalid MCP token: run rost mcp install --client <client> --scope <tenant-admin|seat> again to mint and register a fresh one (standalone install requires an explicit --scope), or rotate with --rotate <old-token-id> (rotation inherits the old token's scope, so no --scope needed). Tokens minted by mcp install default to a 90-day expiry — check expires_in_days in rost command mcp_token.list.

Agent-creation failures

These are the common blockers when adding an agent (see the add-agents guide and the custom agents guide):

  • Missing Steward: an agent occupancy or go-live is blocked because no Steward chain resolves to a human. Name a human Steward on the seat, then retry. The no-orphan-agent rule is enforced server-side; do not route around it.
  • Failed dry run: the draft is kept and the failure reason is shown. Read it with agent.status / rost_get_agent_status, fix the Charter, manifest, or tool decision, then agent.run_dry_run again. A passed dry run is required before go-live.
  • Declined tool: declining a proposed tool updates the permission manifest and the dry-run task. If the agent then cannot complete the task, either grant a narrower tool or adjust the Charter so the work still routes safely or escalates.
  • Expired confirmation: a pending human gate expired before approval. Re-issue the gated command (for example agent.create_from_template, charter.sign_manifest, or agent.go_live) and approve the new confirmation; an agent never approves its own request.
  • Runner offline: a Local Runner lane agent cannot run because its Runner is offline. Check state with runner.list / rost_list_runners; bring the Runner back or re-pair it with runner.pairing.start. Scheduled runs should surface as Friction or escalation, not fail silently.
  • Token revoked after go-live: a live agent shows degraded because its credential or MCP token was revoked. Re-mint the narrowest token with rost mcp install or re-ingress the credential through the vault path; scheduled runs fail toward Friction/escalation until it is restored.

When to stop for confirmation

Most reads are safe to run while diagnosing. Any fix that changes authority, credentials, go-live state, or a durable decision is gated — human_required, credential_flow, or dangerous — and routes through a human confirmation (confirmation.approve). Diagnose freely; stop before approving.

Agent guidance

Name the failing surface, collect evidence, recommend the smallest correction, and escalate when the fix changes authority, credentials, or go-live state. Never paste secrets into chat or tool arguments while troubleshooting.