chatcli daemon start --detach) — but they work identically in in-process mode if you keep chatcli open.
1. Terraform deploy + K8s wait + validate
Apply infrastructure, wait for the deployment to becomeAvailable, then print the final state. One-liner — you can close the CLI.
2. Docker compose + healthcheck + smoke tests
and(...) condition only satisfies when the container is healthy and the endpoint returns 200 — avoids the classic “container up but app still initializing” case.
3. Nightly backup via cron
- Runs every day at 02:00 local time.
- 1h timeout (long restore tolerated).
- Up to 3 retries with exponential backoff.
- TTL of 7 days in
/jobs history. - Filterable by tag in
/jobs list --tag category=backup.
4. Wait and notify Slack
No scheduling — a wait that pings when the DB comes back:--async returns the command immediately (you get the job ID). When the port opens, the webhook fires.
5. DAG pipeline — multi-stage
Canary deploy → smoke → rollout → notify:rollout and notify become StatusFailed with dependency failed in the message.
6. Agent scheduling itself (ReAct)
You ask for something long-running; the agent decides to pause and come back:schedule with async:true + triggers to chain; on the next conversation the result is already in context.
7. Self-limited job (budget)
Explicit rate-limit and budget to avoid LLM cost runaway:- One retry only.
- 2m cap on the LLM call (fast-fail if the model hangs).
CHATCLI_SCHEDULER_RATE_LIMIT_OWNER_RPS for a global cap.
8. Periodic probe with auto circuit breaker
--on-timeout fire_anyway fires the action anyway — which in turn asks the agent to open the issue.
After 5 consecutive failures in 60s, the http_status circuit breaker opens for 30s — subsequent probes return OutcomeBreakerOff without hammering the downed API.
9. Cancel in bulk by tag
/jobs cancel --tag directly, the IPC + shell combination is enough.
10. Audit and troubleshooting
type, timestamp, job_id, status, message, and full event payload — use it as the basis for alerts in Splunk/Datadog/Loki.
Common troubleshooting
Job stays pending and never fires
Job stays pending and never fires
Check:
- Does
/config schedulershowenabled? - Does
CHATCLI_SCHEDULER_ACTION_ALLOWLISTinclude the action type? - Did the rate limiter reject it? (see
chatcli_scheduler_enqueue_errors_total{reason="rate_limited"}) - Is the daemon running?
chatcli daemon status.
Wait keeps polling forever
Wait keeps polling forever
- Did you set
--timeout? (default 30m) - Check breaker state:
/config schedulershows breakers in the daemon section; or metricchatcli_scheduler_breaker_state. - Test the condition in isolation with
/wait --until <cond> --async— when it fires, you know the evaluator works.
Daemon won't start ('address already in use')
Daemon won't start ('address already in use')
Stale socket from a previous crash.
start tries to clean it automatically, but may fail if another process is still holding it:/schedule fails with ErrShellPolicyAsk or ErrShellPolicyDeny
/schedule fails with ErrShellPolicyAsk or ErrShellPolicyDeny
Your shell command is classified by CoderMode at Or edit the JSON directly if you prefer bulk edits.ErrShellPolicyAsk — the command would hit an “Allow once / Allow always / Deny” prompt if run interactively. The scheduler never prompts. Three ways out:
/schedule time (preflight). Two cases:ErrShellPolicyDeny — the command (or a matching pattern) is on the denylist in ~/.chatcli/coder_policy.json. --i-know does NOT clear this rejection; denylist is authoritative. Remove with:- Pre-authorize this specific job with
--i-know: - Permanently add to allowlist — one liner:
You can also run the command once through interactive
/coderand choose “Allow always” on the prompt (samePolicyManager.AddRuleinfrastructure underneath). Both persist to~/.chatcli/coder_policy.json. - Convert to
agent_taskorslash_cmd— commands run via agent go through interactive policy at fire time.
ErrShellPolicyDeny in the log — it won’t run.Shell action blocked — I want a full bypass (trusted CI)
Shell action blocked — I want a full bypass (trusted CI)
For automation in an ephemeral CI container where you control the sandbox:Both must be present —
bypass_safety alone without the operator’s env var is rejected. This prevents an agent from writing bypass_safety:true in a spec and slipping past the policy without the operator having enabled it.Next steps
Feature doc
Full architecture, invariants and plug-in patterns.
Command reference
All flags and subcommands.