ChatCLI ships a set of optimizations that keep token consumption in check during long /agent and /coder sessions. This page explains what runs out of the box and which knobs are available when the default behaviour doesnβt fit your workflow.
Every optimization on this page works across all supported providers (Anthropic direct, Bedrock, OpenAI, xAI, ZAI, MiniMax, Moonshot (Kimi), Google AI, Ollama, Copilot, GitHub Models, OpenRouter, OpenAI Assistant, StackSpot). Providers with explicit prompt caching (Anthropic, Bedrock Anthropic) or automatic prefix caching (OpenAI, xAI) benefit most.
The real problem
A poorly calibrated ReAct loop can turn a trivial question into huge token bills. Without the optimizations below, a query like βwho won the last Flamengo gameβ can easily burn 20k+ tokens because:
- The full system prompt is re-sent on every turn (uncached).
- Tool definitions (15+ JSON schemas) are also re-sent per turn.
- Nothing breaks the loop when the model repeats the same
tool_call without converging.
- Large
@webfetch bodies land in the context raw.
Aggregated, each turn carries 4-8k tokens of pure overhead Γ 3-5 turns in a typical session = 12-40k tokens before we even count the useful answer.
1. Structured prompt caching
Anthropic caching (and OpenAI/xAI prefix auto-caching) works by prefix: a breakpoint only hits when every byte before it is identical to a prior request. The golden rule is simple β stable blocks first, volatile blocks last. The system prompt is assembled exactly that way:
Stable prefix (cached, cache_control: ephemeral):
| Block | Content | Why itβs stable |
|---|
| Core | Persona + format rules (Coder/Agent) + language hint | Never changes between turns |
| Tools | Plugin descriptions + workspace hint | Changes only when a plugin is loaded/unloaded |
| Orchestrator | Multi-agent orchestrator catalog | Stable within the session |
| Memory index | Compact memory digest (index mode) | Turn-independent β see section 7 |
Volatile suffix (no cache marker β changes every turn):
| Block | Content | Why itβs volatile |
|---|
| Memory (full) | Hint-driven MEMORY.md retrieval (full mode) | Varies with the turnβs hints |
| Skills | Query-activated skills | Depends on the question |
| MCP channel | Recent push messages from MCP servers | Updates every turn |
| Dynamic context | Date/time + current directory | Changes by definition |
Why ordering matters (a fixed defect): previously the workspace+memory block carried cache_control but held the wall-clock timestamp to the second plus hint-driven retrieval β both volatile β near the top of the prompt. That guaranteed a cache miss on that block every turn and poisoned every cached block below it (/context attachments, pinned skills, MCP catalog): you paid cache creation (1.25Γ) each time and never earned a read. The timestamp now lives in its own trailing block, and the volatile memory left the prefix. The genuinely stable blocks form a contiguous, cacheable prefix.
Each stable block carries cache_control: ephemeral for Anthropic-family providers (respecting the 4-breakpoint cap, with automatic coalescing). For providers with prefix auto-caching (OpenAI, xAI), the stable ordering ensures cache hits naturally. Chat follows the same ordering; being tool-less, it does not pull memory on demand (see section 7).
The last tool definition sent to Anthropic also carries cache_control: ephemeral, turning the whole tools array into a cacheable prefix. In a /coder session with 15 coder tools + 2 web tools, this is ~19KB of JSON that stops being re-tokenized every turn.
Cache visibility
| Provider | Field populated in UsageInfo |
|---|
| Anthropic / Bedrock Anthropic | CacheReadInputTokens, CacheCreationInputTokens |
| OpenAI Chat Completions (auto-caching) | CacheReadInputTokens (from prompt_tokens_details.cached_tokens) |
| OpenAI Responses API (auto-caching) | CacheReadInputTokens (from input_tokens_details.cached_tokens) |
| OpenAI reasoning models (o-series / GPT-5) | ReasoningTokens (from *_tokens_details.reasoning_tokens) β informational, already counted in CompletionTokens |
| Other providers | Not reported β but a stable prefix still benefits internal caches |
OpenAI cached tokens require no opt-in β prompt caching is automatic on gpt-4o and newer (including o-series and GPT-5), triggered when the prompt prefix is β₯1,024 tokens, with hits served in 128-token increments. For streaming Chat Completions, ChatCLI sets stream_options: {include_usage: true} so the terminal usage chunk arrives before [DONE]; for the Responses API, usage rides on the response.completed SSE event and needs no extra flag.
Verify real session impact via /cost β cache hits appear as a separate line. The chat envelope also shows Nβ Mβ on the right border for every provider that surfaces usage, including all OpenAI APIs.
2. Stagnation early-exit
When the model enters a reflection loop β emitting exactly the same batch of tool_calls turn after turn without new information β ChatCLI detects it and breaks the loop.
How it works
Each turn, the tool_calls fingerprint (name + normalized args, order-independent, truncated SHA-256) is computed. Three consecutive turns with the same fingerprint β the loop is stopped with a clear message.
Parameters
| Variable | Default | Description |
|---|
CHATCLI_AGENT_EARLY_EXIT | 1 (on) | Toggle the detector. 0/false/off disables. |
CHATCLI_AGENT_EARLY_EXIT_TURNS | 3 | Consecutive repeats required to trigger (clamped to [2, 10]). |
The fingerprint is order-independent: [read A, read B] and [read B, read A] hash to the same value, so cosmetic re-ordering doesnβt fool the detector.
3. Smart chat β agent routing
Not every query deserves a full ReAct loop. Conversational / factual questions (βwhat is a mutex?β, βdifference between slice and arrayβ) are answered by a single chat-mode turn.
The classifier identifies trivial queries from lexical signals:
- Question leaders (
what, why, how does, explain, β¦)
- Absence of task verbs (
create, build, run, fix, β¦)
- No workspace references (
@file, @git, paths, code extensions)
- Short length + a question mark
CHATCLI_AGENT_SMART_ROUTE value | Behaviour |
|---|
off | 0 | false | no | Fully off. /agent and /run always enter the loop. |
hint (default) | 1 | on | true | Detect and print a short tip, but respect user intent and still run the loop. |
auto | redirect | 2 | Auto-reroute trivial queries to chat mode. Maximum savings; can surprise on edge cases. |
/coder is never rerouted β that mode exists for structured tasks. Even seemingly trivial questions there are treated as work requests.
Example
$ /agent "what is a channel in Go?"
βΉ Tip: this query looks conversational β the /agent loop was skipped.
Use /chat or just type the question to force chat mode, or /run to force agent mode.
# With CHATCLI_AGENT_SMART_ROUTE=auto, the question goes straight to chat.
# With the default (hint), you see the tip but agent mode still runs.
4. Smart auto-save in WebFetch
@webfetch is tuned to never dump giant pages into the context. See WebFetch & WebSearch for complete documentation.
| Parameter | Before | Now |
|---|
max_length default | 50,000 chars | 20,000 chars |
| Auto-save when body > 10KB with no filter | no | yes β saved to scratch dir with a compact preview returned |
Escape hatch
| Variable | Description |
|---|
CHATCLI_WEBFETCH_AUTOSAVE_BYTES | Byte threshold for auto-save. Default: 10000. |
Auto-save always persists the full pre-filter body to $CHATCLI_AGENT_TMPDIR, and the return carries:
[auto-saved: response was 142318 bytes β too large to inline.
Full body is at /tmp/chatcli-agent-.../webfetch_1712....txt.
Preview below; use read_file with start/end or rerun with
filter/from_line/to_line for specific ranges.]
[first ~5000 chars of extracted text]
...(auto-truncated β full body saved to disk)
The agent typically issues a read_file against that path with the right start/end, paying only for the lines that matter.
5. Slimmer system prompts
Mode-specific prompts were condensed without semantic loss β every original rule remains, only redundancy and repeated examples were removed:
| Prompt | Before | Now | Reduction |
|---|
CoderSystemPrompt | ~1,647 tokens | ~1,000 tokens | ~40% |
CoderFormatInstructions | ~560 tokens | ~390 tokens | ~30% |
AgentFormatInstructions | ~324 tokens | ~230 tokens | ~30% |
OrchestratorSystemPrompt | ~2,111 tokens | ~1,050 tokens | ~50% |
Because these prompts live in the core cache block, cache-enabled providers (Anthropic/Bedrock/OpenAI) see the reduction only on the first turn of a session. Providers without caching save those tokens on every turn.
Old tool results (file reads, search, git-diff, etc.) are progressively compressed in the history to keep the payload lean. See Tool Result Management for the full pipeline.
Defaults are conservative to protect multi-turn workflows with cross-references (large refactors, review sessions). Users who want to squeeze harder can tune:
| Variable | Default | Description |
|---|
CHATCLI_MICROCOMPACT_TRUNCATE_TURNS | 2 | After how many turns old tool results are truncated to head+tail preview. |
CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS | 4 | After how many turns tool results are replaced by a one-line summary. |
CHATCLI_MICROCOMPACT_HEAD_CHARS | 2000 | Head size kept during truncation. |
CHATCLI_MICROCOMPACT_TAIL_CHARS | 500 | Tail size kept during truncation. |
CHATCLI_MICROCOMPACT_MIN_CONTENT | 3000 | Minimum tool result size to become a compaction candidate. |
For chat/lookup sessions where speed and low tokens matter more than long-term recall, try:export CHATCLI_MICROCOMPACT_TRUNCATE_TURNS=1
export CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS=3
export CHATCLI_MICROCOMPACT_HEAD_CHARS=1200
export CHATCLI_MICROCOMPACT_TAIL_CHARS=300
export CHATCLI_MICROCOMPACT_MIN_CONTENT=2000
7. Pull-first memory (index + recall)
Pushing the whole memory into the system prompt every turn doesnβt scale: the cost grows with the store size and is re-sent each turn. ChatCLI now defaults to a pull model: it injects only a stable digest and lets the agent pull detail on demand via the @memory recall tool.
Controlled by CHATCLI_MEMORY_MODE:
| Mode | Behavior | When to use |
|---|
index (default) | Injects a compact, stable index (profile summary + top topic/project names + fact tally by category) and a directive for the agent to call @memory recall when it needs detail. | Default. Per-turn cost stays bounded as memory grows. |
full | Injects the full hint-driven retrieval every turn (previous behavior). | When you want the agent to always see relevant memory without relying on it pulling. |
off | Injects no memory (bootstrap still applies). | Sessions where long-term memory only gets in the way. |
The index is stable (turn-independent, no timestamp), so it lives in the cached prefix (section 1) and is size-capped regardless of store size. @memory recall uses the full retrieval stack (HyDE + vector cosine search + keyword extraction), so pulled detail matches the quality of the old push.
Measured impact
Measured against a real store of 500 facts (MEMORY.md ~32KB, fact index ~270KB):
| Per turn (agent/coder) | chars | ~tokens |
|---|
Push (full) | 3,946 | ~986 |
Index (index) | 486 | ~121 |
β87.7% on the per-turn memory block β and unlike full (capped by CHATCLI_MEMORY_RETRIEVAL_BUDGET), the index does not grow as memory grows.
Chat is tool-less by design and cannot pull on demand: there index automatically degrades to full, and only off suppresses memory. The active mode is shown in /config memory.
In index mode the agent/coder no longer sees the whole memory automatically β it must call @memory recall. The index gives it the βmapβ (what exists) so it knows what to pull. If you notice the agent missing context it should recall, switch back to CHATCLI_MEMORY_MODE=full (the section 1 cache savings still apply).
Measuring impact
Run your session normally and check /cost at the end:
Session cost summary
Provider: CLAUDEAI | Model: claude-sonnet-4-6
βββββββββββββββββββββββββββββββββββββββββββββ
Input tokens: 12,345
Output tokens: 3,210
Cache read: 87,650 β ideal: growing per turn
Cache creation: 4,100
βββββββββββββββββββββββββββββββββββββββββββββ
Total cost: $0.0234
Signals that the optimizations are active and working:
Cache read > 0 and growing per turn β structured caching is hitting the prefix.
- Few/no FORMAT ERROR in logs β reminders are holding format on smaller models.
- Turns with
tool_calls = 0 followed by quick completion β early-exit detected convergence.
[auto-saved: response was N bytes] markers in @webfetch results β inline budget is protecting context.
Variable summary
Every variable on this page in one place:
| Variable | Default | Disable with |
|---|
CHATCLI_MEMORY_MODE | index | full (push) / off |
CHATCLI_AGENT_EARLY_EXIT | 1 (on) | 0 / false / off / no |
CHATCLI_AGENT_EARLY_EXIT_TURNS | 3 | β (clamped to [2, 10]) |
CHATCLI_AGENT_SMART_ROUTE | hint | off / 0 / false / no |
CHATCLI_WEBFETCH_AUTOSAVE_BYTES | 10000 | set a very high value |
CHATCLI_MICROCOMPACT_TRUNCATE_TURNS | 2 | high value (e.g., 100) |
CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS | 4 | high value |
CHATCLI_MICROCOMPACT_HEAD_CHARS | 2000 | high value |
CHATCLI_MICROCOMPACT_TAIL_CHARS | 500 | high value |
CHATCLI_MICROCOMPACT_MIN_CONTENT | 3000 | very high value |