Token Efficiency

ChatCLI ships a set of optimizations that keep token consumption in check during long /agent and /coder sessions. This page explains what runs out of the box and which knobs are available when the default behaviour doesn’t fit your workflow.

Every optimization on this page works across all supported providers (Anthropic direct, Bedrock, OpenAI, xAI, ZAI, MiniMax, Moonshot (Kimi), Google AI, Ollama, Copilot, GitHub Models, OpenRouter, OpenAI Responses, StackSpot). Providers with explicit prompt caching (Anthropic, Bedrock Anthropic) or automatic prefix caching (OpenAI, xAI) benefit most.

The real problem

A poorly calibrated ReAct loop can turn a trivial question into huge token bills. Without the optimizations below, a query like “who won the last Flamengo game” can easily burn 20k+ tokens because:

The full system prompt is re-sent on every turn (uncached).
Tool definitions (15+ JSON schemas) are also re-sent per turn.
Nothing breaks the loop when the model repeats the same tool_call without converging.
Large @webfetch bodies land in the context raw.

Aggregated, each turn carries 4-8k tokens of pure overhead × 3-5 turns in a typical session = 12-40k tokens before we even count the useful answer.

1. Structured prompt caching

Anthropic caching (and OpenAI/xAI prefix auto-caching) works by prefix: a breakpoint only hits when every byte before it is identical to a prior request. The golden rule is simple — stable blocks first, volatile blocks last. The system prompt is assembled exactly that way: Stable prefix (cached, cache_control: ephemeral):

Block	Content	Why it’s stable
Core	Persona + format rules (Coder/Agent) + language hint	Never changes between turns
Tools	Plugin descriptions + workspace hint	Changes only when a plugin is loaded/unloaded
Orchestrator	Multi-agent orchestrator catalog	Stable within the session
Memory index	Compact memory digest (`index` mode)	Turn-independent — see section 7

Volatile suffix (no cache marker — changes every turn):

Block	Content	Why it’s volatile
Memory (full)	Hint-driven MEMORY.md retrieval (`full` mode)	Varies with the turn’s hints
Skills	Query-activated skills	Depends on the question
MCP channel	Recent push messages from MCP servers	Updates every turn
Dynamic context	Date/time + current directory	Changes by definition

Why ordering matters (a fixed defect): previously the workspace+memory block carried cache_control but held the wall-clock timestamp to the second plus hint-driven retrieval — both volatile — near the top of the prompt. That guaranteed a cache miss on that block every turn and poisoned every cached block below it (/context attachments, pinned skills, MCP catalog): you paid cache creation (1.25×) each time and never earned a read. The timestamp now lives in its own trailing block, and the volatile memory left the prefix. The genuinely stable blocks form a contiguous, cacheable prefix.

Each stable block carries cache_control: ephemeral for Anthropic-family providers (respecting the 4-breakpoint cap, with automatic coalescing). For providers with prefix auto-caching (OpenAI, xAI), the stable ordering ensures cache hits naturally. Chat follows the same ordering; being tool-less, it does not pull memory on demand (see section 7).

Cache on tool definitions

The last tool definition sent to Anthropic also carries cache_control: ephemeral, turning the whole tools array into a cacheable prefix. In a /coder session with 15 coder tools + 2 web tools, this is ~19KB of JSON that stops being re-tokenized every turn.

Cache visibility

Provider	Field populated in `UsageInfo`
Anthropic / Bedrock Anthropic	`CacheReadInputTokens`, `CacheCreationInputTokens`
OpenAI Chat Completions (auto-caching)	`CacheReadInputTokens` (from `prompt_tokens_details.cached_tokens`)
OpenAI Responses API (auto-caching)	`CacheReadInputTokens` (from `input_tokens_details.cached_tokens`)
OpenAI reasoning models (o-series / GPT-5)	`ReasoningTokens` (from `*_tokens_details.reasoning_tokens`) — informational, already counted in `CompletionTokens`
Other providers	Not reported — but a stable prefix still benefits internal caches

OpenAI cached tokens require no opt-in — prompt caching is automatic on gpt-4o and newer (including o-series and GPT-5), triggered when the prompt prefix is ≥1,024 tokens, with hits served in 128-token increments. For streaming Chat Completions, ChatCLI sets stream_options: {include_usage: true} so the terminal usage chunk arrives before [DONE]; for the Responses API, usage rides on the response.completed SSE event and needs no extra flag.

Verify real session impact via /cost — cache hits appear as a separate line. The chat envelope also shows N↑ M↓ on the right border for every provider that surfaces usage, including all OpenAI APIs.

2. Stagnation early-exit

When the model enters a reflection loop — emitting exactly the same batch of tool_calls turn after turn without new information — ChatCLI detects it and breaks the loop.

How it works

Each turn, the tool_calls fingerprint (name + normalized args, order-independent, truncated SHA-256) is computed. Three consecutive turns with the same fingerprint → the loop is stopped with a clear message.

Parameters

Variable	Default	Description
`CHATCLI_AGENT_EARLY_EXIT`	`1` (on)	Toggle the detector. `0`/`false`/`off` disables.
`CHATCLI_AGENT_EARLY_EXIT_TURNS`	`3`	Consecutive repeats required to trigger (clamped to [2, 10]).

The fingerprint is order-independent: [read A, read B] and [read B, read A] hash to the same value, so cosmetic re-ordering doesn’t fool the detector.

3. Smart chat ↔ agent routing

Not every query deserves a full ReAct loop. Conversational / factual questions (“what is a mutex?”, “difference between slice and array”) are answered by a single chat-mode turn. The classifier identifies trivial queries from lexical signals:

Question leaders (what, why, how does, explain, …)
Absence of task verbs (create, build, run, fix, …)
No workspace references (@file, @git, paths, code extensions)
Short length + a question mark

Modes

`CHATCLI_AGENT_SMART_ROUTE` value	Behaviour
`off` \| `0` \| `false` \| `no`	Fully off. `/agent` and `/run` always enter the loop.
`hint` (default) \| `1` \| `on` \| `true`	Detect and print a short tip, but respect user intent and still run the loop.
`auto` \| `redirect` \| `2`	Auto-reroute trivial queries to chat mode. Maximum savings; can surprise on edge cases.

/coder is never rerouted — that mode exists for structured tasks. Even seemingly trivial questions there are treated as work requests.

Example

$ /agent "what is a channel in Go?"
  ℹ Tip: this query looks conversational — the /agent loop was skipped.
    Use /chat or just type the question to force chat mode, or /run to force agent mode.

# With CHATCLI_AGENT_SMART_ROUTE=auto, the question goes straight to chat.
# With the default (hint), you see the tip but agent mode still runs.

4. Smart auto-save in WebFetch

@webfetch is tuned to never dump giant pages into the context. See WebFetch & WebSearch for complete documentation.

Parameter	Before	Now
`max_length` default	50,000 chars	20,000 chars
Auto-save when body > 10KB with no filter	no	yes — saved to scratch dir with a compact preview returned

Escape hatch

Variable	Description
`CHATCLI_WEBFETCH_AUTOSAVE_BYTES`	Byte threshold for auto-save. Default: `10000`.

Auto-save always persists the full pre-filter body to $CHATCLI_AGENT_TMPDIR, and the return carries:

[auto-saved: response was 142318 bytes — too large to inline.
 Full body is at /tmp/chatcli-agent-.../webfetch_1712....txt.
 Preview below; use read_file with start/end or rerun with
 filter/from_line/to_line for specific ranges.]

[first ~5000 chars of extracted text]
...(auto-truncated — full body saved to disk)

The agent typically issues a read_file against that path with the right start/end, paying only for the lines that matter.

5. Slimmer system prompts

Mode-specific prompts were condensed without semantic loss — every original rule remains, only redundancy and repeated examples were removed:

Prompt	Before	Now	Reduction
`CoderSystemPrompt`	~1,647 tokens	~1,000 tokens	~40%
`CoderFormatInstructions`	~560 tokens	~390 tokens	~30%
`AgentFormatInstructions`	~324 tokens	~230 tokens	~30%
`OrchestratorSystemPrompt`	~2,111 tokens	~1,050 tokens	~50%

Because these prompts live in the core cache block, cache-enabled providers (Anthropic/Bedrock/OpenAI) see the reduction only on the first turn of a session. Providers without caching save those tokens on every turn.

6. Proactive tool result compaction

Old tool results (file reads, search, git-diff, etc.) are progressively compressed in the history to keep the payload lean. See Tool Result Management for the full pipeline. Defaults are conservative to protect multi-turn workflows with cross-references (large refactors, review sessions). Users who want to squeeze harder can tune:

Variable	Default	Description
`CHATCLI_MICROCOMPACT_TRUNCATE_TURNS`	`2`	After how many turns old tool results are truncated to head+tail preview.
`CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS`	`4`	After how many turns tool results are replaced by a one-line summary.
`CHATCLI_MICROCOMPACT_HEAD_CHARS`	`2000`	Head size kept during truncation.
`CHATCLI_MICROCOMPACT_TAIL_CHARS`	`500`	Tail size kept during truncation.
`CHATCLI_MICROCOMPACT_MIN_CONTENT`	`3000`	Minimum tool result size to become a compaction candidate.

For chat/lookup sessions where speed and low tokens matter more than long-term recall, try:

export CHATCLI_MICROCOMPACT_TRUNCATE_TURNS=1
export CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS=3
export CHATCLI_MICROCOMPACT_HEAD_CHARS=1200
export CHATCLI_MICROCOMPACT_TAIL_CHARS=300
export CHATCLI_MICROCOMPACT_MIN_CONTENT=2000

7. Pull-first memory (index + recall)

Pushing the whole memory into the system prompt every turn doesn’t scale: the cost grows with the store size and is re-sent each turn. ChatCLI now defaults to a pull model: it injects only a stable digest and lets the agent pull detail on demand via the @memory recall tool. Controlled by CHATCLI_MEMORY_MODE:

Mode	Behavior	When to use
`index` (default)	Injects a compact, stable index (profile summary + top topic/project names + fact tally by category) and a directive for the agent to call `@memory recall` when it needs detail.	Default. Per-turn cost stays bounded as memory grows.
`full`	Injects the full hint-driven retrieval every turn (previous behavior).	When you want the agent to always see relevant memory without relying on it pulling.
`off`	Injects no memory (bootstrap still applies).	Sessions where long-term memory only gets in the way.

The index is stable (turn-independent, no timestamp), so it lives in the cached prefix (section 1) and is size-capped regardless of store size. @memory recall uses the full retrieval stack (HyDE + vector cosine search + keyword extraction), so pulled detail matches the quality of the old push. Proactive auto-recall (CHATCLI_MEMORY_AUTORECALL, on by default) adds a tiny hint-driven top-3-facts block per turn — but it rides in the uncached trailing region next to the wall-clock context, never in the stable digest, so the cached prefix stays byte-identical and none of the savings above are given back. See Bootstrap and Memory for details.

Measured impact

Measured against a real store of 500 facts (MEMORY.md ~32KB, fact index ~270KB):

Per turn (agent/coder)	chars	~tokens
Push (`full`)	3,946	~986
Index (`index`)	486	~121

−87.7% on the per-turn memory block — and unlike full (capped by CHATCLI_MEMORY_RETRIEVAL_BUDGET), the index does not grow as memory grows.

Chat is tool-less by design and cannot pull on demand: there index automatically degrades to full, and only off suppresses memory. The active mode is shown in /config memory.

In index mode the agent/coder no longer sees the whole memory automatically — it must call @memory recall. The index gives it the “map” (what exists) so it knows what to pull. If you notice the agent missing context it should recall, switch back to CHATCLI_MEMORY_MODE=full (the section 1 cache savings still apply).

Measuring impact

Run your session normally and check /cost at the end:

Session cost summary
  Provider: CLAUDEAI | Model: claude-sonnet-4-6
  ─────────────────────────────────────────────
  Input tokens:       12,345
  Output tokens:       3,210
  Cache read:         87,650   ← ideal: growing per turn
  Cache creation:      4,100
  ─────────────────────────────────────────────
  Total cost: $0.0234

Signals that the optimizations are active and working:

Cache read > 0 and growing per turn → structured caching is hitting the prefix.
Few/no FORMAT ERROR in logs → reminders are holding format on smaller models.
Turns with tool_calls = 0 followed by quick completion → early-exit detected convergence.
[auto-saved: response was N bytes] markers in @webfetch results → inline budget is protecting context.

Variable summary

Every variable on this page in one place:

Variable	Default	Disable with
`CHATCLI_MEMORY_MODE`	`index`	`full` (push) / `off`
`CHATCLI_AGENT_EARLY_EXIT`	`1` (on)	`0` / `false` / `off` / `no`
`CHATCLI_AGENT_EARLY_EXIT_TURNS`	`3`	— (clamped to [2, 10])
`CHATCLI_AGENT_SMART_ROUTE`	`hint`	`off` / `0` / `false` / `no`
`CHATCLI_WEBFETCH_AUTOSAVE_BYTES`	`10000`	set a very high value
`CHATCLI_MICROCOMPACT_TRUNCATE_TURNS`	`2`	high value (e.g., `100`)
`CHATCLI_MICROCOMPACT_SUMMARIZE_TURNS`	`4`	high value
`CHATCLI_MICROCOMPACT_HEAD_CHARS`	`2000`	high value
`CHATCLI_MICROCOMPACT_TAIL_CHARS`	`500`	high value
`CHATCLI_MICROCOMPACT_MIN_CONTENT`	`3000`	very high value

Native Tool Use — how cache_control:ephemeral propagates to Anthropic/Bedrock.
Tool Result Management — the compaction pipeline in detail.
Cost Tracking — how /cost and the catalog compute real per-turn cost.
WebFetch & WebSearch — filters, save_to_file, and backend fallback.

Home

Getting Started

Core Concepts

Features

Security

Support

Token Efficiency

The real problem

1. Structured prompt caching

Cache on tool definitions

Cache visibility

2. Stagnation early-exit

How it works

Parameters

3. Smart chat ↔ agent routing

Modes

Example

4. Smart auto-save in WebFetch

Escape hatch

5. Slimmer system prompts

6. Proactive tool result compaction

7. Pull-first memory (index + recall)

Measured impact

Measuring impact

Variable summary

​The real problem

​1. Structured prompt caching

​Cache on tool definitions

​Cache visibility

​2. Stagnation early-exit

​How it works

​Parameters

​3. Smart chat ↔ agent routing

​Modes

​Example

​4. Smart auto-save in WebFetch

​Escape hatch

​5. Slimmer system prompts

​6. Proactive tool result compaction

​7. Pull-first memory (index + recall)

​Measured impact

​Measuring impact

​Variable summary

​Related

The real problem

1. Structured prompt caching

Cache on tool definitions

Cache visibility

2. Stagnation early-exit

How it works

Parameters

3. Smart chat ↔ agent routing

Modes

Example

4. Smart auto-save in WebFetch

Escape hatch

5. Slimmer system prompts

6. Proactive tool result compaction

7. Pull-first memory (index + recall)

Measured impact

Measuring impact

Variable summary

Related