Skip to main content
ChatCLI implements an automatic context recovery system that handles three common failure types in long sessions: model context window overflow (β€œprompt too long”), corporate proxy/gateway payload limits (413 / WAF 403 / silent EOF), and output token limits. When the API rejects a request for any of these reasons, the system applies progressively more aggressive strategies to recover the session without losing the conversation.

Context Overflow Recovery

When the API returns a β€œcontext too long” error, ChatCLI applies up to 3 recovery levels before giving up:
First attempt: halves the budget limits and cleans up misalignments.Actions:
  • Repairs tool result pairing (removes orphans, injects synthetics)
  • Reduces DefaultTurnBudgetChars and DefaultPerResultMaxChars to 50% of their original values
  • Applies budget enforcement with reduced limits
  • Truncates long assistant messages to 5,000 chars
The original limits are restored after application. Only the current history is affected by the reduction.

Error Detection β€” model overflow

The system recognizes multiple forms of overflow errors:
Error MessageProvider
context length exceededAnthropic
prompt is too longOpenAI
request too largeVarious
max_tokens exceedVarious
input too longGoogle
token limitGeneric

Corporate proxy / gateway recovery

Enterprise environments often sit behind a proxy or gateway that enforces a POST body size cap β€” typically 1-5 MB, completely independent of the model’s context window. You can be well within Anthropic’s 200K-token window (~800 KB) and still take a mysterious rejection from the proxy. Worse: many proxies don’t return a clean 413 β€” some send a WAF 403 (Cloudflare, Akamai, mod_security), 431 (header too large), or simply drop the TCP connection mid-POST, surfacing as EOF / connection reset on the client. ChatCLI detects all three patterns and funnels them through the same recovery flow as context overflow.

Error Detection β€” proxy/gateway

Pattern detectedExampleFunction
HTTP 413 + variants413 Payload Too Large, request entity too large, body too large, maximum request size, 431 Request Header Fields Too LargeIsPayloadTooLargeError
403 with WAF/firewall signals403 with firewall, waf, security policy, blocked by, cloudflare, cf-ray, mod_security, akamai, proxy denied, policy violationIsProxyWAFRejection
403 with HTML body (Bedrock/AWS SDK)StatusCode: 403 ... deserialization failed ... invalid character '<' looking for beginning of value (SDK trying to decode an HTML proxy block page as JSON)IsProxyWAFRejection
EOF / reset with large historyunexpected eof, connection reset, broken pipe, stream error and history > 500 KBIsLikelyPayloadProblem (heuristic)
WAF detection is conservative β€” a 403 without firewall signals continues to be treated as an auth error (OAuth refresh + retry). Only when a 403 carries specific proxy/WAF signals is it reclassified as a recoverable payload failure. This prevents invalidating valid OAuth credentials when the real problem is on the network layer.
The corporate Bedrock case: when the proxy/WAF intercepts the POST to Bedrock Runtime and returns an HTML block page with status 403, the AWS SDK tries to parse the body as JSON and fails with "invalid character '<' looking for beginning of value" and an empty RequestID. That pattern is an unambiguous middlebox fingerprint (a real AWS 403 returns well-formed JSON) β€” ChatCLI reclassifies it as a recoverable payload failure and triggers the same recovery ladder.
EOF / connection-reset detection applies a history-size threshold (500 KB) before suspecting payload. Small requests that hit EOF keep being treated as transient network failures (normal retry). Only when the history is already suspiciously large is EOF reclassified as a probable body cap.

Pre-flight check

Every agent turn, history is measured before the request goes out. Two paths: With CHATCLI_MAX_PAYLOAD set: If history crosses 85% of the cap, BudgetRatio is forced to 0.40 up front β€” aggressive preventive compaction. The user sees:
β„Ή pre-flight: history 4.2 MB β‰ˆ 86% of configured cap (5.0 MB) β€” compacting
Without a cap set: If history crosses 2.5 MB, a one-shot warning per session fires suggesting the env var. It will not re-trigger in the same run to avoid noise.
β„Ή history 2.8 MB β€” if you are behind a proxy/gateway, export CHATCLI_MAX_PAYLOAD=5MB (adjust to the proxy limit)

Reactive auto-cap after 413

If a 413/WAF/EOF fires and the user hasn’t configured CHATCLI_MAX_PAYLOAD, ChatCLI automatically assumes 4 MB for the rest of the session β€” a high probability that the retry passes through the same proxy:
⚠ Recoverable failure (proxy/WAF rejection (403 + security signals)) β€” compacting and retrying
β„Ή Assuming 4 MB payload cap β€” export CHATCLI_MAX_PAYLOAD (e.g. 5MB, 512KB) to adjust

System notice injected in history

After a payload-limit-triggered recovery, ChatCLI injects a user message before the retry instructing the model to prefer smaller reads going forward. This breaks the model’s loop of trying to re-read the same huge file that caused the 413 in the first place:
[SYSTEM NOTICE β€” PAYLOAD LIMIT HIT] A proxy/gateway rejected the previous
request due to body size. History was compacted to recover. Going forward:
(1) When reading files, prefer targeted reads with line ranges
    (e.g. sed -n '100,200p' file, or read_file with offset+limit) instead
    of reading entire files.
(2) Prefer grep/ripgrep with specific patterns over full-file reads.
(3) If you previously read a large file, its full content is persisted at
    the path shown in the tool-result preview β€” re-read specific ranges
    from that file rather than repeating the original read.
(4) Summarize findings incrementally rather than accumulating raw tool output.
This hint is intentionally injected in English. The AI follows English instructions much more faithfully even when the user is on pt-BR, and this message is never shown to the user β€” it only enters the history sent to the model.

Max Output Token Escalation

When the model stops generating because it hit the max_tokens limit, ChatCLI can automatically escalate:
AttemptAction
1stDoubles the current max_tokens (up to the provider’s cap)
2ndDoubles again (up to the provider’s cap)
3rd+Stops escalating, returns partial content

Continuation Message

When the model is interrupted by a token limit, ChatCLI injects a continuation message:
Your response was cut off at the token limit.
Resume DIRECTLY from where you stopped -- do not repeat any content.
Continue the implementation or explanation from the exact point of interruption.
The message instructs the model to continue from where it left off, avoiding repetition of already generated content.

Configuration

Environment VariableDescriptionDefault
CHATCLI_CONTEXT_WINDOWGlobal context-window override (in tokens), for any provider/model. Takes precedence over the catalog. Use it when your gateway/agent’s real window differs from what ChatCLI assumes β€” the compaction budget derives from this value.(auto from catalog)
CHATCLI_MAX_RECOVERY_ATTEMPTSMaximum context recovery attempts3
CHATCLI_MAX_TOKEN_ESCALATIONSMaximum max_tokens escalations2
CHATCLI_EMERGENCY_KEEP_MESSAGESMessages kept in emergency truncation10
CHATCLI_MAX_PAYLOADHuman-friendly ceiling for POST body size (e.g. 5MB, 512KB, 2.5MB, 5=5MB). When set, the compactor respects this ceiling as an extra cap on top of the model’s context window, and pre-flight forces compaction on crossing 85% of it.(unset β€” no cap)

Live feedback during compaction

Since this release, the terminal never β€œfreezes” during a long compaction anymore. HistoryCompactor emits status at each pipeline phase via SetStatusCallback:
β”‚ πŸ“¦ Compacting history (23 msgs, 4.2 MB β†’ target 2.9 MB)
β”‚ 🧹 Trim: stripping reasoning/dedup (no LLM)…
β”‚ 🧠 Summarizing old messages via LLM (may take 30-90s β€” ESC cancels)…
β”‚ βœ“ Summary applied (23 β†’ 9 msgs, 4.2 MB β†’ 1.8 MB)
Cancellation: the summarization LLM call now derives its context from the turn β€” Ctrl+C / ESC propagates correctly and aborts compaction without corrupting history (returns ctx.Err() instead of blindly falling through to emergency truncation).

Microcompact (pre-budget)

Before NeedsCompaction checks whether history exceeds budget, the agent loop applies ApplyMicrocompact β€” a pure-Go, no-LLM, no-network pass that progressively truncates/summarizes old tool results (2+ turns old β†’ head+tail preview; 4+ turns old β†’ one-line summary). In most cases this keeps history inside budget without triggering the (expensive) Level 2.
πŸ—œ microcompact: 3 truncated, 2 summarized, 1.7 MB freed
Configurable via env:
Environment VariableDescriptionDefault
CHATCLI_MICROCOMPACT_TRUNCATE_TURNSAge (in turns) at which tool results start getting truncated2
CHATCLI_MICROCOMPACT_SUMMARIZE_TURNSAge (in turns) at which tool results are replaced by a one-line summary4

Aggressive Budget Ratio

At level 1, the tool result budget limits are multiplied by 0.5 (50%). This means:
ParameterNormalLevel 1 Recovery
Budget per turn200,000 chars100,000 chars
Max per result20,000 chars10,000 chars

Recovery Flow

API returns recoverable error (context overflow | 413 | WAF 403 | EOF w/ large history)
  β”‚
  β”œβ”€ System notice injected into history (payload-related only) ─────┐
  β”‚                                                                   β”‚
  β”œβ”€ Attempt 1: Aggressive budget (50%) + pairing cleanup             β”‚
  β”‚   └─ Resend to API                                                β”‚
  β”‚       β”œβ”€ Success β†’ continues normally                             β”‚
  β”‚       └─ Failure β†’ next attempt                                   β”‚
  β”‚                                                                   β”‚
  β”œβ”€ Attempt 2: Emergency truncate (system + last 10 msgs)            β”‚
  β”‚   └─ Resend to API                                                β”‚
  β”‚       β”œβ”€ Success β†’ continues with reduced history                 β”‚
  β”‚       └─ Failure β†’ next attempt                                   β”‚
  β”‚                                                                   β”‚
  └─ Attempt 3: Nuclear truncate (system + last 4 msgs)               β”‚
      └─ Resend to API                                                β”‚
          β”œβ”€ Success β†’ continues with minimal history                 β”‚
          └─ Failure β†’ error reported to user                         β”‚
                                                                      β”‚
  For payload-related (413/WAF/EOF): CHATCLI_MAX_PAYLOAD auto-set β—„β”€β”€β”€β”˜
                                      to 4MB if not configured
After nuclear truncation (level 3), the model loses all context from the previous conversation. Only the last 2 exchanges are kept. Use /compact proactively to avoid reaching this point.

Interaction with Other Systems

Context recovery works in conjunction with:

Tool Result Budget

The result budget is the first line of defense. Recovery activates when the budget was not sufficient.

Microcompaction

Progressive compaction reduces context growth over time.

Conversation Control

The /compact command is the proactive way to prevent overflow.

Cost Tracking

Monitor context usage to anticipate when /compact will be needed.