Provider Fallback

ChatCLI supports an automatic failover chain between LLM providers. When the primary provider fails (rate limit, timeout, server error), the system automatically tries the next provider in the chain, completely transparently.

How It Works

The fallback chain is an ordered list of providers. Each request traverses the list until it succeeds or all options are exhausted:

Request -> OpenAI (primary)
             | failed (rate limit)
           Claude (secondary)
             | failed (timeout)
           Google AI (tertiary)
             | success
           Response returned to user

Configuration

Environment Variables
Server Flags
Helm Chart

# Ordered list of providers (first = highest priority)
export CHATCLI_FALLBACK_PROVIDERS="OPENAI,CLAUDEAI,GOOGLEAI,ZAI,MINIMAX,MOONSHOT,OPENROUTER,COPILOT"

# Specific model per provider (optional)
export CHATCLI_FALLBACK_MODEL_OPENAI="gpt-5.4"
export CHATCLI_FALLBACK_MODEL_CLAUDEAI="claude-sonnet-4-6"
export CHATCLI_FALLBACK_MODEL_GOOGLEAI="gemini-2.5-flash"
export CHATCLI_FALLBACK_MODEL_ZAI="glm-5"
export CHATCLI_FALLBACK_MODEL_MINIMAX="MiniMax-M2.7"
export CHATCLI_FALLBACK_MODEL_OPENROUTER="anthropic/claude-sonnet-4"
export CHATCLI_FALLBACK_MODEL_COPILOT="gpt-4o"

# MiniMax in Anthropic-compatible mode (optional)
export MINIMAX_API_COMPAT="anthropic"

# Retry and cooldown control
export CHATCLI_FALLBACK_MAX_RETRIES="2"       # attempts per provider
export CHATCLI_FALLBACK_COOLDOWN_BASE="30s"    # base cooldown
export CHATCLI_FALLBACK_COOLDOWN_MAX="5m"      # maximum cooldown

chatcli server \
  --fallback-providers OPENAI,CLAUDEAI,GOOGLEAI,ZAI,MINIMAX,MOONSHOT,OPENROUTER,COPILOT \
  --fallback-max-retries 2 \
  --fallback-cooldown-base 30s \
  --fallback-cooldown-max 5m

# values.yaml
fallback:
  enabled: true
  providers:
    - name: OPENAI
      model: gpt-5.4
    - name: CLAUDEAI
      model: claude-sonnet-4-6
    - name: GOOGLEAI
      model: gemini-2.5-flash
    - name: ZAI
      model: glm-5
    - name: MINIMAX
      model: MiniMax-M2.7
    - name: OPENROUTER
      model: anthropic/claude-sonnet-4
    - name: COPILOT
      model: gpt-4o
  maxRetries: 2
  cooldownBase: "30s"
  cooldownMax: "5m"

Error Classification

The system automatically classifies each failure to decide the strategy:

Class	Behavior	Examples
`rate_limit`	Waits for backoff, then retries	HTTP 429, “too many requests”
`timeout`	Retries up to maxRetries	Deadline exceeded, connection timeout
`server_error`	Retries up to maxRetries	HTTP 500, 502, 503
`auth_error`	Attempts OAuth token refresh and retries once; if it fails, advances in the chain	HTTP 401, 403, “invalid api key”
`model_not_found`	Does not retry — advances in the chain	HTTP 404, “model not found”
`context_too_long`	Does not retry — advances in the chain	”context length exceeded”

Exponential Cooldown

After consecutive failures, the provider enters cooldown with exponential backoff:

Consecutive Failures	Cooldown
1	30s
2	60s
3	120s
4	240s
5+	300s (max)

In interactive CLI mode, authentication errors (401) automatically trigger an OAuth token refresh and retry the request. In server mode (fallback chain), authentication errors receive immediate maximum cooldown (5m). A successful request clears all cooldown for the provider. Use ResetCooldowns() to clear manually (e.g., after updating credentials).

Health Monitoring

The chain tracks the state of each provider in real time:

health := chain.GetHealth()
for _, h := range health {
    fmt.Printf("Provider: %s, Available: %v, Fails: %d, Cooldown: %v\n",
        h.Name, h.Available, h.ConsecutiveFails, h.CooldownUntil)
}

Fields tracked per provider:

Field	Description
`Available`	Whether the provider is available for requests
`ConsecutiveFails`	Number of consecutive failures
`LastErrorClass`	Type of the last failure
`CooldownUntil`	When the cooldown expires
`LastErrorAt`	Timestamp of the last failure

Tool Use with Fallback

The fallback chain also supports SendPromptWithTools for providers that implement the ToolAwareClient interface. Providers without native tool use support are automatically skipped in the tool call chain.

Best Practices

Order by cost-effectiveness

Place the cheapest/fastest provider first in the chain.

Diversify providers

Mix providers from different companies for real resilience.

Configure models per provider

Use models with equivalent capabilities to maintain quality.

Monitor health

Regularly check if any provider is in persistent cooldown.

OpenRouter native fallback: In addition to ChatCLI’s provider fallback chain, OpenRouter itself supports server-side fallback routing via OPENROUTER_FALLBACK_MODELS. When set (e.g., anthropic/claude-sonnet-4,google/gemini-2.5-flash), OpenRouter automatically routes to the next model if the primary is unavailable — before ChatCLI’s own chain advances. These two mechanisms are complementary: OpenRouter handles model-level failover within its gateway, while ChatCLI handles provider-level failover across different APIs.

Each provider in the chain needs its own configured API key. Make sure to configure the keys for all providers listed in CHATCLI_FALLBACK_PROVIDERS.

​How It Works

​Configuration

​Error Classification

​Exponential Cooldown

​Health Monitoring

​Tool Use with Fallback

​Best Practices

Order by cost-effectiveness

Diversify providers

Configure models per provider

Monitor health

How It Works

Configuration

Error Classification

Exponential Cooldown

Health Monitoring

Tool Use with Fallback

Best Practices