/coder, /agent, one-shot (-p) and across the gateway channels (Telegram, WhatsApp, Slack, Discord, webhook).
How to attach
Point@file at an image — it detects the type and attaches it as vision input (not inlined as text):
Chat is tool-less by design — attaching an image in chat works (it’s an attachment, not a tool). To generate/edit an image, use
/coder or /agent with the @image tool.Hybrid strategy (B + A)
ChatCLI decides automatically, looking at the active model’svision capability in the catalog:
- Vision model (GPT-4o/4.1/5.x, Claude 3+/4.x, Gemini, Kimi, GLM, Bedrock Claude…) → the image goes natively, the model truly sees the pixels. (Path B.)
- Non-vision model → describe-fallback: a vision model describes the image and the text is folded into the prompt, so a text-only model can still reason about the content. (Path A.)
- No vision model available → a clear warning and the answer continues text-only (never breaks).
@file image just works. CHATCLI_VISION_PROVIDER/CHATCLI_VISION_MODEL only override the fallback captioner (e.g. gpt-4o-mini as a cheap captioner).
Off-catalog vision models
Models fetched from a provider’s/models API may have no catalog entry. The decision is layered (CHATCLI_VISION_INPUT):
- Override
CHATCLI_VISION_INPUT=native|describe|off— explicit control. - Catalog (
visioncapability) — authoritative for known models. - Conservative heuristic — if the id carries an unambiguous vision marker (
-vl,vl-,vision,pixtral,llava,internvl,qwen-vl,omni,multimodal), treat it as native. Those names exist only on multimodal models → near-zero false positives. - Otherwise → describe-fallback.
claude-3-5-haiku/o3-mini), so it never sends an image block to a model that would hard-error. Know your off-catalog model sees? CHATCLI_VISION_INPUT=native.
Per-provider coverage (native vision)
Image serialization is done by a shared helper, across 6 dialects covering the vision-capable providers:| Dialect | Providers |
|---|---|
OpenAI image_url | OpenAI, xAI, Z.AI, OpenRouter, Copilot, GitHub Models, Moonshot, MiniMax, Bedrock-OpenAI |
| Anthropic blocks | Anthropic (API-key + OAuth), Bedrock-Claude, MiniMax-Anthropic |
Gemini inline_data | Google Gemini |
| Bedrock SDK | Bedrock Converse |
Ollama images[] | Ollama |
Responses input_image | OpenAI Responses |
catalog.HasCapability(provider, model, "vision"). Text-only API providers (StackSpot, OpenAI Assistants) automatically fall back to describe.
In the gateway (messaging channels)
Receive an image — send a photo on Telegram/WhatsApp/etc. and the gateway downloads it, then the configured model sees it (native or describe-fallback, same logic above). Image-only messages (no text) get a default analysis request. Send an image — if the agent generated/edited an image during the reply (via@image), it is attached automatically to the reply on photo-capable adapters.
| Variable | Purpose | Default |
|---|---|---|
CHATCLI_GATEWAY_IMAGE_REPLY | Attach a generated/edited image to the reply: auto / never | auto |
CHATCLI_GATEWAY_MAX_IMAGE_BYTES | Inbound image download cap (bytes) | 20 MB |
Configuration
| Variable | Purpose | Default |
|---|---|---|
CHATCLI_VISION_INPUT | Mode: auto/native/describe/off | auto |
CHATCLI_VISION_PROVIDER | describe-fallback provider | (auto) |
CHATCLI_VISION_MODEL | describe-fallback model | (auto) |
Notes
- Images cost prompt tokens (a large image can be worth hundreds/thousands of tokens). Confirm before attaching large batches on paid models.
- The image stays in history and is re-sent on subsequent turns (standard multimodal behavior).
- To force a specific fallback captioner:
CHATCLI_VISION_PROVIDER=openai CHATCLI_VISION_MODEL=gpt-4o-mini.
Related
- Image Generation & Editing (@image) — the model creates/edits images
- Chat Gateway
- Voice Replies