Image Input (Vision)

Beyond generating images, ChatCLI lets the model see and understand images you attach — and answer based on them. It works in chat, /coder, /agent, one-shot (-p) and across the gateway channels (Telegram, WhatsApp, Slack, Discord, webhook).

How to attach

Point @file at an image — it detects the type and attaches it as vision input (not inlined as text):

@file diagram.png what does this flow do?
@file ~/Downloads/error.jpg why does this stacktrace happen?
@file screenshot.png       # image only, no text, also works

Supported formats: PNG, JPEG, GIF, WebP. Local path or an image inside a directory. Images with the wrong extension are still detected by content (MIME sniff).

Chat is tool-less by design — attaching an image in chat works (it’s an attachment, not a tool). To generate/edit an image, use /coder or /agent with the @image tool.

Hybrid strategy (B + A)

ChatCLI decides automatically, looking at the active model’s vision capability in the catalog:

Vision model (GPT-4o/4.1/5.x, Claude 3+/4.x, Gemini, Kimi, GLM, Bedrock Claude…) → the image goes natively, the model truly sees the pixels. (Path B.)
Non-vision model → describe-fallback: a vision model describes the image and the text is folded into the prompt, so a text-only model can still reason about the content. (Path A.)
No vision model available → a clear warning and the answer continues text-only (never breaks).

No variable is required — @file image just works. CHATCLI_VISION_PROVIDER/CHATCLI_VISION_MODEL only override the fallback captioner (e.g. gpt-4o-mini as a cheap captioner).

Off-catalog vision models

Models fetched from a provider’s /models API may have no catalog entry. The decision is layered (CHATCLI_VISION_INPUT):

Override CHATCLI_VISION_INPUT=native|describe|off — explicit control.
Catalog (vision capability) — authoritative for known models.
Conservative heuristic — if the id carries an unambiguous vision marker (-vl, vl-, vision, pixtral, llava, internvl, qwen-vl, omni, multimodal), treat it as native. Those names exist only on multimodal models → near-zero false positives.
Otherwise → describe-fallback.

The heuristic matches explicit markers only, never family prefixes (which have text-only exceptions like claude-3-5-haiku/o3-mini), so it never sends an image block to a model that would hard-error. Know your off-catalog model sees? CHATCLI_VISION_INPUT=native.

Per-provider coverage (native vision)

Image serialization is done by a shared helper, across 6 dialects covering the vision-capable providers:

Dialect	Providers
OpenAI `image_url`	OpenAI, xAI, Z.AI, OpenRouter, Copilot, GitHub Models, Moonshot, MiniMax, Bedrock-OpenAI
Anthropic blocks	Anthropic (API-key + OAuth), Bedrock-Claude, MiniMax-Anthropic
Gemini `inline_data`	Google Gemini
Bedrock SDK	Bedrock Converse
Ollama `images[]`	Ollama
Responses `input_image`	OpenAI Responses

The gate is generic: catalog.HasCapability(provider, model, "vision"). Text-only API providers (StackSpot, OpenAI Assistants) automatically fall back to describe.

In the gateway (messaging channels)

Receive an image — send a photo on Telegram/WhatsApp/etc. and the gateway downloads it, then the configured model sees it (native or describe-fallback, same logic above). Image-only messages (no text) get a default analysis request. Send an image — if the agent generated/edited an image during the reply (via @image), it is attached automatically to the reply on photo-capable adapters.

Variable	Purpose	Default
`CHATCLI_GATEWAY_IMAGE_REPLY`	Attach a generated/edited image to the reply: `auto` / `never`	`auto`
`CHATCLI_GATEWAY_MAX_IMAGE_BYTES`	Inbound image download cap (bytes)	20 MB

Configuration

/config integrations          # shows CHATCLI_VISION_*, CHATCLI_GATEWAY_IMAGE_REPLY etc.

Variable	Purpose	Default
`CHATCLI_VISION_INPUT`	Mode: `auto`/`native`/`describe`/`off`	`auto`
`CHATCLI_VISION_PROVIDER`	describe-fallback provider	(auto)
`CHATCLI_VISION_MODEL`	describe-fallback model	(auto)

Notes

Images cost prompt tokens (a large image can be worth hundreds/thousands of tokens). Confirm before attaching large batches on paid models.
The image stays in history and is re-sent on subsequent turns (standard multimodal behavior).
To force a specific fallback captioner: CHATCLI_VISION_PROVIDER=openai CHATCLI_VISION_MODEL=gpt-4o-mini.

Image Generation & Editing (@image) — the model creates/edits images
Chat Gateway
Voice Replies

​How to attach

​Hybrid strategy (B + A)

​Off-catalog vision models

​Per-provider coverage (native vision)

​In the gateway (messaging channels)

​Configuration

​Notes

​Related

How to attach

Hybrid strategy (B + A)

Off-catalog vision models

Per-provider coverage (native vision)

In the gateway (messaging channels)

Configuration

Notes

Related