Skip to main content
Beyond generating images, ChatCLI lets the model see and understand images you attach — and answer based on them. It works in chat, /coder, /agent, one-shot (-p) and across the gateway channels (Telegram, WhatsApp, Slack, Discord, webhook).

How to attach

Point @file at an image — it detects the type and attaches it as vision input (not inlined as text):
@file diagram.png what does this flow do?
@file ~/Downloads/error.jpg why does this stacktrace happen?
@file screenshot.png       # image only, no text, also works
Supported formats: PNG, JPEG, GIF, WebP. Local path or an image inside a directory. Images with the wrong extension are still detected by content (MIME sniff).
Chat is tool-less by design — attaching an image in chat works (it’s an attachment, not a tool). To generate/edit an image, use /coder or /agent with the @image tool.

Hybrid strategy (B + A)

ChatCLI decides automatically, looking at the active model’s vision capability in the catalog:
  • Vision model (GPT-4o/4.1/5.x, Claude 3+/4.x, Gemini, Kimi, GLM, Bedrock Claude…) → the image goes natively, the model truly sees the pixels. (Path B.)
  • Non-vision modeldescribe-fallback: a vision model describes the image and the text is folded into the prompt, so a text-only model can still reason about the content. (Path A.)
  • No vision model available → a clear warning and the answer continues text-only (never breaks).
No variable is required — @file image just works. CHATCLI_VISION_PROVIDER/CHATCLI_VISION_MODEL only override the fallback captioner (e.g. gpt-4o-mini as a cheap captioner).

Off-catalog vision models

Models fetched from a provider’s /models API may have no catalog entry. The decision is layered (CHATCLI_VISION_INPUT):
  1. Override CHATCLI_VISION_INPUT=native|describe|off — explicit control.
  2. Catalog (vision capability) — authoritative for known models.
  3. Conservative heuristic — if the id carries an unambiguous vision marker (-vl, vl-, vision, pixtral, llava, internvl, qwen-vl, omni, multimodal), treat it as native. Those names exist only on multimodal models → near-zero false positives.
  4. Otherwise → describe-fallback.
The heuristic matches explicit markers only, never family prefixes (which have text-only exceptions like claude-3-5-haiku/o3-mini), so it never sends an image block to a model that would hard-error. Know your off-catalog model sees? CHATCLI_VISION_INPUT=native.

Per-provider coverage (native vision)

Image serialization is done by a shared helper, across 6 dialects covering the vision-capable providers:
DialectProviders
OpenAI image_urlOpenAI, xAI, Z.AI, OpenRouter, Copilot, GitHub Models, Moonshot, MiniMax, Bedrock-OpenAI
Anthropic blocksAnthropic (API-key + OAuth), Bedrock-Claude, MiniMax-Anthropic
Gemini inline_dataGoogle Gemini
Bedrock SDKBedrock Converse
Ollama images[]Ollama
Responses input_imageOpenAI Responses
The gate is generic: catalog.HasCapability(provider, model, "vision"). Text-only API providers (StackSpot, OpenAI Assistants) automatically fall back to describe.

In the gateway (messaging channels)

Receive an image — send a photo on Telegram/WhatsApp/etc. and the gateway downloads it, then the configured model sees it (native or describe-fallback, same logic above). Image-only messages (no text) get a default analysis request. Send an image — if the agent generated/edited an image during the reply (via @image), it is attached automatically to the reply on photo-capable adapters.
VariablePurposeDefault
CHATCLI_GATEWAY_IMAGE_REPLYAttach a generated/edited image to the reply: auto / neverauto
CHATCLI_GATEWAY_MAX_IMAGE_BYTESInbound image download cap (bytes)20 MB

Configuration

/config integrations          # shows CHATCLI_VISION_*, CHATCLI_GATEWAY_IMAGE_REPLY etc.
VariablePurposeDefault
CHATCLI_VISION_INPUTMode: auto/native/describe/offauto
CHATCLI_VISION_PROVIDERdescribe-fallback provider(auto)
CHATCLI_VISION_MODELdescribe-fallback model(auto)

Notes

  • Images cost prompt tokens (a large image can be worth hundreds/thousands of tokens). Confirm before attaching large batches on paid models.
  • The image stays in history and is re-sent on subsequent turns (standard multimodal behavior).
  • To force a specific fallback captioner: CHATCLI_VISION_PROVIDER=openai CHATCLI_VISION_MODEL=gpt-4o-mini.