Skip to main content
ChatCLI synthesizes speech via the llm/tts package (mirroring llm/transcription): the @speak tool produces an audio file, and the gateway can reply with voice. Local/keyless-first, and not tied to one provider — including an embedded engine (Kokoro via sherpa-onnx) that runs offline on Linux, macOS and Windows with no API key.

Backend selection (local first)

tts.NewFromEnv picks in this order, degrading to null (disabled) when nothing is configured:
  1. CHATCLI_TTS_CMD → a local command (template with {text} and {output}). Keyless.
  2. CHATCLI_TTS_URL → an OpenAI-compatible endpoint (/audio/speech), keyless unless CHATCLI_TTS_KEY.
  3. Embedded engine (Kokoro) — used automatically when its cache is already provisioned; auto mode never triggers the download on its own (pin CHATCLI_TTS_PROVIDER=embedded to provision).
  4. Local CLI on PATH: macOS say, espeak-ng, espeak — used automatically with zero config.
  5. OPENAI_API_KEY → OpenAI TTS.
  6. GROQ_API_KEY → Groq TTS (same OpenAI shape).
  7. GOOGLEAI_API_KEY/GEMINI_API_KEYnative Gemini TTS (returns PCM, wrapped in WAV).
CHATCLI_TTS_PROVIDER pins a backend (embedded/kokoro | command | url | openai | groq | google). CHATCLI_TTS_VOICE and CHATCLI_TTS_MODEL tune voice/model.
Genuinely multi-provider: any server speaking /audio/speech works by pointing CHATCLI_TTS_URL at it. Anthropic and most of the 14 chat providers do not expose a speech API — that’s why it can’t be “extended to all”: the endpoint doesn’t exist.

Embedded engine (Kokoro) — offline neural voice, no key

export CHATCLI_TTS_PROVIDER=embedded
On the first synthesis, ChatCLI downloads into the user cache (one time, ~150 MB):
  • the prebuilt sherpa-onnx-offline-tts CLI for your platform (~25 MB, shared build), and
  • the kokoro-int8-multi-lang-v1_0 model (~126 MB) — 53 neural voices across English, Brazilian Portuguese, Spanish, French, Hindi, Italian, Japanese and Mandarin.
No cgo, no server, no API key: the ChatCLI release binary stays untouched; synthesis runs in a child process per clip. Supported platforms: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64.

Voice routing by language

Each reply routes to the voice of its detected language — a mixed pt/en conversation answers every message with the right voice and accent:
VariableRoleDefault
CHATCLI_TTS_VOICEVoice for English replies (and overall default)bm_george (British)
CHATCLI_TTS_VOICE_PTVoice for Portuguese repliespm_alex (pt-BR)
Other notable voices: bm_daniel, bm_fable, bm_lewis (British), pm_santa, pf_dora (pt-BR), af_*/am_* (American). An explicit voice passed to @speak always overrides routing.

Provisioning — atomic and auditable

  • .part download → .tmp extraction → atomic rename → a versioned marker written last: a crash mid-provision never leaves a half-built cache that masquerades as installed.
  • Tar extraction is confined (no path traversal, no escaping symlinks, decompression-bomb ceiling) and the cache scan runs inside an os.Root (kernel-enforced confinement).
  • CHATCLI_TTS_CACHE_DIR relocates the cache (must be an absolute path) — useful for shared caches and air-gapped pre-seeding.
With ffmpeg on PATH, clips come out as OGG/Opus (native Telegram voice note). Without ffmpeg, output degrades to WAV — the reply never gets lost.

The @speak tool

<tool_call name="@speak" args='{"cmd":"say","args":{"text":"Build finished","format":"mp3"}}' />
<tool_call name="@speak" args='{"cmd":"status"}' />
SubcommandRole
say {text, voice?, format?, out?}synthesize to a file (formats: mp3/wav/opus/ogg/aac/flac); out optional
statusshow the effective backend

Voice replies over the gateway

The gateway replies in kind by default (CHATCLI_GATEWAY_VOICE_REPLY=auto): send audio, get audio back; send text, get text. Users also toggle it by asking in the conversation itself (“answer me in audio” / “stop sending audio”) via the @voice tool. The Gateway Voice Replies page covers everything: the auto|always|never modes, the per-conversation toggle, speech-aware writing (no emoji/markdown in the audio) and the voice-note transcode. See the live state in /config (Integrations · Gateway · Voice reply (TTS)).