ChatCLI synthesizes speech via the llm/tts package (mirroring llm/transcription): the @speak tool produces an audio file, and the gateway can reply with voice. Local/keyless-first, and not tied to one provider — including an embedded engine (Kokoro via sherpa-onnx) that runs offline on Linux, macOS and Windows with no API key.
Backend selection (local first)
tts.NewFromEnv picks in this order, degrading to null (disabled) when nothing is configured:
CHATCLI_TTS_CMD → a local command (template with {text} and {output}). Keyless.
CHATCLI_TTS_URL → an OpenAI-compatible endpoint (/audio/speech), keyless unless CHATCLI_TTS_KEY.
- Embedded engine (Kokoro) — used automatically when its cache is already provisioned; auto mode never triggers the download on its own (pin
CHATCLI_TTS_PROVIDER=embedded to provision).
- Local CLI on PATH: macOS
say, espeak-ng, espeak — used automatically with zero config.
OPENAI_API_KEY → OpenAI TTS.
GROQ_API_KEY → Groq TTS (same OpenAI shape).
GOOGLEAI_API_KEY/GEMINI_API_KEY → native Gemini TTS (returns PCM, wrapped in WAV).
CHATCLI_TTS_PROVIDER pins a backend (embedded/kokoro | command | url | openai | groq | google). CHATCLI_TTS_VOICE and CHATCLI_TTS_MODEL tune voice/model.
Genuinely multi-provider: any server speaking /audio/speech works by pointing CHATCLI_TTS_URL at it. Anthropic and most of the 14 chat providers do not expose a speech API — that’s why it can’t be “extended to all”: the endpoint doesn’t exist.
Embedded engine (Kokoro) — offline neural voice, no key
export CHATCLI_TTS_PROVIDER=embedded
On the first synthesis, ChatCLI downloads into the user cache (one time, ~150 MB):
- the prebuilt
sherpa-onnx-offline-tts CLI for your platform (~25 MB, shared build), and
- the
kokoro-int8-multi-lang-v1_0 model (~126 MB) — 53 neural voices across English, Brazilian Portuguese, Spanish, French, Hindi, Italian, Japanese and Mandarin.
No cgo, no server, no API key: the ChatCLI release binary stays untouched; synthesis runs in a child process per clip. Supported platforms: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64.
Voice routing by language
Each reply routes to the voice of its detected language — a mixed pt/en conversation answers every message with the right voice and accent:
| Variable | Role | Default |
|---|
CHATCLI_TTS_VOICE | Voice for English replies (and overall default) | bm_george (British) |
CHATCLI_TTS_VOICE_PT | Voice for Portuguese replies | pm_alex (pt-BR) |
Other notable voices: bm_daniel, bm_fable, bm_lewis (British), pm_santa, pf_dora (pt-BR), af_*/am_* (American). An explicit voice passed to @speak always overrides routing.
Provisioning — atomic and auditable
.part download → .tmp extraction → atomic rename → a versioned marker written last: a crash mid-provision never leaves a half-built cache that masquerades as installed.
- Tar extraction is confined (no path traversal, no escaping symlinks, decompression-bomb ceiling) and the cache scan runs inside an
os.Root (kernel-enforced confinement).
CHATCLI_TTS_CACHE_DIR relocates the cache (must be an absolute path) — useful for shared caches and air-gapped pre-seeding.
With ffmpeg on PATH, clips come out as OGG/Opus (native Telegram voice note). Without ffmpeg, output degrades to WAV — the reply never gets lost.
<tool_call name="@speak" args='{"cmd":"say","args":{"text":"Build finished","format":"mp3"}}' />
<tool_call name="@speak" args='{"cmd":"status"}' />
| Subcommand | Role |
|---|
say {text, voice?, format?, out?} | synthesize to a file (formats: mp3/wav/opus/ogg/aac/flac); out optional |
status | show the effective backend |
Voice replies over the gateway
The gateway replies in kind by default (CHATCLI_GATEWAY_VOICE_REPLY=auto): send audio, get audio back; send text, get text. Users also toggle it by asking in the conversation itself (“answer me in audio” / “stop sending audio”) via the @voice tool.
The Gateway Voice Replies page covers everything: the auto|always|never modes, the per-conversation toggle, speech-aware writing (no emoji/markdown in the audio) and the voice-note transcode.
See the live state in /config (Integrations · Gateway · Voice reply (TTS)).