Text-to-Speech & Gateway Voice

ChatCLI synthesizes speech via the llm/tts package (mirroring llm/transcription): the @speak tool produces an audio file, and the gateway can reply with voice. Local/keyless-first, and not tied to one provider — including an embedded engine (Kokoro via sherpa-onnx) that runs offline on Linux, macOS and Windows with no API key.

Backend selection (local first)

tts.NewFromEnv picks in this order, degrading to null (disabled) when nothing is configured:

CHATCLI_TTS_CMD → a local command (template with {text} and {output}). Keyless.
CHATCLI_TTS_URL → an OpenAI-compatible endpoint (/audio/speech), keyless unless CHATCLI_TTS_KEY.
Embedded engine (Kokoro) — used automatically when its cache is already provisioned; auto mode never triggers the download on its own (pin CHATCLI_TTS_PROVIDER=embedded to provision).
Local CLI on PATH: macOS say, espeak-ng, espeak — used automatically with zero config.
OPENAI_API_KEY → OpenAI TTS.
GROQ_API_KEY → Groq TTS (same OpenAI shape).
GOOGLEAI_API_KEY/GEMINI_API_KEY → native Gemini TTS (returns PCM, wrapped in WAV).

Genuinely multi-provider: any server speaking /audio/speech works by pointing CHATCLI_TTS_URL at it. Anthropic and most of the 14 chat providers do not expose a speech API — that’s why it can’t be “extended to all”: the endpoint doesn’t exist.

Embedded engine (Kokoro) — offline neural voice, no key

export CHATCLI_TTS_PROVIDER=embedded

On the first synthesis, ChatCLI downloads into the user cache (one time, ~150 MB):

the prebuilt sherpa-onnx-offline-tts CLI for your platform (~25 MB, shared build), and
the kokoro-int8-multi-lang-v1_0 model (~126 MB) — 53 neural voices across English, Brazilian Portuguese, Spanish, French, Hindi, Italian, Japanese and Mandarin.

No cgo, no server, no API key: the ChatCLI release binary stays untouched; synthesis runs in a child process per clip. Supported platforms: linux/amd64, linux/arm64, darwin/amd64, darwin/arm64, windows/amd64.

Voice routing by language

Each reply routes to the voice of its detected language — a mixed pt/en conversation answers every message with the right voice and accent:

Variable	Role	Default
`CHATCLI_TTS_VOICE`	Voice for English replies (and overall default)	`bm_george` (British)
`CHATCLI_TTS_VOICE_PT`	Voice for Portuguese replies	`pm_alex` (pt-BR)

Other notable voices: bm_daniel, bm_fable, bm_lewis (British), pm_santa, pf_dora (pt-BR), af_*/am_* (American). An explicit voice passed to @speak always overrides routing.

Provisioning — atomic and auditable

.part download → .tmp extraction → atomic rename → a versioned marker written last: a crash mid-provision never leaves a half-built cache that masquerades as installed.
Tar extraction is confined (no path traversal, no escaping symlinks, decompression-bomb ceiling) and the cache scan runs inside an os.Root (kernel-enforced confinement).
CHATCLI_TTS_CACHE_DIR relocates the cache (must be an absolute path) — useful for shared caches and air-gapped pre-seeding.

With ffmpeg on PATH, clips come out as OGG/Opus (native Telegram voice note). Without ffmpeg, output degrades to WAV — the reply never gets lost.

The @speak tool

<tool_call name="@speak" args='{"cmd":"say","args":{"text":"Build finished","format":"mp3"}}' />
<tool_call name="@speak" args='{"cmd":"status"}' />

Subcommand	Role
`say {text, voice?, format?, out?}`	synthesize to a file (formats: mp3/wav/opus/ogg/aac/flac); `out` optional
`status`	show the effective backend

Voice replies over the gateway

The gateway replies in kind by default (CHATCLI_GATEWAY_VOICE_REPLY=auto): send audio, get audio back; send text, get text. Users also toggle it by asking in the conversation itself (“answer me in audio” / “stop sending audio”) via the @voice tool. The Gateway Voice Replies page covers everything: the auto|always|never modes, the per-conversation toggle, speech-aware writing (no emoji/markdown in the audio) and the voice-note transcode. See the live state in /config (Integrations · Gateway · Voice reply (TTS)).

Gateway Voice Replies — modes, per-conversation toggle, speech-aware
Chat Gateway — including inbound audio transcription
Image Generation (@image)

​Backend selection (local first)

​Embedded engine (Kokoro) — offline neural voice, no key

​Voice routing by language

​Provisioning — atomic and auditable

​The @speak tool

​Voice replies over the gateway

​Related

Backend selection (local first)

Embedded engine (Kokoro) — offline neural voice, no key

Voice routing by language

Provisioning — atomic and auditable

The @speak tool

Voice replies over the gateway

Related