Send a voice message on Telegram and the answer comes back as a voice note — transcription on the way in (embedded Whisper by default, zero config), synthesis on the way out, with any configured TTS backend. Everything here is provider-agnostic: from macOS say to the embedded Kokoro engine.
Modes (CHATCLI_GATEWAY_VOICE_REPLY)
| Value | Behavior |
|---|
auto (default) | Reply in kind: voice message → answer with audio; text → text only. |
always | Every final answer carries audio. |
never | Text-only replies. |
Legacy boolean values still work: true → always, false → never. An unknown value falls back to auto — a typo never silences the gateway.
# auto is already the default; export only for always/never
export CHATCLI_GATEWAY_VOICE_REPLY=always
The daemon inherits the environment of the shell that ran /gateway start — and .env never overrides an already-exported variable. If behavior doesn’t match your .env, check the effective mode at daemon boot: gateway.log records voice replies enabled (mode=...).
Each conversation owns its preference by asking in natural language: the model calls the @voice tool, and the choice is stored per session (survives daemon restarts) with precedence over the global mode.
| Ask in the chat | Effect |
|---|
| ”answer me in audio” / “I want to hear your voice” | @voice on — every reply in this conversation is spoken |
| ”stop sending audio” | @voice off — text only in this conversation |
| ”back to normal” | @voice auto — back to the default (voice answers voice) |
The preference lives in ~/.chatcli/gateway_voice_prefs.json (atomic writes), keyed by platform:chat. Decision hierarchy: conversation preference → global mode → in-kind.
Written for the ear (speech-aware)
When a reply will become audio, two layers guarantee natural speech:
- The model knows before writing: a directive says the answer will be spoken — conversational prose, short sentences, no emojis, no lists/tables/markdown.
- A hard guarantee in the sanitizer: before synthesis,
StripForSpeech flattens markdown (code dropped, links collapse to labels, tables become prose) and removes emoji and pictographs — TTS engines read their Unicode names out loud, burying the message. Portuguese accents stay intact.
The visual text in the chat keeps its full formatting; only the audio is sanitized.
A voice note that actually plays (transcode)
Backends that ignore the format hint (macOS say emits aiff, espeak emits wav) used to produce a file Telegram shows with a size but cannot play. With ffmpeg on PATH, the gateway transcodes wav/aiff → OGG/Opus (voice-note profile: 48 kHz mono) for every provider; already-compressed formats pass through, and without ffmpeg the original clip is sent as an audio file — visible degradation, never a lost reply.
The full pipeline
voice note (Telegram) ─→ transcription (local-first whisper) ─→ agent loop
│ speech-aware directive
▼
reply text ─→ StripForSpeech ─→ TTS (any backend) ─→ ogg/opus transcode ─→ sendVoice
The “does this reply speak?” decision is a single rule (conversation preference → global mode → in-kind) shared between the runner and the agent loop — the two can never diverge.
Troubleshooting
| Symptom | Likely cause | Check |
|---|
| Everything comes back with audio, even text | always mode active (env exported in the daemon’s shell) | gateway.log: voice replies enabled (mode=...) |
| Audio arrives but won’t play | Raw format with no ffmpeg to transcode | which ffmpeg; install it for native voice notes |
| Never replies with voice | never mode, or no TTS backend | /config → Integrations · Gateway · Voice reply |
| Wrong voice for the language | Embedded engine voices | CHATCLI_TTS_VOICE (en) / CHATCLI_TTS_VOICE_PT (pt) |