Skip to main content
Send a voice message on Telegram and the answer comes back as a voice note — transcription on the way in (embedded Whisper by default, zero config), synthesis on the way out, with any configured TTS backend. Everything here is provider-agnostic: from macOS say to the embedded Kokoro engine.

Modes (CHATCLI_GATEWAY_VOICE_REPLY)

ValueBehavior
auto (default)Reply in kind: voice message → answer with audio; text → text only.
alwaysEvery final answer carries audio.
neverText-only replies.
Legacy boolean values still work: truealways, falsenever. An unknown value falls back to auto — a typo never silences the gateway.
# auto is already the default; export only for always/never
export CHATCLI_GATEWAY_VOICE_REPLY=always
The daemon inherits the environment of the shell that ran /gateway start — and .env never overrides an already-exported variable. If behavior doesn’t match your .env, check the effective mode at daemon boot: gateway.log records voice replies enabled (mode=...).

Per-conversation control — the @voice tool

Each conversation owns its preference by asking in natural language: the model calls the @voice tool, and the choice is stored per session (survives daemon restarts) with precedence over the global mode.
Ask in the chatEffect
”answer me in audio” / “I want to hear your voice”@voice on — every reply in this conversation is spoken
”stop sending audio”@voice off — text only in this conversation
”back to normal”@voice auto — back to the default (voice answers voice)
The preference lives in ~/.chatcli/gateway_voice_prefs.json (atomic writes), keyed by platform:chat. Decision hierarchy: conversation preferenceglobal modein-kind.

Written for the ear (speech-aware)

When a reply will become audio, two layers guarantee natural speech:
  1. The model knows before writing: a directive says the answer will be spoken — conversational prose, short sentences, no emojis, no lists/tables/markdown.
  2. A hard guarantee in the sanitizer: before synthesis, StripForSpeech flattens markdown (code dropped, links collapse to labels, tables become prose) and removes emoji and pictographs — TTS engines read their Unicode names out loud, burying the message. Portuguese accents stay intact.
The visual text in the chat keeps its full formatting; only the audio is sanitized.

A voice note that actually plays (transcode)

Backends that ignore the format hint (macOS say emits aiff, espeak emits wav) used to produce a file Telegram shows with a size but cannot play. With ffmpeg on PATH, the gateway transcodes wav/aiff → OGG/Opus (voice-note profile: 48 kHz mono) for every provider; already-compressed formats pass through, and without ffmpeg the original clip is sent as an audio file — visible degradation, never a lost reply.

The full pipeline

voice note (Telegram) ─→ transcription (local-first whisper) ─→ agent loop
                                                                  │ speech-aware directive

reply text ─→ StripForSpeech ─→ TTS (any backend) ─→ ogg/opus transcode ─→ sendVoice
The “does this reply speak?” decision is a single rule (conversation preference → global mode → in-kind) shared between the runner and the agent loop — the two can never diverge.

Troubleshooting

SymptomLikely causeCheck
Everything comes back with audio, even textalways mode active (env exported in the daemon’s shell)gateway.log: voice replies enabled (mode=...)
Audio arrives but won’t playRaw format with no ffmpeg to transcodewhich ffmpeg; install it for native voice notes
Never replies with voicenever mode, or no TTS backend/configIntegrations · Gateway · Voice reply
Wrong voice for the languageEmbedded engine voicesCHATCLI_TTS_VOICE (en) / CHATCLI_TTS_VOICE_PT (pt)