Gateway Voice Replies

Send a voice message on Telegram and the answer comes back as a voice note — transcription on the way in (embedded Whisper by default, zero config), synthesis on the way out, with any configured TTS backend. Everything here is provider-agnostic: from macOS say to the embedded Kokoro engine.

Modes (`CHATCLI_GATEWAY_VOICE_REPLY`)

Value	Behavior
`auto` (default)	Reply in kind: voice message → answer with audio; text → text only.
`always`	Every final answer carries audio.
`never`	Text-only replies.

Legacy boolean values still work: true → always, false → never. An unknown value falls back to auto — a typo never silences the gateway.

# auto is already the default; export only for always/never
export CHATCLI_GATEWAY_VOICE_REPLY=always

The daemon inherits the environment of the shell that ran /gateway start — and .env never overrides an already-exported variable. If behavior doesn’t match your .env, check the effective mode at daemon boot: gateway.log records voice replies enabled (mode=...).

Per-conversation control — the `@voice` tool

Each conversation owns its preference by asking in natural language: the model calls the @voice tool, and the choice is stored per session (survives daemon restarts) with precedence over the global mode.

Ask in the chat	Effect
”answer me in audio” / “I want to hear your voice”	`@voice on` — every reply in this conversation is spoken
”stop sending audio”	`@voice off` — text only in this conversation
”back to normal”	`@voice auto` — back to the default (voice answers voice)

The preference lives in ~/.chatcli/gateway_voice_prefs.json (atomic writes), keyed by platform:chat. Decision hierarchy: conversation preference → global mode → in-kind.

Written for the ear (speech-aware)

When a reply will become audio, two layers guarantee natural speech:

The model knows before writing: a directive says the answer will be spoken — conversational prose, short sentences, no emojis, no lists/tables/markdown.
A hard guarantee in the sanitizer: before synthesis, StripForSpeech flattens markdown (code dropped, links collapse to labels, tables become prose) and removes emoji and pictographs — TTS engines read their Unicode names out loud, burying the message. Portuguese accents stay intact.

The visual text in the chat keeps its full formatting; only the audio is sanitized.

A voice note that actually plays (transcode)

Backends that ignore the format hint (macOS say emits aiff, espeak emits wav) used to produce a file Telegram shows with a size but cannot play. With ffmpeg on PATH, the gateway transcodes wav/aiff → OGG/Opus (voice-note profile: 48 kHz mono) for every provider; already-compressed formats pass through, and without ffmpeg the original clip is sent as an audio file — visible degradation, never a lost reply.

The full pipeline

voice note (Telegram) ─→ transcription (local-first whisper) ─→ agent loop
                                                                  │ speech-aware directive
                                                                  ▼
reply text ─→ StripForSpeech ─→ TTS (any backend) ─→ ogg/opus transcode ─→ sendVoice

The “does this reply speak?” decision is a single rule (conversation preference → global mode → in-kind) shared between the runner and the agent loop — the two can never diverge.

Troubleshooting

Symptom	Likely cause	Check
Everything comes back with audio, even text	`always` mode active (env exported in the daemon’s shell)	`gateway.log`: `voice replies enabled (mode=...)`
Audio arrives but won’t play	Raw format with no ffmpeg to transcode	`which ffmpeg`; install it for native voice notes
Never replies with voice	`never` mode, or no TTS backend	`/config` → Integrations · Gateway · Voice reply
Wrong voice for the language	Embedded engine voices	`CHATCLI_TTS_VOICE` (en) / `CHATCLI_TTS_VOICE_PT` (pt)

Text-to-Speech — backends, embedded Kokoro engine, @speak
Chat Gateway — inbound transcription, platforms, daemon
Conversation Hub — cross-channel continuity

​Modes (CHATCLI_GATEWAY_VOICE_REPLY)

​Per-conversation control — the @voice tool

​Written for the ear (speech-aware)

​A voice note that actually plays (transcode)

​The full pipeline

​Troubleshooting

​Related

Modes (`CHATCLI_GATEWAY_VOICE_REPLY`)

Per-conversation control — the `@voice` tool

Written for the ear (speech-aware)

A voice note that actually plays (transcode)

The full pipeline

Troubleshooting

Related