# Stentor — Usage Guide > *"Stentor, whose voice was as powerful as fifty voices of other men."* > — Homer, *Iliad*, Book V Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process. --- ## Table of Contents - [How It Works](#how-it-works) - [Components](#components) - [ESP32 Device Setup](#esp32-device-setup) - [Daedalus Configuration](#daedalus-configuration) - [Device Registration Flow](#device-registration-flow) - [Voice Conversation Flow](#voice-conversation-flow) - [WebSocket Protocol](#websocket-protocol) - [API Endpoints](#api-endpoints) - [Observability](#observability) - [Troubleshooting](#troubleshooting) - [Architecture Overview](#architecture-overview) --- ## How It Works 1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus 2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI 3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session 4. On-device VAD (Voice Activity Detection) detects speech and silence 5. Audio streams to Daedalus, which runs: **Speaches STT** → **Pallas Agent (MCP)** → **Speaches TTS** 6. The response audio streams back to the ESP32 speaker 7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations --- ## Components | Component | Location | Purpose | |-----------|----------|---------| | **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD | | **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS | | **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions | | **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history | The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service. --- ## ESP32 Device Setup The ESP32-S3-AUDIO-Board firmware needs one configuration value: | Setting | Description | Example | |---------|-------------|---------| | Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` | On first boot, the device: 1. Generates a UUID v4 and stores it in NVS (non-volatile storage) 2. Registers with Daedalus via `POST /api/v1/voice/devices/register` 3. The UUID persists across reboots — the device keeps its identity --- ## Daedalus Configuration Voice settings are configured via environment variables with the `DAEDALUS_` prefix: | Variable | Description | Default | |----------|-------------|---------| | `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` | | `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` | | `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` | | `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` | | `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` | | `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` | | `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` | | `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` | | `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` | --- ## Device Registration Flow ``` ESP32 Daedalus │ │ │ [First boot — UUID generated] │ ├─ POST /api/v1/voice/devices/register ▶│ │ {device_id, firmware_version} │ │◀─ {status: "registered"} ─────────┤ │ │ │ [Device appears in Daedalus │ │ Settings → Voice Devices] │ │ │ │ [User assigns workspace + agent │ │ via web UI] │ │ │ │ [Subsequent boots — same UUID] │ ├─ POST /api/v1/voice/devices/register ▶│ │ {device_id, firmware_version} │ │◀─ {status: "already_registered"} ──┤ │ │ ``` After registration, the device appears in the Daedalus settings page. The user assigns it: - A **name** (e.g. "Kitchen Speaker") - A **description** (optional) - A **workspace** (which workspace voice conversations go to) - An **agent** (which Pallas agent to target) Until assigned, the device cannot process voice. --- ## Voice Conversation Flow A voice conversation is a multi-turn session driven by on-device VAD: ``` ESP32 Daedalus │ │ ├─ [Wake word detected] │ ├─ WS /api/v1/voice/realtime ──────▶│ ├─ session.start ───────────────────▶│ → Create Conversation in DB │◀──── session.created ─────────────┤ {session_id, conversation_id} │◀──── status: listening ────────────┤ │ │ │ [VAD: user speaks] │ ├─ input_audio_buffer.append ×N ────▶│ │ [VAD: silence detected] │ ├─ input_audio_buffer.commit ───────▶│ │◀──── status: transcribing ────────┤ → STT │◀──── transcript.done ─────────────┤ → Save user message │◀──── status: thinking ────────────┤ → MCP call to Pallas │◀──── response.text.done ──────────┤ → Save assistant message │◀──── status: speaking ────────────┤ → TTS │◀──── response.audio.delta ×N ─────┤ │◀──── response.audio.done ─────────┤ │◀──── response.done ───────────────┤ │◀──── status: listening ────────────┤ │ │ │ [VAD: user speaks again] │ (same conversation) ├─ (next turn cycle) ──────────────▶│ │ │ │ [Conversation ends by:] │ │ • 120s silence → timeout │ │ • Agent says goodbye │ │ • WebSocket disconnect │ │◀──── session.end ─────────────────┤ ``` ### Conversation End A conversation ends in three ways: 1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120) 2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it 3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history. --- ## WebSocket Protocol ### Connection ``` WS /api/v1/voice/realtime?device_id={uuid} ``` ### Client → Gateway Messages | Type | Description | Fields | |------|-------------|--------| | `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) | | `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) | | `input_audio_buffer.commit` | End of speech, trigger pipeline | — | | `session.close` | End the session | — | ### Gateway → Client Messages | Type | Description | Fields | |------|-------------|--------| | `session.created` | Session started | `session_id`, `conversation_id` | | `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) | | `transcript.done` | User's speech as text | `text` | | `response.text.done` | Agent's text response | `text` | | `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) | | `response.audio.done` | Audio streaming complete | — | | `response.done` | Turn complete | — | | `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) | | `error` | Error occurred | `message`, `code` | ### Audio Format All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON: - **Sample rate:** 16,000 Hz - **Channels:** 1 (mono) - **Bit depth:** 16-bit --- ## API Endpoints All endpoints are served by the Daedalus FastAPI backend. ### Voice Device Management | Method | Route | Purpose | |--------|-------|---------| | `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) | | `GET` | `/api/v1/voice/devices` | List all registered devices | | `GET` | `/api/v1/voice/devices/{id}` | Get device details | | `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) | | `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device | ### Voice Sessions | Method | Route | Purpose | |--------|-------|---------| | `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations | | `GET` | `/api/v1/voice/sessions` | List active voice sessions | ### Voice Configuration & Health | Method | Route | Purpose | |--------|-------|---------| | `GET` | `/api/v1/voice/config` | Current voice configuration | | `PUT` | `/api/v1/voice/config` | Update voice settings | | `GET` | `/api/v1/voice/health` | STT + TTS reachability check | --- ## Observability ### Prometheus Metrics Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix: | Metric | Type | Description | |--------|------|-------------| | `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions | | `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency | | `daedalus_voice_stt_duration_seconds` | histogram | STT latency | | `daedalus_voice_tts_duration_seconds` | histogram | TTS latency | | `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency | | `daedalus_voice_transcriptions_total` | counter | Total STT calls | | `daedalus_voice_conversations_total` | counter | Conversations by end reason | | `daedalus_voice_devices_online` | gauge | Currently connected devices | ### Logs Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki. Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`. --- ## Troubleshooting ### Device not appearing in Daedalus settings - Check the ESP32 can reach the Daedalus URL - Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'` ### Device registered but voice doesn't work - Assign the device to a workspace and agent in **Settings → Voice Devices** - Unassigned devices get: `{"type": "error", "code": "no_workspace"}` ### STT returns empty transcripts - Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models` - Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health` ### High latency - Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana - Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck - Agent latency depends on the Pallas agent and its downstream MCP servers ### Audio sounds wrong (chipmunk / slow) - Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz - Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate --- ## Architecture Overview ``` ┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐ │ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │ │ (stentor-ear) │ JSON + base64 audio │ puck.incus │ │ UUID in NVS │ │ │ │ Wake Word + VAD │ │ voice/ module: │ └──────────────────┘ │ STT → MCP (Pallas) → TTS │ │ Conversations → PostgreSQL │ └──────┬──────────┬────────┬───────────┘ │ │ │ MCP │ HTTP │ HTTP │ ▼ ▼ ▼ ┌──────────┐ ┌────────┐ ┌────────┐ │ Pallas │ │Speaches│ │Speaches│ │ Agents │ │ STT │ │ TTS │ └──────────┘ └────────┘ └────────┘ ``` For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md). For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).