Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
14 KiB
Stentor — Usage Guide
"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V
Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the Daedalus web application backend — there is no separate Stentor server process.
Table of Contents
- How It Works
- Components
- ESP32 Device Setup
- Daedalus Configuration
- Device Registration Flow
- Voice Conversation Flow
- WebSocket Protocol
- API Endpoints
- Observability
- Troubleshooting
- Architecture Overview
How It Works
- An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
- A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
- When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
- On-device VAD (Voice Activity Detection) detects speech and silence
- Audio streams to Daedalus, which runs: Speaches STT → Pallas Agent (MCP) → Speaches TTS
- The response audio streams back to the ESP32 speaker
- Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
Components
| Component | Location | Purpose |
|---|---|---|
| stentor-ear | stentor/stentor-ear/ |
ESP32-S3 firmware — microphone, speaker, wake word, VAD |
| Daedalus voice module | daedalus/backend/daedalus/voice/ |
Voice pipeline — STT, MCP agent calls, TTS |
| Daedalus voice API | daedalus/backend/daedalus/api/v1/voice.py |
WebSocket + REST endpoints for devices and sessions |
| Daedalus web UI | daedalus/frontend/ |
Device management, conversation history |
The Python gateway code that was previously in stentor/stentor-gateway/ has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
ESP32 Device Setup
The ESP32-S3-AUDIO-Board firmware needs one configuration value:
| Setting | Description | Example |
|---|---|---|
| Daedalus URL | Base URL of the Daedalus instance | http://puck.incus:22181 |
On first boot, the device:
- Generates a UUID v4 and stores it in NVS (non-volatile storage)
- Registers with Daedalus via
POST /api/v1/voice/devices/register - The UUID persists across reboots — the device keeps its identity
Daedalus Configuration
Voice settings are configured via environment variables with the DAEDALUS_ prefix:
| Variable | Description | Default |
|---|---|---|
DAEDALUS_VOICE_STT_URL |
Speaches STT endpoint | http://perseus.helu.ca:22070 |
DAEDALUS_VOICE_TTS_URL |
Speaches TTS endpoint | http://perseus.helu.ca:22070 |
DAEDALUS_VOICE_STT_MODEL |
Whisper model for STT | Systran/faster-whisper-small |
DAEDALUS_VOICE_TTS_MODEL |
TTS model name | kokoro |
DAEDALUS_VOICE_TTS_VOICE |
TTS voice ID | af_heart |
DAEDALUS_VOICE_AUDIO_SAMPLE_RATE |
Sample rate in Hz | 16000 |
DAEDALUS_VOICE_AUDIO_CHANNELS |
Audio channels | 1 |
DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH |
Bits per sample | 16 |
DAEDALUS_VOICE_CONVERSATION_TIMEOUT |
Seconds of silence before auto-end | 120 |
Device Registration Flow
ESP32 Daedalus
│ │
│ [First boot — UUID generated] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "registered"} ─────────┤
│ │
│ [Device appears in Daedalus │
│ Settings → Voice Devices] │
│ │
│ [User assigns workspace + agent │
│ via web UI] │
│ │
│ [Subsequent boots — same UUID] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "already_registered"} ──┤
│ │
After registration, the device appears in the Daedalus settings page. The user assigns it:
- A name (e.g. "Kitchen Speaker")
- A description (optional)
- A workspace (which workspace voice conversations go to)
- An agent (which Pallas agent to target)
Until assigned, the device cannot process voice.
Voice Conversation Flow
A voice conversation is a multi-turn session driven by on-device VAD:
ESP32 Daedalus
│ │
├─ [Wake word detected] │
├─ WS /api/v1/voice/realtime ──────▶│
├─ session.start ───────────────────▶│ → Create Conversation in DB
│◀──── session.created ─────────────┤ {session_id, conversation_id}
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks] │
├─ input_audio_buffer.append ×N ────▶│
│ [VAD: silence detected] │
├─ input_audio_buffer.commit ───────▶│
│◀──── status: transcribing ────────┤ → STT
│◀──── transcript.done ─────────────┤ → Save user message
│◀──── status: thinking ────────────┤ → MCP call to Pallas
│◀──── response.text.done ──────────┤ → Save assistant message
│◀──── status: speaking ────────────┤ → TTS
│◀──── response.audio.delta ×N ─────┤
│◀──── response.audio.done ─────────┤
│◀──── response.done ───────────────┤
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks again] │ (same conversation)
├─ (next turn cycle) ──────────────▶│
│ │
│ [Conversation ends by:] │
│ • 120s silence → timeout │
│ • Agent says goodbye │
│ • WebSocket disconnect │
│◀──── session.end ─────────────────┤
Conversation End
A conversation ends in three ways:
- Inactivity timeout — no speech for
VOICE_CONVERSATION_TIMEOUTseconds (default 120) - Agent-initiated — the Pallas agent recognizes the conversation is over and signals it
- Client disconnect — ESP32 sends
session.closeor WebSocket drops
All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
WebSocket Protocol
Connection
WS /api/v1/voice/realtime?device_id={uuid}
Client → Gateway Messages
| Type | Description | Fields |
|---|---|---|
session.start |
Start a new conversation | client_id (optional), audio_config (optional) |
input_audio_buffer.append |
Audio chunk | audio (base64 PCM) |
input_audio_buffer.commit |
End of speech, trigger pipeline | — |
session.close |
End the session | — |
Gateway → Client Messages
| Type | Description | Fields |
|---|---|---|
session.created |
Session started | session_id, conversation_id |
status |
Processing state | state (listening / transcribing / thinking / speaking) |
transcript.done |
User's speech as text | text |
response.text.done |
Agent's text response | text |
response.audio.delta |
Audio chunk (streamed) | delta (base64 PCM) |
response.audio.done |
Audio streaming complete | — |
response.done |
Turn complete | — |
session.end |
Conversation ended | reason (timeout / agent / client) |
error |
Error occurred | message, code |
Audio Format
All audio is PCM signed 16-bit little-endian (pcm_s16le), base64-encoded in JSON:
- Sample rate: 16,000 Hz
- Channels: 1 (mono)
- Bit depth: 16-bit
API Endpoints
All endpoints are served by the Daedalus FastAPI backend.
Voice Device Management
| Method | Route | Purpose |
|---|---|---|
POST |
/api/v1/voice/devices/register |
ESP32 self-registration (idempotent) |
GET |
/api/v1/voice/devices |
List all registered devices |
GET |
/api/v1/voice/devices/{id} |
Get device details |
PUT |
/api/v1/voice/devices/{id} |
Update device (name, description, workspace, agent) |
DELETE |
/api/v1/voice/devices/{id} |
Remove a device |
Voice Sessions
| Method | Route | Purpose |
|---|---|---|
WS |
/api/v1/voice/realtime?device_id={id} |
WebSocket for audio conversations |
GET |
/api/v1/voice/sessions |
List active voice sessions |
Voice Configuration & Health
| Method | Route | Purpose |
|---|---|---|
GET |
/api/v1/voice/config |
Current voice configuration |
PUT |
/api/v1/voice/config |
Update voice settings |
GET |
/api/v1/voice/health |
STT + TTS reachability check |
Observability
Prometheus Metrics
Voice metrics are exposed at Daedalus's GET /metrics endpoint with the daedalus_voice_ prefix:
| Metric | Type | Description |
|---|---|---|
daedalus_voice_sessions_active |
gauge | Active WebSocket sessions |
daedalus_voice_pipeline_duration_seconds |
histogram | Full pipeline latency |
daedalus_voice_stt_duration_seconds |
histogram | STT latency |
daedalus_voice_tts_duration_seconds |
histogram | TTS latency |
daedalus_voice_agent_duration_seconds |
histogram | Agent (MCP) latency |
daedalus_voice_transcriptions_total |
counter | Total STT calls |
daedalus_voice_conversations_total |
counter | Conversations by end reason |
daedalus_voice_devices_online |
gauge | Currently connected devices |
Logs
Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
Key log events: voice_device_registered, voice_session_started, voice_pipeline_complete, voice_conversation_ended, voice_pipeline_error.
Troubleshooting
Device not appearing in Daedalus settings
- Check the ESP32 can reach the Daedalus URL
- Verify the registration endpoint responds:
curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'
Device registered but voice doesn't work
- Assign the device to a workspace and agent in Settings → Voice Devices
- Unassigned devices get:
{"type": "error", "code": "no_workspace"}
STT returns empty transcripts
- Check Speaches STT is running:
curl http://perseus.helu.ca:22070/v1/models - Check the voice health endpoint:
curl http://puck.incus:22181/api/v1/voice/health
High latency
- Check
daedalus_voice_pipeline_duration_secondsin Prometheus/Grafana - Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
- Agent latency depends on the Pallas agent and its downstream MCP servers
Audio sounds wrong (chipmunk / slow)
- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
- Verify
DAEDALUS_VOICE_AUDIO_SAMPLE_RATEmatches the ESP32's playback rate
Architecture Overview
┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐
│ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │
│ (stentor-ear) │ JSON + base64 audio │ puck.incus │
│ UUID in NVS │ │ │
│ Wake Word + VAD │ │ voice/ module: │
└──────────────────┘ │ STT → MCP (Pallas) → TTS │
│ Conversations → PostgreSQL │
└──────┬──────────┬────────┬───────────┘
│ │ │
MCP │ HTTP │ HTTP │
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────┐
│ Pallas │ │Speaches│ │Speaches│
│ Agents │ │ STT │ │ TTS │
└──────────┘ └────────┘ └────────┘
For full architectural details including Mermaid diagrams, see architecture.md.
For the complete integration specification, see daedalus/docs/stentor_integration.md.