Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
316 lines
14 KiB
Markdown
316 lines
14 KiB
Markdown
# Stentor — Usage Guide
|
||
|
||
> *"Stentor, whose voice was as powerful as fifty voices of other men."*
|
||
> — Homer, *Iliad*, Book V
|
||
|
||
Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process.
|
||
|
||
---
|
||
|
||
## Table of Contents
|
||
|
||
- [How It Works](#how-it-works)
|
||
- [Components](#components)
|
||
- [ESP32 Device Setup](#esp32-device-setup)
|
||
- [Daedalus Configuration](#daedalus-configuration)
|
||
- [Device Registration Flow](#device-registration-flow)
|
||
- [Voice Conversation Flow](#voice-conversation-flow)
|
||
- [WebSocket Protocol](#websocket-protocol)
|
||
- [API Endpoints](#api-endpoints)
|
||
- [Observability](#observability)
|
||
- [Troubleshooting](#troubleshooting)
|
||
- [Architecture Overview](#architecture-overview)
|
||
|
||
---
|
||
|
||
## How It Works
|
||
|
||
1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
|
||
2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
|
||
3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
|
||
4. On-device VAD (Voice Activity Detection) detects speech and silence
|
||
5. Audio streams to Daedalus, which runs: **Speaches STT** → **Pallas Agent (MCP)** → **Speaches TTS**
|
||
6. The response audio streams back to the ESP32 speaker
|
||
7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
|
||
|
||
---
|
||
|
||
## Components
|
||
|
||
| Component | Location | Purpose |
|
||
|-----------|----------|---------|
|
||
| **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD |
|
||
| **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS |
|
||
| **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions |
|
||
| **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history |
|
||
|
||
The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
|
||
|
||
---
|
||
|
||
## ESP32 Device Setup
|
||
|
||
The ESP32-S3-AUDIO-Board firmware needs one configuration value:
|
||
|
||
| Setting | Description | Example |
|
||
|---------|-------------|---------|
|
||
| Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` |
|
||
|
||
On first boot, the device:
|
||
1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
|
||
2. Registers with Daedalus via `POST /api/v1/voice/devices/register`
|
||
3. The UUID persists across reboots — the device keeps its identity
|
||
|
||
---
|
||
|
||
## Daedalus Configuration
|
||
|
||
Voice settings are configured via environment variables with the `DAEDALUS_` prefix:
|
||
|
||
| Variable | Description | Default |
|
||
|----------|-------------|---------|
|
||
| `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` |
|
||
| `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` |
|
||
| `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
|
||
| `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` |
|
||
| `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` |
|
||
| `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` |
|
||
| `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` |
|
||
| `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
|
||
| `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` |
|
||
|
||
---
|
||
|
||
## Device Registration Flow
|
||
|
||
```
|
||
ESP32 Daedalus
|
||
│ │
|
||
│ [First boot — UUID generated] │
|
||
├─ POST /api/v1/voice/devices/register ▶│
|
||
│ {device_id, firmware_version} │
|
||
│◀─ {status: "registered"} ─────────┤
|
||
│ │
|
||
│ [Device appears in Daedalus │
|
||
│ Settings → Voice Devices] │
|
||
│ │
|
||
│ [User assigns workspace + agent │
|
||
│ via web UI] │
|
||
│ │
|
||
│ [Subsequent boots — same UUID] │
|
||
├─ POST /api/v1/voice/devices/register ▶│
|
||
│ {device_id, firmware_version} │
|
||
│◀─ {status: "already_registered"} ──┤
|
||
│ │
|
||
```
|
||
|
||
After registration, the device appears in the Daedalus settings page. The user assigns it:
|
||
- A **name** (e.g. "Kitchen Speaker")
|
||
- A **description** (optional)
|
||
- A **workspace** (which workspace voice conversations go to)
|
||
- An **agent** (which Pallas agent to target)
|
||
|
||
Until assigned, the device cannot process voice.
|
||
|
||
---
|
||
|
||
## Voice Conversation Flow
|
||
|
||
A voice conversation is a multi-turn session driven by on-device VAD:
|
||
|
||
```
|
||
ESP32 Daedalus
|
||
│ │
|
||
├─ [Wake word detected] │
|
||
├─ WS /api/v1/voice/realtime ──────▶│
|
||
├─ session.start ───────────────────▶│ → Create Conversation in DB
|
||
│◀──── session.created ─────────────┤ {session_id, conversation_id}
|
||
│◀──── status: listening ────────────┤
|
||
│ │
|
||
│ [VAD: user speaks] │
|
||
├─ input_audio_buffer.append ×N ────▶│
|
||
│ [VAD: silence detected] │
|
||
├─ input_audio_buffer.commit ───────▶│
|
||
│◀──── status: transcribing ────────┤ → STT
|
||
│◀──── transcript.done ─────────────┤ → Save user message
|
||
│◀──── status: thinking ────────────┤ → MCP call to Pallas
|
||
│◀──── response.text.done ──────────┤ → Save assistant message
|
||
│◀──── status: speaking ────────────┤ → TTS
|
||
│◀──── response.audio.delta ×N ─────┤
|
||
│◀──── response.audio.done ─────────┤
|
||
│◀──── response.done ───────────────┤
|
||
│◀──── status: listening ────────────┤
|
||
│ │
|
||
│ [VAD: user speaks again] │ (same conversation)
|
||
├─ (next turn cycle) ──────────────▶│
|
||
│ │
|
||
│ [Conversation ends by:] │
|
||
│ • 120s silence → timeout │
|
||
│ • Agent says goodbye │
|
||
│ • WebSocket disconnect │
|
||
│◀──── session.end ─────────────────┤
|
||
```
|
||
|
||
### Conversation End
|
||
|
||
A conversation ends in three ways:
|
||
|
||
1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120)
|
||
2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it
|
||
3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops
|
||
|
||
All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
|
||
|
||
---
|
||
|
||
## WebSocket Protocol
|
||
|
||
### Connection
|
||
|
||
```
|
||
WS /api/v1/voice/realtime?device_id={uuid}
|
||
```
|
||
|
||
### Client → Gateway Messages
|
||
|
||
| Type | Description | Fields |
|
||
|------|-------------|--------|
|
||
| `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) |
|
||
| `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) |
|
||
| `input_audio_buffer.commit` | End of speech, trigger pipeline | — |
|
||
| `session.close` | End the session | — |
|
||
|
||
### Gateway → Client Messages
|
||
|
||
| Type | Description | Fields |
|
||
|------|-------------|--------|
|
||
| `session.created` | Session started | `session_id`, `conversation_id` |
|
||
| `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) |
|
||
| `transcript.done` | User's speech as text | `text` |
|
||
| `response.text.done` | Agent's text response | `text` |
|
||
| `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) |
|
||
| `response.audio.done` | Audio streaming complete | — |
|
||
| `response.done` | Turn complete | — |
|
||
| `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) |
|
||
| `error` | Error occurred | `message`, `code` |
|
||
|
||
### Audio Format
|
||
|
||
All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON:
|
||
|
||
- **Sample rate:** 16,000 Hz
|
||
- **Channels:** 1 (mono)
|
||
- **Bit depth:** 16-bit
|
||
|
||
---
|
||
|
||
## API Endpoints
|
||
|
||
All endpoints are served by the Daedalus FastAPI backend.
|
||
|
||
### Voice Device Management
|
||
|
||
| Method | Route | Purpose |
|
||
|--------|-------|---------|
|
||
| `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) |
|
||
| `GET` | `/api/v1/voice/devices` | List all registered devices |
|
||
| `GET` | `/api/v1/voice/devices/{id}` | Get device details |
|
||
| `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) |
|
||
| `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device |
|
||
|
||
### Voice Sessions
|
||
|
||
| Method | Route | Purpose |
|
||
|--------|-------|---------|
|
||
| `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations |
|
||
| `GET` | `/api/v1/voice/sessions` | List active voice sessions |
|
||
|
||
### Voice Configuration & Health
|
||
|
||
| Method | Route | Purpose |
|
||
|--------|-------|---------|
|
||
| `GET` | `/api/v1/voice/config` | Current voice configuration |
|
||
| `PUT` | `/api/v1/voice/config` | Update voice settings |
|
||
| `GET` | `/api/v1/voice/health` | STT + TTS reachability check |
|
||
|
||
---
|
||
|
||
## Observability
|
||
|
||
### Prometheus Metrics
|
||
|
||
Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix:
|
||
|
||
| Metric | Type | Description |
|
||
|--------|------|-------------|
|
||
| `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions |
|
||
| `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency |
|
||
| `daedalus_voice_stt_duration_seconds` | histogram | STT latency |
|
||
| `daedalus_voice_tts_duration_seconds` | histogram | TTS latency |
|
||
| `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency |
|
||
| `daedalus_voice_transcriptions_total` | counter | Total STT calls |
|
||
| `daedalus_voice_conversations_total` | counter | Conversations by end reason |
|
||
| `daedalus_voice_devices_online` | gauge | Currently connected devices |
|
||
|
||
### Logs
|
||
|
||
Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
|
||
|
||
Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`.
|
||
|
||
---
|
||
|
||
## Troubleshooting
|
||
|
||
### Device not appearing in Daedalus settings
|
||
|
||
- Check the ESP32 can reach the Daedalus URL
|
||
- Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'`
|
||
|
||
### Device registered but voice doesn't work
|
||
|
||
- Assign the device to a workspace and agent in **Settings → Voice Devices**
|
||
- Unassigned devices get: `{"type": "error", "code": "no_workspace"}`
|
||
|
||
### STT returns empty transcripts
|
||
|
||
- Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models`
|
||
- Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health`
|
||
|
||
### High latency
|
||
|
||
- Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana
|
||
- Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
|
||
- Agent latency depends on the Pallas agent and its downstream MCP servers
|
||
|
||
### Audio sounds wrong (chipmunk / slow)
|
||
|
||
- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
|
||
- Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate
|
||
|
||
---
|
||
|
||
## Architecture Overview
|
||
|
||
```
|
||
┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐
|
||
│ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │
|
||
│ (stentor-ear) │ JSON + base64 audio │ puck.incus │
|
||
│ UUID in NVS │ │ │
|
||
│ Wake Word + VAD │ │ voice/ module: │
|
||
└──────────────────┘ │ STT → MCP (Pallas) → TTS │
|
||
│ Conversations → PostgreSQL │
|
||
└──────┬──────────┬────────┬───────────┘
|
||
│ │ │
|
||
MCP │ HTTP │ HTTP │
|
||
▼ ▼ ▼
|
||
┌──────────┐ ┌────────┐ ┌────────┐
|
||
│ Pallas │ │Speaches│ │Speaches│
|
||
│ Agents │ │ STT │ │ TTS │
|
||
└──────────┘ └────────┘ └────────┘
|
||
```
|
||
|
||
For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md).
|
||
|
||
For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).
|