Files
stentor/docs/stentor.md
Robert Helewka 912593b796 feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00

316 lines
14 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Stentor — Usage Guide
> *"Stentor, whose voice was as powerful as fifty voices of other men."*
> — Homer, *Iliad*, Book V
Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process.
---
## Table of Contents
- [How It Works](#how-it-works)
- [Components](#components)
- [ESP32 Device Setup](#esp32-device-setup)
- [Daedalus Configuration](#daedalus-configuration)
- [Device Registration Flow](#device-registration-flow)
- [Voice Conversation Flow](#voice-conversation-flow)
- [WebSocket Protocol](#websocket-protocol)
- [API Endpoints](#api-endpoints)
- [Observability](#observability)
- [Troubleshooting](#troubleshooting)
- [Architecture Overview](#architecture-overview)
---
## How It Works
1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
4. On-device VAD (Voice Activity Detection) detects speech and silence
5. Audio streams to Daedalus, which runs: **Speaches STT****Pallas Agent (MCP)****Speaches TTS**
6. The response audio streams back to the ESP32 speaker
7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
---
## Components
| Component | Location | Purpose |
|-----------|----------|---------|
| **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD |
| **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS |
| **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions |
| **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history |
The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
---
## ESP32 Device Setup
The ESP32-S3-AUDIO-Board firmware needs one configuration value:
| Setting | Description | Example |
|---------|-------------|---------|
| Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` |
On first boot, the device:
1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
2. Registers with Daedalus via `POST /api/v1/voice/devices/register`
3. The UUID persists across reboots — the device keeps its identity
---
## Daedalus Configuration
Voice settings are configured via environment variables with the `DAEDALUS_` prefix:
| Variable | Description | Default |
|----------|-------------|---------|
| `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` |
| `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` |
| `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
| `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` |
| `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` |
| `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` |
| `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` |
| `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
| `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` |
---
## Device Registration Flow
```
ESP32 Daedalus
│ │
│ [First boot — UUID generated] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "registered"} ─────────┤
│ │
│ [Device appears in Daedalus │
│ Settings → Voice Devices] │
│ │
│ [User assigns workspace + agent │
│ via web UI] │
│ │
│ [Subsequent boots — same UUID] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "already_registered"} ──┤
│ │
```
After registration, the device appears in the Daedalus settings page. The user assigns it:
- A **name** (e.g. "Kitchen Speaker")
- A **description** (optional)
- A **workspace** (which workspace voice conversations go to)
- An **agent** (which Pallas agent to target)
Until assigned, the device cannot process voice.
---
## Voice Conversation Flow
A voice conversation is a multi-turn session driven by on-device VAD:
```
ESP32 Daedalus
│ │
├─ [Wake word detected] │
├─ WS /api/v1/voice/realtime ──────▶│
├─ session.start ───────────────────▶│ → Create Conversation in DB
│◀──── session.created ─────────────┤ {session_id, conversation_id}
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks] │
├─ input_audio_buffer.append ×N ────▶│
│ [VAD: silence detected] │
├─ input_audio_buffer.commit ───────▶│
│◀──── status: transcribing ────────┤ → STT
│◀──── transcript.done ─────────────┤ → Save user message
│◀──── status: thinking ────────────┤ → MCP call to Pallas
│◀──── response.text.done ──────────┤ → Save assistant message
│◀──── status: speaking ────────────┤ → TTS
│◀──── response.audio.delta ×N ─────┤
│◀──── response.audio.done ─────────┤
│◀──── response.done ───────────────┤
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks again] │ (same conversation)
├─ (next turn cycle) ──────────────▶│
│ │
│ [Conversation ends by:] │
│ • 120s silence → timeout │
│ • Agent says goodbye │
│ • WebSocket disconnect │
│◀──── session.end ─────────────────┤
```
### Conversation End
A conversation ends in three ways:
1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120)
2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it
3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops
All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
---
## WebSocket Protocol
### Connection
```
WS /api/v1/voice/realtime?device_id={uuid}
```
### Client → Gateway Messages
| Type | Description | Fields |
|------|-------------|--------|
| `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) |
| `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) |
| `input_audio_buffer.commit` | End of speech, trigger pipeline | — |
| `session.close` | End the session | — |
### Gateway → Client Messages
| Type | Description | Fields |
|------|-------------|--------|
| `session.created` | Session started | `session_id`, `conversation_id` |
| `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) |
| `transcript.done` | User's speech as text | `text` |
| `response.text.done` | Agent's text response | `text` |
| `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) |
| `response.audio.done` | Audio streaming complete | — |
| `response.done` | Turn complete | — |
| `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) |
| `error` | Error occurred | `message`, `code` |
### Audio Format
All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON:
- **Sample rate:** 16,000 Hz
- **Channels:** 1 (mono)
- **Bit depth:** 16-bit
---
## API Endpoints
All endpoints are served by the Daedalus FastAPI backend.
### Voice Device Management
| Method | Route | Purpose |
|--------|-------|---------|
| `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) |
| `GET` | `/api/v1/voice/devices` | List all registered devices |
| `GET` | `/api/v1/voice/devices/{id}` | Get device details |
| `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) |
| `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device |
### Voice Sessions
| Method | Route | Purpose |
|--------|-------|---------|
| `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations |
| `GET` | `/api/v1/voice/sessions` | List active voice sessions |
### Voice Configuration & Health
| Method | Route | Purpose |
|--------|-------|---------|
| `GET` | `/api/v1/voice/config` | Current voice configuration |
| `PUT` | `/api/v1/voice/config` | Update voice settings |
| `GET` | `/api/v1/voice/health` | STT + TTS reachability check |
---
## Observability
### Prometheus Metrics
Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix:
| Metric | Type | Description |
|--------|------|-------------|
| `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions |
| `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency |
| `daedalus_voice_stt_duration_seconds` | histogram | STT latency |
| `daedalus_voice_tts_duration_seconds` | histogram | TTS latency |
| `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency |
| `daedalus_voice_transcriptions_total` | counter | Total STT calls |
| `daedalus_voice_conversations_total` | counter | Conversations by end reason |
| `daedalus_voice_devices_online` | gauge | Currently connected devices |
### Logs
Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`.
---
## Troubleshooting
### Device not appearing in Daedalus settings
- Check the ESP32 can reach the Daedalus URL
- Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'`
### Device registered but voice doesn't work
- Assign the device to a workspace and agent in **Settings → Voice Devices**
- Unassigned devices get: `{"type": "error", "code": "no_workspace"}`
### STT returns empty transcripts
- Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models`
- Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health`
### High latency
- Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana
- Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
- Agent latency depends on the Pallas agent and its downstream MCP servers
### Audio sounds wrong (chipmunk / slow)
- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
- Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate
---
## Architecture Overview
```
┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐
│ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │
│ (stentor-ear) │ JSON + base64 audio │ puck.incus │
│ UUID in NVS │ │ │
│ Wake Word + VAD │ │ voice/ module: │
└──────────────────┘ │ STT → MCP (Pallas) → TTS │
│ Conversations → PostgreSQL │
└──────┬──────────┬────────┬───────────┘
│ │ │
MCP │ HTTP │ HTTP │
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────┐
│ Pallas │ │Speaches│ │Speaches│
│ Agents │ │ STT │ │ TTS │
└──────────┘ └────────┘ └────────┘
```
For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md).
For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).