feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project
This commit is contained in:
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions

315
docs/api-reference.md Normal file
View File

@@ -0,0 +1,315 @@
# Stentor Gateway API Reference
> Version 0.1.0
## Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Dashboard (Bootstrap UI) |
| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
| `/api/v1/info` | GET | Gateway information and configuration |
| `/api/live/` | GET | Liveness probe (Kubernetes) |
| `/api/ready/` | GET | Readiness probe (Kubernetes) |
| `/api/metrics` | GET | Prometheus-compatible metrics |
| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
| `/api/openapi.json` | GET | OpenAPI schema |
---
## WebSocket: `/api/v1/realtime`
Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
### Connection
```
ws://{host}:{port}/api/v1/realtime
```
### Client Events
#### `session.start`
Initiates a new conversation session. Must be sent first.
```json
{
"type": "session.start",
"client_id": "esp32-kitchen",
"audio_config": {
"sample_rate": 16000,
"channels": 1,
"sample_width": 16,
"encoding": "pcm_s16le"
}
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | ✔ | Must be `"session.start"` |
| `client_id` | string | | Client identifier for tracking |
| `audio_config` | object | | Audio format configuration |
#### `input_audio_buffer.append`
Sends a chunk of audio data. Stream continuously while user is speaking.
```json
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM audio>"
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
#### `input_audio_buffer.commit`
Signals end of speech. Triggers the STT → Agent → TTS pipeline.
```json
{
"type": "input_audio_buffer.commit"
}
```
#### `session.close`
Requests session termination. The WebSocket connection will close.
```json
{
"type": "session.close"
}
```
### Server Events
#### `session.created`
Acknowledges session creation.
```json
{
"type": "session.created",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
#### `status`
Processing status update. Use for LED feedback on ESP32.
```json
{
"type": "status",
"state": "listening"
}
```
| State | Description | Suggested LED |
|-------|-------------|--------------|
| `listening` | Ready for audio input | Green |
| `transcribing` | Running STT | Yellow |
| `thinking` | Waiting for agent response | Yellow |
| `speaking` | Playing TTS audio | Cyan |
#### `transcript.done`
Transcript of what the user said.
```json
{
"type": "transcript.done",
"text": "What is the weather like today?"
}
```
#### `response.text.done`
AI agent's response text.
```json
{
"type": "response.text.done",
"text": "I don't have weather tools yet, but I can help with other things."
}
```
#### `response.audio.delta`
Streamed audio response chunk.
```json
{
"type": "response.audio.delta",
"delta": "<base64-encoded PCM audio>"
}
```
#### `response.audio.done`
Audio response streaming complete.
```json
{
"type": "response.audio.done"
}
```
#### `response.done`
Full response cycle complete. Gateway returns to listening state.
```json
{
"type": "response.done"
}
```
#### `error`
Error event.
```json
{
"type": "error",
"message": "STT service unavailable",
"code": "stt_error"
}
```
| Code | Description |
|------|-------------|
| `invalid_json` | Client sent malformed JSON |
| `validation_error` | Message failed schema validation |
| `no_session` | Action requires an active session |
| `empty_buffer` | Audio buffer was empty on commit |
| `empty_transcript` | STT returned no speech |
| `empty_response` | Agent returned empty response |
| `pipeline_error` | Internal pipeline failure |
| `unknown_event` | Unrecognized event type |
| `internal_error` | Unexpected server error |
---
## REST: `/api/v1/info`
Returns gateway information and current configuration.
**Response:**
```json
{
"name": "stentor-gateway",
"version": "0.1.0",
"endpoints": {
"realtime": "/api/v1/realtime",
"live": "/api/live/",
"ready": "/api/ready/",
"metrics": "/api/metrics"
},
"config": {
"stt_url": "http://perseus.incus:8000",
"tts_url": "http://pan.incus:8000",
"agent_url": "http://localhost:8001",
"stt_model": "Systran/faster-whisper-small",
"tts_model": "kokoro",
"tts_voice": "af_heart",
"audio_sample_rate": 16000,
"audio_channels": 1,
"audio_sample_width": 16
}
}
```
---
## REST: `/api/live/`
Kubernetes liveness probe.
**Response (200):**
```json
{
"status": "ok"
}
```
---
## REST: `/api/ready/`
Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
**Response (200 — all services reachable):**
```json
{
"status": "ready",
"checks": {
"stt": true,
"tts": true,
"agent": true
}
}
```
**Response (503 — one or more services unavailable):**
```json
{
"status": "not_ready",
"checks": {
"stt": true,
"tts": false,
"agent": true
}
}
```
---
## REST: `/api/metrics`
Prometheus-compatible metrics in text exposition format.
**Metrics exported:**
| Metric | Type | Description |
|--------|------|-------------|
| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
| `stentor_agent_requests_total` | Counter | Total agent message calls |
| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
---
## Configuration
All configuration via environment variables (12-factor):
| Variable | Description | Default |
|----------|-------------|---------|
| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
| `STENTOR_PORT` | Gateway bind port | `8600` |
| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |

222
docs/architecture.md Normal file
View File

@@ -0,0 +1,222 @@
# Stentor Architecture
> Version 0.2.0 — Daedalus-integrated architecture
## Overview
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
## System Architecture
```mermaid
graph TB
subgraph "ESP32-S3-AUDIO-Board"
MIC["Mic Array<br/>ES7210 ADC"]
WW["Wake Word<br/>ESP-SR"]
VAD["VAD<br/>On-Device"]
SPK["Speaker<br/>ES8311 DAC"]
LED["LED Ring<br/>WS2812B"]
NVS["NVS<br/>Device UUID"]
MIC --> WW
MIC --> VAD
end
subgraph "Daedalus Backend (puck.incus)"
REG["Device Registry<br/>/api/v1/voice/devices"]
WS["WebSocket Server<br/>/api/v1/voice/realtime"]
PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
DB["PostgreSQL<br/>Conversations & Messages"]
MCP["MCP Connection Manager<br/>Pallas Agents"]
end
subgraph "Speech Services"
STT["Speaches STT<br/>Whisper (perseus)"]
TTS["Speaches TTS<br/>Kokoro (perseus)"]
end
subgraph "AI Agents"
PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
end
NVS -->|"POST /register"| REG
WW -->|"WebSocket<br/>JSON + base64 audio"| WS
VAD -->|"commit on silence"| WS
WS --> PIPE
PIPE -->|"POST /v1/audio/transcriptions"| STT
PIPE -->|"MCP call_tool"| MCP
MCP -->|"MCP Streamable HTTP"| PALLAS
PIPE -->|"POST /v1/audio/speech"| TTS
STT -->|"transcript text"| PIPE
PALLAS -->|"response text"| MCP
MCP --> PIPE
TTS -->|"PCM audio stream"| PIPE
PIPE --> DB
PIPE --> WS
WS -->|"audio + status"| SPK
WS -->|"status events"| LED
```
## Device Registration & Lifecycle
```mermaid
sequenceDiagram
participant ESP as ESP32
participant DAE as Daedalus
participant UI as Daedalus Web UI
Note over ESP: First boot — generate UUID, store in NVS
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
DAE->>ESP: {status: "registered"}
Note over UI: User sees new device in Settings → Voice Devices
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
Note over ESP: Wake word detected
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
ESP->>DAE: session.start
DAE->>ESP: session.created {session_id, conversation_id}
```
## Voice Pipeline
```mermaid
sequenceDiagram
participant ESP as ESP32
participant GW as Daedalus Voice
participant STT as Speaches STT
participant MCP as MCP Manager
participant PALLAS as Pallas Agent
participant TTS as Speaches TTS
participant DB as PostgreSQL
Note over ESP: VAD: speech detected
loop Audio streaming
ESP->>GW: input_audio_buffer.append (base64 PCM)
end
Note over ESP: VAD: silence detected
ESP->>GW: input_audio_buffer.commit
GW->>ESP: status: transcribing
GW->>STT: POST /v1/audio/transcriptions (WAV)
STT->>GW: {"text": "..."}
GW->>ESP: transcript.done
GW->>DB: Save Message(role="user", content=transcript)
GW->>ESP: status: thinking
GW->>MCP: call_tool(workspace, agent, tool, {message})
MCP->>PALLAS: MCP Streamable HTTP
PALLAS->>MCP: CallToolResult
MCP->>GW: response text
GW->>ESP: response.text.done
GW->>DB: Save Message(role="assistant", content=response)
GW->>ESP: status: speaking
GW->>TTS: POST /v1/audio/speech
TTS->>GW: PCM audio stream
loop Audio chunks
GW->>ESP: response.audio.delta (base64 PCM)
end
GW->>ESP: response.audio.done
GW->>ESP: response.done
GW->>ESP: status: listening
Note over GW: Timeout timer starts (120s default)
alt Timeout — no speech
GW->>ESP: session.end {reason: "timeout"}
else Agent ends conversation
GW->>ESP: session.end {reason: "agent"}
else User speaks again
Note over ESP: VAD triggers next turn (same conversation)
end
```
## Component Communication
| Source | Destination | Protocol | Format |
|--------|------------|----------|--------|
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
## Network Topology
```mermaid
graph LR
ESP["ESP32<br/>WiFi"]
DAE["Daedalus<br/>puck.incus:8000"]
STT["Speaches STT<br/>perseus.helu.ca:22070"]
TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
DB["PostgreSQL<br/>portia.incus:5432"]
ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
DAE -->|"HTTP"| STT
DAE -->|"HTTP"| TTS
DAE -->|"MCP"| PALLAS
DAE -->|"SQL"| DB
```
## Audio Flow
```mermaid
graph LR
MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
GW -->|"WAV header wrap"| STT["Speaches STT"]
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
```
## Key Design Decisions
| Decision | Why |
|----------|-----|
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
## Repository Structure
```
stentor/ # This repository
├── docs/
│ ├── stentor.md # Usage guide (updated)
│ └── architecture.md # This file
├── stentor-ear/ # ESP32 firmware
│ ├── main/
│ ├── components/
│ └── ...
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
│ └── ...
└── README.md
daedalus/ # Separate repository
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
│ ├── audio.py
│ ├── models.py
│ ├── pipeline.py
│ ├── stt_client.py
│ └── tts_client.py
├── backend/daedalus/api/v1/
│ └── voice.py # Voice REST + WebSocket endpoints
└── docs/
└── stentor_integration.md # Full integration specification
```

315
docs/stentor.md Normal file
View File

@@ -0,0 +1,315 @@
# Stentor — Usage Guide
> *"Stentor, whose voice was as powerful as fifty voices of other men."*
> — Homer, *Iliad*, Book V
Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process.
---
## Table of Contents
- [How It Works](#how-it-works)
- [Components](#components)
- [ESP32 Device Setup](#esp32-device-setup)
- [Daedalus Configuration](#daedalus-configuration)
- [Device Registration Flow](#device-registration-flow)
- [Voice Conversation Flow](#voice-conversation-flow)
- [WebSocket Protocol](#websocket-protocol)
- [API Endpoints](#api-endpoints)
- [Observability](#observability)
- [Troubleshooting](#troubleshooting)
- [Architecture Overview](#architecture-overview)
---
## How It Works
1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
4. On-device VAD (Voice Activity Detection) detects speech and silence
5. Audio streams to Daedalus, which runs: **Speaches STT****Pallas Agent (MCP)****Speaches TTS**
6. The response audio streams back to the ESP32 speaker
7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
---
## Components
| Component | Location | Purpose |
|-----------|----------|---------|
| **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD |
| **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS |
| **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions |
| **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history |
The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
---
## ESP32 Device Setup
The ESP32-S3-AUDIO-Board firmware needs one configuration value:
| Setting | Description | Example |
|---------|-------------|---------|
| Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` |
On first boot, the device:
1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
2. Registers with Daedalus via `POST /api/v1/voice/devices/register`
3. The UUID persists across reboots — the device keeps its identity
---
## Daedalus Configuration
Voice settings are configured via environment variables with the `DAEDALUS_` prefix:
| Variable | Description | Default |
|----------|-------------|---------|
| `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` |
| `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` |
| `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
| `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` |
| `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` |
| `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` |
| `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` |
| `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
| `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` |
---
## Device Registration Flow
```
ESP32 Daedalus
│ │
│ [First boot — UUID generated] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "registered"} ─────────┤
│ │
│ [Device appears in Daedalus │
│ Settings → Voice Devices] │
│ │
│ [User assigns workspace + agent │
│ via web UI] │
│ │
│ [Subsequent boots — same UUID] │
├─ POST /api/v1/voice/devices/register ▶│
│ {device_id, firmware_version} │
│◀─ {status: "already_registered"} ──┤
│ │
```
After registration, the device appears in the Daedalus settings page. The user assigns it:
- A **name** (e.g. "Kitchen Speaker")
- A **description** (optional)
- A **workspace** (which workspace voice conversations go to)
- An **agent** (which Pallas agent to target)
Until assigned, the device cannot process voice.
---
## Voice Conversation Flow
A voice conversation is a multi-turn session driven by on-device VAD:
```
ESP32 Daedalus
│ │
├─ [Wake word detected] │
├─ WS /api/v1/voice/realtime ──────▶│
├─ session.start ───────────────────▶│ → Create Conversation in DB
│◀──── session.created ─────────────┤ {session_id, conversation_id}
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks] │
├─ input_audio_buffer.append ×N ────▶│
│ [VAD: silence detected] │
├─ input_audio_buffer.commit ───────▶│
│◀──── status: transcribing ────────┤ → STT
│◀──── transcript.done ─────────────┤ → Save user message
│◀──── status: thinking ────────────┤ → MCP call to Pallas
│◀──── response.text.done ──────────┤ → Save assistant message
│◀──── status: speaking ────────────┤ → TTS
│◀──── response.audio.delta ×N ─────┤
│◀──── response.audio.done ─────────┤
│◀──── response.done ───────────────┤
│◀──── status: listening ────────────┤
│ │
│ [VAD: user speaks again] │ (same conversation)
├─ (next turn cycle) ──────────────▶│
│ │
│ [Conversation ends by:] │
│ • 120s silence → timeout │
│ • Agent says goodbye │
│ • WebSocket disconnect │
│◀──── session.end ─────────────────┤
```
### Conversation End
A conversation ends in three ways:
1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120)
2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it
3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops
All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
---
## WebSocket Protocol
### Connection
```
WS /api/v1/voice/realtime?device_id={uuid}
```
### Client → Gateway Messages
| Type | Description | Fields |
|------|-------------|--------|
| `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) |
| `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) |
| `input_audio_buffer.commit` | End of speech, trigger pipeline | — |
| `session.close` | End the session | — |
### Gateway → Client Messages
| Type | Description | Fields |
|------|-------------|--------|
| `session.created` | Session started | `session_id`, `conversation_id` |
| `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) |
| `transcript.done` | User's speech as text | `text` |
| `response.text.done` | Agent's text response | `text` |
| `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) |
| `response.audio.done` | Audio streaming complete | — |
| `response.done` | Turn complete | — |
| `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) |
| `error` | Error occurred | `message`, `code` |
### Audio Format
All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON:
- **Sample rate:** 16,000 Hz
- **Channels:** 1 (mono)
- **Bit depth:** 16-bit
---
## API Endpoints
All endpoints are served by the Daedalus FastAPI backend.
### Voice Device Management
| Method | Route | Purpose |
|--------|-------|---------|
| `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) |
| `GET` | `/api/v1/voice/devices` | List all registered devices |
| `GET` | `/api/v1/voice/devices/{id}` | Get device details |
| `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) |
| `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device |
### Voice Sessions
| Method | Route | Purpose |
|--------|-------|---------|
| `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations |
| `GET` | `/api/v1/voice/sessions` | List active voice sessions |
### Voice Configuration & Health
| Method | Route | Purpose |
|--------|-------|---------|
| `GET` | `/api/v1/voice/config` | Current voice configuration |
| `PUT` | `/api/v1/voice/config` | Update voice settings |
| `GET` | `/api/v1/voice/health` | STT + TTS reachability check |
---
## Observability
### Prometheus Metrics
Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix:
| Metric | Type | Description |
|--------|------|-------------|
| `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions |
| `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency |
| `daedalus_voice_stt_duration_seconds` | histogram | STT latency |
| `daedalus_voice_tts_duration_seconds` | histogram | TTS latency |
| `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency |
| `daedalus_voice_transcriptions_total` | counter | Total STT calls |
| `daedalus_voice_conversations_total` | counter | Conversations by end reason |
| `daedalus_voice_devices_online` | gauge | Currently connected devices |
### Logs
Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`.
---
## Troubleshooting
### Device not appearing in Daedalus settings
- Check the ESP32 can reach the Daedalus URL
- Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'`
### Device registered but voice doesn't work
- Assign the device to a workspace and agent in **Settings → Voice Devices**
- Unassigned devices get: `{"type": "error", "code": "no_workspace"}`
### STT returns empty transcripts
- Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models`
- Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health`
### High latency
- Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana
- Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
- Agent latency depends on the Pallas agent and its downstream MCP servers
### Audio sounds wrong (chipmunk / slow)
- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
- Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate
---
## Architecture Overview
```
┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐
│ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │
│ (stentor-ear) │ JSON + base64 audio │ puck.incus │
│ UUID in NVS │ │ │
│ Wake Word + VAD │ │ voice/ module: │
└──────────────────┘ │ STT → MCP (Pallas) → TTS │
│ Conversations → PostgreSQL │
└──────┬──────────┬────────┬───────────┘
│ │ │
MCP │ HTTP │ HTTP │
▼ ▼ ▼
┌──────────┐ ┌────────┐ ┌────────┐
│ Pallas │ │Speaches│ │Speaches│
│ Agents │ │ STT │ │ TTS │
└──────────┘ └────────┘ └────────┘
```
For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md).
For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).