feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions
--- a/docs/api-reference.md
+++ b/docs/api-reference.md
@@ -0,0 +1,315 @@
+# Stentor Gateway API Reference
+
+> Version 0.1.0
+
+## Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Dashboard (Bootstrap UI) |
+| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
+| `/api/v1/info` | GET | Gateway information and configuration |
+| `/api/live/` | GET | Liveness probe (Kubernetes) |
+| `/api/ready/` | GET | Readiness probe (Kubernetes) |
+| `/api/metrics` | GET | Prometheus-compatible metrics |
+| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
+| `/api/openapi.json` | GET | OpenAPI schema |
+
+---
+
+## WebSocket: `/api/v1/realtime`
+
+Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
+
+### Connection
+
+```
+ws://{host}:{port}/api/v1/realtime
+```
+
+### Client Events
+
+#### `session.start`
+
+Initiates a new conversation session. Must be sent first.
+
+```json
+{
+  "type": "session.start",
+  "client_id": "esp32-kitchen",
+  "audio_config": {
+    "sample_rate": 16000,
+    "channels": 1,
+    "sample_width": 16,
+    "encoding": "pcm_s16le"
+  }
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `type` | string | ✔ | Must be `"session.start"` |
+| `client_id` | string | | Client identifier for tracking |
+| `audio_config` | object | | Audio format configuration |
+
+#### `input_audio_buffer.append`
+
+Sends a chunk of audio data. Stream continuously while user is speaking.
+
+```json
+{
+  "type": "input_audio_buffer.append",
+  "audio": "<base64-encoded PCM audio>"
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
+| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
+
+#### `input_audio_buffer.commit`
+
+Signals end of speech. Triggers the STT → Agent → TTS pipeline.
+
+```json
+{
+  "type": "input_audio_buffer.commit"
+}
+```
+
+#### `session.close`
+
+Requests session termination. The WebSocket connection will close.
+
+```json
+{
+  "type": "session.close"
+}
+```
+
+### Server Events
+
+#### `session.created`
+
+Acknowledges session creation.
+
+```json
+{
+  "type": "session.created",
+  "session_id": "550e8400-e29b-41d4-a716-446655440000"
+}
+```
+
+#### `status`
+
+Processing status update. Use for LED feedback on ESP32.
+
+```json
+{
+  "type": "status",
+  "state": "listening"
+}
+```
+
+| State | Description | Suggested LED |
+|-------|-------------|--------------|
+| `listening` | Ready for audio input | Green |
+| `transcribing` | Running STT | Yellow |
+| `thinking` | Waiting for agent response | Yellow |
+| `speaking` | Playing TTS audio | Cyan |
+
+#### `transcript.done`
+
+Transcript of what the user said.
+
+```json
+{
+  "type": "transcript.done",
+  "text": "What is the weather like today?"
+}
+```
+
+#### `response.text.done`
+
+AI agent's response text.
+
+```json
+{
+  "type": "response.text.done",
+  "text": "I don't have weather tools yet, but I can help with other things."
+}
+```
+
+#### `response.audio.delta`
+
+Streamed audio response chunk.
+
+```json
+{
+  "type": "response.audio.delta",
+  "delta": "<base64-encoded PCM audio>"
+}
+```
+
+#### `response.audio.done`
+
+Audio response streaming complete.
+
+```json
+{
+  "type": "response.audio.done"
+}
+```
+
+#### `response.done`
+
+Full response cycle complete. Gateway returns to listening state.
+
+```json
+{
+  "type": "response.done"
+}
+```
+
+#### `error`
+
+Error event.
+
+```json
+{
+  "type": "error",
+  "message": "STT service unavailable",
+  "code": "stt_error"
+}
+```
+
+| Code | Description |
+|------|-------------|
+| `invalid_json` | Client sent malformed JSON |
+| `validation_error` | Message failed schema validation |
+| `no_session` | Action requires an active session |
+| `empty_buffer` | Audio buffer was empty on commit |
+| `empty_transcript` | STT returned no speech |
+| `empty_response` | Agent returned empty response |
+| `pipeline_error` | Internal pipeline failure |
+| `unknown_event` | Unrecognized event type |
+| `internal_error` | Unexpected server error |
+
+---
+
+## REST: `/api/v1/info`
+
+Returns gateway information and current configuration.
+
+**Response:**
+
+```json
+{
+  "name": "stentor-gateway",
+  "version": "0.1.0",
+  "endpoints": {
+    "realtime": "/api/v1/realtime",
+    "live": "/api/live/",
+    "ready": "/api/ready/",
+    "metrics": "/api/metrics"
+  },
+  "config": {
+    "stt_url": "http://perseus.incus:8000",
+    "tts_url": "http://pan.incus:8000",
+    "agent_url": "http://localhost:8001",
+    "stt_model": "Systran/faster-whisper-small",
+    "tts_model": "kokoro",
+    "tts_voice": "af_heart",
+    "audio_sample_rate": 16000,
+    "audio_channels": 1,
+    "audio_sample_width": 16
+  }
+}
+```
+
+---
+
+## REST: `/api/live/`
+
+Kubernetes liveness probe.
+
+**Response (200):**
+
+```json
+{
+  "status": "ok"
+}
+```
+
+---
+
+## REST: `/api/ready/`
+
+Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
+
+**Response (200 — all services reachable):**
+
+```json
+{
+  "status": "ready",
+  "checks": {
+    "stt": true,
+    "tts": true,
+    "agent": true
+  }
+}
+```
+
+**Response (503 — one or more services unavailable):**
+
+```json
+{
+  "status": "not_ready",
+  "checks": {
+    "stt": true,
+    "tts": false,
+    "agent": true
+  }
+}
+```
+
+---
+
+## REST: `/api/metrics`
+
+Prometheus-compatible metrics in text exposition format.
+
+**Metrics exported:**
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
+| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
+| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
+| `stentor_agent_requests_total` | Counter | Total agent message calls |
+| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
+| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
+| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
+| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
+
+---
+
+## Configuration
+
+All configuration via environment variables (12-factor):
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
+| `STENTOR_PORT` | Gateway bind port | `8600` |
+| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
+| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
+| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
+| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
+| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
+| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
+| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
+| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
+| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
+| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -0,0 +1,222 @@
+# Stentor Architecture
+
+> Version 0.2.0 — Daedalus-integrated architecture
+
+## Overview
+
+Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
+
+1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
+2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
+
+The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
+
+## System Architecture
+
+```mermaid
+graph TB
+    subgraph "ESP32-S3-AUDIO-Board"
+        MIC["Mic Array<br/>ES7210 ADC"]
+        WW["Wake Word<br/>ESP-SR"]
+        VAD["VAD<br/>On-Device"]
+        SPK["Speaker<br/>ES8311 DAC"]
+        LED["LED Ring<br/>WS2812B"]
+        NVS["NVS<br/>Device UUID"]
+        MIC --> WW
+        MIC --> VAD
+    end
+
+    subgraph "Daedalus Backend (puck.incus)"
+        REG["Device Registry<br/>/api/v1/voice/devices"]
+        WS["WebSocket Server<br/>/api/v1/voice/realtime"]
+        PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
+        DB["PostgreSQL<br/>Conversations & Messages"]
+        MCP["MCP Connection Manager<br/>Pallas Agents"]
+    end
+
+    subgraph "Speech Services"
+        STT["Speaches STT<br/>Whisper (perseus)"]
+        TTS["Speaches TTS<br/>Kokoro (perseus)"]
+    end
+
+    subgraph "AI Agents"
+        PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
+    end
+
+    NVS -->|"POST /register"| REG
+    WW -->|"WebSocket<br/>JSON + base64 audio"| WS
+    VAD -->|"commit on silence"| WS
+    WS --> PIPE
+    PIPE -->|"POST /v1/audio/transcriptions"| STT
+    PIPE -->|"MCP call_tool"| MCP
+    MCP -->|"MCP Streamable HTTP"| PALLAS
+    PIPE -->|"POST /v1/audio/speech"| TTS
+    STT -->|"transcript text"| PIPE
+    PALLAS -->|"response text"| MCP
+    MCP --> PIPE
+    TTS -->|"PCM audio stream"| PIPE
+    PIPE --> DB
+    PIPE --> WS
+    WS -->|"audio + status"| SPK
+    WS -->|"status events"| LED
+```
+
+## Device Registration & Lifecycle
+
+```mermaid
+sequenceDiagram
+    participant ESP as ESP32
+    participant DAE as Daedalus
+    participant UI as Daedalus Web UI
+
+    Note over ESP: First boot — generate UUID, store in NVS
+    ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
+    DAE->>ESP: {status: "registered"}
+
+    Note over UI: User sees new device in Settings → Voice Devices
+    UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
+
+    Note over ESP: Wake word detected
+    ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
+    ESP->>DAE: session.start
+    DAE->>ESP: session.created {session_id, conversation_id}
+```
+
+## Voice Pipeline
+
+```mermaid
+sequenceDiagram
+    participant ESP as ESP32
+    participant GW as Daedalus Voice
+    participant STT as Speaches STT
+    participant MCP as MCP Manager
+    participant PALLAS as Pallas Agent
+    participant TTS as Speaches TTS
+    participant DB as PostgreSQL
+
+    Note over ESP: VAD: speech detected
+    loop Audio streaming
+        ESP->>GW: input_audio_buffer.append (base64 PCM)
+    end
+    Note over ESP: VAD: silence detected
+    ESP->>GW: input_audio_buffer.commit
+
+    GW->>ESP: status: transcribing
+    GW->>STT: POST /v1/audio/transcriptions (WAV)
+    STT->>GW: {"text": "..."}
+    GW->>ESP: transcript.done
+    GW->>DB: Save Message(role="user", content=transcript)
+
+    GW->>ESP: status: thinking
+    GW->>MCP: call_tool(workspace, agent, tool, {message})
+    MCP->>PALLAS: MCP Streamable HTTP
+    PALLAS->>MCP: CallToolResult
+    MCP->>GW: response text
+    GW->>ESP: response.text.done
+    GW->>DB: Save Message(role="assistant", content=response)
+
+    GW->>ESP: status: speaking
+    GW->>TTS: POST /v1/audio/speech
+    TTS->>GW: PCM audio stream
+
+    loop Audio chunks
+        GW->>ESP: response.audio.delta (base64 PCM)
+    end
+
+    GW->>ESP: response.audio.done
+    GW->>ESP: response.done
+    GW->>ESP: status: listening
+
+    Note over GW: Timeout timer starts (120s default)
+
+    alt Timeout — no speech
+        GW->>ESP: session.end {reason: "timeout"}
+    else Agent ends conversation
+        GW->>ESP: session.end {reason: "agent"}
+    else User speaks again
+        Note over ESP: VAD triggers next turn (same conversation)
+    end
+```
+
+## Component Communication
+
+| Source | Destination | Protocol | Format |
+|--------|------------|----------|--------|
+| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
+| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
+| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
+| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
+| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
+| Daedalus | PostgreSQL | SQL | Conversations + Messages |
+
+## Network Topology
+
+```mermaid
+graph LR
+    ESP["ESP32<br/>WiFi"]
+    DAE["Daedalus<br/>puck.incus:8000"]
+    STT["Speaches STT<br/>perseus.helu.ca:22070"]
+    TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
+    PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
+    DB["PostgreSQL<br/>portia.incus:5432"]
+
+    ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
+    DAE -->|"HTTP"| STT
+    DAE -->|"HTTP"| TTS
+    DAE -->|"MCP"| PALLAS
+    DAE -->|"SQL"| DB
+```
+
+## Audio Flow
+
+```mermaid
+graph LR
+    MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
+    B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
+    GW -->|"WAV header wrap"| STT["Speaches STT"]
+
+    TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
+    RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
+    B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
+```
+
+## Key Design Decisions
+
+| Decision | Why |
+|----------|-----|
+| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
+| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
+| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
+| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
+| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
+| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
+| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
+| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
+
+## Repository Structure
+
+```
+stentor/                        # This repository
+├── docs/
+│   ├── stentor.md              # Usage guide (updated)
+│   └── architecture.md         # This file
+├── stentor-ear/                # ESP32 firmware
+│   ├── main/
+│   ├── components/
+│   └── ...
+├── stentor-gateway/            # Legacy — gateway code migrated to Daedalus
+│   └── ...
+└── README.md
+
+daedalus/                       # Separate repository
+├── backend/daedalus/voice/     # Voice module (migrated from stentor-gateway)
+│   ├── audio.py
+│   ├── models.py
+│   ├── pipeline.py
+│   ├── stt_client.py
+│   └── tts_client.py
+├── backend/daedalus/api/v1/
+│   └── voice.py                # Voice REST + WebSocket endpoints
+└── docs/
+    └── stentor_integration.md  # Full integration specification
+```
--- a/docs/stentor.md
+++ b/docs/stentor.md
@@ -0,0 +1,315 @@
+# Stentor — Usage Guide
+
+> *"Stentor, whose voice was as powerful as fifty voices of other men."*
+> — Homer, *Iliad*, Book V
+
+Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process.
+
+---
+
+## Table of Contents
+
+- [How It Works](#how-it-works)
+- [Components](#components)
+- [ESP32 Device Setup](#esp32-device-setup)
+- [Daedalus Configuration](#daedalus-configuration)
+- [Device Registration Flow](#device-registration-flow)
+- [Voice Conversation Flow](#voice-conversation-flow)
+- [WebSocket Protocol](#websocket-protocol)
+- [API Endpoints](#api-endpoints)
+- [Observability](#observability)
+- [Troubleshooting](#troubleshooting)
+- [Architecture Overview](#architecture-overview)
+
+---
+
+## How It Works
+
+1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
+2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
+3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
+4. On-device VAD (Voice Activity Detection) detects speech and silence
+5. Audio streams to Daedalus, which runs: **Speaches STT** → **Pallas Agent (MCP)** → **Speaches TTS**
+6. The response audio streams back to the ESP32 speaker
+7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
+
+---
+
+## Components
+
+| Component | Location | Purpose |
+|-----------|----------|---------|
+| **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD |
+| **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS |
+| **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions |
+| **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history |
+
+The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
+
+---
+
+## ESP32 Device Setup
+
+The ESP32-S3-AUDIO-Board firmware needs one configuration value:
+
+| Setting | Description | Example |
+|---------|-------------|---------|
+| Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` |
+
+On first boot, the device:
+1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
+2. Registers with Daedalus via `POST /api/v1/voice/devices/register`
+3. The UUID persists across reboots — the device keeps its identity
+
+---
+
+## Daedalus Configuration
+
+Voice settings are configured via environment variables with the `DAEDALUS_` prefix:
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` |
+| `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` |
+| `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
+| `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` |
+| `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` |
+| `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` |
+| `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` |
+| `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
+| `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` |
+
+---
+
+## Device Registration Flow
+
+```
+ESP32                              Daedalus
+  │                                    │
+  │  [First boot — UUID generated]     │
+  ├─ POST /api/v1/voice/devices/register ▶│
+  │   {device_id, firmware_version}    │
+  │◀─ {status: "registered"} ─────────┤
+  │                                    │
+  │  [Device appears in Daedalus       │
+  │   Settings → Voice Devices]        │
+  │                                    │
+  │  [User assigns workspace + agent   │
+  │   via web UI]                      │
+  │                                    │
+  │  [Subsequent boots — same UUID]    │
+  ├─ POST /api/v1/voice/devices/register ▶│
+  │   {device_id, firmware_version}    │
+  │◀─ {status: "already_registered"} ──┤
+  │                                    │
+```
+
+After registration, the device appears in the Daedalus settings page. The user assigns it:
+- A **name** (e.g. "Kitchen Speaker")
+- A **description** (optional)
+- A **workspace** (which workspace voice conversations go to)
+- An **agent** (which Pallas agent to target)
+
+Until assigned, the device cannot process voice.
+
+---
+
+## Voice Conversation Flow
+
+A voice conversation is a multi-turn session driven by on-device VAD:
+
+```
+ESP32                              Daedalus
+  │                                    │
+  ├─ [Wake word detected]             │
+  ├─ WS /api/v1/voice/realtime ──────▶│
+  ├─ session.start ───────────────────▶│  → Create Conversation in DB
+  │◀──── session.created ─────────────┤    {session_id, conversation_id}
+  │◀──── status: listening ────────────┤
+  │                                    │
+  │  [VAD: user speaks]                │
+  ├─ input_audio_buffer.append ×N ────▶│
+  │  [VAD: silence detected]           │
+  ├─ input_audio_buffer.commit ───────▶│
+  │◀──── status: transcribing ────────┤  → STT
+  │◀──── transcript.done ─────────────┤  → Save user message
+  │◀──── status: thinking ────────────┤  → MCP call to Pallas
+  │◀──── response.text.done ──────────┤  → Save assistant message
+  │◀──── status: speaking ────────────┤  → TTS
+  │◀──── response.audio.delta ×N ─────┤
+  │◀──── response.audio.done ─────────┤
+  │◀──── response.done ───────────────┤
+  │◀──── status: listening ────────────┤
+  │                                    │
+  │  [VAD: user speaks again]          │  (same conversation)
+  ├─ (next turn cycle) ──────────────▶│
+  │                                    │
+  │  [Conversation ends by:]           │
+  │  • 120s silence → timeout          │
+  │  • Agent says goodbye              │
+  │  • WebSocket disconnect            │
+  │◀──── session.end ─────────────────┤
+```
+
+### Conversation End
+
+A conversation ends in three ways:
+
+1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120)
+2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it
+3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops
+
+All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
+
+---
+
+## WebSocket Protocol
+
+### Connection
+
+```
+WS /api/v1/voice/realtime?device_id={uuid}
+```
+
+### Client → Gateway Messages
+
+| Type | Description | Fields |
+|------|-------------|--------|
+| `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) |
+| `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) |
+| `input_audio_buffer.commit` | End of speech, trigger pipeline | — |
+| `session.close` | End the session | — |
+
+### Gateway → Client Messages
+
+| Type | Description | Fields |
+|------|-------------|--------|
+| `session.created` | Session started | `session_id`, `conversation_id` |
+| `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) |
+| `transcript.done` | User's speech as text | `text` |
+| `response.text.done` | Agent's text response | `text` |
+| `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) |
+| `response.audio.done` | Audio streaming complete | — |
+| `response.done` | Turn complete | — |
+| `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) |
+| `error` | Error occurred | `message`, `code` |
+
+### Audio Format
+
+All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON:
+
+- **Sample rate:** 16,000 Hz
+- **Channels:** 1 (mono)
+- **Bit depth:** 16-bit
+
+---
+
+## API Endpoints
+
+All endpoints are served by the Daedalus FastAPI backend.
+
+### Voice Device Management
+
+| Method | Route | Purpose |
+|--------|-------|---------|
+| `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) |
+| `GET` | `/api/v1/voice/devices` | List all registered devices |
+| `GET` | `/api/v1/voice/devices/{id}` | Get device details |
+| `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) |
+| `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device |
+
+### Voice Sessions
+
+| Method | Route | Purpose |
+|--------|-------|---------|
+| `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations |
+| `GET` | `/api/v1/voice/sessions` | List active voice sessions |
+
+### Voice Configuration & Health
+
+| Method | Route | Purpose |
+|--------|-------|---------|
+| `GET` | `/api/v1/voice/config` | Current voice configuration |
+| `PUT` | `/api/v1/voice/config` | Update voice settings |
+| `GET` | `/api/v1/voice/health` | STT + TTS reachability check |
+
+---
+
+## Observability
+
+### Prometheus Metrics
+
+Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix:
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions |
+| `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency |
+| `daedalus_voice_stt_duration_seconds` | histogram | STT latency |
+| `daedalus_voice_tts_duration_seconds` | histogram | TTS latency |
+| `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency |
+| `daedalus_voice_transcriptions_total` | counter | Total STT calls |
+| `daedalus_voice_conversations_total` | counter | Conversations by end reason |
+| `daedalus_voice_devices_online` | gauge | Currently connected devices |
+
+### Logs
+
+Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
+
+Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`.
+
+---
+
+## Troubleshooting
+
+### Device not appearing in Daedalus settings
+
+- Check the ESP32 can reach the Daedalus URL
+- Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'`
+
+### Device registered but voice doesn't work
+
+- Assign the device to a workspace and agent in **Settings → Voice Devices**
+- Unassigned devices get: `{"type": "error", "code": "no_workspace"}`
+
+### STT returns empty transcripts
+
+- Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models`
+- Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health`
+
+### High latency
+
+- Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana
+- Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
+- Agent latency depends on the Pallas agent and its downstream MCP servers
+
+### Audio sounds wrong (chipmunk / slow)
+
+- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
+- Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate
+
+---
+
+## Architecture Overview
+
+```
+┌──────────────────┐       WebSocket         ┌──────────────────────────────────────┐
+│  ESP32-S3 Board  │◀══════════════════════▶ │   Daedalus Backend (FastAPI)         │
+│  (stentor-ear)   │   JSON + base64 audio   │   puck.incus                         │
+│  UUID in NVS     │                         │                                      │
+│  Wake Word + VAD │                         │   voice/ module:                     │
+└──────────────────┘                         │     STT → MCP (Pallas) → TTS        │
+                                             │     Conversations → PostgreSQL       │
+                                             └──────┬──────────┬────────┬───────────┘
+                                                    │          │        │
+                                              MCP   │    HTTP  │  HTTP  │
+                                                    ▼          ▼        ▼
+                                             ┌──────────┐ ┌────────┐ ┌────────┐
+                                             │  Pallas  │ │Speaches│ │Speaches│
+                                             │  Agents  │ │  STT   │ │  TTS   │
+                                             └──────────┘ └────────┘ └────────┘
+```
+
+For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md).
+
+For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).