feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions
--- a/docs/architecture.md
+++ b/docs/architecture.md
@@ -0,0 +1,222 @@
+# Stentor Architecture
+
+> Version 0.2.0 — Daedalus-integrated architecture
+
+## Overview
+
+Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
+
+1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
+2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
+
+The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
+
+## System Architecture
+
+```mermaid
+graph TB
+    subgraph "ESP32-S3-AUDIO-Board"
+        MIC["Mic Array<br/>ES7210 ADC"]
+        WW["Wake Word<br/>ESP-SR"]
+        VAD["VAD<br/>On-Device"]
+        SPK["Speaker<br/>ES8311 DAC"]
+        LED["LED Ring<br/>WS2812B"]
+        NVS["NVS<br/>Device UUID"]
+        MIC --> WW
+        MIC --> VAD
+    end
+
+    subgraph "Daedalus Backend (puck.incus)"
+        REG["Device Registry<br/>/api/v1/voice/devices"]
+        WS["WebSocket Server<br/>/api/v1/voice/realtime"]
+        PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
+        DB["PostgreSQL<br/>Conversations & Messages"]
+        MCP["MCP Connection Manager<br/>Pallas Agents"]
+    end
+
+    subgraph "Speech Services"
+        STT["Speaches STT<br/>Whisper (perseus)"]
+        TTS["Speaches TTS<br/>Kokoro (perseus)"]
+    end
+
+    subgraph "AI Agents"
+        PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
+    end
+
+    NVS -->|"POST /register"| REG
+    WW -->|"WebSocket<br/>JSON + base64 audio"| WS
+    VAD -->|"commit on silence"| WS
+    WS --> PIPE
+    PIPE -->|"POST /v1/audio/transcriptions"| STT
+    PIPE -->|"MCP call_tool"| MCP
+    MCP -->|"MCP Streamable HTTP"| PALLAS
+    PIPE -->|"POST /v1/audio/speech"| TTS
+    STT -->|"transcript text"| PIPE
+    PALLAS -->|"response text"| MCP
+    MCP --> PIPE
+    TTS -->|"PCM audio stream"| PIPE
+    PIPE --> DB
+    PIPE --> WS
+    WS -->|"audio + status"| SPK
+    WS -->|"status events"| LED
+```
+
+## Device Registration & Lifecycle
+
+```mermaid
+sequenceDiagram
+    participant ESP as ESP32
+    participant DAE as Daedalus
+    participant UI as Daedalus Web UI
+
+    Note over ESP: First boot — generate UUID, store in NVS
+    ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
+    DAE->>ESP: {status: "registered"}
+
+    Note over UI: User sees new device in Settings → Voice Devices
+    UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
+
+    Note over ESP: Wake word detected
+    ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
+    ESP->>DAE: session.start
+    DAE->>ESP: session.created {session_id, conversation_id}
+```
+
+## Voice Pipeline
+
+```mermaid
+sequenceDiagram
+    participant ESP as ESP32
+    participant GW as Daedalus Voice
+    participant STT as Speaches STT
+    participant MCP as MCP Manager
+    participant PALLAS as Pallas Agent
+    participant TTS as Speaches TTS
+    participant DB as PostgreSQL
+
+    Note over ESP: VAD: speech detected
+    loop Audio streaming
+        ESP->>GW: input_audio_buffer.append (base64 PCM)
+    end
+    Note over ESP: VAD: silence detected
+    ESP->>GW: input_audio_buffer.commit
+
+    GW->>ESP: status: transcribing
+    GW->>STT: POST /v1/audio/transcriptions (WAV)
+    STT->>GW: {"text": "..."}
+    GW->>ESP: transcript.done
+    GW->>DB: Save Message(role="user", content=transcript)
+
+    GW->>ESP: status: thinking
+    GW->>MCP: call_tool(workspace, agent, tool, {message})
+    MCP->>PALLAS: MCP Streamable HTTP
+    PALLAS->>MCP: CallToolResult
+    MCP->>GW: response text
+    GW->>ESP: response.text.done
+    GW->>DB: Save Message(role="assistant", content=response)
+
+    GW->>ESP: status: speaking
+    GW->>TTS: POST /v1/audio/speech
+    TTS->>GW: PCM audio stream
+
+    loop Audio chunks
+        GW->>ESP: response.audio.delta (base64 PCM)
+    end
+
+    GW->>ESP: response.audio.done
+    GW->>ESP: response.done
+    GW->>ESP: status: listening
+
+    Note over GW: Timeout timer starts (120s default)
+
+    alt Timeout — no speech
+        GW->>ESP: session.end {reason: "timeout"}
+    else Agent ends conversation
+        GW->>ESP: session.end {reason: "agent"}
+    else User speaks again
+        Note over ESP: VAD triggers next turn (same conversation)
+    end
+```
+
+## Component Communication
+
+| Source | Destination | Protocol | Format |
+|--------|------------|----------|--------|
+| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
+| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
+| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
+| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
+| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
+| Daedalus | PostgreSQL | SQL | Conversations + Messages |
+
+## Network Topology
+
+```mermaid
+graph LR
+    ESP["ESP32<br/>WiFi"]
+    DAE["Daedalus<br/>puck.incus:8000"]
+    STT["Speaches STT<br/>perseus.helu.ca:22070"]
+    TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
+    PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
+    DB["PostgreSQL<br/>portia.incus:5432"]
+
+    ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
+    DAE -->|"HTTP"| STT
+    DAE -->|"HTTP"| TTS
+    DAE -->|"MCP"| PALLAS
+    DAE -->|"SQL"| DB
+```
+
+## Audio Flow
+
+```mermaid
+graph LR
+    MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
+    B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
+    GW -->|"WAV header wrap"| STT["Speaches STT"]
+
+    TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
+    RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
+    B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
+```
+
+## Key Design Decisions
+
+| Decision | Why |
+|----------|-----|
+| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
+| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
+| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
+| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
+| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
+| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
+| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
+| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
+
+## Repository Structure
+
+```
+stentor/                        # This repository
+├── docs/
+│   ├── stentor.md              # Usage guide (updated)
+│   └── architecture.md         # This file
+├── stentor-ear/                # ESP32 firmware
+│   ├── main/
+│   ├── components/
+│   └── ...
+├── stentor-gateway/            # Legacy — gateway code migrated to Daedalus
+│   └── ...
+└── README.md
+
+daedalus/                       # Separate repository
+├── backend/daedalus/voice/     # Voice module (migrated from stentor-gateway)
+│   ├── audio.py
+│   ├── models.py
+│   ├── pipeline.py
+│   ├── stt_client.py
+│   └── tts_client.py
+├── backend/daedalus/api/v1/
+│   └── voice.py                # Voice REST + WebSocket endpoints
+└── docs/
+    └── stentor_integration.md  # Full integration specification
+```