feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
This commit is contained in:
222
docs/architecture.md
Normal file
222
docs/architecture.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Stentor Architecture
|
||||
|
||||
> Version 0.2.0 — Daedalus-integrated architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
|
||||
|
||||
1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
|
||||
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
|
||||
|
||||
The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
|
||||
|
||||
## System Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "ESP32-S3-AUDIO-Board"
|
||||
MIC["Mic Array<br/>ES7210 ADC"]
|
||||
WW["Wake Word<br/>ESP-SR"]
|
||||
VAD["VAD<br/>On-Device"]
|
||||
SPK["Speaker<br/>ES8311 DAC"]
|
||||
LED["LED Ring<br/>WS2812B"]
|
||||
NVS["NVS<br/>Device UUID"]
|
||||
MIC --> WW
|
||||
MIC --> VAD
|
||||
end
|
||||
|
||||
subgraph "Daedalus Backend (puck.incus)"
|
||||
REG["Device Registry<br/>/api/v1/voice/devices"]
|
||||
WS["WebSocket Server<br/>/api/v1/voice/realtime"]
|
||||
PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
|
||||
DB["PostgreSQL<br/>Conversations & Messages"]
|
||||
MCP["MCP Connection Manager<br/>Pallas Agents"]
|
||||
end
|
||||
|
||||
subgraph "Speech Services"
|
||||
STT["Speaches STT<br/>Whisper (perseus)"]
|
||||
TTS["Speaches TTS<br/>Kokoro (perseus)"]
|
||||
end
|
||||
|
||||
subgraph "AI Agents"
|
||||
PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
|
||||
end
|
||||
|
||||
NVS -->|"POST /register"| REG
|
||||
WW -->|"WebSocket<br/>JSON + base64 audio"| WS
|
||||
VAD -->|"commit on silence"| WS
|
||||
WS --> PIPE
|
||||
PIPE -->|"POST /v1/audio/transcriptions"| STT
|
||||
PIPE -->|"MCP call_tool"| MCP
|
||||
MCP -->|"MCP Streamable HTTP"| PALLAS
|
||||
PIPE -->|"POST /v1/audio/speech"| TTS
|
||||
STT -->|"transcript text"| PIPE
|
||||
PALLAS -->|"response text"| MCP
|
||||
MCP --> PIPE
|
||||
TTS -->|"PCM audio stream"| PIPE
|
||||
PIPE --> DB
|
||||
PIPE --> WS
|
||||
WS -->|"audio + status"| SPK
|
||||
WS -->|"status events"| LED
|
||||
```
|
||||
|
||||
## Device Registration & Lifecycle
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant ESP as ESP32
|
||||
participant DAE as Daedalus
|
||||
participant UI as Daedalus Web UI
|
||||
|
||||
Note over ESP: First boot — generate UUID, store in NVS
|
||||
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
|
||||
DAE->>ESP: {status: "registered"}
|
||||
|
||||
Note over UI: User sees new device in Settings → Voice Devices
|
||||
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
|
||||
|
||||
Note over ESP: Wake word detected
|
||||
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
|
||||
ESP->>DAE: session.start
|
||||
DAE->>ESP: session.created {session_id, conversation_id}
|
||||
```
|
||||
|
||||
## Voice Pipeline
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant ESP as ESP32
|
||||
participant GW as Daedalus Voice
|
||||
participant STT as Speaches STT
|
||||
participant MCP as MCP Manager
|
||||
participant PALLAS as Pallas Agent
|
||||
participant TTS as Speaches TTS
|
||||
participant DB as PostgreSQL
|
||||
|
||||
Note over ESP: VAD: speech detected
|
||||
loop Audio streaming
|
||||
ESP->>GW: input_audio_buffer.append (base64 PCM)
|
||||
end
|
||||
Note over ESP: VAD: silence detected
|
||||
ESP->>GW: input_audio_buffer.commit
|
||||
|
||||
GW->>ESP: status: transcribing
|
||||
GW->>STT: POST /v1/audio/transcriptions (WAV)
|
||||
STT->>GW: {"text": "..."}
|
||||
GW->>ESP: transcript.done
|
||||
GW->>DB: Save Message(role="user", content=transcript)
|
||||
|
||||
GW->>ESP: status: thinking
|
||||
GW->>MCP: call_tool(workspace, agent, tool, {message})
|
||||
MCP->>PALLAS: MCP Streamable HTTP
|
||||
PALLAS->>MCP: CallToolResult
|
||||
MCP->>GW: response text
|
||||
GW->>ESP: response.text.done
|
||||
GW->>DB: Save Message(role="assistant", content=response)
|
||||
|
||||
GW->>ESP: status: speaking
|
||||
GW->>TTS: POST /v1/audio/speech
|
||||
TTS->>GW: PCM audio stream
|
||||
|
||||
loop Audio chunks
|
||||
GW->>ESP: response.audio.delta (base64 PCM)
|
||||
end
|
||||
|
||||
GW->>ESP: response.audio.done
|
||||
GW->>ESP: response.done
|
||||
GW->>ESP: status: listening
|
||||
|
||||
Note over GW: Timeout timer starts (120s default)
|
||||
|
||||
alt Timeout — no speech
|
||||
GW->>ESP: session.end {reason: "timeout"}
|
||||
else Agent ends conversation
|
||||
GW->>ESP: session.end {reason: "agent"}
|
||||
else User speaks again
|
||||
Note over ESP: VAD triggers next turn (same conversation)
|
||||
end
|
||||
```
|
||||
|
||||
## Component Communication
|
||||
|
||||
| Source | Destination | Protocol | Format |
|
||||
|--------|------------|----------|--------|
|
||||
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
|
||||
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
|
||||
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
|
||||
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
|
||||
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
|
||||
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
|
||||
|
||||
## Network Topology
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
ESP["ESP32<br/>WiFi"]
|
||||
DAE["Daedalus<br/>puck.incus:8000"]
|
||||
STT["Speaches STT<br/>perseus.helu.ca:22070"]
|
||||
TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
|
||||
PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
|
||||
DB["PostgreSQL<br/>portia.incus:5432"]
|
||||
|
||||
ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
|
||||
DAE -->|"HTTP"| STT
|
||||
DAE -->|"HTTP"| TTS
|
||||
DAE -->|"MCP"| PALLAS
|
||||
DAE -->|"SQL"| DB
|
||||
```
|
||||
|
||||
## Audio Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
|
||||
B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
|
||||
GW -->|"WAV header wrap"| STT["Speaches STT"]
|
||||
|
||||
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
|
||||
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
|
||||
B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
| Decision | Why |
|
||||
|----------|-----|
|
||||
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
|
||||
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
|
||||
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
|
||||
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
|
||||
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
|
||||
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
|
||||
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
|
||||
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
stentor/ # This repository
|
||||
├── docs/
|
||||
│ ├── stentor.md # Usage guide (updated)
|
||||
│ └── architecture.md # This file
|
||||
├── stentor-ear/ # ESP32 firmware
|
||||
│ ├── main/
|
||||
│ ├── components/
|
||||
│ └── ...
|
||||
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
|
||||
│ └── ...
|
||||
└── README.md
|
||||
|
||||
daedalus/ # Separate repository
|
||||
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
|
||||
│ ├── audio.py
|
||||
│ ├── models.py
|
||||
│ ├── pipeline.py
|
||||
│ ├── stt_client.py
|
||||
│ └── tts_client.py
|
||||
├── backend/daedalus/api/v1/
|
||||
│ └── voice.py # Voice REST + WebSocket endpoints
|
||||
└── docs/
|
||||
└── stentor_integration.md # Full integration specification
|
||||
```
|
||||
Reference in New Issue
Block a user