Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
223 lines
7.6 KiB
Markdown
223 lines
7.6 KiB
Markdown
# Stentor Architecture
|
|
|
|
> Version 0.2.0 — Daedalus-integrated architecture
|
|
|
|
## Overview
|
|
|
|
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
|
|
|
|
1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
|
|
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
|
|
|
|
The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
|
|
|
|
## System Architecture
|
|
|
|
```mermaid
|
|
graph TB
|
|
subgraph "ESP32-S3-AUDIO-Board"
|
|
MIC["Mic Array<br/>ES7210 ADC"]
|
|
WW["Wake Word<br/>ESP-SR"]
|
|
VAD["VAD<br/>On-Device"]
|
|
SPK["Speaker<br/>ES8311 DAC"]
|
|
LED["LED Ring<br/>WS2812B"]
|
|
NVS["NVS<br/>Device UUID"]
|
|
MIC --> WW
|
|
MIC --> VAD
|
|
end
|
|
|
|
subgraph "Daedalus Backend (puck.incus)"
|
|
REG["Device Registry<br/>/api/v1/voice/devices"]
|
|
WS["WebSocket Server<br/>/api/v1/voice/realtime"]
|
|
PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
|
|
DB["PostgreSQL<br/>Conversations & Messages"]
|
|
MCP["MCP Connection Manager<br/>Pallas Agents"]
|
|
end
|
|
|
|
subgraph "Speech Services"
|
|
STT["Speaches STT<br/>Whisper (perseus)"]
|
|
TTS["Speaches TTS<br/>Kokoro (perseus)"]
|
|
end
|
|
|
|
subgraph "AI Agents"
|
|
PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
|
|
end
|
|
|
|
NVS -->|"POST /register"| REG
|
|
WW -->|"WebSocket<br/>JSON + base64 audio"| WS
|
|
VAD -->|"commit on silence"| WS
|
|
WS --> PIPE
|
|
PIPE -->|"POST /v1/audio/transcriptions"| STT
|
|
PIPE -->|"MCP call_tool"| MCP
|
|
MCP -->|"MCP Streamable HTTP"| PALLAS
|
|
PIPE -->|"POST /v1/audio/speech"| TTS
|
|
STT -->|"transcript text"| PIPE
|
|
PALLAS -->|"response text"| MCP
|
|
MCP --> PIPE
|
|
TTS -->|"PCM audio stream"| PIPE
|
|
PIPE --> DB
|
|
PIPE --> WS
|
|
WS -->|"audio + status"| SPK
|
|
WS -->|"status events"| LED
|
|
```
|
|
|
|
## Device Registration & Lifecycle
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant ESP as ESP32
|
|
participant DAE as Daedalus
|
|
participant UI as Daedalus Web UI
|
|
|
|
Note over ESP: First boot — generate UUID, store in NVS
|
|
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
|
|
DAE->>ESP: {status: "registered"}
|
|
|
|
Note over UI: User sees new device in Settings → Voice Devices
|
|
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
|
|
|
|
Note over ESP: Wake word detected
|
|
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
|
|
ESP->>DAE: session.start
|
|
DAE->>ESP: session.created {session_id, conversation_id}
|
|
```
|
|
|
|
## Voice Pipeline
|
|
|
|
```mermaid
|
|
sequenceDiagram
|
|
participant ESP as ESP32
|
|
participant GW as Daedalus Voice
|
|
participant STT as Speaches STT
|
|
participant MCP as MCP Manager
|
|
participant PALLAS as Pallas Agent
|
|
participant TTS as Speaches TTS
|
|
participant DB as PostgreSQL
|
|
|
|
Note over ESP: VAD: speech detected
|
|
loop Audio streaming
|
|
ESP->>GW: input_audio_buffer.append (base64 PCM)
|
|
end
|
|
Note over ESP: VAD: silence detected
|
|
ESP->>GW: input_audio_buffer.commit
|
|
|
|
GW->>ESP: status: transcribing
|
|
GW->>STT: POST /v1/audio/transcriptions (WAV)
|
|
STT->>GW: {"text": "..."}
|
|
GW->>ESP: transcript.done
|
|
GW->>DB: Save Message(role="user", content=transcript)
|
|
|
|
GW->>ESP: status: thinking
|
|
GW->>MCP: call_tool(workspace, agent, tool, {message})
|
|
MCP->>PALLAS: MCP Streamable HTTP
|
|
PALLAS->>MCP: CallToolResult
|
|
MCP->>GW: response text
|
|
GW->>ESP: response.text.done
|
|
GW->>DB: Save Message(role="assistant", content=response)
|
|
|
|
GW->>ESP: status: speaking
|
|
GW->>TTS: POST /v1/audio/speech
|
|
TTS->>GW: PCM audio stream
|
|
|
|
loop Audio chunks
|
|
GW->>ESP: response.audio.delta (base64 PCM)
|
|
end
|
|
|
|
GW->>ESP: response.audio.done
|
|
GW->>ESP: response.done
|
|
GW->>ESP: status: listening
|
|
|
|
Note over GW: Timeout timer starts (120s default)
|
|
|
|
alt Timeout — no speech
|
|
GW->>ESP: session.end {reason: "timeout"}
|
|
else Agent ends conversation
|
|
GW->>ESP: session.end {reason: "agent"}
|
|
else User speaks again
|
|
Note over ESP: VAD triggers next turn (same conversation)
|
|
end
|
|
```
|
|
|
|
## Component Communication
|
|
|
|
| Source | Destination | Protocol | Format |
|
|
|--------|------------|----------|--------|
|
|
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
|
|
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
|
|
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
|
|
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
|
|
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
|
|
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
|
|
|
|
## Network Topology
|
|
|
|
```mermaid
|
|
graph LR
|
|
ESP["ESP32<br/>WiFi"]
|
|
DAE["Daedalus<br/>puck.incus:8000"]
|
|
STT["Speaches STT<br/>perseus.helu.ca:22070"]
|
|
TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
|
|
PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
|
|
DB["PostgreSQL<br/>portia.incus:5432"]
|
|
|
|
ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
|
|
DAE -->|"HTTP"| STT
|
|
DAE -->|"HTTP"| TTS
|
|
DAE -->|"MCP"| PALLAS
|
|
DAE -->|"SQL"| DB
|
|
```
|
|
|
|
## Audio Flow
|
|
|
|
```mermaid
|
|
graph LR
|
|
MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
|
|
B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
|
|
GW -->|"WAV header wrap"| STT["Speaches STT"]
|
|
|
|
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
|
|
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
|
|
B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
|
|
```
|
|
|
|
## Key Design Decisions
|
|
|
|
| Decision | Why |
|
|
|----------|-----|
|
|
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
|
|
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
|
|
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
|
|
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
|
|
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
|
|
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
|
|
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
|
|
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
stentor/ # This repository
|
|
├── docs/
|
|
│ ├── stentor.md # Usage guide (updated)
|
|
│ └── architecture.md # This file
|
|
├── stentor-ear/ # ESP32 firmware
|
|
│ ├── main/
|
|
│ ├── components/
|
|
│ └── ...
|
|
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
|
|
│ └── ...
|
|
└── README.md
|
|
|
|
daedalus/ # Separate repository
|
|
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
|
|
│ ├── audio.py
|
|
│ ├── models.py
|
|
│ ├── pipeline.py
|
|
│ ├── stt_client.py
|
|
│ └── tts_client.py
|
|
├── backend/daedalus/api/v1/
|
|
│ └── voice.py # Voice REST + WebSocket endpoints
|
|
└── docs/
|
|
└── stentor_integration.md # Full integration specification
|
|
```
|