Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
7.6 KiB
7.6 KiB
Stentor Architecture
Version 0.2.0 — Daedalus-integrated architecture
Overview
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
- stentor-ear — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
- Daedalus voice module — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
The Python gateway that was previously a standalone service (stentor-gateway/) has been merged into the Daedalus backend as daedalus/backend/daedalus/voice/. See daedalus/docs/stentor_integration.md for the full integration specification.
System Architecture
graph TB
subgraph "ESP32-S3-AUDIO-Board"
MIC["Mic Array<br/>ES7210 ADC"]
WW["Wake Word<br/>ESP-SR"]
VAD["VAD<br/>On-Device"]
SPK["Speaker<br/>ES8311 DAC"]
LED["LED Ring<br/>WS2812B"]
NVS["NVS<br/>Device UUID"]
MIC --> WW
MIC --> VAD
end
subgraph "Daedalus Backend (puck.incus)"
REG["Device Registry<br/>/api/v1/voice/devices"]
WS["WebSocket Server<br/>/api/v1/voice/realtime"]
PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
DB["PostgreSQL<br/>Conversations & Messages"]
MCP["MCP Connection Manager<br/>Pallas Agents"]
end
subgraph "Speech Services"
STT["Speaches STT<br/>Whisper (perseus)"]
TTS["Speaches TTS<br/>Kokoro (perseus)"]
end
subgraph "AI Agents"
PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
end
NVS -->|"POST /register"| REG
WW -->|"WebSocket<br/>JSON + base64 audio"| WS
VAD -->|"commit on silence"| WS
WS --> PIPE
PIPE -->|"POST /v1/audio/transcriptions"| STT
PIPE -->|"MCP call_tool"| MCP
MCP -->|"MCP Streamable HTTP"| PALLAS
PIPE -->|"POST /v1/audio/speech"| TTS
STT -->|"transcript text"| PIPE
PALLAS -->|"response text"| MCP
MCP --> PIPE
TTS -->|"PCM audio stream"| PIPE
PIPE --> DB
PIPE --> WS
WS -->|"audio + status"| SPK
WS -->|"status events"| LED
Device Registration & Lifecycle
sequenceDiagram
participant ESP as ESP32
participant DAE as Daedalus
participant UI as Daedalus Web UI
Note over ESP: First boot — generate UUID, store in NVS
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
DAE->>ESP: {status: "registered"}
Note over UI: User sees new device in Settings → Voice Devices
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
Note over ESP: Wake word detected
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
ESP->>DAE: session.start
DAE->>ESP: session.created {session_id, conversation_id}
Voice Pipeline
sequenceDiagram
participant ESP as ESP32
participant GW as Daedalus Voice
participant STT as Speaches STT
participant MCP as MCP Manager
participant PALLAS as Pallas Agent
participant TTS as Speaches TTS
participant DB as PostgreSQL
Note over ESP: VAD: speech detected
loop Audio streaming
ESP->>GW: input_audio_buffer.append (base64 PCM)
end
Note over ESP: VAD: silence detected
ESP->>GW: input_audio_buffer.commit
GW->>ESP: status: transcribing
GW->>STT: POST /v1/audio/transcriptions (WAV)
STT->>GW: {"text": "..."}
GW->>ESP: transcript.done
GW->>DB: Save Message(role="user", content=transcript)
GW->>ESP: status: thinking
GW->>MCP: call_tool(workspace, agent, tool, {message})
MCP->>PALLAS: MCP Streamable HTTP
PALLAS->>MCP: CallToolResult
MCP->>GW: response text
GW->>ESP: response.text.done
GW->>DB: Save Message(role="assistant", content=response)
GW->>ESP: status: speaking
GW->>TTS: POST /v1/audio/speech
TTS->>GW: PCM audio stream
loop Audio chunks
GW->>ESP: response.audio.delta (base64 PCM)
end
GW->>ESP: response.audio.done
GW->>ESP: response.done
GW->>ESP: status: listening
Note over GW: Timeout timer starts (120s default)
alt Timeout — no speech
GW->>ESP: session.end {reason: "timeout"}
else Agent ends conversation
GW->>ESP: session.end {reason: "agent"}
else User speaks again
Note over ESP: VAD triggers next turn (same conversation)
end
Component Communication
| Source | Destination | Protocol | Format |
|---|---|---|---|
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
Network Topology
graph LR
ESP["ESP32<br/>WiFi"]
DAE["Daedalus<br/>puck.incus:8000"]
STT["Speaches STT<br/>perseus.helu.ca:22070"]
TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
DB["PostgreSQL<br/>portia.incus:5432"]
ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
DAE -->|"HTTP"| STT
DAE -->|"HTTP"| TTS
DAE -->|"MCP"| PALLAS
DAE -->|"SQL"| DB
Audio Flow
graph LR
MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
GW -->|"WAV header wrap"| STT["Speaches STT"]
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
Key Design Decisions
| Decision | Why |
|---|---|
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
Repository Structure
stentor/ # This repository
├── docs/
│ ├── stentor.md # Usage guide (updated)
│ └── architecture.md # This file
├── stentor-ear/ # ESP32 firmware
│ ├── main/
│ ├── components/
│ └── ...
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
│ └── ...
└── README.md
daedalus/ # Separate repository
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
│ ├── audio.py
│ ├── models.py
│ ├── pipeline.py
│ ├── stt_client.py
│ └── tts_client.py
├── backend/daedalus/api/v1/
│ └── voice.py # Voice REST + WebSocket endpoints
└── docs/
└── stentor_integration.md # Full integration specification