# Stentor Architecture
> Version 0.2.0 — Daedalus-integrated architecture
## Overview
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
## System Architecture
```mermaid
graph TB
subgraph "ESP32-S3-AUDIO-Board"
MIC["Mic Array
ES7210 ADC"]
WW["Wake Word
ESP-SR"]
VAD["VAD
On-Device"]
SPK["Speaker
ES8311 DAC"]
LED["LED Ring
WS2812B"]
NVS["NVS
Device UUID"]
MIC --> WW
MIC --> VAD
end
subgraph "Daedalus Backend (puck.incus)"
REG["Device Registry
/api/v1/voice/devices"]
WS["WebSocket Server
/api/v1/voice/realtime"]
PIPE["Voice Pipeline
STT → MCP → TTS"]
DB["PostgreSQL
Conversations & Messages"]
MCP["MCP Connection Manager
Pallas Agents"]
end
subgraph "Speech Services"
STT["Speaches STT
Whisper (perseus)"]
TTS["Speaches TTS
Kokoro (perseus)"]
end
subgraph "AI Agents"
PALLAS["Pallas MCP Servers
Research · Infra · Orchestrator"]
end
NVS -->|"POST /register"| REG
WW -->|"WebSocket
JSON + base64 audio"| WS
VAD -->|"commit on silence"| WS
WS --> PIPE
PIPE -->|"POST /v1/audio/transcriptions"| STT
PIPE -->|"MCP call_tool"| MCP
MCP -->|"MCP Streamable HTTP"| PALLAS
PIPE -->|"POST /v1/audio/speech"| TTS
STT -->|"transcript text"| PIPE
PALLAS -->|"response text"| MCP
MCP --> PIPE
TTS -->|"PCM audio stream"| PIPE
PIPE --> DB
PIPE --> WS
WS -->|"audio + status"| SPK
WS -->|"status events"| LED
```
## Device Registration & Lifecycle
```mermaid
sequenceDiagram
participant ESP as ESP32
participant DAE as Daedalus
participant UI as Daedalus Web UI
Note over ESP: First boot — generate UUID, store in NVS
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
DAE->>ESP: {status: "registered"}
Note over UI: User sees new device in Settings → Voice Devices
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
Note over ESP: Wake word detected
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
ESP->>DAE: session.start
DAE->>ESP: session.created {session_id, conversation_id}
```
## Voice Pipeline
```mermaid
sequenceDiagram
participant ESP as ESP32
participant GW as Daedalus Voice
participant STT as Speaches STT
participant MCP as MCP Manager
participant PALLAS as Pallas Agent
participant TTS as Speaches TTS
participant DB as PostgreSQL
Note over ESP: VAD: speech detected
loop Audio streaming
ESP->>GW: input_audio_buffer.append (base64 PCM)
end
Note over ESP: VAD: silence detected
ESP->>GW: input_audio_buffer.commit
GW->>ESP: status: transcribing
GW->>STT: POST /v1/audio/transcriptions (WAV)
STT->>GW: {"text": "..."}
GW->>ESP: transcript.done
GW->>DB: Save Message(role="user", content=transcript)
GW->>ESP: status: thinking
GW->>MCP: call_tool(workspace, agent, tool, {message})
MCP->>PALLAS: MCP Streamable HTTP
PALLAS->>MCP: CallToolResult
MCP->>GW: response text
GW->>ESP: response.text.done
GW->>DB: Save Message(role="assistant", content=response)
GW->>ESP: status: speaking
GW->>TTS: POST /v1/audio/speech
TTS->>GW: PCM audio stream
loop Audio chunks
GW->>ESP: response.audio.delta (base64 PCM)
end
GW->>ESP: response.audio.done
GW->>ESP: response.done
GW->>ESP: status: listening
Note over GW: Timeout timer starts (120s default)
alt Timeout — no speech
GW->>ESP: session.end {reason: "timeout"}
else Agent ends conversation
GW->>ESP: session.end {reason: "agent"}
else User speaks again
Note over ESP: VAD triggers next turn (same conversation)
end
```
## Component Communication
| Source | Destination | Protocol | Format |
|--------|------------|----------|--------|
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
## Network Topology
```mermaid
graph LR
ESP["ESP32
WiFi"]
DAE["Daedalus
puck.incus:8000"]
STT["Speaches STT
perseus.helu.ca:22070"]
TTS["Speaches TTS
perseus.helu.ca:22070"]
PALLAS["Pallas Agents
puck.incus:23031-33"]
DB["PostgreSQL
portia.incus:5432"]
ESP <-->|"WS :22181
(via Nginx)"| DAE
DAE -->|"HTTP"| STT
DAE -->|"HTTP"| TTS
DAE -->|"MCP"| PALLAS
DAE -->|"SQL"| DB
```
## Audio Flow
```mermaid
graph LR
MIC["Microphone
16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
B64 -->|"WebSocket JSON"| GW["Daedalus Voice
Audio Buffer"]
GW -->|"WAV header wrap"| STT["Speaches STT"]
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample
24kHz → 16kHz"]
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
B64OUT -->|"WebSocket JSON"| SPK["Speaker
16kHz/16-bit/mono"]
```
## Key Design Decisions
| Decision | Why |
|----------|-----|
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
## Repository Structure
```
stentor/ # This repository
├── docs/
│ ├── stentor.md # Usage guide (updated)
│ └── architecture.md # This file
├── stentor-ear/ # ESP32 firmware
│ ├── main/
│ ├── components/
│ └── ...
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
│ └── ...
└── README.md
daedalus/ # Separate repository
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
│ ├── audio.py
│ ├── models.py
│ ├── pipeline.py
│ ├── stt_client.py
│ └── tts_client.py
├── backend/daedalus/api/v1/
│ └── voice.py # Voice REST + WebSocket endpoints
└── docs/
└── stentor_integration.md # Full integration specification
```