# Stentor Architecture > Version 0.2.0 — Daedalus-integrated architecture ## Overview Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components: 1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD 2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification. ## System Architecture ```mermaid graph TB subgraph "ESP32-S3-AUDIO-Board" MIC["Mic Array
ES7210 ADC"] WW["Wake Word
ESP-SR"] VAD["VAD
On-Device"] SPK["Speaker
ES8311 DAC"] LED["LED Ring
WS2812B"] NVS["NVS
Device UUID"] MIC --> WW MIC --> VAD end subgraph "Daedalus Backend (puck.incus)" REG["Device Registry
/api/v1/voice/devices"] WS["WebSocket Server
/api/v1/voice/realtime"] PIPE["Voice Pipeline
STT → MCP → TTS"] DB["PostgreSQL
Conversations & Messages"] MCP["MCP Connection Manager
Pallas Agents"] end subgraph "Speech Services" STT["Speaches STT
Whisper (perseus)"] TTS["Speaches TTS
Kokoro (perseus)"] end subgraph "AI Agents" PALLAS["Pallas MCP Servers
Research · Infra · Orchestrator"] end NVS -->|"POST /register"| REG WW -->|"WebSocket
JSON + base64 audio"| WS VAD -->|"commit on silence"| WS WS --> PIPE PIPE -->|"POST /v1/audio/transcriptions"| STT PIPE -->|"MCP call_tool"| MCP MCP -->|"MCP Streamable HTTP"| PALLAS PIPE -->|"POST /v1/audio/speech"| TTS STT -->|"transcript text"| PIPE PALLAS -->|"response text"| MCP MCP --> PIPE TTS -->|"PCM audio stream"| PIPE PIPE --> DB PIPE --> WS WS -->|"audio + status"| SPK WS -->|"status events"| LED ``` ## Device Registration & Lifecycle ```mermaid sequenceDiagram participant ESP as ESP32 participant DAE as Daedalus participant UI as Daedalus Web UI Note over ESP: First boot — generate UUID, store in NVS ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware} DAE->>ESP: {status: "registered"} Note over UI: User sees new device in Settings → Voice Devices UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent} Note over ESP: Wake word detected ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid ESP->>DAE: session.start DAE->>ESP: session.created {session_id, conversation_id} ``` ## Voice Pipeline ```mermaid sequenceDiagram participant ESP as ESP32 participant GW as Daedalus Voice participant STT as Speaches STT participant MCP as MCP Manager participant PALLAS as Pallas Agent participant TTS as Speaches TTS participant DB as PostgreSQL Note over ESP: VAD: speech detected loop Audio streaming ESP->>GW: input_audio_buffer.append (base64 PCM) end Note over ESP: VAD: silence detected ESP->>GW: input_audio_buffer.commit GW->>ESP: status: transcribing GW->>STT: POST /v1/audio/transcriptions (WAV) STT->>GW: {"text": "..."} GW->>ESP: transcript.done GW->>DB: Save Message(role="user", content=transcript) GW->>ESP: status: thinking GW->>MCP: call_tool(workspace, agent, tool, {message}) MCP->>PALLAS: MCP Streamable HTTP PALLAS->>MCP: CallToolResult MCP->>GW: response text GW->>ESP: response.text.done GW->>DB: Save Message(role="assistant", content=response) GW->>ESP: status: speaking GW->>TTS: POST /v1/audio/speech TTS->>GW: PCM audio stream loop Audio chunks GW->>ESP: response.audio.delta (base64 PCM) end GW->>ESP: response.audio.done GW->>ESP: response.done GW->>ESP: status: listening Note over GW: Timeout timer starts (120s default) alt Timeout — no speech GW->>ESP: session.end {reason: "timeout"} else Agent ends conversation GW->>ESP: session.end {reason: "agent"} else User speaks again Note over ESP: VAD triggers next turn (same conversation) end ``` ## Component Communication | Source | Destination | Protocol | Format | |--------|------------|----------|--------| | ESP32 | Daedalus | WebSocket | JSON + base64 PCM | | ESP32 | Daedalus | HTTP POST | JSON (device registration) | | Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) | | Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool | | Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response | | Daedalus | PostgreSQL | SQL | Conversations + Messages | ## Network Topology ```mermaid graph LR ESP["ESP32
WiFi"] DAE["Daedalus
puck.incus:8000"] STT["Speaches STT
perseus.helu.ca:22070"] TTS["Speaches TTS
perseus.helu.ca:22070"] PALLAS["Pallas Agents
puck.incus:23031-33"] DB["PostgreSQL
portia.incus:5432"] ESP <-->|"WS :22181
(via Nginx)"| DAE DAE -->|"HTTP"| STT DAE -->|"HTTP"| TTS DAE -->|"MCP"| PALLAS DAE -->|"SQL"| DB ``` ## Audio Flow ```mermaid graph LR MIC["Microphone
16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"] B64 -->|"WebSocket JSON"| GW["Daedalus Voice
Audio Buffer"] GW -->|"WAV header wrap"| STT["Speaches STT"] TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample
24kHz → 16kHz"] RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"] B64OUT -->|"WebSocket JSON"| SPK["Speaker
16kHz/16-bit/mono"] ``` ## Key Design Decisions | Decision | Why | |----------|-----| | Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure | | Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks | | Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI | | VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD | | JSON + base64 over WebSocket | Simple for v1; binary frames planned for future | | One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction | | Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye | | No audio storage | Only transcripts persisted; audio processed in-memory and discarded | ## Repository Structure ``` stentor/ # This repository ├── docs/ │ ├── stentor.md # Usage guide (updated) │ └── architecture.md # This file ├── stentor-ear/ # ESP32 firmware │ ├── main/ │ ├── components/ │ └── ... ├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus │ └── ... └── README.md daedalus/ # Separate repository ├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway) │ ├── audio.py │ ├── models.py │ ├── pipeline.py │ ├── stt_client.py │ └── tts_client.py ├── backend/daedalus/api/v1/ │ └── voice.py # Voice REST + WebSocket endpoints └── docs/ └── stentor_integration.md # Full integration specification ```