Files
stentor/docs/architecture.md
Robert Helewka 912593b796 feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00

7.6 KiB

Stentor Architecture

Version 0.2.0 — Daedalus-integrated architecture

Overview

Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:

  1. stentor-ear — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
  2. Daedalus voice module — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline

The Python gateway that was previously a standalone service (stentor-gateway/) has been merged into the Daedalus backend as daedalus/backend/daedalus/voice/. See daedalus/docs/stentor_integration.md for the full integration specification.

System Architecture

graph TB
    subgraph "ESP32-S3-AUDIO-Board"
        MIC["Mic Array<br/>ES7210 ADC"]
        WW["Wake Word<br/>ESP-SR"]
        VAD["VAD<br/>On-Device"]
        SPK["Speaker<br/>ES8311 DAC"]
        LED["LED Ring<br/>WS2812B"]
        NVS["NVS<br/>Device UUID"]
        MIC --> WW
        MIC --> VAD
    end

    subgraph "Daedalus Backend (puck.incus)"
        REG["Device Registry<br/>/api/v1/voice/devices"]
        WS["WebSocket Server<br/>/api/v1/voice/realtime"]
        PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
        DB["PostgreSQL<br/>Conversations & Messages"]
        MCP["MCP Connection Manager<br/>Pallas Agents"]
    end

    subgraph "Speech Services"
        STT["Speaches STT<br/>Whisper (perseus)"]
        TTS["Speaches TTS<br/>Kokoro (perseus)"]
    end

    subgraph "AI Agents"
        PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
    end

    NVS -->|"POST /register"| REG
    WW -->|"WebSocket<br/>JSON + base64 audio"| WS
    VAD -->|"commit on silence"| WS
    WS --> PIPE
    PIPE -->|"POST /v1/audio/transcriptions"| STT
    PIPE -->|"MCP call_tool"| MCP
    MCP -->|"MCP Streamable HTTP"| PALLAS
    PIPE -->|"POST /v1/audio/speech"| TTS
    STT -->|"transcript text"| PIPE
    PALLAS -->|"response text"| MCP
    MCP --> PIPE
    TTS -->|"PCM audio stream"| PIPE
    PIPE --> DB
    PIPE --> WS
    WS -->|"audio + status"| SPK
    WS -->|"status events"| LED

Device Registration & Lifecycle

sequenceDiagram
    participant ESP as ESP32
    participant DAE as Daedalus
    participant UI as Daedalus Web UI

    Note over ESP: First boot — generate UUID, store in NVS
    ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
    DAE->>ESP: {status: "registered"}

    Note over UI: User sees new device in Settings → Voice Devices
    UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}

    Note over ESP: Wake word detected
    ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
    ESP->>DAE: session.start
    DAE->>ESP: session.created {session_id, conversation_id}

Voice Pipeline

sequenceDiagram
    participant ESP as ESP32
    participant GW as Daedalus Voice
    participant STT as Speaches STT
    participant MCP as MCP Manager
    participant PALLAS as Pallas Agent
    participant TTS as Speaches TTS
    participant DB as PostgreSQL

    Note over ESP: VAD: speech detected
    loop Audio streaming
        ESP->>GW: input_audio_buffer.append (base64 PCM)
    end
    Note over ESP: VAD: silence detected
    ESP->>GW: input_audio_buffer.commit

    GW->>ESP: status: transcribing
    GW->>STT: POST /v1/audio/transcriptions (WAV)
    STT->>GW: {"text": "..."}
    GW->>ESP: transcript.done
    GW->>DB: Save Message(role="user", content=transcript)

    GW->>ESP: status: thinking
    GW->>MCP: call_tool(workspace, agent, tool, {message})
    MCP->>PALLAS: MCP Streamable HTTP
    PALLAS->>MCP: CallToolResult
    MCP->>GW: response text
    GW->>ESP: response.text.done
    GW->>DB: Save Message(role="assistant", content=response)

    GW->>ESP: status: speaking
    GW->>TTS: POST /v1/audio/speech
    TTS->>GW: PCM audio stream

    loop Audio chunks
        GW->>ESP: response.audio.delta (base64 PCM)
    end

    GW->>ESP: response.audio.done
    GW->>ESP: response.done
    GW->>ESP: status: listening

    Note over GW: Timeout timer starts (120s default)

    alt Timeout — no speech
        GW->>ESP: session.end {reason: "timeout"}
    else Agent ends conversation
        GW->>ESP: session.end {reason: "agent"}
    else User speaks again
        Note over ESP: VAD triggers next turn (same conversation)
    end

Component Communication

Source Destination Protocol Format
ESP32 Daedalus WebSocket JSON + base64 PCM
ESP32 Daedalus HTTP POST JSON (device registration)
Daedalus Speaches STT HTTP POST multipart/form-data (WAV)
Daedalus Pallas Agents MCP Streamable HTTP MCP call_tool
Daedalus Speaches TTS HTTP POST JSON request, binary PCM response
Daedalus PostgreSQL SQL Conversations + Messages

Network Topology

graph LR
    ESP["ESP32<br/>WiFi"]
    DAE["Daedalus<br/>puck.incus:8000"]
    STT["Speaches STT<br/>perseus.helu.ca:22070"]
    TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
    PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
    DB["PostgreSQL<br/>portia.incus:5432"]

    ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
    DAE -->|"HTTP"| STT
    DAE -->|"HTTP"| TTS
    DAE -->|"MCP"| PALLAS
    DAE -->|"SQL"| DB

Audio Flow

graph LR
    MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
    B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
    GW -->|"WAV header wrap"| STT["Speaches STT"]

    TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
    RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
    B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]

Key Design Decisions

Decision Why
Gateway merged into Daedalus Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure
Agent calls via MCP (not POST /message) Same Pallas path as text chat; unified connection management and health checks
Device self-registration with UUID in NVS Plug-and-play; user configures workspace assignment in web UI
VAD on ESP32, not server-side Reduces bandwidth; ESP32-SR provides reliable on-device VAD
JSON + base64 over WebSocket Simple for v1; binary frames planned for future
One conversation per WebSocket session Multi-turn within a session; natural mapping to voice interaction
Timeout + LLM-initiated end Two natural ways to close: silence timeout or agent recognizes goodbye
No audio storage Only transcripts persisted; audio processed in-memory and discarded

Repository Structure

stentor/                        # This repository
├── docs/
│   ├── stentor.md              # Usage guide (updated)
│   └── architecture.md         # This file
├── stentor-ear/                # ESP32 firmware
│   ├── main/
│   ├── components/
│   └── ...
├── stentor-gateway/            # Legacy — gateway code migrated to Daedalus
│   └── ...
└── README.md

daedalus/                       # Separate repository
├── backend/daedalus/voice/     # Voice module (migrated from stentor-gateway)
│   ├── audio.py
│   ├── models.py
│   ├── pipeline.py
│   ├── stt_client.py
│   └── tts_client.py
├── backend/daedalus/api/v1/
│   └── voice.py                # Voice REST + WebSocket endpoints
└── docs/
    └── stentor_integration.md  # Full integration specification