stentor/docs/architecture.md

# Stentor Architecture

> Version 0.2.0 — Daedalus-integrated architecture

## Overview

Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:

1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline

The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.

## System Architecture

```mermaid
graph TB
    subgraph "ESP32-S3-AUDIO-Board"
        MIC["Mic Array<br/>ES7210 ADC"]
        WW["Wake Word<br/>ESP-SR"]
        VAD["VAD<br/>On-Device"]
        SPK["Speaker<br/>ES8311 DAC"]
        LED["LED Ring<br/>WS2812B"]
        NVS["NVS<br/>Device UUID"]
        MIC --> WW
        MIC --> VAD
    end

    subgraph "Daedalus Backend (puck.incus)"
        REG["Device Registry<br/>/api/v1/voice/devices"]
        WS["WebSocket Server<br/>/api/v1/voice/realtime"]
        PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
        DB["PostgreSQL<br/>Conversations & Messages"]
        MCP["MCP Connection Manager<br/>Pallas Agents"]
    end

    subgraph "Speech Services"
        STT["Speaches STT<br/>Whisper (perseus)"]
        TTS["Speaches TTS<br/>Kokoro (perseus)"]
    end

    subgraph "AI Agents"
        PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
    end

    NVS -->|"POST /register"| REG
    WW -->|"WebSocket<br/>JSON + base64 audio"| WS
    VAD -->|"commit on silence"| WS
    WS --> PIPE
    PIPE -->|"POST /v1/audio/transcriptions"| STT
    PIPE -->|"MCP call_tool"| MCP
    MCP -->|"MCP Streamable HTTP"| PALLAS
    PIPE -->|"POST /v1/audio/speech"| TTS
    STT -->|"transcript text"| PIPE
    PALLAS -->|"response text"| MCP
    MCP --> PIPE
    TTS -->|"PCM audio stream"| PIPE
    PIPE --> DB
    PIPE --> WS
    WS -->|"audio + status"| SPK
    WS -->|"status events"| LED
```

## Device Registration & Lifecycle

```mermaid
sequenceDiagram
    participant ESP as ESP32
    participant DAE as Daedalus
    participant UI as Daedalus Web UI

    Note over ESP: First boot — generate UUID, store in NVS
    ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
    DAE->>ESP: {status: "registered"}

    Note over UI: User sees new device in Settings → Voice Devices
    UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}

    Note over ESP: Wake word detected
    ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
    ESP->>DAE: session.start
    DAE->>ESP: session.created {session_id, conversation_id}
```

## Voice Pipeline

```mermaid
sequenceDiagram
    participant ESP as ESP32
    participant GW as Daedalus Voice
    participant STT as Speaches STT
    participant MCP as MCP Manager
    participant PALLAS as Pallas Agent
    participant TTS as Speaches TTS
    participant DB as PostgreSQL

    Note over ESP: VAD: speech detected
    loop Audio streaming
        ESP->>GW: input_audio_buffer.append (base64 PCM)
    end
    Note over ESP: VAD: silence detected
    ESP->>GW: input_audio_buffer.commit

    GW->>ESP: status: transcribing
    GW->>STT: POST /v1/audio/transcriptions (WAV)
    STT->>GW: {"text": "..."}
    GW->>ESP: transcript.done
    GW->>DB: Save Message(role="user", content=transcript)

    GW->>ESP: status: thinking
    GW->>MCP: call_tool(workspace, agent, tool, {message})
    MCP->>PALLAS: MCP Streamable HTTP
    PALLAS->>MCP: CallToolResult
    MCP->>GW: response text
    GW->>ESP: response.text.done
    GW->>DB: Save Message(role="assistant", content=response)

    GW->>ESP: status: speaking
    GW->>TTS: POST /v1/audio/speech
    TTS->>GW: PCM audio stream

    loop Audio chunks
        GW->>ESP: response.audio.delta (base64 PCM)
    end

    GW->>ESP: response.audio.done
    GW->>ESP: response.done
    GW->>ESP: status: listening

    Note over GW: Timeout timer starts (120s default)

    alt Timeout — no speech
        GW->>ESP: session.end {reason: "timeout"}
    else Agent ends conversation
        GW->>ESP: session.end {reason: "agent"}
    else User speaks again
        Note over ESP: VAD triggers next turn (same conversation)
    end
```

## Component Communication

| Source | Destination | Protocol | Format |
|--------|------------|----------|--------|
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
| Daedalus | PostgreSQL | SQL | Conversations + Messages |

## Network Topology

```mermaid
graph LR
    ESP["ESP32<br/>WiFi"]
    DAE["Daedalus<br/>puck.incus:8000"]
    STT["Speaches STT<br/>perseus.helu.ca:22070"]
    TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
    PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
    DB["PostgreSQL<br/>portia.incus:5432"]

    ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
    DAE -->|"HTTP"| STT
    DAE -->|"HTTP"| TTS
    DAE -->|"MCP"| PALLAS
    DAE -->|"SQL"| DB
```

## Audio Flow

```mermaid
graph LR
    MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
    B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
    GW -->|"WAV header wrap"| STT["Speaches STT"]

    TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
    RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
    B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
```

## Key Design Decisions

| Decision | Why |
|----------|-----|
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |

## Repository Structure

```
stentor/                        # This repository
├── docs/
│   ├── stentor.md              # Usage guide (updated)
│   └── architecture.md         # This file
├── stentor-ear/                # ESP32 firmware
│   ├── main/
│   ├── components/
│   └── ...
├── stentor-gateway/            # Legacy — gateway code migrated to Daedalus
│   └── ...
└── README.md

daedalus/                       # Separate repository
├── backend/daedalus/voice/     # Voice module (migrated from stentor-gateway)
│   ├── audio.py
│   ├── models.py
│   ├── pipeline.py
│   ├── stt_client.py
│   └── tts_client.py
├── backend/daedalus/api/v1/
│   └── voice.py                # Voice REST + WebSocket endpoints
└── docs/
    └── stentor_integration.md  # Full integration specification
```