Files
stentor/docs/stentor.md
Robert Helewka 912593b796 feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00

14 KiB
Raw Permalink Blame History

Stentor — Usage Guide

"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V

Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the Daedalus web application backend — there is no separate Stentor server process.


Table of Contents


How It Works

  1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
  2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
  3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
  4. On-device VAD (Voice Activity Detection) detects speech and silence
  5. Audio streams to Daedalus, which runs: Speaches STTPallas Agent (MCP)Speaches TTS
  6. The response audio streams back to the ESP32 speaker
  7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations

Components

Component Location Purpose
stentor-ear stentor/stentor-ear/ ESP32-S3 firmware — microphone, speaker, wake word, VAD
Daedalus voice module daedalus/backend/daedalus/voice/ Voice pipeline — STT, MCP agent calls, TTS
Daedalus voice API daedalus/backend/daedalus/api/v1/voice.py WebSocket + REST endpoints for devices and sessions
Daedalus web UI daedalus/frontend/ Device management, conversation history

The Python gateway code that was previously in stentor/stentor-gateway/ has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.


ESP32 Device Setup

The ESP32-S3-AUDIO-Board firmware needs one configuration value:

Setting Description Example
Daedalus URL Base URL of the Daedalus instance http://puck.incus:22181

On first boot, the device:

  1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
  2. Registers with Daedalus via POST /api/v1/voice/devices/register
  3. The UUID persists across reboots — the device keeps its identity

Daedalus Configuration

Voice settings are configured via environment variables with the DAEDALUS_ prefix:

Variable Description Default
DAEDALUS_VOICE_STT_URL Speaches STT endpoint http://perseus.helu.ca:22070
DAEDALUS_VOICE_TTS_URL Speaches TTS endpoint http://perseus.helu.ca:22070
DAEDALUS_VOICE_STT_MODEL Whisper model for STT Systran/faster-whisper-small
DAEDALUS_VOICE_TTS_MODEL TTS model name kokoro
DAEDALUS_VOICE_TTS_VOICE TTS voice ID af_heart
DAEDALUS_VOICE_AUDIO_SAMPLE_RATE Sample rate in Hz 16000
DAEDALUS_VOICE_AUDIO_CHANNELS Audio channels 1
DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH Bits per sample 16
DAEDALUS_VOICE_CONVERSATION_TIMEOUT Seconds of silence before auto-end 120

Device Registration Flow

ESP32                              Daedalus
  │                                    │
  │  [First boot — UUID generated]     │
  ├─ POST /api/v1/voice/devices/register ▶│
  │   {device_id, firmware_version}    │
  │◀─ {status: "registered"} ─────────┤
  │                                    │
  │  [Device appears in Daedalus       │
  │   Settings → Voice Devices]        │
  │                                    │
  │  [User assigns workspace + agent   │
  │   via web UI]                      │
  │                                    │
  │  [Subsequent boots — same UUID]    │
  ├─ POST /api/v1/voice/devices/register ▶│
  │   {device_id, firmware_version}    │
  │◀─ {status: "already_registered"} ──┤
  │                                    │

After registration, the device appears in the Daedalus settings page. The user assigns it:

  • A name (e.g. "Kitchen Speaker")
  • A description (optional)
  • A workspace (which workspace voice conversations go to)
  • An agent (which Pallas agent to target)

Until assigned, the device cannot process voice.


Voice Conversation Flow

A voice conversation is a multi-turn session driven by on-device VAD:

ESP32                              Daedalus
  │                                    │
  ├─ [Wake word detected]             │
  ├─ WS /api/v1/voice/realtime ──────▶│
  ├─ session.start ───────────────────▶│  → Create Conversation in DB
  │◀──── session.created ─────────────┤    {session_id, conversation_id}
  │◀──── status: listening ────────────┤
  │                                    │
  │  [VAD: user speaks]                │
  ├─ input_audio_buffer.append ×N ────▶│
  │  [VAD: silence detected]           │
  ├─ input_audio_buffer.commit ───────▶│
  │◀──── status: transcribing ────────┤  → STT
  │◀──── transcript.done ─────────────┤  → Save user message
  │◀──── status: thinking ────────────┤  → MCP call to Pallas
  │◀──── response.text.done ──────────┤  → Save assistant message
  │◀──── status: speaking ────────────┤  → TTS
  │◀──── response.audio.delta ×N ─────┤
  │◀──── response.audio.done ─────────┤
  │◀──── response.done ───────────────┤
  │◀──── status: listening ────────────┤
  │                                    │
  │  [VAD: user speaks again]          │  (same conversation)
  ├─ (next turn cycle) ──────────────▶│
  │                                    │
  │  [Conversation ends by:]           │
  │  • 120s silence → timeout          │
  │  • Agent says goodbye              │
  │  • WebSocket disconnect            │
  │◀──── session.end ─────────────────┤

Conversation End

A conversation ends in three ways:

  1. Inactivity timeout — no speech for VOICE_CONVERSATION_TIMEOUT seconds (default 120)
  2. Agent-initiated — the Pallas agent recognizes the conversation is over and signals it
  3. Client disconnect — ESP32 sends session.close or WebSocket drops

All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.


WebSocket Protocol

Connection

WS /api/v1/voice/realtime?device_id={uuid}

Client → Gateway Messages

Type Description Fields
session.start Start a new conversation client_id (optional), audio_config (optional)
input_audio_buffer.append Audio chunk audio (base64 PCM)
input_audio_buffer.commit End of speech, trigger pipeline
session.close End the session

Gateway → Client Messages

Type Description Fields
session.created Session started session_id, conversation_id
status Processing state state (listening / transcribing / thinking / speaking)
transcript.done User's speech as text text
response.text.done Agent's text response text
response.audio.delta Audio chunk (streamed) delta (base64 PCM)
response.audio.done Audio streaming complete
response.done Turn complete
session.end Conversation ended reason (timeout / agent / client)
error Error occurred message, code

Audio Format

All audio is PCM signed 16-bit little-endian (pcm_s16le), base64-encoded in JSON:

  • Sample rate: 16,000 Hz
  • Channels: 1 (mono)
  • Bit depth: 16-bit

API Endpoints

All endpoints are served by the Daedalus FastAPI backend.

Voice Device Management

Method Route Purpose
POST /api/v1/voice/devices/register ESP32 self-registration (idempotent)
GET /api/v1/voice/devices List all registered devices
GET /api/v1/voice/devices/{id} Get device details
PUT /api/v1/voice/devices/{id} Update device (name, description, workspace, agent)
DELETE /api/v1/voice/devices/{id} Remove a device

Voice Sessions

Method Route Purpose
WS /api/v1/voice/realtime?device_id={id} WebSocket for audio conversations
GET /api/v1/voice/sessions List active voice sessions

Voice Configuration & Health

Method Route Purpose
GET /api/v1/voice/config Current voice configuration
PUT /api/v1/voice/config Update voice settings
GET /api/v1/voice/health STT + TTS reachability check

Observability

Prometheus Metrics

Voice metrics are exposed at Daedalus's GET /metrics endpoint with the daedalus_voice_ prefix:

Metric Type Description
daedalus_voice_sessions_active gauge Active WebSocket sessions
daedalus_voice_pipeline_duration_seconds histogram Full pipeline latency
daedalus_voice_stt_duration_seconds histogram STT latency
daedalus_voice_tts_duration_seconds histogram TTS latency
daedalus_voice_agent_duration_seconds histogram Agent (MCP) latency
daedalus_voice_transcriptions_total counter Total STT calls
daedalus_voice_conversations_total counter Conversations by end reason
daedalus_voice_devices_online gauge Currently connected devices

Logs

Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.

Key log events: voice_device_registered, voice_session_started, voice_pipeline_complete, voice_conversation_ended, voice_pipeline_error.


Troubleshooting

Device not appearing in Daedalus settings

  • Check the ESP32 can reach the Daedalus URL
  • Verify the registration endpoint responds: curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'

Device registered but voice doesn't work

  • Assign the device to a workspace and agent in Settings → Voice Devices
  • Unassigned devices get: {"type": "error", "code": "no_workspace"}

STT returns empty transcripts

  • Check Speaches STT is running: curl http://perseus.helu.ca:22070/v1/models
  • Check the voice health endpoint: curl http://puck.incus:22181/api/v1/voice/health

High latency

  • Check daedalus_voice_pipeline_duration_seconds in Prometheus/Grafana
  • Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
  • Agent latency depends on the Pallas agent and its downstream MCP servers

Audio sounds wrong (chipmunk / slow)

  • Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
  • Verify DAEDALUS_VOICE_AUDIO_SAMPLE_RATE matches the ESP32's playback rate

Architecture Overview

┌──────────────────┐       WebSocket         ┌──────────────────────────────────────┐
│  ESP32-S3 Board  │◀══════════════════════▶ │   Daedalus Backend (FastAPI)         │
│  (stentor-ear)   │   JSON + base64 audio   │   puck.incus                         │
│  UUID in NVS     │                         │                                      │
│  Wake Word + VAD │                         │   voice/ module:                     │
└──────────────────┘                         │     STT → MCP (Pallas) → TTS        │
                                             │     Conversations → PostgreSQL       │
                                             └──────┬──────────┬────────┬───────────┘
                                                    │          │        │
                                              MCP   │    HTTP  │  HTTP  │
                                                    ▼          ▼        ▼
                                             ┌──────────┐ ┌────────┐ ┌────────┐
                                             │  Pallas  │ │Speaches│ │Speaches│
                                             │  Agents  │ │  STT   │ │  TTS   │
                                             └──────────┘ └────────┘ └────────┘

For full architectural details including Mermaid diagrams, see architecture.md.

For the complete integration specification, see daedalus/docs/stentor_integration.md.