r/stentor

Files

Robert Helewka 912593b796 feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project

2026-03-21 19:11:48 +00:00

14 KiB

Raw Permalink Blame History

Stentor — Usage Guide

"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V

Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the Daedalus web application backend — there is no separate Stentor server process.

How It Works
Components
ESP32 Device Setup
Daedalus Configuration
Device Registration Flow
Voice Conversation Flow
WebSocket Protocol
API Endpoints
Observability
Troubleshooting
Architecture Overview

How It Works

An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
On-device VAD (Voice Activity Detection) detects speech and silence
Audio streams to Daedalus, which runs: Speaches STT → Pallas Agent (MCP) → Speaches TTS
The response audio streams back to the ESP32 speaker
Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations

Components

Component	Location	Purpose
stentor-ear	`stentor/stentor-ear/`	ESP32-S3 firmware — microphone, speaker, wake word, VAD
Daedalus voice module	`daedalus/backend/daedalus/voice/`	Voice pipeline — STT, MCP agent calls, TTS
Daedalus voice API	`daedalus/backend/daedalus/api/v1/voice.py`	WebSocket + REST endpoints for devices and sessions
Daedalus web UI	`daedalus/frontend/`	Device management, conversation history

The Python gateway code that was previously in stentor/stentor-gateway/ has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.

ESP32 Device Setup

The ESP32-S3-AUDIO-Board firmware needs one configuration value:

Setting	Description	Example
Daedalus URL	Base URL of the Daedalus instance	`http://puck.incus:22181`

On first boot, the device:

Generates a UUID v4 and stores it in NVS (non-volatile storage)
Registers with Daedalus via POST /api/v1/voice/devices/register
The UUID persists across reboots — the device keeps its identity

Daedalus Configuration

Voice settings are configured via environment variables with the DAEDALUS_ prefix:

Variable	Description	Default
`DAEDALUS_VOICE_STT_URL`	Speaches STT endpoint	`http://perseus.helu.ca:22070`
`DAEDALUS_VOICE_TTS_URL`	Speaches TTS endpoint	`http://perseus.helu.ca:22070`
`DAEDALUS_VOICE_STT_MODEL`	Whisper model for STT	`Systran/faster-whisper-small`
`DAEDALUS_VOICE_TTS_MODEL`	TTS model name	`kokoro`
`DAEDALUS_VOICE_TTS_VOICE`	TTS voice ID	`af_heart`
`DAEDALUS_VOICE_AUDIO_SAMPLE_RATE`	Sample rate in Hz	`16000`
`DAEDALUS_VOICE_AUDIO_CHANNELS`	Audio channels	`1`
`DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH`	Bits per sample	`16`
`DAEDALUS_VOICE_CONVERSATION_TIMEOUT`	Seconds of silence before auto-end	`120`

Device Registration Flow

ESP32                              Daedalus
  │                                    │
  │  [First boot — UUID generated]     │
  ├─ POST /api/v1/voice/devices/register ▶│
  │   {device_id, firmware_version}    │
  │◀─ {status: "registered"} ─────────┤
  │                                    │
  │  [Device appears in Daedalus       │
  │   Settings → Voice Devices]        │
  │                                    │
  │  [User assigns workspace + agent   │
  │   via web UI]                      │
  │                                    │
  │  [Subsequent boots — same UUID]    │
  ├─ POST /api/v1/voice/devices/register ▶│
  │   {device_id, firmware_version}    │
  │◀─ {status: "already_registered"} ──┤
  │                                    │

After registration, the device appears in the Daedalus settings page. The user assigns it:

A name (e.g. "Kitchen Speaker")
A description (optional)
A workspace (which workspace voice conversations go to)
An agent (which Pallas agent to target)

Until assigned, the device cannot process voice.

Voice Conversation Flow

A voice conversation is a multi-turn session driven by on-device VAD:

ESP32                              Daedalus
  │                                    │
  ├─ [Wake word detected]             │
  ├─ WS /api/v1/voice/realtime ──────▶│
  ├─ session.start ───────────────────▶│  → Create Conversation in DB
  │◀──── session.created ─────────────┤    {session_id, conversation_id}
  │◀──── status: listening ────────────┤
  │                                    │
  │  [VAD: user speaks]                │
  ├─ input_audio_buffer.append ×N ────▶│
  │  [VAD: silence detected]           │
  ├─ input_audio_buffer.commit ───────▶│
  │◀──── status: transcribing ────────┤  → STT
  │◀──── transcript.done ─────────────┤  → Save user message
  │◀──── status: thinking ────────────┤  → MCP call to Pallas
  │◀──── response.text.done ──────────┤  → Save assistant message
  │◀──── status: speaking ────────────┤  → TTS
  │◀──── response.audio.delta ×N ─────┤
  │◀──── response.audio.done ─────────┤
  │◀──── response.done ───────────────┤
  │◀──── status: listening ────────────┤
  │                                    │
  │  [VAD: user speaks again]          │  (same conversation)
  ├─ (next turn cycle) ──────────────▶│
  │                                    │
  │  [Conversation ends by:]           │
  │  • 120s silence → timeout          │
  │  • Agent says goodbye              │
  │  • WebSocket disconnect            │
  │◀──── session.end ─────────────────┤

Conversation End

A conversation ends in three ways:

Inactivity timeout — no speech for VOICE_CONVERSATION_TIMEOUT seconds (default 120)
Agent-initiated — the Pallas agent recognizes the conversation is over and signals it
Client disconnect — ESP32 sends session.close or WebSocket drops

All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.

WebSocket Protocol

Connection

WS /api/v1/voice/realtime?device_id={uuid}

Client → Gateway Messages

Type	Description	Fields
`session.start`	Start a new conversation	`client_id` (optional), `audio_config` (optional)
`input_audio_buffer.append`	Audio chunk	`audio` (base64 PCM)
`input_audio_buffer.commit`	End of speech, trigger pipeline	—
`session.close`	End the session	—

Gateway → Client Messages

Type	Description	Fields
`session.created`	Session started	`session_id`, `conversation_id`
`status`	Processing state	`state` (`listening` / `transcribing` / `thinking` / `speaking`)
`transcript.done`	User's speech as text	`text`
`response.text.done`	Agent's text response	`text`
`response.audio.delta`	Audio chunk (streamed)	`delta` (base64 PCM)
`response.audio.done`	Audio streaming complete	—
`response.done`	Turn complete	—
`session.end`	Conversation ended	`reason` (`timeout` / `agent` / `client`)
`error`	Error occurred	`message`, `code`

Audio Format

All audio is PCM signed 16-bit little-endian (pcm_s16le), base64-encoded in JSON:

Sample rate: 16,000 Hz
Channels: 1 (mono)
Bit depth: 16-bit

API Endpoints

All endpoints are served by the Daedalus FastAPI backend.

Voice Device Management

Method	Route	Purpose
`POST`	`/api/v1/voice/devices/register`	ESP32 self-registration (idempotent)
`GET`	`/api/v1/voice/devices`	List all registered devices
`GET`	`/api/v1/voice/devices/{id}`	Get device details
`PUT`	`/api/v1/voice/devices/{id}`	Update device (name, description, workspace, agent)
`DELETE`	`/api/v1/voice/devices/{id}`	Remove a device

Voice Sessions

Method	Route	Purpose
`WS`	`/api/v1/voice/realtime?device_id={id}`	WebSocket for audio conversations
`GET`	`/api/v1/voice/sessions`	List active voice sessions

Voice Configuration & Health

Method	Route	Purpose
`GET`	`/api/v1/voice/config`	Current voice configuration
`PUT`	`/api/v1/voice/config`	Update voice settings
`GET`	`/api/v1/voice/health`	STT + TTS reachability check

Observability

Prometheus Metrics

Voice metrics are exposed at Daedalus's GET /metrics endpoint with the daedalus_voice_ prefix:

Metric	Type	Description
`daedalus_voice_sessions_active`	gauge	Active WebSocket sessions
`daedalus_voice_pipeline_duration_seconds`	histogram	Full pipeline latency
`daedalus_voice_stt_duration_seconds`	histogram	STT latency
`daedalus_voice_tts_duration_seconds`	histogram	TTS latency
`daedalus_voice_agent_duration_seconds`	histogram	Agent (MCP) latency
`daedalus_voice_transcriptions_total`	counter	Total STT calls
`daedalus_voice_conversations_total`	counter	Conversations by end reason
`daedalus_voice_devices_online`	gauge	Currently connected devices

Logs

Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.

Key log events: voice_device_registered, voice_session_started, voice_pipeline_complete, voice_conversation_ended, voice_pipeline_error.

Troubleshooting

Device not appearing in Daedalus settings

Check the ESP32 can reach the Daedalus URL
Verify the registration endpoint responds: curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'

Device registered but voice doesn't work

Assign the device to a workspace and agent in Settings → Voice Devices
Unassigned devices get: {"type": "error", "code": "no_workspace"}

STT returns empty transcripts

Check Speaches STT is running: curl http://perseus.helu.ca:22070/v1/models
Check the voice health endpoint: curl http://puck.incus:22181/api/v1/voice/health

High latency

Check daedalus_voice_pipeline_duration_seconds in Prometheus/Grafana
Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
Agent latency depends on the Pallas agent and its downstream MCP servers

Audio sounds wrong (chipmunk / slow)

Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
Verify DAEDALUS_VOICE_AUDIO_SAMPLE_RATE matches the ESP32's playback rate

Architecture Overview

┌──────────────────┐       WebSocket         ┌──────────────────────────────────────┐
│  ESP32-S3 Board  │◀══════════════════════▶ │   Daedalus Backend (FastAPI)         │
│  (stentor-ear)   │   JSON + base64 audio   │   puck.incus                         │
│  UUID in NVS     │                         │                                      │
│  Wake Word + VAD │                         │   voice/ module:                     │
└──────────────────┘                         │     STT → MCP (Pallas) → TTS        │
                                             │     Conversations → PostgreSQL       │
                                             └──────┬──────────┬────────┬───────────┘
                                                    │          │        │
                                              MCP   │    HTTP  │  HTTP  │
                                                    ▼          ▼        ▼
                                             ┌──────────┐ ┌────────┐ ┌────────┐
                                             │  Pallas  │ │Speaches│ │Speaches│
                                             │  Agents  │ │  STT   │ │  TTS   │
                                             └──────────┘ └────────┘ └────────┘

For full architectural details including Mermaid diagrams, see architecture.md.

For the complete integration specification, see daedalus/docs/stentor_integration.md.

14 KiB Raw Permalink Blame History Unescape Escape