feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
This commit is contained in:
315
docs/api-reference.md
Normal file
315
docs/api-reference.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Stentor Gateway API Reference
|
||||
|
||||
> Version 0.1.0
|
||||
|
||||
## Endpoints
|
||||
|
||||
| Endpoint | Method | Description |
|
||||
|----------|--------|-------------|
|
||||
| `/` | GET | Dashboard (Bootstrap UI) |
|
||||
| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
|
||||
| `/api/v1/info` | GET | Gateway information and configuration |
|
||||
| `/api/live/` | GET | Liveness probe (Kubernetes) |
|
||||
| `/api/ready/` | GET | Readiness probe (Kubernetes) |
|
||||
| `/api/metrics` | GET | Prometheus-compatible metrics |
|
||||
| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
|
||||
| `/api/openapi.json` | GET | OpenAPI schema |
|
||||
|
||||
---
|
||||
|
||||
## WebSocket: `/api/v1/realtime`
|
||||
|
||||
Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
|
||||
|
||||
### Connection
|
||||
|
||||
```
|
||||
ws://{host}:{port}/api/v1/realtime
|
||||
```
|
||||
|
||||
### Client Events
|
||||
|
||||
#### `session.start`
|
||||
|
||||
Initiates a new conversation session. Must be sent first.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "session.start",
|
||||
"client_id": "esp32-kitchen",
|
||||
"audio_config": {
|
||||
"sample_rate": 16000,
|
||||
"channels": 1,
|
||||
"sample_width": 16,
|
||||
"encoding": "pcm_s16le"
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `type` | string | ✔ | Must be `"session.start"` |
|
||||
| `client_id` | string | | Client identifier for tracking |
|
||||
| `audio_config` | object | | Audio format configuration |
|
||||
|
||||
#### `input_audio_buffer.append`
|
||||
|
||||
Sends a chunk of audio data. Stream continuously while user is speaking.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "input_audio_buffer.append",
|
||||
"audio": "<base64-encoded PCM audio>"
|
||||
}
|
||||
```
|
||||
|
||||
| Field | Type | Required | Description |
|
||||
|-------|------|----------|-------------|
|
||||
| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
|
||||
| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
|
||||
|
||||
#### `input_audio_buffer.commit`
|
||||
|
||||
Signals end of speech. Triggers the STT → Agent → TTS pipeline.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "input_audio_buffer.commit"
|
||||
}
|
||||
```
|
||||
|
||||
#### `session.close`
|
||||
|
||||
Requests session termination. The WebSocket connection will close.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "session.close"
|
||||
}
|
||||
```
|
||||
|
||||
### Server Events
|
||||
|
||||
#### `session.created`
|
||||
|
||||
Acknowledges session creation.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "session.created",
|
||||
"session_id": "550e8400-e29b-41d4-a716-446655440000"
|
||||
}
|
||||
```
|
||||
|
||||
#### `status`
|
||||
|
||||
Processing status update. Use for LED feedback on ESP32.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "status",
|
||||
"state": "listening"
|
||||
}
|
||||
```
|
||||
|
||||
| State | Description | Suggested LED |
|
||||
|-------|-------------|--------------|
|
||||
| `listening` | Ready for audio input | Green |
|
||||
| `transcribing` | Running STT | Yellow |
|
||||
| `thinking` | Waiting for agent response | Yellow |
|
||||
| `speaking` | Playing TTS audio | Cyan |
|
||||
|
||||
#### `transcript.done`
|
||||
|
||||
Transcript of what the user said.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "transcript.done",
|
||||
"text": "What is the weather like today?"
|
||||
}
|
||||
```
|
||||
|
||||
#### `response.text.done`
|
||||
|
||||
AI agent's response text.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "response.text.done",
|
||||
"text": "I don't have weather tools yet, but I can help with other things."
|
||||
}
|
||||
```
|
||||
|
||||
#### `response.audio.delta`
|
||||
|
||||
Streamed audio response chunk.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "response.audio.delta",
|
||||
"delta": "<base64-encoded PCM audio>"
|
||||
}
|
||||
```
|
||||
|
||||
#### `response.audio.done`
|
||||
|
||||
Audio response streaming complete.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "response.audio.done"
|
||||
}
|
||||
```
|
||||
|
||||
#### `response.done`
|
||||
|
||||
Full response cycle complete. Gateway returns to listening state.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "response.done"
|
||||
}
|
||||
```
|
||||
|
||||
#### `error`
|
||||
|
||||
Error event.
|
||||
|
||||
```json
|
||||
{
|
||||
"type": "error",
|
||||
"message": "STT service unavailable",
|
||||
"code": "stt_error"
|
||||
}
|
||||
```
|
||||
|
||||
| Code | Description |
|
||||
|------|-------------|
|
||||
| `invalid_json` | Client sent malformed JSON |
|
||||
| `validation_error` | Message failed schema validation |
|
||||
| `no_session` | Action requires an active session |
|
||||
| `empty_buffer` | Audio buffer was empty on commit |
|
||||
| `empty_transcript` | STT returned no speech |
|
||||
| `empty_response` | Agent returned empty response |
|
||||
| `pipeline_error` | Internal pipeline failure |
|
||||
| `unknown_event` | Unrecognized event type |
|
||||
| `internal_error` | Unexpected server error |
|
||||
|
||||
---
|
||||
|
||||
## REST: `/api/v1/info`
|
||||
|
||||
Returns gateway information and current configuration.
|
||||
|
||||
**Response:**
|
||||
|
||||
```json
|
||||
{
|
||||
"name": "stentor-gateway",
|
||||
"version": "0.1.0",
|
||||
"endpoints": {
|
||||
"realtime": "/api/v1/realtime",
|
||||
"live": "/api/live/",
|
||||
"ready": "/api/ready/",
|
||||
"metrics": "/api/metrics"
|
||||
},
|
||||
"config": {
|
||||
"stt_url": "http://perseus.incus:8000",
|
||||
"tts_url": "http://pan.incus:8000",
|
||||
"agent_url": "http://localhost:8001",
|
||||
"stt_model": "Systran/faster-whisper-small",
|
||||
"tts_model": "kokoro",
|
||||
"tts_voice": "af_heart",
|
||||
"audio_sample_rate": 16000,
|
||||
"audio_channels": 1,
|
||||
"audio_sample_width": 16
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## REST: `/api/live/`
|
||||
|
||||
Kubernetes liveness probe.
|
||||
|
||||
**Response (200):**
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ok"
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## REST: `/api/ready/`
|
||||
|
||||
Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
|
||||
|
||||
**Response (200 — all services reachable):**
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "ready",
|
||||
"checks": {
|
||||
"stt": true,
|
||||
"tts": true,
|
||||
"agent": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
**Response (503 — one or more services unavailable):**
|
||||
|
||||
```json
|
||||
{
|
||||
"status": "not_ready",
|
||||
"checks": {
|
||||
"stt": true,
|
||||
"tts": false,
|
||||
"agent": true
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## REST: `/api/metrics`
|
||||
|
||||
Prometheus-compatible metrics in text exposition format.
|
||||
|
||||
**Metrics exported:**
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
|
||||
| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
|
||||
| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
|
||||
| `stentor_agent_requests_total` | Counter | Total agent message calls |
|
||||
| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
|
||||
| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
|
||||
| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
|
||||
| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
|
||||
|
||||
---
|
||||
|
||||
## Configuration
|
||||
|
||||
All configuration via environment variables (12-factor):
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
|
||||
| `STENTOR_PORT` | Gateway bind port | `8600` |
|
||||
| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
|
||||
| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
|
||||
| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
|
||||
| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
|
||||
| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
|
||||
| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
|
||||
| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
|
||||
| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
|
||||
| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
|
||||
| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |
|
||||
222
docs/architecture.md
Normal file
222
docs/architecture.md
Normal file
@@ -0,0 +1,222 @@
|
||||
# Stentor Architecture
|
||||
|
||||
> Version 0.2.0 — Daedalus-integrated architecture
|
||||
|
||||
## Overview
|
||||
|
||||
Stentor is a voice interface that connects physical audio hardware to AI agents via speech services. The system consists of two main components:
|
||||
|
||||
1. **stentor-ear** — ESP32-S3 firmware handling microphone input, speaker output, wake word detection, and VAD
|
||||
2. **Daedalus voice module** — Python code integrated into the Daedalus FastAPI backend, handling the STT → Agent → TTS pipeline
|
||||
|
||||
The Python gateway that was previously a standalone service (`stentor-gateway/`) has been merged into the Daedalus backend as `daedalus/backend/daedalus/voice/`. See [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md) for the full integration specification.
|
||||
|
||||
## System Architecture
|
||||
|
||||
```mermaid
|
||||
graph TB
|
||||
subgraph "ESP32-S3-AUDIO-Board"
|
||||
MIC["Mic Array<br/>ES7210 ADC"]
|
||||
WW["Wake Word<br/>ESP-SR"]
|
||||
VAD["VAD<br/>On-Device"]
|
||||
SPK["Speaker<br/>ES8311 DAC"]
|
||||
LED["LED Ring<br/>WS2812B"]
|
||||
NVS["NVS<br/>Device UUID"]
|
||||
MIC --> WW
|
||||
MIC --> VAD
|
||||
end
|
||||
|
||||
subgraph "Daedalus Backend (puck.incus)"
|
||||
REG["Device Registry<br/>/api/v1/voice/devices"]
|
||||
WS["WebSocket Server<br/>/api/v1/voice/realtime"]
|
||||
PIPE["Voice Pipeline<br/>STT → MCP → TTS"]
|
||||
DB["PostgreSQL<br/>Conversations & Messages"]
|
||||
MCP["MCP Connection Manager<br/>Pallas Agents"]
|
||||
end
|
||||
|
||||
subgraph "Speech Services"
|
||||
STT["Speaches STT<br/>Whisper (perseus)"]
|
||||
TTS["Speaches TTS<br/>Kokoro (perseus)"]
|
||||
end
|
||||
|
||||
subgraph "AI Agents"
|
||||
PALLAS["Pallas MCP Servers<br/>Research · Infra · Orchestrator"]
|
||||
end
|
||||
|
||||
NVS -->|"POST /register"| REG
|
||||
WW -->|"WebSocket<br/>JSON + base64 audio"| WS
|
||||
VAD -->|"commit on silence"| WS
|
||||
WS --> PIPE
|
||||
PIPE -->|"POST /v1/audio/transcriptions"| STT
|
||||
PIPE -->|"MCP call_tool"| MCP
|
||||
MCP -->|"MCP Streamable HTTP"| PALLAS
|
||||
PIPE -->|"POST /v1/audio/speech"| TTS
|
||||
STT -->|"transcript text"| PIPE
|
||||
PALLAS -->|"response text"| MCP
|
||||
MCP --> PIPE
|
||||
TTS -->|"PCM audio stream"| PIPE
|
||||
PIPE --> DB
|
||||
PIPE --> WS
|
||||
WS -->|"audio + status"| SPK
|
||||
WS -->|"status events"| LED
|
||||
```
|
||||
|
||||
## Device Registration & Lifecycle
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant ESP as ESP32
|
||||
participant DAE as Daedalus
|
||||
participant UI as Daedalus Web UI
|
||||
|
||||
Note over ESP: First boot — generate UUID, store in NVS
|
||||
ESP->>DAE: POST /api/v1/voice/devices/register {device_id, firmware}
|
||||
DAE->>ESP: {status: "registered"}
|
||||
|
||||
Note over UI: User sees new device in Settings → Voice Devices
|
||||
UI->>DAE: PUT /api/v1/voice/devices/{id} {name, workspace, agent}
|
||||
|
||||
Note over ESP: Wake word detected
|
||||
ESP->>DAE: WS /api/v1/voice/realtime?device_id=uuid
|
||||
ESP->>DAE: session.start
|
||||
DAE->>ESP: session.created {session_id, conversation_id}
|
||||
```
|
||||
|
||||
## Voice Pipeline
|
||||
|
||||
```mermaid
|
||||
sequenceDiagram
|
||||
participant ESP as ESP32
|
||||
participant GW as Daedalus Voice
|
||||
participant STT as Speaches STT
|
||||
participant MCP as MCP Manager
|
||||
participant PALLAS as Pallas Agent
|
||||
participant TTS as Speaches TTS
|
||||
participant DB as PostgreSQL
|
||||
|
||||
Note over ESP: VAD: speech detected
|
||||
loop Audio streaming
|
||||
ESP->>GW: input_audio_buffer.append (base64 PCM)
|
||||
end
|
||||
Note over ESP: VAD: silence detected
|
||||
ESP->>GW: input_audio_buffer.commit
|
||||
|
||||
GW->>ESP: status: transcribing
|
||||
GW->>STT: POST /v1/audio/transcriptions (WAV)
|
||||
STT->>GW: {"text": "..."}
|
||||
GW->>ESP: transcript.done
|
||||
GW->>DB: Save Message(role="user", content=transcript)
|
||||
|
||||
GW->>ESP: status: thinking
|
||||
GW->>MCP: call_tool(workspace, agent, tool, {message})
|
||||
MCP->>PALLAS: MCP Streamable HTTP
|
||||
PALLAS->>MCP: CallToolResult
|
||||
MCP->>GW: response text
|
||||
GW->>ESP: response.text.done
|
||||
GW->>DB: Save Message(role="assistant", content=response)
|
||||
|
||||
GW->>ESP: status: speaking
|
||||
GW->>TTS: POST /v1/audio/speech
|
||||
TTS->>GW: PCM audio stream
|
||||
|
||||
loop Audio chunks
|
||||
GW->>ESP: response.audio.delta (base64 PCM)
|
||||
end
|
||||
|
||||
GW->>ESP: response.audio.done
|
||||
GW->>ESP: response.done
|
||||
GW->>ESP: status: listening
|
||||
|
||||
Note over GW: Timeout timer starts (120s default)
|
||||
|
||||
alt Timeout — no speech
|
||||
GW->>ESP: session.end {reason: "timeout"}
|
||||
else Agent ends conversation
|
||||
GW->>ESP: session.end {reason: "agent"}
|
||||
else User speaks again
|
||||
Note over ESP: VAD triggers next turn (same conversation)
|
||||
end
|
||||
```
|
||||
|
||||
## Component Communication
|
||||
|
||||
| Source | Destination | Protocol | Format |
|
||||
|--------|------------|----------|--------|
|
||||
| ESP32 | Daedalus | WebSocket | JSON + base64 PCM |
|
||||
| ESP32 | Daedalus | HTTP POST | JSON (device registration) |
|
||||
| Daedalus | Speaches STT | HTTP POST | multipart/form-data (WAV) |
|
||||
| Daedalus | Pallas Agents | MCP Streamable HTTP | MCP call_tool |
|
||||
| Daedalus | Speaches TTS | HTTP POST | JSON request, binary PCM response |
|
||||
| Daedalus | PostgreSQL | SQL | Conversations + Messages |
|
||||
|
||||
## Network Topology
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
ESP["ESP32<br/>WiFi"]
|
||||
DAE["Daedalus<br/>puck.incus:8000"]
|
||||
STT["Speaches STT<br/>perseus.helu.ca:22070"]
|
||||
TTS["Speaches TTS<br/>perseus.helu.ca:22070"]
|
||||
PALLAS["Pallas Agents<br/>puck.incus:23031-33"]
|
||||
DB["PostgreSQL<br/>portia.incus:5432"]
|
||||
|
||||
ESP <-->|"WS :22181<br/>(via Nginx)"| DAE
|
||||
DAE -->|"HTTP"| STT
|
||||
DAE -->|"HTTP"| TTS
|
||||
DAE -->|"MCP"| PALLAS
|
||||
DAE -->|"SQL"| DB
|
||||
```
|
||||
|
||||
## Audio Flow
|
||||
|
||||
```mermaid
|
||||
graph LR
|
||||
MIC["Microphone<br/>16kHz/16-bit/mono"] -->|"PCM S16LE"| B64["Base64 Encode"]
|
||||
B64 -->|"WebSocket JSON"| GW["Daedalus Voice<br/>Audio Buffer"]
|
||||
GW -->|"WAV header wrap"| STT["Speaches STT"]
|
||||
|
||||
TTS["Speaches TTS"] -->|"PCM 24kHz"| RESAMPLE["Resample<br/>24kHz → 16kHz"]
|
||||
RESAMPLE -->|"PCM 16kHz"| B64OUT["Base64 Encode"]
|
||||
B64OUT -->|"WebSocket JSON"| SPK["Speaker<br/>16kHz/16-bit/mono"]
|
||||
```
|
||||
|
||||
## Key Design Decisions
|
||||
|
||||
| Decision | Why |
|
||||
|----------|-----|
|
||||
| Gateway merged into Daedalus | Shares MCP connections, DB, auth, metrics, frontend — no duplicate infrastructure |
|
||||
| Agent calls via MCP (not POST /message) | Same Pallas path as text chat; unified connection management and health checks |
|
||||
| Device self-registration with UUID in NVS | Plug-and-play; user configures workspace assignment in web UI |
|
||||
| VAD on ESP32, not server-side | Reduces bandwidth; ESP32-SR provides reliable on-device VAD |
|
||||
| JSON + base64 over WebSocket | Simple for v1; binary frames planned for future |
|
||||
| One conversation per WebSocket session | Multi-turn within a session; natural mapping to voice interaction |
|
||||
| Timeout + LLM-initiated end | Two natural ways to close: silence timeout or agent recognizes goodbye |
|
||||
| No audio storage | Only transcripts persisted; audio processed in-memory and discarded |
|
||||
|
||||
## Repository Structure
|
||||
|
||||
```
|
||||
stentor/ # This repository
|
||||
├── docs/
|
||||
│ ├── stentor.md # Usage guide (updated)
|
||||
│ └── architecture.md # This file
|
||||
├── stentor-ear/ # ESP32 firmware
|
||||
│ ├── main/
|
||||
│ ├── components/
|
||||
│ └── ...
|
||||
├── stentor-gateway/ # Legacy — gateway code migrated to Daedalus
|
||||
│ └── ...
|
||||
└── README.md
|
||||
|
||||
daedalus/ # Separate repository
|
||||
├── backend/daedalus/voice/ # Voice module (migrated from stentor-gateway)
|
||||
│ ├── audio.py
|
||||
│ ├── models.py
|
||||
│ ├── pipeline.py
|
||||
│ ├── stt_client.py
|
||||
│ └── tts_client.py
|
||||
├── backend/daedalus/api/v1/
|
||||
│ └── voice.py # Voice REST + WebSocket endpoints
|
||||
└── docs/
|
||||
└── stentor_integration.md # Full integration specification
|
||||
```
|
||||
315
docs/stentor.md
Normal file
315
docs/stentor.md
Normal file
@@ -0,0 +1,315 @@
|
||||
# Stentor — Usage Guide
|
||||
|
||||
> *"Stentor, whose voice was as powerful as fifty voices of other men."*
|
||||
> — Homer, *Iliad*, Book V
|
||||
|
||||
Stentor is a voice interface that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents via speech services. The voice gateway runs as part of the **Daedalus** web application backend — there is no separate Stentor server process.
|
||||
|
||||
---
|
||||
|
||||
## Table of Contents
|
||||
|
||||
- [How It Works](#how-it-works)
|
||||
- [Components](#components)
|
||||
- [ESP32 Device Setup](#esp32-device-setup)
|
||||
- [Daedalus Configuration](#daedalus-configuration)
|
||||
- [Device Registration Flow](#device-registration-flow)
|
||||
- [Voice Conversation Flow](#voice-conversation-flow)
|
||||
- [WebSocket Protocol](#websocket-protocol)
|
||||
- [API Endpoints](#api-endpoints)
|
||||
- [Observability](#observability)
|
||||
- [Troubleshooting](#troubleshooting)
|
||||
- [Architecture Overview](#architecture-overview)
|
||||
|
||||
---
|
||||
|
||||
## How It Works
|
||||
|
||||
1. An ESP32-S3-AUDIO-Board generates a UUID on first boot and registers itself with Daedalus
|
||||
2. A user assigns the device to a workspace and Pallas agent via the Daedalus web UI
|
||||
3. When the ESP32 detects a wake word, it opens a WebSocket to Daedalus and starts a voice session
|
||||
4. On-device VAD (Voice Activity Detection) detects speech and silence
|
||||
5. Audio streams to Daedalus, which runs: **Speaches STT** → **Pallas Agent (MCP)** → **Speaches TTS**
|
||||
6. The response audio streams back to the ESP32 speaker
|
||||
7. Transcripts are saved as conversations in PostgreSQL — visible in the Daedalus web UI alongside text conversations
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
| Component | Location | Purpose |
|
||||
|-----------|----------|---------|
|
||||
| **stentor-ear** | `stentor/stentor-ear/` | ESP32-S3 firmware — microphone, speaker, wake word, VAD |
|
||||
| **Daedalus voice module** | `daedalus/backend/daedalus/voice/` | Voice pipeline — STT, MCP agent calls, TTS |
|
||||
| **Daedalus voice API** | `daedalus/backend/daedalus/api/v1/voice.py` | WebSocket + REST endpoints for devices and sessions |
|
||||
| **Daedalus web UI** | `daedalus/frontend/` | Device management, conversation history |
|
||||
|
||||
The Python gateway code that was previously in `stentor/stentor-gateway/` has been merged into Daedalus. That directory is retained for reference but is no longer deployed as a standalone service.
|
||||
|
||||
---
|
||||
|
||||
## ESP32 Device Setup
|
||||
|
||||
The ESP32-S3-AUDIO-Board firmware needs one configuration value:
|
||||
|
||||
| Setting | Description | Example |
|
||||
|---------|-------------|---------|
|
||||
| Daedalus URL | Base URL of the Daedalus instance | `http://puck.incus:22181` |
|
||||
|
||||
On first boot, the device:
|
||||
1. Generates a UUID v4 and stores it in NVS (non-volatile storage)
|
||||
2. Registers with Daedalus via `POST /api/v1/voice/devices/register`
|
||||
3. The UUID persists across reboots — the device keeps its identity
|
||||
|
||||
---
|
||||
|
||||
## Daedalus Configuration
|
||||
|
||||
Voice settings are configured via environment variables with the `DAEDALUS_` prefix:
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `DAEDALUS_VOICE_STT_URL` | Speaches STT endpoint | `http://perseus.helu.ca:22070` |
|
||||
| `DAEDALUS_VOICE_TTS_URL` | Speaches TTS endpoint | `http://perseus.helu.ca:22070` |
|
||||
| `DAEDALUS_VOICE_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
|
||||
| `DAEDALUS_VOICE_TTS_MODEL` | TTS model name | `kokoro` |
|
||||
| `DAEDALUS_VOICE_TTS_VOICE` | TTS voice ID | `af_heart` |
|
||||
| `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` | Sample rate in Hz | `16000` |
|
||||
| `DAEDALUS_VOICE_AUDIO_CHANNELS` | Audio channels | `1` |
|
||||
| `DAEDALUS_VOICE_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
|
||||
| `DAEDALUS_VOICE_CONVERSATION_TIMEOUT` | Seconds of silence before auto-end | `120` |
|
||||
|
||||
---
|
||||
|
||||
## Device Registration Flow
|
||||
|
||||
```
|
||||
ESP32 Daedalus
|
||||
│ │
|
||||
│ [First boot — UUID generated] │
|
||||
├─ POST /api/v1/voice/devices/register ▶│
|
||||
│ {device_id, firmware_version} │
|
||||
│◀─ {status: "registered"} ─────────┤
|
||||
│ │
|
||||
│ [Device appears in Daedalus │
|
||||
│ Settings → Voice Devices] │
|
||||
│ │
|
||||
│ [User assigns workspace + agent │
|
||||
│ via web UI] │
|
||||
│ │
|
||||
│ [Subsequent boots — same UUID] │
|
||||
├─ POST /api/v1/voice/devices/register ▶│
|
||||
│ {device_id, firmware_version} │
|
||||
│◀─ {status: "already_registered"} ──┤
|
||||
│ │
|
||||
```
|
||||
|
||||
After registration, the device appears in the Daedalus settings page. The user assigns it:
|
||||
- A **name** (e.g. "Kitchen Speaker")
|
||||
- A **description** (optional)
|
||||
- A **workspace** (which workspace voice conversations go to)
|
||||
- An **agent** (which Pallas agent to target)
|
||||
|
||||
Until assigned, the device cannot process voice.
|
||||
|
||||
---
|
||||
|
||||
## Voice Conversation Flow
|
||||
|
||||
A voice conversation is a multi-turn session driven by on-device VAD:
|
||||
|
||||
```
|
||||
ESP32 Daedalus
|
||||
│ │
|
||||
├─ [Wake word detected] │
|
||||
├─ WS /api/v1/voice/realtime ──────▶│
|
||||
├─ session.start ───────────────────▶│ → Create Conversation in DB
|
||||
│◀──── session.created ─────────────┤ {session_id, conversation_id}
|
||||
│◀──── status: listening ────────────┤
|
||||
│ │
|
||||
│ [VAD: user speaks] │
|
||||
├─ input_audio_buffer.append ×N ────▶│
|
||||
│ [VAD: silence detected] │
|
||||
├─ input_audio_buffer.commit ───────▶│
|
||||
│◀──── status: transcribing ────────┤ → STT
|
||||
│◀──── transcript.done ─────────────┤ → Save user message
|
||||
│◀──── status: thinking ────────────┤ → MCP call to Pallas
|
||||
│◀──── response.text.done ──────────┤ → Save assistant message
|
||||
│◀──── status: speaking ────────────┤ → TTS
|
||||
│◀──── response.audio.delta ×N ─────┤
|
||||
│◀──── response.audio.done ─────────┤
|
||||
│◀──── response.done ───────────────┤
|
||||
│◀──── status: listening ────────────┤
|
||||
│ │
|
||||
│ [VAD: user speaks again] │ (same conversation)
|
||||
├─ (next turn cycle) ──────────────▶│
|
||||
│ │
|
||||
│ [Conversation ends by:] │
|
||||
│ • 120s silence → timeout │
|
||||
│ • Agent says goodbye │
|
||||
│ • WebSocket disconnect │
|
||||
│◀──── session.end ─────────────────┤
|
||||
```
|
||||
|
||||
### Conversation End
|
||||
|
||||
A conversation ends in three ways:
|
||||
|
||||
1. **Inactivity timeout** — no speech for `VOICE_CONVERSATION_TIMEOUT` seconds (default 120)
|
||||
2. **Agent-initiated** — the Pallas agent recognizes the conversation is over and signals it
|
||||
3. **Client disconnect** — ESP32 sends `session.close` or WebSocket drops
|
||||
|
||||
All conversations are saved in PostgreSQL and visible in the Daedalus workspace chat history.
|
||||
|
||||
---
|
||||
|
||||
## WebSocket Protocol
|
||||
|
||||
### Connection
|
||||
|
||||
```
|
||||
WS /api/v1/voice/realtime?device_id={uuid}
|
||||
```
|
||||
|
||||
### Client → Gateway Messages
|
||||
|
||||
| Type | Description | Fields |
|
||||
|------|-------------|--------|
|
||||
| `session.start` | Start a new conversation | `client_id` (optional), `audio_config` (optional) |
|
||||
| `input_audio_buffer.append` | Audio chunk | `audio` (base64 PCM) |
|
||||
| `input_audio_buffer.commit` | End of speech, trigger pipeline | — |
|
||||
| `session.close` | End the session | — |
|
||||
|
||||
### Gateway → Client Messages
|
||||
|
||||
| Type | Description | Fields |
|
||||
|------|-------------|--------|
|
||||
| `session.created` | Session started | `session_id`, `conversation_id` |
|
||||
| `status` | Processing state | `state` (`listening` / `transcribing` / `thinking` / `speaking`) |
|
||||
| `transcript.done` | User's speech as text | `text` |
|
||||
| `response.text.done` | Agent's text response | `text` |
|
||||
| `response.audio.delta` | Audio chunk (streamed) | `delta` (base64 PCM) |
|
||||
| `response.audio.done` | Audio streaming complete | — |
|
||||
| `response.done` | Turn complete | — |
|
||||
| `session.end` | Conversation ended | `reason` (`timeout` / `agent` / `client`) |
|
||||
| `error` | Error occurred | `message`, `code` |
|
||||
|
||||
### Audio Format
|
||||
|
||||
All audio is **PCM signed 16-bit little-endian** (`pcm_s16le`), base64-encoded in JSON:
|
||||
|
||||
- **Sample rate:** 16,000 Hz
|
||||
- **Channels:** 1 (mono)
|
||||
- **Bit depth:** 16-bit
|
||||
|
||||
---
|
||||
|
||||
## API Endpoints
|
||||
|
||||
All endpoints are served by the Daedalus FastAPI backend.
|
||||
|
||||
### Voice Device Management
|
||||
|
||||
| Method | Route | Purpose |
|
||||
|--------|-------|---------|
|
||||
| `POST` | `/api/v1/voice/devices/register` | ESP32 self-registration (idempotent) |
|
||||
| `GET` | `/api/v1/voice/devices` | List all registered devices |
|
||||
| `GET` | `/api/v1/voice/devices/{id}` | Get device details |
|
||||
| `PUT` | `/api/v1/voice/devices/{id}` | Update device (name, description, workspace, agent) |
|
||||
| `DELETE` | `/api/v1/voice/devices/{id}` | Remove a device |
|
||||
|
||||
### Voice Sessions
|
||||
|
||||
| Method | Route | Purpose |
|
||||
|--------|-------|---------|
|
||||
| `WS` | `/api/v1/voice/realtime?device_id={id}` | WebSocket for audio conversations |
|
||||
| `GET` | `/api/v1/voice/sessions` | List active voice sessions |
|
||||
|
||||
### Voice Configuration & Health
|
||||
|
||||
| Method | Route | Purpose |
|
||||
|--------|-------|---------|
|
||||
| `GET` | `/api/v1/voice/config` | Current voice configuration |
|
||||
| `PUT` | `/api/v1/voice/config` | Update voice settings |
|
||||
| `GET` | `/api/v1/voice/health` | STT + TTS reachability check |
|
||||
|
||||
---
|
||||
|
||||
## Observability
|
||||
|
||||
### Prometheus Metrics
|
||||
|
||||
Voice metrics are exposed at Daedalus's `GET /metrics` endpoint with the `daedalus_voice_` prefix:
|
||||
|
||||
| Metric | Type | Description |
|
||||
|--------|------|-------------|
|
||||
| `daedalus_voice_sessions_active` | gauge | Active WebSocket sessions |
|
||||
| `daedalus_voice_pipeline_duration_seconds` | histogram | Full pipeline latency |
|
||||
| `daedalus_voice_stt_duration_seconds` | histogram | STT latency |
|
||||
| `daedalus_voice_tts_duration_seconds` | histogram | TTS latency |
|
||||
| `daedalus_voice_agent_duration_seconds` | histogram | Agent (MCP) latency |
|
||||
| `daedalus_voice_transcriptions_total` | counter | Total STT calls |
|
||||
| `daedalus_voice_conversations_total` | counter | Conversations by end reason |
|
||||
| `daedalus_voice_devices_online` | gauge | Currently connected devices |
|
||||
|
||||
### Logs
|
||||
|
||||
Voice events flow through the standard Daedalus logging pipeline: structlog → stdout → syslog → Alloy → Loki.
|
||||
|
||||
Key log events: `voice_device_registered`, `voice_session_started`, `voice_pipeline_complete`, `voice_conversation_ended`, `voice_pipeline_error`.
|
||||
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Device not appearing in Daedalus settings
|
||||
|
||||
- Check the ESP32 can reach the Daedalus URL
|
||||
- Verify the registration endpoint responds: `curl -X POST http://puck.incus:22181/api/v1/voice/devices/register -H 'Content-Type: application/json' -d '{"device_id":"test","firmware_version":"1.0"}'`
|
||||
|
||||
### Device registered but voice doesn't work
|
||||
|
||||
- Assign the device to a workspace and agent in **Settings → Voice Devices**
|
||||
- Unassigned devices get: `{"type": "error", "code": "no_workspace"}`
|
||||
|
||||
### STT returns empty transcripts
|
||||
|
||||
- Check Speaches STT is running: `curl http://perseus.helu.ca:22070/v1/models`
|
||||
- Check the voice health endpoint: `curl http://puck.incus:22181/api/v1/voice/health`
|
||||
|
||||
### High latency
|
||||
|
||||
- Check `daedalus_voice_pipeline_duration_seconds` in Prometheus/Grafana
|
||||
- Breakdown by stage: STT, Agent, TTS histograms identify the bottleneck
|
||||
- Agent latency depends on the Pallas agent and its downstream MCP servers
|
||||
|
||||
### Audio sounds wrong (chipmunk / slow)
|
||||
|
||||
- Speaches TTS outputs at 24 kHz; the pipeline resamples to 16 kHz
|
||||
- Verify `DAEDALUS_VOICE_AUDIO_SAMPLE_RATE` matches the ESP32's playback rate
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌──────────────────┐ WebSocket ┌──────────────────────────────────────┐
|
||||
│ ESP32-S3 Board │◀══════════════════════▶ │ Daedalus Backend (FastAPI) │
|
||||
│ (stentor-ear) │ JSON + base64 audio │ puck.incus │
|
||||
│ UUID in NVS │ │ │
|
||||
│ Wake Word + VAD │ │ voice/ module: │
|
||||
└──────────────────┘ │ STT → MCP (Pallas) → TTS │
|
||||
│ Conversations → PostgreSQL │
|
||||
└──────┬──────────┬────────┬───────────┘
|
||||
│ │ │
|
||||
MCP │ HTTP │ HTTP │
|
||||
▼ ▼ ▼
|
||||
┌──────────┐ ┌────────┐ ┌────────┐
|
||||
│ Pallas │ │Speaches│ │Speaches│
|
||||
│ Agents │ │ STT │ │ TTS │
|
||||
└──────────┘ └────────┘ └────────┘
|
||||
```
|
||||
|
||||
For full architectural details including Mermaid diagrams, see [architecture.md](architecture.md).
|
||||
|
||||
For the complete integration specification, see [daedalus/docs/stentor_integration.md](../../daedalus/docs/stentor_integration.md).
|
||||
Reference in New Issue
Block a user