r/stentor

Files

Robert Helewka 912593b796 feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project

2026-03-21 19:11:48 +00:00

20 KiB

Raw Permalink Blame History

Stentor — Voice Gateway Specification

"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V

Stentor is a voice gateway that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents (FastAgent) via speech services (Speaches), enabling real-time voice conversations with AI.

Architecture Overview

┌──────────────────────────────────┐
│  ESP32-S3-AUDIO-Board            │
│  (stentor-ear)                   │
│                                  │
│  Mic Array ──▶ Wake Word ──▶ WS  │
│  Speaker   ◀── PCM playback ◀──  │
└──────────────┬───────────────────┘
               │ WebSocket (audio)
               ▼
┌──────────────────────────────────┐
│  Stentor Gateway                 │
│  (Python service)                │
│                                  │
│  WS Server ──▶ STT ──▶ Agent    │
│  WS Server ◀── TTS ◀── Agent    │
└──────┬──────────┬──────────┬─────┘
       │          │          │
       ▼          ▼          ▼
   Speaches    FastAgent   Speaches
   STT         (HTTP)      TTS

Components

1. Stentor Gateway (`stentor-gateway/`)

Language: Python 3.12+
Framework: FastAPI + WebSockets
Deployment: Agathos container (Caliban or dedicated)

The gateway is the central orchestrator. It:

Accepts WebSocket connections from ESP32 clients
Streams audio to Speaches STT for transcription
Sends transcripts to FastAgent for AI response
Sends AI response text to Speaches TTS for synthesis
Streams synthesized audio back to the ESP32 client

Configuration

All configuration via environment variables (12-factor):

Variable	Description	Default
`STENTOR_HOST`	Gateway bind address	`0.0.0.0`
`STENTOR_PORT`	Gateway bind port	`8600`
`STENTOR_STT_URL`	Speaches STT endpoint (OpenAI-compatible)	`http://perseus.incus:8000`
`STENTOR_TTS_URL`	Speaches TTS endpoint (OpenAI-compatible)	`http://pan.incus:8000`
`STENTOR_AGENT_URL`	FastAgent HTTP endpoint	`http://localhost:8001`
`STENTOR_STT_MODEL`	Whisper model for STT	`Systran/faster-whisper-small`
`STENTOR_TTS_MODEL`	TTS model name	`kokoro`
`STENTOR_TTS_VOICE`	TTS voice ID	`af_heart`
`STENTOR_AUDIO_SAMPLE_RATE`	Audio sample rate in Hz	`16000`
`STENTOR_AUDIO_CHANNELS`	Audio channel count	`1`
`STENTOR_AUDIO_SAMPLE_WIDTH`	Bits per sample	`16`
`STENTOR_LOG_LEVEL`	Logging level	`INFO`

WebSocket Protocol

Endpoint: ws://{STENTOR_HOST}:{STENTOR_PORT}/ws/audio

Client → Gateway Messages

// Session start (sent once on connect)
{
  \"type\": \"session.start\",
  \"client_id\": \"esp32-kitchen\",
  \"audio_config\": {
    \"sample_rate\": 16000,
    \"channels\": 1,
    \"sample_width\": 16,
    \"encoding\": \"pcm_s16le\"
  }
}

// Audio data (sent continuously while listening)
{
  \"type\": \"audio.chunk\",
  \"data\": \"<base64-encoded PCM audio>\"
}

// Client signals end of speech
{
  \"type\": \"audio.end\"
}

// Client requests session termination
{
  \"type\": \"session.end\"
}

Gateway → Client Messages

// Acknowledge session
{
  \"type\": \"session.ready\",
  \"session_id\": \"uuid\"
}

// Gateway is processing (ESP32 can show \"thinking\" LED)
{
  \"type\": \"status\",
  \"state\": \"listening\" | \"transcribing\" | \"thinking\" | \"speaking\"
}

// Transcript of what user said (for debugging/display)
{
  \"type\": \"transcript\",
  \"text\": \"What is the weather like today?\"
}

// AI response text (for debugging/display)
{
  \"type\": \"response.text\",
  \"text\": \"I don\\'t have weather tools yet, but...\"
}

// Audio response (streamed in chunks)
{
  \"type\": \"audio.chunk\",
  \"data\": \"<base64-encoded PCM audio>\"
}

// Audio response complete
{
  \"type\": \"audio.end\"
}

// Error
{
  \"type\": \"error\",
  \"message\": \"STT service unavailable\",
  \"code\": \"stt_error\"
}

Processing Pipeline

1. ESP32 detects wake word locally (ESP-SR)
2. ESP32 opens WebSocket, sends session.start
3. Gateway sends session.ready
4. ESP32 streams audio.chunk messages
5. Gateway buffers audio (with Voice Activity Detection)
6. ESP32 sends audio.end (or VAD detects silence)
7. Gateway sends status: \"transcribing\"
8. Gateway calls Speaches STT:
   POST {STENTOR_STT_URL}/v1/audio/transcriptions
   - model: {STENTOR_STT_MODEL}
   - file: collected audio as WAV
   - response_format: json
9. Gateway sends transcript message to client
10. Gateway sends status: \"thinking\"
11. Gateway calls FastAgent:
    POST {STENTOR_AGENT_URL}/message
    - content: transcript text
12. Gateway sends response.text to client
13. Gateway sends status: \"speaking\"
14. Gateway calls Speaches TTS:
    POST {STENTOR_TTS_URL}/v1/audio/speech
    - model: {STENTOR_TTS_MODEL}
    - voice: {STENTOR_TTS_VOICE}
    - input: agent response text
    - response_format: pcm
    - speed: 1.0
15. Gateway streams audio.chunk messages to client
16. Gateway sends audio.end
17. Gateway returns to listening state (step 4)
    OR ESP32 sends session.end to disconnect

Health & Observability

GET /health — Returns 200 if gateway is running
GET /health/ready — Returns 200 if STT, TTS, and Agent endpoints are reachable
GET /metrics — Prometheus-compatible metrics
- stentor_sessions_active — Current active WebSocket sessions
- stentor_transcriptions_total — Total STT calls
- stentor_tts_requests_total — Total TTS calls
- stentor_agent_requests_total — Total agent calls
- stentor_pipeline_duration_seconds — Histogram of full pipeline latency
- stentor_stt_duration_seconds — Histogram of STT latency
- stentor_tts_duration_seconds — Histogram of TTS latency
- stentor_agent_duration_seconds — Histogram of agent response latency

Dependencies (Python)

[project]
name = \"stentor-gateway\"
version = \"0.1.0\"
requires-python = \">=3.12\"
dependencies = [
    \"fastapi>=0.115\",
    \"uvicorn[standard]>=0.34\",
    \"websockets>=14.0\",
    \"httpx>=0.28\",
    \"pydantic>=2.10\",
    \"pydantic-settings>=2.7\",
]

[project.optional-dependencies]
dev = [
    \"pytest>=8.0\",
    \"pytest-asyncio>=0.25\",
    \"ruff>=0.9\",
]

Project Structure

stentor-gateway/
├── pyproject.toml
├── src/
│   └── stentor/
│       ├── __init__.py
│       ├── main.py              # FastAPI app, WebSocket endpoint
│       ├── config.py            # Pydantic Settings configuration
│       ├── models.py            # WebSocket message models
│       ├── pipeline.py          # Orchestration: STT → Agent → TTS
│       ├── stt_client.py        # Speaches STT client (OpenAI-compatible)
│       ├── tts_client.py        # Speaches TTS client (OpenAI-compatible)
│       ├── agent_client.py      # FastAgent HTTP client
│       ├── audio.py             # Audio utilities (PCM↔WAV, resampling)
│       └── health.py            # Health check endpoints
├── tests/
│   ├── test_pipeline.py
│   ├── test_stt_client.py
│   ├── test_tts_client.py
│   └── test_websocket.py
└── Dockerfile

2. Stentor Ear (`stentor-ear/`)

Platform: ESP32-S3-AUDIO-Board (Waveshare)
Framework: ESP-IDF 5.x + ESP-ADF
Language: C

The ESP32 firmware handles local audio capture, wake word detection, and communication with the Stentor Gateway.

Hardware Capabilities (Waveshare ESP32-S3-AUDIO-Board)

Component	Chip	Purpose
Microphone array	2× MEMS via ES7210 ADC	Audio capture, noise reduction
Speaker output	ES8311 DAC + Class-D amp	Audio playback
Display	1.85" LCD (ST7789V2)	Status display (optional)
LEDs	WS2812B RGB ring (×2)	Visual status feedback
Storage	TF card slot	Local audio cache (optional)
Camera	OV2640 (2MP)	Not used for Stentor
RTC	PCF85063	Timestamping (optional)
Connectivity	WiFi 802.11 b/g/n + BLE 5	Network communication
Wake word	ESP-SR	Local "Hi ESP" detection

Reference: Waveshare wiki documents ESP-SR wake word "hi esp" with command recognition support [2].

Firmware State Machine

                    ┌──────────┐
         ┌─────────│  IDLE    │◀──────────┐
         │         │ (LEDs: dim blue)     │
         │         └────┬─────┘          │
         │              │ Wake word       │
         │              │ detected        │
         │              ▼                 │
         │         ┌──────────┐          │
         │         │ CONNECT  │          │
         │         │ (LEDs: pulse white) │
         │         └────┬─────┘          │
         │              │ WS connected    │
         │              ▼                 │
         │         ┌──────────┐          │
  Timeout│         │ LISTEN   │          │ audio.end
  / Error│         │ (LEDs: green)       │ received
         │         └────┬─────┘          │
         │              │ Silence /       │
         │              │ button press    │
         │              ▼                 │
         │         ┌──────────┐          │
         │         │ WAITING  │          │
         │         │ (LEDs: pulse yellow)│
         │         └────┬─────┘          │
         │              │ audio.chunk     │
         │              │ from gateway    │
         │              ▼                 │
         │         ┌──────────┐          │
         │         │ PLAYING  ├──────────┘
         │         │ (LEDs: pulse cyan)  │
         │         └──────────┘          │
         │                                │
         └────────────────────────────────┘

Audio Configuration

Parameter	Value	Notes
Sample rate	16000 Hz	Standard for speech recognition
Bit depth	16-bit	Signed, little-endian (PCM_S16LE)
Channels	1 (mono)	Downmixed from dual-mic array
Chunk size	512 samples (32ms)	Balance between latency and overhead
Buffer	4096 samples (256ms)	Ring buffer for network jitter

WiFi & Network

WiFi credentials via menuconfig or BLE provisioning
mDNS discovery for gateway: _stentor._tcp.local
Fallback to configured static gateway URL
Auto-reconnect with exponential backoff

LED Feedback (WS2812B Ring)

State	Pattern	Color
Idle / standby	Dim steady	Blue
Wake word detected / connecting	Pulse	White
Listening (recording)	Steady	Green
Transcribing / thinking	Rotating pulse	Yellow
Speaking (playback)	Pulse with audio	Cyan
Error	Flash	Red

Firmware Project Structure

stentor-ear/
├── CMakeLists.txt
├── sdkconfig.defaults
├── main/
│   ├── CMakeLists.txt
│   ├── main.c                  # App entry, state machine
│   ├── audio_capture.c/.h      # I2S mic input via ES7210
│   ├── audio_playback.c/.h     # I2S speaker output via ES8311
│   ├── wake_word.c/.h          # ESP-SR integration
│   ├── ws_client.c/.h          # WebSocket client to gateway
│   ├── led_status.c/.h         # WS2812B LED patterns
│   ├── wifi_manager.c/.h       # WiFi connection management
│   └── config.h                # Compile-time configuration
├── components/
│   └── (ESP-ADF / ESP-SR as managed components)
└── partitions.csv

3. FastAgent Integration

The Stentor Gateway communicates with FastAgent via its HTTP transport mode.

Starting FastAgent for Stentor

# Expose Harper (or any agent) as an HTTP server
uv run fast-agent --transport http --port 8001

Agent Request/Response

The gateway sends a simple message and receives the agent's response. The exact API shape depends on FastAgent's HTTP transport implementation — expected to follow MCP-over-HTTP conventions:

POST /message
Content-Type: application/json

{
  \"content\": \"What can you tell me about the Agathos infrastructure?\"
}

Response:
{
  \"content\": \"Agathos is a set of Incus containers on a 10.10.0.0/24 network...\"
}

Note for Cline: The FastAgent HTTP transport API may need investigation. Check the fast-agent source for the actual endpoint shape. The gateway's agent_client.py should be the thinnest possible adapter so it's easy to swap.

4. Speaches Integration

Speaches provides OpenAI API-compatible STT and TTS endpoints.

STT (Speech-to-Text)

POST {STENTOR_STT_URL}/v1/audio/transcriptions
Content-Type: multipart/form-data

- file: audio.wav (PCM wrapped in WAV header)
- model: \"Systran/faster-whisper-small\"
- language: \"en\" (optional, auto-detect if omitted)
- response_format: \"json\"

Response:
{
  \"text\": \"What is the weather like today?\"
}

TTS (Text-to-Speech)

POST {STENTOR_TTS_URL}/v1/audio/speech
Content-Type: application/json

{
  \"model\": \"kokoro\",
  \"voice\": \"af_heart\",
  \"input\": \"I don\\'t have weather tools yet, but I can help with other things.\",
  \"response_format\": \"pcm\",
  \"speed\": 1.0
}

Response: Binary PCM audio stream (24kHz, mono, 16-bit)

Note: Speaches TTS may output at 24kHz. The gateway's audio.py module should handle resampling to 16kHz if the ESP32 is configured for 16kHz playback. Alternatively, configure the ESP32 for 24kHz playback.

Build & Run

Gateway

cd stentor-gateway
pip install -e .
stentor

Or with more control:

python -m stentor
# or: uvicorn stentor.main:app --host 0.0.0.0 --port 25573

See docs/stentor.md for full usage documentation.

ESP32 Firmware

cd stentor-ear
idf.py set-target esp32s3
idf.py menuconfig  # Set WiFi creds, gateway URL
idf.py build flash monitor

Phase 2 (Future)

These are explicitly out of scope for v0.1 but documented for planning:

Streaming STT — Use Speaches SSE streaming for real-time partial transcripts
Streaming TTS — Stream TTS audio chunks as they're generated (reduce time-to-first-audio)
Barge-in / Interruption — Allow user to interrupt AI playback with new speech
Multi-turn context — Maintain conversation history across turns within a session
PJSUA2 SIP bridge — Enable calling Stentor from a phone via SIP
Multiple agents — Route to different FastAgent instances based on wake word or command
Local fallback — ESP32 handles simple commands locally when gateway is unreachable
OTA firmware updates — Update stentor-ear firmware over WiFi
LCD UI — Show transcript and response text on the 1.85" display

Design Decisions & Rationale

Decision	Rationale
WebSocket over SIP/RTP	No SIP infrastructure needed; ESP-IDF has native WS support; Speaches speaks HTTP not SIP
Base64 audio in JSON	Simplicity for v0.1; binary WS frames are a Phase 2 optimization
Wake word on ESP32	Zero-latency wake detection; no audio streaming when idle; privacy-preserving
Gateway as separate service	Decouples hardware from AI; can swap agents, add SIP later, scale independently
PCM_S16LE format	Universal, no codec overhead, both ESP32 and Speaches handle it natively
Pydantic Settings for config	Type-safe, env-var-driven, zero config files to manage

Notes for Cline

Start with the gateway. The ESP32 firmware is more complex and needs the gateway to test against.
Use httpx.AsyncClient for all HTTP calls — it supports streaming responses which you'll need for TTS.
The WebSocket protocol is intentionally simple. JSON messages with base64 audio. Optimize to binary frames later.
Test with a Python WebSocket client first before touching ESP32 hardware. Write a test_client.py that sends a WAV file over the WebSocket and plays back the response.
FastAgent HTTP transport needs investigation — check the fast-agent repo for actual endpoint signatures.
Audio resampling — Speaches TTS may output 24kHz; ESP32 may expect 16kHz. Handle this in audio.py.
Don't overthink error handling in v0.1 — log errors, send error messages to the client, keep the pipeline simple. SPECEOF echo 'SPEC.md written'"}} </tool_call> <tool_result> SPEC.md written </tool_result>

Now let me push it to the repo:

<tool_call> {"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && git init && git remote add origin https://gitea.service.consul/robert/stentor.git && git fetch origin && git checkout main && cp SPEC.md . && git add SPEC.md && git commit -m 'Add Stentor voice gateway specification' && git push origin main"}} </tool_call> <tool_result> hint: Using 'master' as the name for the initial branch. This name is subject to change. To hint: configure the initial branch name to use in all of your new repositories, which will hint: suppress this warning, call: hint: hint: git config --global init.defaultBranch hint: hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 'development'. The hint: just-created branch can be renamed via this command: hint: hint: git branch -m From https://gitea.service.consul/robert/stentor

[new branch] main -> origin/main branch 'main' set up to track 'origin/main'. Already on 'main' [main 2e4e0b5] Add Stentor voice gateway specification 1 file changed, 362 insertions(+) create mode 100644 SPEC.md To https://gitea.service.consul/robert/stentor.git 7e5e7b2..2e4e0b5 main -> main </tool_result>

Now let me also create a basic README and scaffold the directory structure:

<tool_call> {"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && cat > README.md << 'EOF'

Στέντωρ (Stentor)

"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V

Voice gateway for AI agents. Bridges physical audio hardware to AI agents via speech services.

Architecture

ESP32-S3-AUDIO-Board  ◀──WebSocket──▶  Stentor Gateway  ◀──HTTP──▶  Speaches (STT/TTS)
     (stentor-ear)                      (Python)                     FastAgent (AI)

Stentor Gateway orchestrates the voice pipeline:

Receives audio from ESP32 via WebSocket
Transcribes speech via Speaches STT (OpenAI-compatible API)
Sends transcript to FastAgent for AI response
Synthesizes speech via Speaches TTS
Streams audio back to ESP32 for playback

Stentor Ear is the ESP32-S3 firmware:

Local wake word detection ("Hi ESP" via ESP-SR)
Audio capture from dual-mic array
Audio playback via onboard speaker
LED ring feedback for conversation state

Quick Start

Gateway

cd stentor-gateway
uv sync
uv run uvicorn stentor.main:app --host 0.0.0.0 --port 8600

ESP32 Firmware

cd stentor-ear
idf.py set-target esp32s3
idf.py menuconfig  # Set WiFi creds, gateway URL
idf.py build flash monitor

Documentation

Full Specification

Status

🚧 Pre-alpha — Under active development.

License

MIT

20 KiB Raw Permalink Blame History Unescape Escape

Stentor — Voice Gateway Specification

Architecture Overview

Components

1. Stentor Gateway (stentor-gateway/)

Configuration

WebSocket Protocol

Client → Gateway Messages

Gateway → Client Messages

Processing Pipeline

Health & Observability

Dependencies (Python)

Project Structure

2. Stentor Ear (stentor-ear/)

Hardware Capabilities (Waveshare ESP32-S3-AUDIO-Board)

Firmware State Machine

Audio Configuration

WiFi & Network

LED Feedback (WS2812B Ring)

Firmware Project Structure

3. FastAgent Integration

Starting FastAgent for Stentor

Agent Request/Response

4. Speaches Integration

STT (Speech-to-Text)

TTS (Text-to-Speech)

Build & Run

Gateway

ESP32 Firmware

Phase 2 (Future)

Design Decisions & Rationale

Notes for Cline

Στέντωρ (Stentor)

Architecture

Quick Start

Gateway

ESP32 Firmware

Documentation

Status

License

20 KiB

Raw Permalink Blame History

1. Stentor Gateway (`stentor-gateway/`)

2. Stentor Ear (`stentor-ear/`)