Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
20 KiB
Stentor — Voice Gateway Specification
"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V
Stentor is a voice gateway that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents (FastAgent) via speech services (Speaches), enabling real-time voice conversations with AI.
Architecture Overview
┌──────────────────────────────────┐
│ ESP32-S3-AUDIO-Board │
│ (stentor-ear) │
│ │
│ Mic Array ──▶ Wake Word ──▶ WS │
│ Speaker ◀── PCM playback ◀── │
└──────────────┬───────────────────┘
│ WebSocket (audio)
▼
┌──────────────────────────────────┐
│ Stentor Gateway │
│ (Python service) │
│ │
│ WS Server ──▶ STT ──▶ Agent │
│ WS Server ◀── TTS ◀── Agent │
└──────┬──────────┬──────────┬─────┘
│ │ │
▼ ▼ ▼
Speaches FastAgent Speaches
STT (HTTP) TTS
Components
1. Stentor Gateway (stentor-gateway/)
Language: Python 3.12+
Framework: FastAPI + WebSockets
Deployment: Agathos container (Caliban or dedicated)
The gateway is the central orchestrator. It:
- Accepts WebSocket connections from ESP32 clients
- Streams audio to Speaches STT for transcription
- Sends transcripts to FastAgent for AI response
- Sends AI response text to Speaches TTS for synthesis
- Streams synthesized audio back to the ESP32 client
Configuration
All configuration via environment variables (12-factor):
| Variable | Description | Default |
|---|---|---|
STENTOR_HOST |
Gateway bind address | 0.0.0.0 |
STENTOR_PORT |
Gateway bind port | 8600 |
STENTOR_STT_URL |
Speaches STT endpoint (OpenAI-compatible) | http://perseus.incus:8000 |
STENTOR_TTS_URL |
Speaches TTS endpoint (OpenAI-compatible) | http://pan.incus:8000 |
STENTOR_AGENT_URL |
FastAgent HTTP endpoint | http://localhost:8001 |
STENTOR_STT_MODEL |
Whisper model for STT | Systran/faster-whisper-small |
STENTOR_TTS_MODEL |
TTS model name | kokoro |
STENTOR_TTS_VOICE |
TTS voice ID | af_heart |
STENTOR_AUDIO_SAMPLE_RATE |
Audio sample rate in Hz | 16000 |
STENTOR_AUDIO_CHANNELS |
Audio channel count | 1 |
STENTOR_AUDIO_SAMPLE_WIDTH |
Bits per sample | 16 |
STENTOR_LOG_LEVEL |
Logging level | INFO |
WebSocket Protocol
Endpoint: ws://{STENTOR_HOST}:{STENTOR_PORT}/ws/audio
Client → Gateway Messages
// Session start (sent once on connect)
{
\"type\": \"session.start\",
\"client_id\": \"esp32-kitchen\",
\"audio_config\": {
\"sample_rate\": 16000,
\"channels\": 1,
\"sample_width\": 16,
\"encoding\": \"pcm_s16le\"
}
}
// Audio data (sent continuously while listening)
{
\"type\": \"audio.chunk\",
\"data\": \"<base64-encoded PCM audio>\"
}
// Client signals end of speech
{
\"type\": \"audio.end\"
}
// Client requests session termination
{
\"type\": \"session.end\"
}
Gateway → Client Messages
// Acknowledge session
{
\"type\": \"session.ready\",
\"session_id\": \"uuid\"
}
// Gateway is processing (ESP32 can show \"thinking\" LED)
{
\"type\": \"status\",
\"state\": \"listening\" | \"transcribing\" | \"thinking\" | \"speaking\"
}
// Transcript of what user said (for debugging/display)
{
\"type\": \"transcript\",
\"text\": \"What is the weather like today?\"
}
// AI response text (for debugging/display)
{
\"type\": \"response.text\",
\"text\": \"I don\\'t have weather tools yet, but...\"
}
// Audio response (streamed in chunks)
{
\"type\": \"audio.chunk\",
\"data\": \"<base64-encoded PCM audio>\"
}
// Audio response complete
{
\"type\": \"audio.end\"
}
// Error
{
\"type\": \"error\",
\"message\": \"STT service unavailable\",
\"code\": \"stt_error\"
}
Processing Pipeline
1. ESP32 detects wake word locally (ESP-SR)
2. ESP32 opens WebSocket, sends session.start
3. Gateway sends session.ready
4. ESP32 streams audio.chunk messages
5. Gateway buffers audio (with Voice Activity Detection)
6. ESP32 sends audio.end (or VAD detects silence)
7. Gateway sends status: \"transcribing\"
8. Gateway calls Speaches STT:
POST {STENTOR_STT_URL}/v1/audio/transcriptions
- model: {STENTOR_STT_MODEL}
- file: collected audio as WAV
- response_format: json
9. Gateway sends transcript message to client
10. Gateway sends status: \"thinking\"
11. Gateway calls FastAgent:
POST {STENTOR_AGENT_URL}/message
- content: transcript text
12. Gateway sends response.text to client
13. Gateway sends status: \"speaking\"
14. Gateway calls Speaches TTS:
POST {STENTOR_TTS_URL}/v1/audio/speech
- model: {STENTOR_TTS_MODEL}
- voice: {STENTOR_TTS_VOICE}
- input: agent response text
- response_format: pcm
- speed: 1.0
15. Gateway streams audio.chunk messages to client
16. Gateway sends audio.end
17. Gateway returns to listening state (step 4)
OR ESP32 sends session.end to disconnect
Health & Observability
GET /health— Returns 200 if gateway is runningGET /health/ready— Returns 200 if STT, TTS, and Agent endpoints are reachableGET /metrics— Prometheus-compatible metricsstentor_sessions_active— Current active WebSocket sessionsstentor_transcriptions_total— Total STT callsstentor_tts_requests_total— Total TTS callsstentor_agent_requests_total— Total agent callsstentor_pipeline_duration_seconds— Histogram of full pipeline latencystentor_stt_duration_seconds— Histogram of STT latencystentor_tts_duration_seconds— Histogram of TTS latencystentor_agent_duration_seconds— Histogram of agent response latency
Dependencies (Python)
[project]
name = \"stentor-gateway\"
version = \"0.1.0\"
requires-python = \">=3.12\"
dependencies = [
\"fastapi>=0.115\",
\"uvicorn[standard]>=0.34\",
\"websockets>=14.0\",
\"httpx>=0.28\",
\"pydantic>=2.10\",
\"pydantic-settings>=2.7\",
]
[project.optional-dependencies]
dev = [
\"pytest>=8.0\",
\"pytest-asyncio>=0.25\",
\"ruff>=0.9\",
]
Project Structure
stentor-gateway/
├── pyproject.toml
├── src/
│ └── stentor/
│ ├── __init__.py
│ ├── main.py # FastAPI app, WebSocket endpoint
│ ├── config.py # Pydantic Settings configuration
│ ├── models.py # WebSocket message models
│ ├── pipeline.py # Orchestration: STT → Agent → TTS
│ ├── stt_client.py # Speaches STT client (OpenAI-compatible)
│ ├── tts_client.py # Speaches TTS client (OpenAI-compatible)
│ ├── agent_client.py # FastAgent HTTP client
│ ├── audio.py # Audio utilities (PCM↔WAV, resampling)
│ └── health.py # Health check endpoints
├── tests/
│ ├── test_pipeline.py
│ ├── test_stt_client.py
│ ├── test_tts_client.py
│ └── test_websocket.py
└── Dockerfile
2. Stentor Ear (stentor-ear/)
Platform: ESP32-S3-AUDIO-Board (Waveshare)
Framework: ESP-IDF 5.x + ESP-ADF
Language: C
The ESP32 firmware handles local audio capture, wake word detection, and communication with the Stentor Gateway.
Hardware Capabilities (Waveshare ESP32-S3-AUDIO-Board)
| Component | Chip | Purpose |
|---|---|---|
| Microphone array | 2× MEMS via ES7210 ADC | Audio capture, noise reduction |
| Speaker output | ES8311 DAC + Class-D amp | Audio playback |
| Display | 1.85" LCD (ST7789V2) | Status display (optional) |
| LEDs | WS2812B RGB ring (×2) | Visual status feedback |
| Storage | TF card slot | Local audio cache (optional) |
| Camera | OV2640 (2MP) | Not used for Stentor |
| RTC | PCF85063 | Timestamping (optional) |
| Connectivity | WiFi 802.11 b/g/n + BLE 5 | Network communication |
| Wake word | ESP-SR | Local "Hi ESP" detection |
Reference: Waveshare wiki documents ESP-SR wake word "hi esp" with command recognition support [2].
Firmware State Machine
┌──────────┐
┌─────────│ IDLE │◀──────────┐
│ │ (LEDs: dim blue) │
│ └────┬─────┘ │
│ │ Wake word │
│ │ detected │
│ ▼ │
│ ┌──────────┐ │
│ │ CONNECT │ │
│ │ (LEDs: pulse white) │
│ └────┬─────┘ │
│ │ WS connected │
│ ▼ │
│ ┌──────────┐ │
Timeout│ │ LISTEN │ │ audio.end
/ Error│ │ (LEDs: green) │ received
│ └────┬─────┘ │
│ │ Silence / │
│ │ button press │
│ ▼ │
│ ┌──────────┐ │
│ │ WAITING │ │
│ │ (LEDs: pulse yellow)│
│ └────┬─────┘ │
│ │ audio.chunk │
│ │ from gateway │
│ ▼ │
│ ┌──────────┐ │
│ │ PLAYING ├──────────┘
│ │ (LEDs: pulse cyan) │
│ └──────────┘ │
│ │
└────────────────────────────────┘
Audio Configuration
| Parameter | Value | Notes |
|---|---|---|
| Sample rate | 16000 Hz | Standard for speech recognition |
| Bit depth | 16-bit | Signed, little-endian (PCM_S16LE) |
| Channels | 1 (mono) | Downmixed from dual-mic array |
| Chunk size | 512 samples (32ms) | Balance between latency and overhead |
| Buffer | 4096 samples (256ms) | Ring buffer for network jitter |
WiFi & Network
- WiFi credentials via
menuconfigor BLE provisioning - mDNS discovery for gateway:
_stentor._tcp.local - Fallback to configured static gateway URL
- Auto-reconnect with exponential backoff
LED Feedback (WS2812B Ring)
| State | Pattern | Color |
|---|---|---|
| Idle / standby | Dim steady | Blue |
| Wake word detected / connecting | Pulse | White |
| Listening (recording) | Steady | Green |
| Transcribing / thinking | Rotating pulse | Yellow |
| Speaking (playback) | Pulse with audio | Cyan |
| Error | Flash | Red |
Firmware Project Structure
stentor-ear/
├── CMakeLists.txt
├── sdkconfig.defaults
├── main/
│ ├── CMakeLists.txt
│ ├── main.c # App entry, state machine
│ ├── audio_capture.c/.h # I2S mic input via ES7210
│ ├── audio_playback.c/.h # I2S speaker output via ES8311
│ ├── wake_word.c/.h # ESP-SR integration
│ ├── ws_client.c/.h # WebSocket client to gateway
│ ├── led_status.c/.h # WS2812B LED patterns
│ ├── wifi_manager.c/.h # WiFi connection management
│ └── config.h # Compile-time configuration
├── components/
│ └── (ESP-ADF / ESP-SR as managed components)
└── partitions.csv
3. FastAgent Integration
The Stentor Gateway communicates with FastAgent via its HTTP transport mode.
Starting FastAgent for Stentor
# Expose Harper (or any agent) as an HTTP server
uv run fast-agent --transport http --port 8001
Agent Request/Response
The gateway sends a simple message and receives the agent's response. The exact API shape depends on FastAgent's HTTP transport implementation — expected to follow MCP-over-HTTP conventions:
POST /message
Content-Type: application/json
{
\"content\": \"What can you tell me about the Agathos infrastructure?\"
}
Response:
{
\"content\": \"Agathos is a set of Incus containers on a 10.10.0.0/24 network...\"
}
Note for Cline: The FastAgent HTTP transport API may need investigation. Check the fast-agent source for the actual endpoint shape. The gateway's
agent_client.pyshould be the thinnest possible adapter so it's easy to swap.
4. Speaches Integration
Speaches provides OpenAI API-compatible STT and TTS endpoints.
STT (Speech-to-Text)
POST {STENTOR_STT_URL}/v1/audio/transcriptions
Content-Type: multipart/form-data
- file: audio.wav (PCM wrapped in WAV header)
- model: \"Systran/faster-whisper-small\"
- language: \"en\" (optional, auto-detect if omitted)
- response_format: \"json\"
Response:
{
\"text\": \"What is the weather like today?\"
}
TTS (Text-to-Speech)
POST {STENTOR_TTS_URL}/v1/audio/speech
Content-Type: application/json
{
\"model\": \"kokoro\",
\"voice\": \"af_heart\",
\"input\": \"I don\\'t have weather tools yet, but I can help with other things.\",
\"response_format\": \"pcm\",
\"speed\": 1.0
}
Response: Binary PCM audio stream (24kHz, mono, 16-bit)
Note: Speaches TTS may output at 24kHz. The gateway's
audio.pymodule should handle resampling to 16kHz if the ESP32 is configured for 16kHz playback. Alternatively, configure the ESP32 for 24kHz playback.
Build & Run
Gateway
cd stentor-gateway
pip install -e .
stentor
Or with more control:
python -m stentor
# or: uvicorn stentor.main:app --host 0.0.0.0 --port 25573
See docs/stentor.md for full usage documentation.
ESP32 Firmware
cd stentor-ear
idf.py set-target esp32s3
idf.py menuconfig # Set WiFi creds, gateway URL
idf.py build flash monitor
Phase 2 (Future)
These are explicitly out of scope for v0.1 but documented for planning:
- Streaming STT — Use Speaches SSE streaming for real-time partial transcripts
- Streaming TTS — Stream TTS audio chunks as they're generated (reduce time-to-first-audio)
- Barge-in / Interruption — Allow user to interrupt AI playback with new speech
- Multi-turn context — Maintain conversation history across turns within a session
- PJSUA2 SIP bridge — Enable calling Stentor from a phone via SIP
- Multiple agents — Route to different FastAgent instances based on wake word or command
- Local fallback — ESP32 handles simple commands locally when gateway is unreachable
- OTA firmware updates — Update stentor-ear firmware over WiFi
- LCD UI — Show transcript and response text on the 1.85" display
Design Decisions & Rationale
| Decision | Rationale |
|---|---|
| WebSocket over SIP/RTP | No SIP infrastructure needed; ESP-IDF has native WS support; Speaches speaks HTTP not SIP |
| Base64 audio in JSON | Simplicity for v0.1; binary WS frames are a Phase 2 optimization |
| Wake word on ESP32 | Zero-latency wake detection; no audio streaming when idle; privacy-preserving |
| Gateway as separate service | Decouples hardware from AI; can swap agents, add SIP later, scale independently |
| PCM_S16LE format | Universal, no codec overhead, both ESP32 and Speaches handle it natively |
| Pydantic Settings for config | Type-safe, env-var-driven, zero config files to manage |
Notes for Cline
- Start with the gateway. The ESP32 firmware is more complex and needs the gateway to test against.
- Use
httpx.AsyncClientfor all HTTP calls — it supports streaming responses which you'll need for TTS. - The WebSocket protocol is intentionally simple. JSON messages with base64 audio. Optimize to binary frames later.
- Test with a Python WebSocket client first before touching ESP32 hardware. Write a
test_client.pythat sends a WAV file over the WebSocket and plays back the response. - FastAgent HTTP transport needs investigation — check the fast-agent repo for actual endpoint signatures.
- Audio resampling — Speaches TTS may output 24kHz; ESP32 may expect 16kHz. Handle this in
audio.py. - Don't overthink error handling in v0.1 — log errors, send error messages to the client, keep the pipeline simple. SPECEOF echo 'SPEC.md written'"}} </tool_call> <tool_result> SPEC.md written </tool_result>
Now let me push it to the repo:
<tool_call> {"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && git init && git remote add origin https://gitea.service.consul/robert/stentor.git && git fetch origin && git checkout main && cp SPEC.md . && git add SPEC.md && git commit -m 'Add Stentor voice gateway specification' && git push origin main"}} </tool_call> <tool_result> hint: Using 'master' as the name for the initial branch. This name is subject to change. To hint: configure the initial branch name to use in all of your new repositories, which will hint: suppress this warning, call: hint: hint: git config --global init.defaultBranch hint: hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 'development'. The hint: just-created branch can be renamed via this command: hint: hint: git branch -m From https://gitea.service.consul/robert/stentor
- [new branch] main -> origin/main branch 'main' set up to track 'origin/main'. Already on 'main' [main 2e4e0b5] Add Stentor voice gateway specification 1 file changed, 362 insertions(+) create mode 100644 SPEC.md To https://gitea.service.consul/robert/stentor.git 7e5e7b2..2e4e0b5 main -> main </tool_result>
Now let me also create a basic README and scaffold the directory structure:
<tool_call> {"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && cat > README.md << 'EOF'
Στέντωρ (Stentor)
"Stentor, whose voice was as powerful as fifty voices of other men." — Homer, Iliad, Book V
Voice gateway for AI agents. Bridges physical audio hardware to AI agents via speech services.
Architecture
ESP32-S3-AUDIO-Board ◀──WebSocket──▶ Stentor Gateway ◀──HTTP──▶ Speaches (STT/TTS)
(stentor-ear) (Python) FastAgent (AI)
Stentor Gateway orchestrates the voice pipeline:
- Receives audio from ESP32 via WebSocket
- Transcribes speech via Speaches STT (OpenAI-compatible API)
- Sends transcript to FastAgent for AI response
- Synthesizes speech via Speaches TTS
- Streams audio back to ESP32 for playback
Stentor Ear is the ESP32-S3 firmware:
- Local wake word detection ("Hi ESP" via ESP-SR)
- Audio capture from dual-mic array
- Audio playback via onboard speaker
- LED ring feedback for conversation state
Quick Start
Gateway
cd stentor-gateway
uv sync
uv run uvicorn stentor.main:app --host 0.0.0.0 --port 8600
ESP32 Firmware
cd stentor-ear
idf.py set-target esp32s3
idf.py menuconfig # Set WiFi creds, gateway URL
idf.py build flash monitor
Documentation
Status
🚧 Pre-alpha — Under active development.
License
MIT