feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions
--- a/README.md
+++ b/README.md
@@ -1,2 +1,604 @@
-# stentor

+# Stentor — Voice Gateway Specification
+
+> *\"Stentor, whose voice was as powerful as fifty voices of other men.\"*
+> — Homer, *Iliad*, Book V
+
+**Stentor** is a voice gateway that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents (FastAgent) via speech services (Speaches), enabling real-time voice conversations with AI.
+
+---
+
+## Architecture Overview
+
+```
+┌──────────────────────────────────┐
+│  ESP32-S3-AUDIO-Board            │
+│  (stentor-ear)                   │
+│                                  │
+│  Mic Array ──▶ Wake Word ──▶ WS  │
+│  Speaker   ◀── PCM playback ◀──  │
+└──────────────┬───────────────────┘
+               │ WebSocket (audio)
+               ▼
+┌──────────────────────────────────┐
+│  Stentor Gateway                 │
+│  (Python service)                │
+│                                  │
+│  WS Server ──▶ STT ──▶ Agent    │
+│  WS Server ◀── TTS ◀── Agent    │
+└──────┬──────────┬──────────┬─────┘
+       │          │          │
+       ▼          ▼          ▼
+   Speaches    FastAgent   Speaches
+   STT         (HTTP)      TTS
+```
+
+---
+
+## Components
+
+### 1. Stentor Gateway (`stentor-gateway/`)
+
+**Language:** Python 3.12+  
+**Framework:** FastAPI + WebSockets  
+**Deployment:** Agathos container (Caliban or dedicated)  
+
+The gateway is the central orchestrator. It:
+- Accepts WebSocket connections from ESP32 clients
+- Streams audio to Speaches STT for transcription
+- Sends transcripts to FastAgent for AI response
+- Sends AI response text to Speaches TTS for synthesis
+- Streams synthesized audio back to the ESP32 client
+
+#### Configuration
+
+All configuration via environment variables (12-factor):
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
+| `STENTOR_PORT` | Gateway bind port | `8600` |
+| `STENTOR_STT_URL` | Speaches STT endpoint (OpenAI-compatible) | `http://perseus.incus:8000` |
+| `STENTOR_TTS_URL` | Speaches TTS endpoint (OpenAI-compatible) | `http://pan.incus:8000` |
+| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
+| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
+| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
+| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
+| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
+| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
+| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
+| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |
+
+#### WebSocket Protocol
+
+Endpoint: `ws://{STENTOR_HOST}:{STENTOR_PORT}/ws/audio`
+
+##### Client → Gateway Messages
+
+```json
+// Session start (sent once on connect)
+{
+  \"type\": \"session.start\",
+  \"client_id\": \"esp32-kitchen\",
+  \"audio_config\": {
+    \"sample_rate\": 16000,
+    \"channels\": 1,
+    \"sample_width\": 16,
+    \"encoding\": \"pcm_s16le\"
+  }
+}
+
+// Audio data (sent continuously while listening)
+{
+  \"type\": \"audio.chunk\",
+  \"data\": \"<base64-encoded PCM audio>\"
+}
+
+// Client signals end of speech
+{
+  \"type\": \"audio.end\"
+}
+
+// Client requests session termination
+{
+  \"type\": \"session.end\"
+}
+```
+
+##### Gateway → Client Messages
+
+```json
+// Acknowledge session
+{
+  \"type\": \"session.ready\",
+  \"session_id\": \"uuid\"
+}
+
+// Gateway is processing (ESP32 can show \"thinking\" LED)
+{
+  \"type\": \"status\",
+  \"state\": \"listening\" | \"transcribing\" | \"thinking\" | \"speaking\"
+}
+
+// Transcript of what user said (for debugging/display)
+{
+  \"type\": \"transcript\",
+  \"text\": \"What is the weather like today?\"
+}
+
+// AI response text (for debugging/display)
+{
+  \"type\": \"response.text\",
+  \"text\": \"I don\\'t have weather tools yet, but...\"
+}
+
+// Audio response (streamed in chunks)
+{
+  \"type\": \"audio.chunk\",
+  \"data\": \"<base64-encoded PCM audio>\"
+}
+
+// Audio response complete
+{
+  \"type\": \"audio.end\"
+}
+
+// Error
+{
+  \"type\": \"error\",
+  \"message\": \"STT service unavailable\",
+  \"code\": \"stt_error\"
+}
+```
+
+#### Processing Pipeline
+
+```
+1. ESP32 detects wake word locally (ESP-SR)
+2. ESP32 opens WebSocket, sends session.start
+3. Gateway sends session.ready
+4. ESP32 streams audio.chunk messages
+5. Gateway buffers audio (with Voice Activity Detection)
+6. ESP32 sends audio.end (or VAD detects silence)
+7. Gateway sends status: \"transcribing\"
+8. Gateway calls Speaches STT:
+   POST {STENTOR_STT_URL}/v1/audio/transcriptions
+   - model: {STENTOR_STT_MODEL}
+   - file: collected audio as WAV
+   - response_format: json
+9. Gateway sends transcript message to client
+10. Gateway sends status: \"thinking\"
+11. Gateway calls FastAgent:
+    POST {STENTOR_AGENT_URL}/message
+    - content: transcript text
+12. Gateway sends response.text to client
+13. Gateway sends status: \"speaking\"
+14. Gateway calls Speaches TTS:
+    POST {STENTOR_TTS_URL}/v1/audio/speech
+    - model: {STENTOR_TTS_MODEL}
+    - voice: {STENTOR_TTS_VOICE}
+    - input: agent response text
+    - response_format: pcm
+    - speed: 1.0
+15. Gateway streams audio.chunk messages to client
+16. Gateway sends audio.end
+17. Gateway returns to listening state (step 4)
+    OR ESP32 sends session.end to disconnect
+```
+
+#### Health & Observability
+
+- `GET /health` — Returns 200 if gateway is running
+- `GET /health/ready` — Returns 200 if STT, TTS, and Agent endpoints are reachable
+- `GET /metrics` — Prometheus-compatible metrics
+  - `stentor_sessions_active` — Current active WebSocket sessions
+  - `stentor_transcriptions_total` — Total STT calls
+  - `stentor_tts_requests_total` — Total TTS calls
+  - `stentor_agent_requests_total` — Total agent calls
+  - `stentor_pipeline_duration_seconds` — Histogram of full pipeline latency
+  - `stentor_stt_duration_seconds` — Histogram of STT latency
+  - `stentor_tts_duration_seconds` — Histogram of TTS latency
+  - `stentor_agent_duration_seconds` — Histogram of agent response latency
+
+#### Dependencies (Python)
+
+```toml
+[project]
+name = \"stentor-gateway\"
+version = \"0.1.0\"
+requires-python = \">=3.12\"
+dependencies = [
+    \"fastapi>=0.115\",
+    \"uvicorn[standard]>=0.34\",
+    \"websockets>=14.0\",
+    \"httpx>=0.28\",
+    \"pydantic>=2.10\",
+    \"pydantic-settings>=2.7\",
+]
+
+[project.optional-dependencies]
+dev = [
+    \"pytest>=8.0\",
+    \"pytest-asyncio>=0.25\",
+    \"ruff>=0.9\",
+]
+```
+
+#### Project Structure
+
+```
+stentor-gateway/
+├── pyproject.toml
+├── src/
+│   └── stentor/
+│       ├── __init__.py
+│       ├── main.py              # FastAPI app, WebSocket endpoint
+│       ├── config.py            # Pydantic Settings configuration
+│       ├── models.py            # WebSocket message models
+│       ├── pipeline.py          # Orchestration: STT → Agent → TTS
+│       ├── stt_client.py        # Speaches STT client (OpenAI-compatible)
+│       ├── tts_client.py        # Speaches TTS client (OpenAI-compatible)
+│       ├── agent_client.py      # FastAgent HTTP client
+│       ├── audio.py             # Audio utilities (PCM↔WAV, resampling)
+│       └── health.py            # Health check endpoints
+├── tests/
+│   ├── test_pipeline.py
+│   ├── test_stt_client.py
+│   ├── test_tts_client.py
+│   └── test_websocket.py
+└── Dockerfile
+```
+
+---
+
+### 2. Stentor Ear (`stentor-ear/`)
+
+**Platform:** ESP32-S3-AUDIO-Board (Waveshare)  
+**Framework:** ESP-IDF 5.x + ESP-ADF  
+**Language:** C  
+
+The ESP32 firmware handles local audio capture, wake word detection, and communication with the Stentor Gateway.
+
+#### Hardware Capabilities (Waveshare ESP32-S3-AUDIO-Board)
+
+| Component | Chip | Purpose |
+|-----------|------|---------|
+| Microphone array | 2× MEMS via ES7210 ADC | Audio capture, noise reduction |
+| Speaker output | ES8311 DAC + Class-D amp | Audio playback |
+| Display | 1.85\" LCD (ST7789V2) | Status display (optional) |
+| LEDs | WS2812B RGB ring (×2) | Visual status feedback |
+| Storage | TF card slot | Local audio cache (optional) |
+| Camera | OV2640 (2MP) | Not used for Stentor |
+| RTC | PCF85063 | Timestamping (optional) |
+| Connectivity | WiFi 802.11 b/g/n + BLE 5 | Network communication |
+| Wake word | ESP-SR | Local \"Hi ESP\" detection |
+
+Reference: Waveshare wiki documents ESP-SR wake word \"hi esp\" with command recognition support [2].
+
+#### Firmware State Machine
+
+```
+                    ┌──────────┐
+         ┌─────────│  IDLE    │◀──────────┐
+         │         │ (LEDs: dim blue)     │
+         │         └────┬─────┘          │
+         │              │ Wake word       │
+         │              │ detected        │
+         │              ▼                 │
+         │         ┌──────────┐          │
+         │         │ CONNECT  │          │
+         │         │ (LEDs: pulse white) │
+         │         └────┬─────┘          │
+         │              │ WS connected    │
+         │              ▼                 │
+         │         ┌──────────┐          │
+  Timeout│         │ LISTEN   │          │ audio.end
+  / Error│         │ (LEDs: green)       │ received
+         │         └────┬─────┘          │
+         │              │ Silence /       │
+         │              │ button press    │
+         │              ▼                 │
+         │         ┌──────────┐          │
+         │         │ WAITING  │          │
+         │         │ (LEDs: pulse yellow)│
+         │         └────┬─────┘          │
+         │              │ audio.chunk     │
+         │              │ from gateway    │
+         │              ▼                 │
+         │         ┌──────────┐          │
+         │         │ PLAYING  ├──────────┘
+         │         │ (LEDs: pulse cyan)  │
+         │         └──────────┘          │
+         │                                │
+         └────────────────────────────────┘
+```
+
+#### Audio Configuration
+
+| Parameter | Value | Notes |
+|-----------|-------|-------|
+| Sample rate | 16000 Hz | Standard for speech recognition |
+| Bit depth | 16-bit | Signed, little-endian (PCM_S16LE) |
+| Channels | 1 (mono) | Downmixed from dual-mic array |
+| Chunk size | 512 samples (32ms) | Balance between latency and overhead |
+| Buffer | 4096 samples (256ms) | Ring buffer for network jitter |
+
+#### WiFi & Network
+
+- WiFi credentials via `menuconfig` or BLE provisioning
+- mDNS discovery for gateway: `_stentor._tcp.local`
+- Fallback to configured static gateway URL
+- Auto-reconnect with exponential backoff
+
+#### LED Feedback (WS2812B Ring)
+
+| State | Pattern | Color |
+|-------|---------|-------|
+| Idle / standby | Dim steady | Blue |
+| Wake word detected / connecting | Pulse | White |
+| Listening (recording) | Steady | Green |
+| Transcribing / thinking | Rotating pulse | Yellow |
+| Speaking (playback) | Pulse with audio | Cyan |
+| Error | Flash | Red |
+
+#### Firmware Project Structure
+
+```
+stentor-ear/
+├── CMakeLists.txt
+├── sdkconfig.defaults
+├── main/
+│   ├── CMakeLists.txt
+│   ├── main.c                  # App entry, state machine
+│   ├── audio_capture.c/.h      # I2S mic input via ES7210
+│   ├── audio_playback.c/.h     # I2S speaker output via ES8311
+│   ├── wake_word.c/.h          # ESP-SR integration
+│   ├── ws_client.c/.h          # WebSocket client to gateway
+│   ├── led_status.c/.h         # WS2812B LED patterns
+│   ├── wifi_manager.c/.h       # WiFi connection management
+│   └── config.h                # Compile-time configuration
+├── components/
+│   └── (ESP-ADF / ESP-SR as managed components)
+└── partitions.csv
+```
+
+---
+
+### 3. FastAgent Integration
+
+The Stentor Gateway communicates with FastAgent via its HTTP transport mode.
+
+#### Starting FastAgent for Stentor
+
+```bash
+# Expose Harper (or any agent) as an HTTP server
+uv run fast-agent --transport http --port 8001
+```
+
+#### Agent Request/Response
+
+The gateway sends a simple message and receives the agent\'s response. The exact API shape depends on FastAgent\'s HTTP transport implementation — expected to follow MCP-over-HTTP conventions:
+
+```
+POST /message
+Content-Type: application/json
+
+{
+  \"content\": \"What can you tell me about the Agathos infrastructure?\"
+}
+
+Response:
+{
+  \"content\": \"Agathos is a set of Incus containers on a 10.10.0.0/24 network...\"
+}
+```
+
+> **Note for Cline:** The FastAgent HTTP transport API may need investigation. Check the fast-agent source for the actual endpoint shape. The gateway\'s `agent_client.py` should be the thinnest possible adapter so it\'s easy to swap.
+
+---
+
+### 4. Speaches Integration
+
+Speaches provides OpenAI API-compatible STT and TTS endpoints.
+
+#### STT (Speech-to-Text)
+
+```
+POST {STENTOR_STT_URL}/v1/audio/transcriptions
+Content-Type: multipart/form-data
+
+- file: audio.wav (PCM wrapped in WAV header)
+- model: \"Systran/faster-whisper-small\"
+- language: \"en\" (optional, auto-detect if omitted)
+- response_format: \"json\"
+
+Response:
+{
+  \"text\": \"What is the weather like today?\"
+}
+```
+
+#### TTS (Text-to-Speech)
+
+```
+POST {STENTOR_TTS_URL}/v1/audio/speech
+Content-Type: application/json
+
+{
+  \"model\": \"kokoro\",
+  \"voice\": \"af_heart\",
+  \"input\": \"I don\\'t have weather tools yet, but I can help with other things.\",
+  \"response_format\": \"pcm\",
+  \"speed\": 1.0
+}
+
+Response: Binary PCM audio stream (24kHz, mono, 16-bit)
+```
+
+> **Note:** Speaches TTS may output at 24kHz. The gateway\'s `audio.py` module should handle resampling to 16kHz if the ESP32 is configured for 16kHz playback. Alternatively, configure the ESP32 for 24kHz playback.
+
+---
+
+## Build & Run
+
+### Gateway
+
+```bash
+cd stentor-gateway
+pip install -e .
+stentor
+```
+
+Or with more control:
+
+```bash
+python -m stentor
+# or: uvicorn stentor.main:app --host 0.0.0.0 --port 25573
+```
+
+See [docs/stentor.md](docs/stentor.md) for full usage documentation.
+
+### ESP32 Firmware
+
+```bash
+cd stentor-ear
+idf.py set-target esp32s3
+idf.py menuconfig  # Set WiFi creds, gateway URL
+idf.py build flash monitor
+```
+
+---
+
+## Phase 2 (Future)
+
+These are explicitly **out of scope** for v0.1 but documented for planning:
+
+- [ ] **Streaming STT** — Use Speaches SSE streaming for real-time partial transcripts
+- [ ] **Streaming TTS** — Stream TTS audio chunks as they\'re generated (reduce time-to-first-audio)
+- [ ] **Barge-in / Interruption** — Allow user to interrupt AI playback with new speech
+- [ ] **Multi-turn context** — Maintain conversation history across turns within a session
+- [ ] **PJSUA2 SIP bridge** — Enable calling Stentor from a phone via SIP
+- [ ] **Multiple agents** — Route to different FastAgent instances based on wake word or command
+- [ ] **Local fallback** — ESP32 handles simple commands locally when gateway is unreachable
+- [ ] **OTA firmware updates** — Update stentor-ear firmware over WiFi
+- [ ] **LCD UI** — Show transcript and response text on the 1.85\" display
+
+---
+
+## Design Decisions & Rationale
+
+| Decision | Rationale |
+|----------|-----------|
+| WebSocket over SIP/RTP | No SIP infrastructure needed; ESP-IDF has native WS support; Speaches speaks HTTP not SIP |
+| Base64 audio in JSON | Simplicity for v0.1; binary WS frames are a Phase 2 optimization |
+| Wake word on ESP32 | Zero-latency wake detection; no audio streaming when idle; privacy-preserving |
+| Gateway as separate service | Decouples hardware from AI; can swap agents, add SIP later, scale independently |
+| PCM_S16LE format | Universal, no codec overhead, both ESP32 and Speaches handle it natively |
+| Pydantic Settings for config | Type-safe, env-var-driven, zero config files to manage |
+
+---
+
+## Notes for Cline
+
+1. **Start with the gateway.** The ESP32 firmware is more complex and needs the gateway to test against.
+2. **Use `httpx.AsyncClient`** for all HTTP calls — it supports streaming responses which you\'ll need for TTS.
+3. **The WebSocket protocol is intentionally simple.** JSON messages with base64 audio. Optimize to binary frames later.
+4. **Test with a Python WebSocket client first** before touching ESP32 hardware. Write a `test_client.py` that sends a WAV file over the WebSocket and plays back the response.
+5. **FastAgent HTTP transport** needs investigation — check the fast-agent repo for actual endpoint signatures.
+6. **Audio resampling** — Speaches TTS may output 24kHz; ESP32 may expect 16kHz. Handle this in `audio.py`.
+7. **Don\'t overthink error handling in v0.1** — log errors, send error messages to the client, keep the pipeline simple.
+SPECEOF
+echo 'SPEC.md written'"}}
+</tool_call>
+<tool_result>
+SPEC.md written
+</tool_result>
+
+Now let me push it to the repo:
+
+<tool_call>
+{"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && git init && git remote add origin https://gitea.service.consul/robert/stentor.git && git fetch origin && git checkout main && cp SPEC.md . && git add SPEC.md && git commit -m 'Add Stentor voice gateway specification' && git push origin main"}}
+</tool_call>
+<tool_result>
+hint: Using 'master' as the name for the initial branch. This name is subject to change. To
+hint: configure the initial branch name to use in all of your new repositories, which will
+hint: suppress this warning, call:
+hint: 
+hint: 	git config --global init.defaultBranch <name>
+hint: 
+hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 'development'. The
+hint: just-created branch can be renamed via this command:
+hint: 
+hint: 	git branch -m <name>
+From https://gitea.service.consul/robert/stentor
+ * [new branch]      main       -> origin/main
+branch 'main' set up to track 'origin/main'.
+Already on 'main'
+[main 2e4e0b5] Add Stentor voice gateway specification
+ 1 file changed, 362 insertions(+)
+ create mode 100644 SPEC.md
+To https://gitea.service.consul/robert/stentor.git
+   7e5e7b2..2e4e0b5  main -> main
+</tool_result>
+
+Now let me also create a basic README and scaffold the directory structure:
+
+<tool_call>
+{"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && cat > README.md << 'EOF'
+# Στέντωρ (Stentor)
+
+> *\"Stentor, whose voice was as powerful as fifty voices of other men.\"*
+> — Homer, *Iliad*, Book V
+
+Voice gateway for AI agents. Bridges physical audio hardware to AI agents via speech services.
+
+## Architecture
+
+```
+ESP32-S3-AUDIO-Board  ◀──WebSocket──▶  Stentor Gateway  ◀──HTTP──▶  Speaches (STT/TTS)
+     (stentor-ear)                      (Python)                     FastAgent (AI)
+```
+
+**Stentor Gateway** orchestrates the voice pipeline:
+1. Receives audio from ESP32 via WebSocket
+2. Transcribes speech via Speaches STT (OpenAI-compatible API)
+3. Sends transcript to FastAgent for AI response
+4. Synthesizes speech via Speaches TTS
+5. Streams audio back to ESP32 for playback
+
+**Stentor Ear** is the ESP32-S3 firmware:
+- Local wake word detection (\"Hi ESP\" via ESP-SR)
+- Audio capture from dual-mic array
+- Audio playback via onboard speaker
+- LED ring feedback for conversation state
+
+## Quick Start
+
+### Gateway
+
+```bash
+cd stentor-gateway
+uv sync
+uv run uvicorn stentor.main:app --host 0.0.0.0 --port 8600
+```
+
+### ESP32 Firmware
+
+```bash
+cd stentor-ear
+idf.py set-target esp32s3
+idf.py menuconfig  # Set WiFi creds, gateway URL
+idf.py build flash monitor
+```
+
+## Documentation
+
+- [Full Specification](SPEC.md)
+
+## Status
+
+🚧 **Pre-alpha** — Under active development.
+
+## License
+
+MIT