feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions
--- a/docs/api-reference.md
+++ b/docs/api-reference.md
@@ -0,0 +1,315 @@
+# Stentor Gateway API Reference
+
+> Version 0.1.0
+
+## Endpoints
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/` | GET | Dashboard (Bootstrap UI) |
+| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
+| `/api/v1/info` | GET | Gateway information and configuration |
+| `/api/live/` | GET | Liveness probe (Kubernetes) |
+| `/api/ready/` | GET | Readiness probe (Kubernetes) |
+| `/api/metrics` | GET | Prometheus-compatible metrics |
+| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
+| `/api/openapi.json` | GET | OpenAPI schema |
+
+---
+
+## WebSocket: `/api/v1/realtime`
+
+Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
+
+### Connection
+
+```
+ws://{host}:{port}/api/v1/realtime
+```
+
+### Client Events
+
+#### `session.start`
+
+Initiates a new conversation session. Must be sent first.
+
+```json
+{
+  "type": "session.start",
+  "client_id": "esp32-kitchen",
+  "audio_config": {
+    "sample_rate": 16000,
+    "channels": 1,
+    "sample_width": 16,
+    "encoding": "pcm_s16le"
+  }
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `type` | string | ✔ | Must be `"session.start"` |
+| `client_id` | string | | Client identifier for tracking |
+| `audio_config` | object | | Audio format configuration |
+
+#### `input_audio_buffer.append`
+
+Sends a chunk of audio data. Stream continuously while user is speaking.
+
+```json
+{
+  "type": "input_audio_buffer.append",
+  "audio": "<base64-encoded PCM audio>"
+}
+```
+
+| Field | Type | Required | Description |
+|-------|------|----------|-------------|
+| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
+| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
+
+#### `input_audio_buffer.commit`
+
+Signals end of speech. Triggers the STT → Agent → TTS pipeline.
+
+```json
+{
+  "type": "input_audio_buffer.commit"
+}
+```
+
+#### `session.close`
+
+Requests session termination. The WebSocket connection will close.
+
+```json
+{
+  "type": "session.close"
+}
+```
+
+### Server Events
+
+#### `session.created`
+
+Acknowledges session creation.
+
+```json
+{
+  "type": "session.created",
+  "session_id": "550e8400-e29b-41d4-a716-446655440000"
+}
+```
+
+#### `status`
+
+Processing status update. Use for LED feedback on ESP32.
+
+```json
+{
+  "type": "status",
+  "state": "listening"
+}
+```
+
+| State | Description | Suggested LED |
+|-------|-------------|--------------|
+| `listening` | Ready for audio input | Green |
+| `transcribing` | Running STT | Yellow |
+| `thinking` | Waiting for agent response | Yellow |
+| `speaking` | Playing TTS audio | Cyan |
+
+#### `transcript.done`
+
+Transcript of what the user said.
+
+```json
+{
+  "type": "transcript.done",
+  "text": "What is the weather like today?"
+}
+```
+
+#### `response.text.done`
+
+AI agent's response text.
+
+```json
+{
+  "type": "response.text.done",
+  "text": "I don't have weather tools yet, but I can help with other things."
+}
+```
+
+#### `response.audio.delta`
+
+Streamed audio response chunk.
+
+```json
+{
+  "type": "response.audio.delta",
+  "delta": "<base64-encoded PCM audio>"
+}
+```
+
+#### `response.audio.done`
+
+Audio response streaming complete.
+
+```json
+{
+  "type": "response.audio.done"
+}
+```
+
+#### `response.done`
+
+Full response cycle complete. Gateway returns to listening state.
+
+```json
+{
+  "type": "response.done"
+}
+```
+
+#### `error`
+
+Error event.
+
+```json
+{
+  "type": "error",
+  "message": "STT service unavailable",
+  "code": "stt_error"
+}
+```
+
+| Code | Description |
+|------|-------------|
+| `invalid_json` | Client sent malformed JSON |
+| `validation_error` | Message failed schema validation |
+| `no_session` | Action requires an active session |
+| `empty_buffer` | Audio buffer was empty on commit |
+| `empty_transcript` | STT returned no speech |
+| `empty_response` | Agent returned empty response |
+| `pipeline_error` | Internal pipeline failure |
+| `unknown_event` | Unrecognized event type |
+| `internal_error` | Unexpected server error |
+
+---
+
+## REST: `/api/v1/info`
+
+Returns gateway information and current configuration.
+
+**Response:**
+
+```json
+{
+  "name": "stentor-gateway",
+  "version": "0.1.0",
+  "endpoints": {
+    "realtime": "/api/v1/realtime",
+    "live": "/api/live/",
+    "ready": "/api/ready/",
+    "metrics": "/api/metrics"
+  },
+  "config": {
+    "stt_url": "http://perseus.incus:8000",
+    "tts_url": "http://pan.incus:8000",
+    "agent_url": "http://localhost:8001",
+    "stt_model": "Systran/faster-whisper-small",
+    "tts_model": "kokoro",
+    "tts_voice": "af_heart",
+    "audio_sample_rate": 16000,
+    "audio_channels": 1,
+    "audio_sample_width": 16
+  }
+}
+```
+
+---
+
+## REST: `/api/live/`
+
+Kubernetes liveness probe.
+
+**Response (200):**
+
+```json
+{
+  "status": "ok"
+}
+```
+
+---
+
+## REST: `/api/ready/`
+
+Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
+
+**Response (200 — all services reachable):**
+
+```json
+{
+  "status": "ready",
+  "checks": {
+    "stt": true,
+    "tts": true,
+    "agent": true
+  }
+}
+```
+
+**Response (503 — one or more services unavailable):**
+
+```json
+{
+  "status": "not_ready",
+  "checks": {
+    "stt": true,
+    "tts": false,
+    "agent": true
+  }
+}
+```
+
+---
+
+## REST: `/api/metrics`
+
+Prometheus-compatible metrics in text exposition format.
+
+**Metrics exported:**
+
+| Metric | Type | Description |
+|--------|------|-------------|
+| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
+| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
+| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
+| `stentor_agent_requests_total` | Counter | Total agent message calls |
+| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
+| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
+| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
+| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
+
+---
+
+## Configuration
+
+All configuration via environment variables (12-factor):
+
+| Variable | Description | Default |
+|----------|-------------|---------|
+| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
+| `STENTOR_PORT` | Gateway bind port | `8600` |
+| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
+| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
+| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
+| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
+| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
+| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
+| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
+| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
+| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
+| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |