feat: scaffold stentor-gateway with FastAPI voice pipeline

Initialize the stentor-gateway project with WebSocket-based voice
pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs.

- Add FastAPI app with WebSocket endpoint for audio streaming
- Add pipeline orchestration (stt_client, tts_client, agent_client)
- Add Pydantic Settings configuration and message models
- Add audio utilities for PCM/WAV conversion and resampling
- Add health check endpoints
- Add Dockerfile and pyproject.toml with dependencies
- Add initial test suite (pipeline, STT, TTS, WebSocket)
- Add comprehensive README covering gateway and ESP32 ear design
- Clean up .gitignore for Python/uv project
This commit is contained in:
2026-03-21 19:11:48 +00:00
parent 9ba9435883
commit 912593b796
27 changed files with 3985 additions and 138 deletions

315
docs/api-reference.md Normal file
View File

@@ -0,0 +1,315 @@
# Stentor Gateway API Reference
> Version 0.1.0
## Endpoints
| Endpoint | Method | Description |
|----------|--------|-------------|
| `/` | GET | Dashboard (Bootstrap UI) |
| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
| `/api/v1/info` | GET | Gateway information and configuration |
| `/api/live/` | GET | Liveness probe (Kubernetes) |
| `/api/ready/` | GET | Readiness probe (Kubernetes) |
| `/api/metrics` | GET | Prometheus-compatible metrics |
| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
| `/api/openapi.json` | GET | OpenAPI schema |
---
## WebSocket: `/api/v1/realtime`
Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
### Connection
```
ws://{host}:{port}/api/v1/realtime
```
### Client Events
#### `session.start`
Initiates a new conversation session. Must be sent first.
```json
{
"type": "session.start",
"client_id": "esp32-kitchen",
"audio_config": {
"sample_rate": 16000,
"channels": 1,
"sample_width": 16,
"encoding": "pcm_s16le"
}
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | ✔ | Must be `"session.start"` |
| `client_id` | string | | Client identifier for tracking |
| `audio_config` | object | | Audio format configuration |
#### `input_audio_buffer.append`
Sends a chunk of audio data. Stream continuously while user is speaking.
```json
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM audio>"
}
```
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
#### `input_audio_buffer.commit`
Signals end of speech. Triggers the STT → Agent → TTS pipeline.
```json
{
"type": "input_audio_buffer.commit"
}
```
#### `session.close`
Requests session termination. The WebSocket connection will close.
```json
{
"type": "session.close"
}
```
### Server Events
#### `session.created`
Acknowledges session creation.
```json
{
"type": "session.created",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
```
#### `status`
Processing status update. Use for LED feedback on ESP32.
```json
{
"type": "status",
"state": "listening"
}
```
| State | Description | Suggested LED |
|-------|-------------|--------------|
| `listening` | Ready for audio input | Green |
| `transcribing` | Running STT | Yellow |
| `thinking` | Waiting for agent response | Yellow |
| `speaking` | Playing TTS audio | Cyan |
#### `transcript.done`
Transcript of what the user said.
```json
{
"type": "transcript.done",
"text": "What is the weather like today?"
}
```
#### `response.text.done`
AI agent's response text.
```json
{
"type": "response.text.done",
"text": "I don't have weather tools yet, but I can help with other things."
}
```
#### `response.audio.delta`
Streamed audio response chunk.
```json
{
"type": "response.audio.delta",
"delta": "<base64-encoded PCM audio>"
}
```
#### `response.audio.done`
Audio response streaming complete.
```json
{
"type": "response.audio.done"
}
```
#### `response.done`
Full response cycle complete. Gateway returns to listening state.
```json
{
"type": "response.done"
}
```
#### `error`
Error event.
```json
{
"type": "error",
"message": "STT service unavailable",
"code": "stt_error"
}
```
| Code | Description |
|------|-------------|
| `invalid_json` | Client sent malformed JSON |
| `validation_error` | Message failed schema validation |
| `no_session` | Action requires an active session |
| `empty_buffer` | Audio buffer was empty on commit |
| `empty_transcript` | STT returned no speech |
| `empty_response` | Agent returned empty response |
| `pipeline_error` | Internal pipeline failure |
| `unknown_event` | Unrecognized event type |
| `internal_error` | Unexpected server error |
---
## REST: `/api/v1/info`
Returns gateway information and current configuration.
**Response:**
```json
{
"name": "stentor-gateway",
"version": "0.1.0",
"endpoints": {
"realtime": "/api/v1/realtime",
"live": "/api/live/",
"ready": "/api/ready/",
"metrics": "/api/metrics"
},
"config": {
"stt_url": "http://perseus.incus:8000",
"tts_url": "http://pan.incus:8000",
"agent_url": "http://localhost:8001",
"stt_model": "Systran/faster-whisper-small",
"tts_model": "kokoro",
"tts_voice": "af_heart",
"audio_sample_rate": 16000,
"audio_channels": 1,
"audio_sample_width": 16
}
}
```
---
## REST: `/api/live/`
Kubernetes liveness probe.
**Response (200):**
```json
{
"status": "ok"
}
```
---
## REST: `/api/ready/`
Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
**Response (200 — all services reachable):**
```json
{
"status": "ready",
"checks": {
"stt": true,
"tts": true,
"agent": true
}
}
```
**Response (503 — one or more services unavailable):**
```json
{
"status": "not_ready",
"checks": {
"stt": true,
"tts": false,
"agent": true
}
}
```
---
## REST: `/api/metrics`
Prometheus-compatible metrics in text exposition format.
**Metrics exported:**
| Metric | Type | Description |
|--------|------|-------------|
| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
| `stentor_agent_requests_total` | Counter | Total agent message calls |
| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
---
## Configuration
All configuration via environment variables (12-factor):
| Variable | Description | Default |
|----------|-------------|---------|
| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
| `STENTOR_PORT` | Gateway bind port | `8600` |
| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |