Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
316 lines
6.7 KiB
Markdown
316 lines
6.7 KiB
Markdown
# Stentor Gateway API Reference
|
|
|
|
> Version 0.1.0
|
|
|
|
## Endpoints
|
|
|
|
| Endpoint | Method | Description |
|
|
|----------|--------|-------------|
|
|
| `/` | GET | Dashboard (Bootstrap UI) |
|
|
| `/api/v1/realtime` | WebSocket | Real-time audio conversation |
|
|
| `/api/v1/info` | GET | Gateway information and configuration |
|
|
| `/api/live/` | GET | Liveness probe (Kubernetes) |
|
|
| `/api/ready/` | GET | Readiness probe (Kubernetes) |
|
|
| `/api/metrics` | GET | Prometheus-compatible metrics |
|
|
| `/api/docs` | GET | Interactive API documentation (Swagger UI) |
|
|
| `/api/openapi.json` | GET | OpenAPI schema |
|
|
|
|
---
|
|
|
|
## WebSocket: `/api/v1/realtime`
|
|
|
|
Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
|
|
|
|
### Connection
|
|
|
|
```
|
|
ws://{host}:{port}/api/v1/realtime
|
|
```
|
|
|
|
### Client Events
|
|
|
|
#### `session.start`
|
|
|
|
Initiates a new conversation session. Must be sent first.
|
|
|
|
```json
|
|
{
|
|
"type": "session.start",
|
|
"client_id": "esp32-kitchen",
|
|
"audio_config": {
|
|
"sample_rate": 16000,
|
|
"channels": 1,
|
|
"sample_width": 16,
|
|
"encoding": "pcm_s16le"
|
|
}
|
|
}
|
|
```
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `type` | string | ✔ | Must be `"session.start"` |
|
|
| `client_id` | string | | Client identifier for tracking |
|
|
| `audio_config` | object | | Audio format configuration |
|
|
|
|
#### `input_audio_buffer.append`
|
|
|
|
Sends a chunk of audio data. Stream continuously while user is speaking.
|
|
|
|
```json
|
|
{
|
|
"type": "input_audio_buffer.append",
|
|
"audio": "<base64-encoded PCM audio>"
|
|
}
|
|
```
|
|
|
|
| Field | Type | Required | Description |
|
|
|-------|------|----------|-------------|
|
|
| `type` | string | ✔ | Must be `"input_audio_buffer.append"` |
|
|
| `audio` | string | ✔ | Base64-encoded PCM S16LE audio |
|
|
|
|
#### `input_audio_buffer.commit`
|
|
|
|
Signals end of speech. Triggers the STT → Agent → TTS pipeline.
|
|
|
|
```json
|
|
{
|
|
"type": "input_audio_buffer.commit"
|
|
}
|
|
```
|
|
|
|
#### `session.close`
|
|
|
|
Requests session termination. The WebSocket connection will close.
|
|
|
|
```json
|
|
{
|
|
"type": "session.close"
|
|
}
|
|
```
|
|
|
|
### Server Events
|
|
|
|
#### `session.created`
|
|
|
|
Acknowledges session creation.
|
|
|
|
```json
|
|
{
|
|
"type": "session.created",
|
|
"session_id": "550e8400-e29b-41d4-a716-446655440000"
|
|
}
|
|
```
|
|
|
|
#### `status`
|
|
|
|
Processing status update. Use for LED feedback on ESP32.
|
|
|
|
```json
|
|
{
|
|
"type": "status",
|
|
"state": "listening"
|
|
}
|
|
```
|
|
|
|
| State | Description | Suggested LED |
|
|
|-------|-------------|--------------|
|
|
| `listening` | Ready for audio input | Green |
|
|
| `transcribing` | Running STT | Yellow |
|
|
| `thinking` | Waiting for agent response | Yellow |
|
|
| `speaking` | Playing TTS audio | Cyan |
|
|
|
|
#### `transcript.done`
|
|
|
|
Transcript of what the user said.
|
|
|
|
```json
|
|
{
|
|
"type": "transcript.done",
|
|
"text": "What is the weather like today?"
|
|
}
|
|
```
|
|
|
|
#### `response.text.done`
|
|
|
|
AI agent's response text.
|
|
|
|
```json
|
|
{
|
|
"type": "response.text.done",
|
|
"text": "I don't have weather tools yet, but I can help with other things."
|
|
}
|
|
```
|
|
|
|
#### `response.audio.delta`
|
|
|
|
Streamed audio response chunk.
|
|
|
|
```json
|
|
{
|
|
"type": "response.audio.delta",
|
|
"delta": "<base64-encoded PCM audio>"
|
|
}
|
|
```
|
|
|
|
#### `response.audio.done`
|
|
|
|
Audio response streaming complete.
|
|
|
|
```json
|
|
{
|
|
"type": "response.audio.done"
|
|
}
|
|
```
|
|
|
|
#### `response.done`
|
|
|
|
Full response cycle complete. Gateway returns to listening state.
|
|
|
|
```json
|
|
{
|
|
"type": "response.done"
|
|
}
|
|
```
|
|
|
|
#### `error`
|
|
|
|
Error event.
|
|
|
|
```json
|
|
{
|
|
"type": "error",
|
|
"message": "STT service unavailable",
|
|
"code": "stt_error"
|
|
}
|
|
```
|
|
|
|
| Code | Description |
|
|
|------|-------------|
|
|
| `invalid_json` | Client sent malformed JSON |
|
|
| `validation_error` | Message failed schema validation |
|
|
| `no_session` | Action requires an active session |
|
|
| `empty_buffer` | Audio buffer was empty on commit |
|
|
| `empty_transcript` | STT returned no speech |
|
|
| `empty_response` | Agent returned empty response |
|
|
| `pipeline_error` | Internal pipeline failure |
|
|
| `unknown_event` | Unrecognized event type |
|
|
| `internal_error` | Unexpected server error |
|
|
|
|
---
|
|
|
|
## REST: `/api/v1/info`
|
|
|
|
Returns gateway information and current configuration.
|
|
|
|
**Response:**
|
|
|
|
```json
|
|
{
|
|
"name": "stentor-gateway",
|
|
"version": "0.1.0",
|
|
"endpoints": {
|
|
"realtime": "/api/v1/realtime",
|
|
"live": "/api/live/",
|
|
"ready": "/api/ready/",
|
|
"metrics": "/api/metrics"
|
|
},
|
|
"config": {
|
|
"stt_url": "http://perseus.incus:8000",
|
|
"tts_url": "http://pan.incus:8000",
|
|
"agent_url": "http://localhost:8001",
|
|
"stt_model": "Systran/faster-whisper-small",
|
|
"tts_model": "kokoro",
|
|
"tts_voice": "af_heart",
|
|
"audio_sample_rate": 16000,
|
|
"audio_channels": 1,
|
|
"audio_sample_width": 16
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## REST: `/api/live/`
|
|
|
|
Kubernetes liveness probe.
|
|
|
|
**Response (200):**
|
|
|
|
```json
|
|
{
|
|
"status": "ok"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## REST: `/api/ready/`
|
|
|
|
Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
|
|
|
|
**Response (200 — all services reachable):**
|
|
|
|
```json
|
|
{
|
|
"status": "ready",
|
|
"checks": {
|
|
"stt": true,
|
|
"tts": true,
|
|
"agent": true
|
|
}
|
|
}
|
|
```
|
|
|
|
**Response (503 — one or more services unavailable):**
|
|
|
|
```json
|
|
{
|
|
"status": "not_ready",
|
|
"checks": {
|
|
"stt": true,
|
|
"tts": false,
|
|
"agent": true
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## REST: `/api/metrics`
|
|
|
|
Prometheus-compatible metrics in text exposition format.
|
|
|
|
**Metrics exported:**
|
|
|
|
| Metric | Type | Description |
|
|
|--------|------|-------------|
|
|
| `stentor_sessions_active` | Gauge | Current active WebSocket sessions |
|
|
| `stentor_transcriptions_total` | Counter | Total STT transcription calls |
|
|
| `stentor_tts_requests_total` | Counter | Total TTS synthesis calls |
|
|
| `stentor_agent_requests_total` | Counter | Total agent message calls |
|
|
| `stentor_pipeline_duration_seconds` | Histogram | Full pipeline latency |
|
|
| `stentor_stt_duration_seconds` | Histogram | STT transcription latency |
|
|
| `stentor_tts_duration_seconds` | Histogram | TTS synthesis latency |
|
|
| `stentor_agent_duration_seconds` | Histogram | Agent response latency |
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
All configuration via environment variables (12-factor):
|
|
|
|
| Variable | Description | Default |
|
|
|----------|-------------|---------|
|
|
| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
|
|
| `STENTOR_PORT` | Gateway bind port | `8600` |
|
|
| `STENTOR_STT_URL` | Speaches STT endpoint | `http://perseus.incus:8000` |
|
|
| `STENTOR_TTS_URL` | Speaches TTS endpoint | `http://pan.incus:8000` |
|
|
| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
|
|
| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
|
|
| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
|
|
| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
|
|
| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
|
|
| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
|
|
| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
|
|
| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |
|