Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
6.7 KiB
Stentor Gateway API Reference
Version 0.1.0
Endpoints
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Dashboard (Bootstrap UI) |
/api/v1/realtime |
WebSocket | Real-time audio conversation |
/api/v1/info |
GET | Gateway information and configuration |
/api/live/ |
GET | Liveness probe (Kubernetes) |
/api/ready/ |
GET | Readiness probe (Kubernetes) |
/api/metrics |
GET | Prometheus-compatible metrics |
/api/docs |
GET | Interactive API documentation (Swagger UI) |
/api/openapi.json |
GET | OpenAPI schema |
WebSocket: /api/v1/realtime
Real-time voice conversation endpoint. Protocol inspired by the OpenAI Realtime API.
Connection
ws://{host}:{port}/api/v1/realtime
Client Events
session.start
Initiates a new conversation session. Must be sent first.
{
"type": "session.start",
"client_id": "esp32-kitchen",
"audio_config": {
"sample_rate": 16000,
"channels": 1,
"sample_width": 16,
"encoding": "pcm_s16le"
}
}
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | ✔ | Must be "session.start" |
client_id |
string | Client identifier for tracking | |
audio_config |
object | Audio format configuration |
input_audio_buffer.append
Sends a chunk of audio data. Stream continuously while user is speaking.
{
"type": "input_audio_buffer.append",
"audio": "<base64-encoded PCM audio>"
}
| Field | Type | Required | Description |
|---|---|---|---|
type |
string | ✔ | Must be "input_audio_buffer.append" |
audio |
string | ✔ | Base64-encoded PCM S16LE audio |
input_audio_buffer.commit
Signals end of speech. Triggers the STT → Agent → TTS pipeline.
{
"type": "input_audio_buffer.commit"
}
session.close
Requests session termination. The WebSocket connection will close.
{
"type": "session.close"
}
Server Events
session.created
Acknowledges session creation.
{
"type": "session.created",
"session_id": "550e8400-e29b-41d4-a716-446655440000"
}
status
Processing status update. Use for LED feedback on ESP32.
{
"type": "status",
"state": "listening"
}
| State | Description | Suggested LED |
|---|---|---|
listening |
Ready for audio input | Green |
transcribing |
Running STT | Yellow |
thinking |
Waiting for agent response | Yellow |
speaking |
Playing TTS audio | Cyan |
transcript.done
Transcript of what the user said.
{
"type": "transcript.done",
"text": "What is the weather like today?"
}
response.text.done
AI agent's response text.
{
"type": "response.text.done",
"text": "I don't have weather tools yet, but I can help with other things."
}
response.audio.delta
Streamed audio response chunk.
{
"type": "response.audio.delta",
"delta": "<base64-encoded PCM audio>"
}
response.audio.done
Audio response streaming complete.
{
"type": "response.audio.done"
}
response.done
Full response cycle complete. Gateway returns to listening state.
{
"type": "response.done"
}
error
Error event.
{
"type": "error",
"message": "STT service unavailable",
"code": "stt_error"
}
| Code | Description |
|---|---|
invalid_json |
Client sent malformed JSON |
validation_error |
Message failed schema validation |
no_session |
Action requires an active session |
empty_buffer |
Audio buffer was empty on commit |
empty_transcript |
STT returned no speech |
empty_response |
Agent returned empty response |
pipeline_error |
Internal pipeline failure |
unknown_event |
Unrecognized event type |
internal_error |
Unexpected server error |
REST: /api/v1/info
Returns gateway information and current configuration.
Response:
{
"name": "stentor-gateway",
"version": "0.1.0",
"endpoints": {
"realtime": "/api/v1/realtime",
"live": "/api/live/",
"ready": "/api/ready/",
"metrics": "/api/metrics"
},
"config": {
"stt_url": "http://perseus.incus:8000",
"tts_url": "http://pan.incus:8000",
"agent_url": "http://localhost:8001",
"stt_model": "Systran/faster-whisper-small",
"tts_model": "kokoro",
"tts_voice": "af_heart",
"audio_sample_rate": 16000,
"audio_channels": 1,
"audio_sample_width": 16
}
}
REST: /api/live/
Kubernetes liveness probe.
Response (200):
{
"status": "ok"
}
REST: /api/ready/
Kubernetes readiness probe. Checks connectivity to STT, TTS, and Agent services.
Response (200 — all services reachable):
{
"status": "ready",
"checks": {
"stt": true,
"tts": true,
"agent": true
}
}
Response (503 — one or more services unavailable):
{
"status": "not_ready",
"checks": {
"stt": true,
"tts": false,
"agent": true
}
}
REST: /api/metrics
Prometheus-compatible metrics in text exposition format.
Metrics exported:
| Metric | Type | Description |
|---|---|---|
stentor_sessions_active |
Gauge | Current active WebSocket sessions |
stentor_transcriptions_total |
Counter | Total STT transcription calls |
stentor_tts_requests_total |
Counter | Total TTS synthesis calls |
stentor_agent_requests_total |
Counter | Total agent message calls |
stentor_pipeline_duration_seconds |
Histogram | Full pipeline latency |
stentor_stt_duration_seconds |
Histogram | STT transcription latency |
stentor_tts_duration_seconds |
Histogram | TTS synthesis latency |
stentor_agent_duration_seconds |
Histogram | Agent response latency |
Configuration
All configuration via environment variables (12-factor):
| Variable | Description | Default |
|---|---|---|
STENTOR_HOST |
Gateway bind address | 0.0.0.0 |
STENTOR_PORT |
Gateway bind port | 8600 |
STENTOR_STT_URL |
Speaches STT endpoint | http://perseus.incus:8000 |
STENTOR_TTS_URL |
Speaches TTS endpoint | http://pan.incus:8000 |
STENTOR_AGENT_URL |
FastAgent HTTP endpoint | http://localhost:8001 |
STENTOR_STT_MODEL |
Whisper model for STT | Systran/faster-whisper-small |
STENTOR_TTS_MODEL |
TTS model name | kokoro |
STENTOR_TTS_VOICE |
TTS voice ID | af_heart |
STENTOR_AUDIO_SAMPLE_RATE |
Audio sample rate in Hz | 16000 |
STENTOR_AUDIO_CHANNELS |
Audio channel count | 1 |
STENTOR_AUDIO_SAMPLE_WIDTH |
Bits per sample | 16 |
STENTOR_LOG_LEVEL |
Logging level | INFO |