feat: scaffold stentor-gateway with FastAPI voice pipeline
Initialize the stentor-gateway project with WebSocket-based voice pipeline orchestrating STT → Agent → TTS via OpenAI-compatible APIs. - Add FastAPI app with WebSocket endpoint for audio streaming - Add pipeline orchestration (stt_client, tts_client, agent_client) - Add Pydantic Settings configuration and message models - Add audio utilities for PCM/WAV conversion and resampling - Add health check endpoints - Add Dockerfile and pyproject.toml with dependencies - Add initial test suite (pipeline, STT, TTS, WebSocket) - Add comprehensive README covering gateway and ESP32 ear design - Clean up .gitignore for Python/uv project
This commit is contained in:
604
README.md
604
README.md
@@ -1,2 +1,604 @@
|
||||
# stentor
|
||||
|
||||
# Stentor — Voice Gateway Specification
|
||||
|
||||
> *\"Stentor, whose voice was as powerful as fifty voices of other men.\"*
|
||||
> — Homer, *Iliad*, Book V
|
||||
|
||||
**Stentor** is a voice gateway that connects physical audio hardware (ESP32-S3-AUDIO-Board) to AI agents (FastAgent) via speech services (Speaches), enabling real-time voice conversations with AI.
|
||||
|
||||
---
|
||||
|
||||
## Architecture Overview
|
||||
|
||||
```
|
||||
┌──────────────────────────────────┐
|
||||
│ ESP32-S3-AUDIO-Board │
|
||||
│ (stentor-ear) │
|
||||
│ │
|
||||
│ Mic Array ──▶ Wake Word ──▶ WS │
|
||||
│ Speaker ◀── PCM playback ◀── │
|
||||
└──────────────┬───────────────────┘
|
||||
│ WebSocket (audio)
|
||||
▼
|
||||
┌──────────────────────────────────┐
|
||||
│ Stentor Gateway │
|
||||
│ (Python service) │
|
||||
│ │
|
||||
│ WS Server ──▶ STT ──▶ Agent │
|
||||
│ WS Server ◀── TTS ◀── Agent │
|
||||
└──────┬──────────┬──────────┬─────┘
|
||||
│ │ │
|
||||
▼ ▼ ▼
|
||||
Speaches FastAgent Speaches
|
||||
STT (HTTP) TTS
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Components
|
||||
|
||||
### 1. Stentor Gateway (`stentor-gateway/`)
|
||||
|
||||
**Language:** Python 3.12+
|
||||
**Framework:** FastAPI + WebSockets
|
||||
**Deployment:** Agathos container (Caliban or dedicated)
|
||||
|
||||
The gateway is the central orchestrator. It:
|
||||
- Accepts WebSocket connections from ESP32 clients
|
||||
- Streams audio to Speaches STT for transcription
|
||||
- Sends transcripts to FastAgent for AI response
|
||||
- Sends AI response text to Speaches TTS for synthesis
|
||||
- Streams synthesized audio back to the ESP32 client
|
||||
|
||||
#### Configuration
|
||||
|
||||
All configuration via environment variables (12-factor):
|
||||
|
||||
| Variable | Description | Default |
|
||||
|----------|-------------|---------|
|
||||
| `STENTOR_HOST` | Gateway bind address | `0.0.0.0` |
|
||||
| `STENTOR_PORT` | Gateway bind port | `8600` |
|
||||
| `STENTOR_STT_URL` | Speaches STT endpoint (OpenAI-compatible) | `http://perseus.incus:8000` |
|
||||
| `STENTOR_TTS_URL` | Speaches TTS endpoint (OpenAI-compatible) | `http://pan.incus:8000` |
|
||||
| `STENTOR_AGENT_URL` | FastAgent HTTP endpoint | `http://localhost:8001` |
|
||||
| `STENTOR_STT_MODEL` | Whisper model for STT | `Systran/faster-whisper-small` |
|
||||
| `STENTOR_TTS_MODEL` | TTS model name | `kokoro` |
|
||||
| `STENTOR_TTS_VOICE` | TTS voice ID | `af_heart` |
|
||||
| `STENTOR_AUDIO_SAMPLE_RATE` | Audio sample rate in Hz | `16000` |
|
||||
| `STENTOR_AUDIO_CHANNELS` | Audio channel count | `1` |
|
||||
| `STENTOR_AUDIO_SAMPLE_WIDTH` | Bits per sample | `16` |
|
||||
| `STENTOR_LOG_LEVEL` | Logging level | `INFO` |
|
||||
|
||||
#### WebSocket Protocol
|
||||
|
||||
Endpoint: `ws://{STENTOR_HOST}:{STENTOR_PORT}/ws/audio`
|
||||
|
||||
##### Client → Gateway Messages
|
||||
|
||||
```json
|
||||
// Session start (sent once on connect)
|
||||
{
|
||||
\"type\": \"session.start\",
|
||||
\"client_id\": \"esp32-kitchen\",
|
||||
\"audio_config\": {
|
||||
\"sample_rate\": 16000,
|
||||
\"channels\": 1,
|
||||
\"sample_width\": 16,
|
||||
\"encoding\": \"pcm_s16le\"
|
||||
}
|
||||
}
|
||||
|
||||
// Audio data (sent continuously while listening)
|
||||
{
|
||||
\"type\": \"audio.chunk\",
|
||||
\"data\": \"<base64-encoded PCM audio>\"
|
||||
}
|
||||
|
||||
// Client signals end of speech
|
||||
{
|
||||
\"type\": \"audio.end\"
|
||||
}
|
||||
|
||||
// Client requests session termination
|
||||
{
|
||||
\"type\": \"session.end\"
|
||||
}
|
||||
```
|
||||
|
||||
##### Gateway → Client Messages
|
||||
|
||||
```json
|
||||
// Acknowledge session
|
||||
{
|
||||
\"type\": \"session.ready\",
|
||||
\"session_id\": \"uuid\"
|
||||
}
|
||||
|
||||
// Gateway is processing (ESP32 can show \"thinking\" LED)
|
||||
{
|
||||
\"type\": \"status\",
|
||||
\"state\": \"listening\" | \"transcribing\" | \"thinking\" | \"speaking\"
|
||||
}
|
||||
|
||||
// Transcript of what user said (for debugging/display)
|
||||
{
|
||||
\"type\": \"transcript\",
|
||||
\"text\": \"What is the weather like today?\"
|
||||
}
|
||||
|
||||
// AI response text (for debugging/display)
|
||||
{
|
||||
\"type\": \"response.text\",
|
||||
\"text\": \"I don\\'t have weather tools yet, but...\"
|
||||
}
|
||||
|
||||
// Audio response (streamed in chunks)
|
||||
{
|
||||
\"type\": \"audio.chunk\",
|
||||
\"data\": \"<base64-encoded PCM audio>\"
|
||||
}
|
||||
|
||||
// Audio response complete
|
||||
{
|
||||
\"type\": \"audio.end\"
|
||||
}
|
||||
|
||||
// Error
|
||||
{
|
||||
\"type\": \"error\",
|
||||
\"message\": \"STT service unavailable\",
|
||||
\"code\": \"stt_error\"
|
||||
}
|
||||
```
|
||||
|
||||
#### Processing Pipeline
|
||||
|
||||
```
|
||||
1. ESP32 detects wake word locally (ESP-SR)
|
||||
2. ESP32 opens WebSocket, sends session.start
|
||||
3. Gateway sends session.ready
|
||||
4. ESP32 streams audio.chunk messages
|
||||
5. Gateway buffers audio (with Voice Activity Detection)
|
||||
6. ESP32 sends audio.end (or VAD detects silence)
|
||||
7. Gateway sends status: \"transcribing\"
|
||||
8. Gateway calls Speaches STT:
|
||||
POST {STENTOR_STT_URL}/v1/audio/transcriptions
|
||||
- model: {STENTOR_STT_MODEL}
|
||||
- file: collected audio as WAV
|
||||
- response_format: json
|
||||
9. Gateway sends transcript message to client
|
||||
10. Gateway sends status: \"thinking\"
|
||||
11. Gateway calls FastAgent:
|
||||
POST {STENTOR_AGENT_URL}/message
|
||||
- content: transcript text
|
||||
12. Gateway sends response.text to client
|
||||
13. Gateway sends status: \"speaking\"
|
||||
14. Gateway calls Speaches TTS:
|
||||
POST {STENTOR_TTS_URL}/v1/audio/speech
|
||||
- model: {STENTOR_TTS_MODEL}
|
||||
- voice: {STENTOR_TTS_VOICE}
|
||||
- input: agent response text
|
||||
- response_format: pcm
|
||||
- speed: 1.0
|
||||
15. Gateway streams audio.chunk messages to client
|
||||
16. Gateway sends audio.end
|
||||
17. Gateway returns to listening state (step 4)
|
||||
OR ESP32 sends session.end to disconnect
|
||||
```
|
||||
|
||||
#### Health & Observability
|
||||
|
||||
- `GET /health` — Returns 200 if gateway is running
|
||||
- `GET /health/ready` — Returns 200 if STT, TTS, and Agent endpoints are reachable
|
||||
- `GET /metrics` — Prometheus-compatible metrics
|
||||
- `stentor_sessions_active` — Current active WebSocket sessions
|
||||
- `stentor_transcriptions_total` — Total STT calls
|
||||
- `stentor_tts_requests_total` — Total TTS calls
|
||||
- `stentor_agent_requests_total` — Total agent calls
|
||||
- `stentor_pipeline_duration_seconds` — Histogram of full pipeline latency
|
||||
- `stentor_stt_duration_seconds` — Histogram of STT latency
|
||||
- `stentor_tts_duration_seconds` — Histogram of TTS latency
|
||||
- `stentor_agent_duration_seconds` — Histogram of agent response latency
|
||||
|
||||
#### Dependencies (Python)
|
||||
|
||||
```toml
|
||||
[project]
|
||||
name = \"stentor-gateway\"
|
||||
version = \"0.1.0\"
|
||||
requires-python = \">=3.12\"
|
||||
dependencies = [
|
||||
\"fastapi>=0.115\",
|
||||
\"uvicorn[standard]>=0.34\",
|
||||
\"websockets>=14.0\",
|
||||
\"httpx>=0.28\",
|
||||
\"pydantic>=2.10\",
|
||||
\"pydantic-settings>=2.7\",
|
||||
]
|
||||
|
||||
[project.optional-dependencies]
|
||||
dev = [
|
||||
\"pytest>=8.0\",
|
||||
\"pytest-asyncio>=0.25\",
|
||||
\"ruff>=0.9\",
|
||||
]
|
||||
```
|
||||
|
||||
#### Project Structure
|
||||
|
||||
```
|
||||
stentor-gateway/
|
||||
├── pyproject.toml
|
||||
├── src/
|
||||
│ └── stentor/
|
||||
│ ├── __init__.py
|
||||
│ ├── main.py # FastAPI app, WebSocket endpoint
|
||||
│ ├── config.py # Pydantic Settings configuration
|
||||
│ ├── models.py # WebSocket message models
|
||||
│ ├── pipeline.py # Orchestration: STT → Agent → TTS
|
||||
│ ├── stt_client.py # Speaches STT client (OpenAI-compatible)
|
||||
│ ├── tts_client.py # Speaches TTS client (OpenAI-compatible)
|
||||
│ ├── agent_client.py # FastAgent HTTP client
|
||||
│ ├── audio.py # Audio utilities (PCM↔WAV, resampling)
|
||||
│ └── health.py # Health check endpoints
|
||||
├── tests/
|
||||
│ ├── test_pipeline.py
|
||||
│ ├── test_stt_client.py
|
||||
│ ├── test_tts_client.py
|
||||
│ └── test_websocket.py
|
||||
└── Dockerfile
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 2. Stentor Ear (`stentor-ear/`)
|
||||
|
||||
**Platform:** ESP32-S3-AUDIO-Board (Waveshare)
|
||||
**Framework:** ESP-IDF 5.x + ESP-ADF
|
||||
**Language:** C
|
||||
|
||||
The ESP32 firmware handles local audio capture, wake word detection, and communication with the Stentor Gateway.
|
||||
|
||||
#### Hardware Capabilities (Waveshare ESP32-S3-AUDIO-Board)
|
||||
|
||||
| Component | Chip | Purpose |
|
||||
|-----------|------|---------|
|
||||
| Microphone array | 2× MEMS via ES7210 ADC | Audio capture, noise reduction |
|
||||
| Speaker output | ES8311 DAC + Class-D amp | Audio playback |
|
||||
| Display | 1.85\" LCD (ST7789V2) | Status display (optional) |
|
||||
| LEDs | WS2812B RGB ring (×2) | Visual status feedback |
|
||||
| Storage | TF card slot | Local audio cache (optional) |
|
||||
| Camera | OV2640 (2MP) | Not used for Stentor |
|
||||
| RTC | PCF85063 | Timestamping (optional) |
|
||||
| Connectivity | WiFi 802.11 b/g/n + BLE 5 | Network communication |
|
||||
| Wake word | ESP-SR | Local \"Hi ESP\" detection |
|
||||
|
||||
Reference: Waveshare wiki documents ESP-SR wake word \"hi esp\" with command recognition support [2].
|
||||
|
||||
#### Firmware State Machine
|
||||
|
||||
```
|
||||
┌──────────┐
|
||||
┌─────────│ IDLE │◀──────────┐
|
||||
│ │ (LEDs: dim blue) │
|
||||
│ └────┬─────┘ │
|
||||
│ │ Wake word │
|
||||
│ │ detected │
|
||||
│ ▼ │
|
||||
│ ┌──────────┐ │
|
||||
│ │ CONNECT │ │
|
||||
│ │ (LEDs: pulse white) │
|
||||
│ └────┬─────┘ │
|
||||
│ │ WS connected │
|
||||
│ ▼ │
|
||||
│ ┌──────────┐ │
|
||||
Timeout│ │ LISTEN │ │ audio.end
|
||||
/ Error│ │ (LEDs: green) │ received
|
||||
│ └────┬─────┘ │
|
||||
│ │ Silence / │
|
||||
│ │ button press │
|
||||
│ ▼ │
|
||||
│ ┌──────────┐ │
|
||||
│ │ WAITING │ │
|
||||
│ │ (LEDs: pulse yellow)│
|
||||
│ └────┬─────┘ │
|
||||
│ │ audio.chunk │
|
||||
│ │ from gateway │
|
||||
│ ▼ │
|
||||
│ ┌──────────┐ │
|
||||
│ │ PLAYING ├──────────┘
|
||||
│ │ (LEDs: pulse cyan) │
|
||||
│ └──────────┘ │
|
||||
│ │
|
||||
└────────────────────────────────┘
|
||||
```
|
||||
|
||||
#### Audio Configuration
|
||||
|
||||
| Parameter | Value | Notes |
|
||||
|-----------|-------|-------|
|
||||
| Sample rate | 16000 Hz | Standard for speech recognition |
|
||||
| Bit depth | 16-bit | Signed, little-endian (PCM_S16LE) |
|
||||
| Channels | 1 (mono) | Downmixed from dual-mic array |
|
||||
| Chunk size | 512 samples (32ms) | Balance between latency and overhead |
|
||||
| Buffer | 4096 samples (256ms) | Ring buffer for network jitter |
|
||||
|
||||
#### WiFi & Network
|
||||
|
||||
- WiFi credentials via `menuconfig` or BLE provisioning
|
||||
- mDNS discovery for gateway: `_stentor._tcp.local`
|
||||
- Fallback to configured static gateway URL
|
||||
- Auto-reconnect with exponential backoff
|
||||
|
||||
#### LED Feedback (WS2812B Ring)
|
||||
|
||||
| State | Pattern | Color |
|
||||
|-------|---------|-------|
|
||||
| Idle / standby | Dim steady | Blue |
|
||||
| Wake word detected / connecting | Pulse | White |
|
||||
| Listening (recording) | Steady | Green |
|
||||
| Transcribing / thinking | Rotating pulse | Yellow |
|
||||
| Speaking (playback) | Pulse with audio | Cyan |
|
||||
| Error | Flash | Red |
|
||||
|
||||
#### Firmware Project Structure
|
||||
|
||||
```
|
||||
stentor-ear/
|
||||
├── CMakeLists.txt
|
||||
├── sdkconfig.defaults
|
||||
├── main/
|
||||
│ ├── CMakeLists.txt
|
||||
│ ├── main.c # App entry, state machine
|
||||
│ ├── audio_capture.c/.h # I2S mic input via ES7210
|
||||
│ ├── audio_playback.c/.h # I2S speaker output via ES8311
|
||||
│ ├── wake_word.c/.h # ESP-SR integration
|
||||
│ ├── ws_client.c/.h # WebSocket client to gateway
|
||||
│ ├── led_status.c/.h # WS2812B LED patterns
|
||||
│ ├── wifi_manager.c/.h # WiFi connection management
|
||||
│ └── config.h # Compile-time configuration
|
||||
├── components/
|
||||
│ └── (ESP-ADF / ESP-SR as managed components)
|
||||
└── partitions.csv
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 3. FastAgent Integration
|
||||
|
||||
The Stentor Gateway communicates with FastAgent via its HTTP transport mode.
|
||||
|
||||
#### Starting FastAgent for Stentor
|
||||
|
||||
```bash
|
||||
# Expose Harper (or any agent) as an HTTP server
|
||||
uv run fast-agent --transport http --port 8001
|
||||
```
|
||||
|
||||
#### Agent Request/Response
|
||||
|
||||
The gateway sends a simple message and receives the agent\'s response. The exact API shape depends on FastAgent\'s HTTP transport implementation — expected to follow MCP-over-HTTP conventions:
|
||||
|
||||
```
|
||||
POST /message
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
\"content\": \"What can you tell me about the Agathos infrastructure?\"
|
||||
}
|
||||
|
||||
Response:
|
||||
{
|
||||
\"content\": \"Agathos is a set of Incus containers on a 10.10.0.0/24 network...\"
|
||||
}
|
||||
```
|
||||
|
||||
> **Note for Cline:** The FastAgent HTTP transport API may need investigation. Check the fast-agent source for the actual endpoint shape. The gateway\'s `agent_client.py` should be the thinnest possible adapter so it\'s easy to swap.
|
||||
|
||||
---
|
||||
|
||||
### 4. Speaches Integration
|
||||
|
||||
Speaches provides OpenAI API-compatible STT and TTS endpoints.
|
||||
|
||||
#### STT (Speech-to-Text)
|
||||
|
||||
```
|
||||
POST {STENTOR_STT_URL}/v1/audio/transcriptions
|
||||
Content-Type: multipart/form-data
|
||||
|
||||
- file: audio.wav (PCM wrapped in WAV header)
|
||||
- model: \"Systran/faster-whisper-small\"
|
||||
- language: \"en\" (optional, auto-detect if omitted)
|
||||
- response_format: \"json\"
|
||||
|
||||
Response:
|
||||
{
|
||||
\"text\": \"What is the weather like today?\"
|
||||
}
|
||||
```
|
||||
|
||||
#### TTS (Text-to-Speech)
|
||||
|
||||
```
|
||||
POST {STENTOR_TTS_URL}/v1/audio/speech
|
||||
Content-Type: application/json
|
||||
|
||||
{
|
||||
\"model\": \"kokoro\",
|
||||
\"voice\": \"af_heart\",
|
||||
\"input\": \"I don\\'t have weather tools yet, but I can help with other things.\",
|
||||
\"response_format\": \"pcm\",
|
||||
\"speed\": 1.0
|
||||
}
|
||||
|
||||
Response: Binary PCM audio stream (24kHz, mono, 16-bit)
|
||||
```
|
||||
|
||||
> **Note:** Speaches TTS may output at 24kHz. The gateway\'s `audio.py` module should handle resampling to 16kHz if the ESP32 is configured for 16kHz playback. Alternatively, configure the ESP32 for 24kHz playback.
|
||||
|
||||
---
|
||||
|
||||
## Build & Run
|
||||
|
||||
### Gateway
|
||||
|
||||
```bash
|
||||
cd stentor-gateway
|
||||
pip install -e .
|
||||
stentor
|
||||
```
|
||||
|
||||
Or with more control:
|
||||
|
||||
```bash
|
||||
python -m stentor
|
||||
# or: uvicorn stentor.main:app --host 0.0.0.0 --port 25573
|
||||
```
|
||||
|
||||
See [docs/stentor.md](docs/stentor.md) for full usage documentation.
|
||||
|
||||
### ESP32 Firmware
|
||||
|
||||
```bash
|
||||
cd stentor-ear
|
||||
idf.py set-target esp32s3
|
||||
idf.py menuconfig # Set WiFi creds, gateway URL
|
||||
idf.py build flash monitor
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Phase 2 (Future)
|
||||
|
||||
These are explicitly **out of scope** for v0.1 but documented for planning:
|
||||
|
||||
- [ ] **Streaming STT** — Use Speaches SSE streaming for real-time partial transcripts
|
||||
- [ ] **Streaming TTS** — Stream TTS audio chunks as they\'re generated (reduce time-to-first-audio)
|
||||
- [ ] **Barge-in / Interruption** — Allow user to interrupt AI playback with new speech
|
||||
- [ ] **Multi-turn context** — Maintain conversation history across turns within a session
|
||||
- [ ] **PJSUA2 SIP bridge** — Enable calling Stentor from a phone via SIP
|
||||
- [ ] **Multiple agents** — Route to different FastAgent instances based on wake word or command
|
||||
- [ ] **Local fallback** — ESP32 handles simple commands locally when gateway is unreachable
|
||||
- [ ] **OTA firmware updates** — Update stentor-ear firmware over WiFi
|
||||
- [ ] **LCD UI** — Show transcript and response text on the 1.85\" display
|
||||
|
||||
---
|
||||
|
||||
## Design Decisions & Rationale
|
||||
|
||||
| Decision | Rationale |
|
||||
|----------|-----------|
|
||||
| WebSocket over SIP/RTP | No SIP infrastructure needed; ESP-IDF has native WS support; Speaches speaks HTTP not SIP |
|
||||
| Base64 audio in JSON | Simplicity for v0.1; binary WS frames are a Phase 2 optimization |
|
||||
| Wake word on ESP32 | Zero-latency wake detection; no audio streaming when idle; privacy-preserving |
|
||||
| Gateway as separate service | Decouples hardware from AI; can swap agents, add SIP later, scale independently |
|
||||
| PCM_S16LE format | Universal, no codec overhead, both ESP32 and Speaches handle it natively |
|
||||
| Pydantic Settings for config | Type-safe, env-var-driven, zero config files to manage |
|
||||
|
||||
---
|
||||
|
||||
## Notes for Cline
|
||||
|
||||
1. **Start with the gateway.** The ESP32 firmware is more complex and needs the gateway to test against.
|
||||
2. **Use `httpx.AsyncClient`** for all HTTP calls — it supports streaming responses which you\'ll need for TTS.
|
||||
3. **The WebSocket protocol is intentionally simple.** JSON messages with base64 audio. Optimize to binary frames later.
|
||||
4. **Test with a Python WebSocket client first** before touching ESP32 hardware. Write a `test_client.py` that sends a WAV file over the WebSocket and plays back the response.
|
||||
5. **FastAgent HTTP transport** needs investigation — check the fast-agent repo for actual endpoint signatures.
|
||||
6. **Audio resampling** — Speaches TTS may output 24kHz; ESP32 may expect 16kHz. Handle this in `audio.py`.
|
||||
7. **Don\'t overthink error handling in v0.1** — log errors, send error messages to the client, keep the pipeline simple.
|
||||
SPECEOF
|
||||
echo 'SPEC.md written'"}}
|
||||
</tool_call>
|
||||
<tool_result>
|
||||
SPEC.md written
|
||||
</tool_result>
|
||||
|
||||
Now let me push it to the repo:
|
||||
|
||||
<tool_call>
|
||||
{"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && git init && git remote add origin https://gitea.service.consul/robert/stentor.git && git fetch origin && git checkout main && cp SPEC.md . && git add SPEC.md && git commit -m 'Add Stentor voice gateway specification' && git push origin main"}}
|
||||
</tool_call>
|
||||
<tool_result>
|
||||
hint: Using 'master' as the name for the initial branch. This name is subject to change. To
|
||||
hint: configure the initial branch name to use in all of your new repositories, which will
|
||||
hint: suppress this warning, call:
|
||||
hint:
|
||||
hint: git config --global init.defaultBranch <name>
|
||||
hint:
|
||||
hint: Names commonly chosen instead of 'master' are 'main', 'trunk' and 'development'. The
|
||||
hint: just-created branch can be renamed via this command:
|
||||
hint:
|
||||
hint: git branch -m <name>
|
||||
From https://gitea.service.consul/robert/stentor
|
||||
* [new branch] main -> origin/main
|
||||
branch 'main' set up to track 'origin/main'.
|
||||
Already on 'main'
|
||||
[main 2e4e0b5] Add Stentor voice gateway specification
|
||||
1 file changed, 362 insertions(+)
|
||||
create mode 100644 SPEC.md
|
||||
To https://gitea.service.consul/robert/stentor.git
|
||||
7e5e7b2..2e4e0b5 main -> main
|
||||
</tool_result>
|
||||
|
||||
Now let me also create a basic README and scaffold the directory structure:
|
||||
|
||||
<tool_call>
|
||||
{"name": "execute_command", "arguments": {"command": "cd /tmp/stentor && cat > README.md << 'EOF'
|
||||
# Στέντωρ (Stentor)
|
||||
|
||||
> *\"Stentor, whose voice was as powerful as fifty voices of other men.\"*
|
||||
> — Homer, *Iliad*, Book V
|
||||
|
||||
Voice gateway for AI agents. Bridges physical audio hardware to AI agents via speech services.
|
||||
|
||||
## Architecture
|
||||
|
||||
```
|
||||
ESP32-S3-AUDIO-Board ◀──WebSocket──▶ Stentor Gateway ◀──HTTP──▶ Speaches (STT/TTS)
|
||||
(stentor-ear) (Python) FastAgent (AI)
|
||||
```
|
||||
|
||||
**Stentor Gateway** orchestrates the voice pipeline:
|
||||
1. Receives audio from ESP32 via WebSocket
|
||||
2. Transcribes speech via Speaches STT (OpenAI-compatible API)
|
||||
3. Sends transcript to FastAgent for AI response
|
||||
4. Synthesizes speech via Speaches TTS
|
||||
5. Streams audio back to ESP32 for playback
|
||||
|
||||
**Stentor Ear** is the ESP32-S3 firmware:
|
||||
- Local wake word detection (\"Hi ESP\" via ESP-SR)
|
||||
- Audio capture from dual-mic array
|
||||
- Audio playback via onboard speaker
|
||||
- LED ring feedback for conversation state
|
||||
|
||||
## Quick Start
|
||||
|
||||
### Gateway
|
||||
|
||||
```bash
|
||||
cd stentor-gateway
|
||||
uv sync
|
||||
uv run uvicorn stentor.main:app --host 0.0.0.0 --port 8600
|
||||
```
|
||||
|
||||
### ESP32 Firmware
|
||||
|
||||
```bash
|
||||
cd stentor-ear
|
||||
idf.py set-target esp32s3
|
||||
idf.py menuconfig # Set WiFi creds, gateway URL
|
||||
idf.py build flash monitor
|
||||
```
|
||||
|
||||
## Documentation
|
||||
|
||||
- [Full Specification](SPEC.md)
|
||||
|
||||
## Status
|
||||
|
||||
🚧 **Pre-alpha** — Under active development.
|
||||
|
||||
## License
|
||||
|
||||
MIT
|
||||
|
||||
Reference in New Issue
Block a user