feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
This commit is contained in:
178
docs/architecture.md
Normal file
178
docs/architecture.md
Normal file
@@ -0,0 +1,178 @@
|
||||
# Architecture
|
||||
|
||||
Hold Slayer is a single-process async Python application built on FastAPI. It acts as an intelligent B2BUA (Back-to-Back User Agent) sitting between your SIP trunk (PSTN access) and your desk phone/softphone.
|
||||
|
||||
## System Diagram
|
||||
|
||||
```
|
||||
┌─────────────────────────────────────────────────────────────────┐
|
||||
│ FastAPI Server │
|
||||
│ │
|
||||
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐ │
|
||||
│ │ REST API │ │WebSocket │ │MCP Server │ │ Dashboard │ │
|
||||
│ │ /api/* │ │ /ws/* │ │ (SSE) │ │ /dashboard │ │
|
||||
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌────┴──────────────┴──────────────┴────┐ │
|
||||
│ │ Event Bus │ │
|
||||
│ │ (asyncio Queue pub/sub per client) │ │
|
||||
│ └────┬──────────────┬──────────────┬────┘ │
|
||||
│ │ │ │ │
|
||||
│ ┌────┴─────┐ ┌─────┴─────┐ ┌────┴──────────┐ │
|
||||
│ │ Call │ │ Hold │ │ Services │ │
|
||||
│ │ Manager │ │ Slayer │ │ (LLM, STT, │ │
|
||||
│ │ │ │ │ │ Recording, │ │
|
||||
│ │ │ │ │ │ Analytics, │ │
|
||||
│ │ │ │ │ │ Notify) │ │
|
||||
│ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
|
||||
│ │ │ │
|
||||
│ ┌────┴──────────────┴───────────────────┐ │
|
||||
│ │ Sippy B2BUA Engine │ │
|
||||
│ │ (SIP calls, DTMF, conference bridge) │ │
|
||||
│ └────┬──────────────────────────────────┘ │
|
||||
│ │ │
|
||||
└───────┼─────────────────────────────────────────────────────────┘
|
||||
│
|
||||
┌────┴────┐
|
||||
│SIP Trunk│ ──→ PSTN
|
||||
└─────────┘
|
||||
```
|
||||
|
||||
## Component Overview
|
||||
|
||||
### Presentation Layer
|
||||
|
||||
| Component | File | Protocol | Purpose |
|
||||
|-----------|------|----------|---------|
|
||||
| REST API | `api/calls.py`, `api/call_flows.py`, `api/devices.py` | HTTP | Call management, CRUD, configuration |
|
||||
| WebSocket | `api/websocket.py` | WS | Real-time event streaming to clients |
|
||||
| MCP Server | `mcp_server/server.py` | SSE | AI assistant tool integration |
|
||||
|
||||
### Orchestration Layer
|
||||
|
||||
| Component | File | Purpose |
|
||||
|-----------|------|---------|
|
||||
| Gateway | `core/gateway.py` | Top-level orchestrator — owns all services, routes calls |
|
||||
| Call Manager | `core/call_manager.py` | Active call state, lifecycle, transcript tracking |
|
||||
| Event Bus | `core/event_bus.py` | Async pub/sub connecting everything together |
|
||||
|
||||
### Intelligence Layer
|
||||
|
||||
| Component | File | Purpose |
|
||||
|-----------|------|---------|
|
||||
| Hold Slayer | `services/hold_slayer.py` | IVR navigation, hold monitoring, human detection |
|
||||
| Audio Classifier | `services/audio_classifier.py` | Real-time waveform analysis (music/speech/DTMF/silence) |
|
||||
| LLM Client | `services/llm_client.py` | OpenAI-compatible LLM for IVR menu decisions |
|
||||
| Transcription | `services/transcription.py` | Speaches/Whisper STT for live audio |
|
||||
| Call Flow Learner | `services/call_flow_learner.py` | Builds reusable IVR trees from exploration data |
|
||||
|
||||
### Infrastructure Layer
|
||||
|
||||
| Component | File | Purpose |
|
||||
|-----------|------|---------|
|
||||
| Sippy Engine | `core/sippy_engine.py` | SIP signaling (INVITE, BYE, REGISTER, DTMF) |
|
||||
| Media Pipeline | `core/media_pipeline.py` | PJSUA2 RTP media handling, conference bridge, recording |
|
||||
| Recording | `services/recording.py` | WAV file management and storage |
|
||||
| Analytics | `services/call_analytics.py` | Call metrics, hold time stats, trends |
|
||||
| Notifications | `services/notification.py` | WebSocket + SMS alerts |
|
||||
| Database | `db/database.py` | SQLAlchemy async (PostgreSQL or SQLite) |
|
||||
|
||||
## Data Flow — Hold Slayer Call
|
||||
|
||||
```
|
||||
1. User Request
|
||||
POST /api/calls/hold-slayer { number, intent, call_flow_id }
|
||||
│
|
||||
2. Gateway.make_call()
|
||||
├── CallManager.create_call() → track state
|
||||
├── SippyEngine.make_call() → SIP INVITE to trunk
|
||||
└── MediaPipeline.add_stream() → RTP media setup
|
||||
│
|
||||
3. HoldSlayer.run_with_flow() or run_exploration()
|
||||
├── AudioClassifier.classify() → analyze 3s audio windows
|
||||
│ ├── silence? → wait
|
||||
│ ├── ringing? → wait
|
||||
│ ├── DTMF? → detect tones
|
||||
│ ├── music? → HOLD_DETECTED event
|
||||
│ └── speech? → transcribe + decide
|
||||
│
|
||||
├── TranscriptionService.transcribe() → STT on speech audio
|
||||
│
|
||||
├── LLMClient.analyze_ivr_menu() → pick menu option (fallback)
|
||||
│ └── SippyEngine.send_dtmf() → press the button
|
||||
│
|
||||
└── detect_hold_to_human_transition()
|
||||
└── HUMAN_DETECTED! → transfer
|
||||
│
|
||||
4. Transfer
|
||||
├── SippyEngine.bridge() → connect call legs
|
||||
├── MediaPipeline.bridge_streams() → bridge RTP
|
||||
├── EventBus.publish(TRANSFER_STARTED)
|
||||
└── NotificationService → "Pick up your phone!"
|
||||
│
|
||||
5. Real-Time Updates (throughout)
|
||||
EventBus.publish() → WebSocket clients
|
||||
→ MCP server resources
|
||||
→ Notification service
|
||||
→ Analytics tracking
|
||||
```
|
||||
|
||||
## Threading Model
|
||||
|
||||
Hold Slayer is primarily single-threaded async (asyncio), with one exception:
|
||||
|
||||
- **Main thread**: FastAPI + all async services (event bus, hold slayer, classifier, etc.)
|
||||
- **Sippy thread**: Sippy B2BUA runs its own event loop in a dedicated daemon thread. The `SippyEngine` bridges async↔sync via `asyncio.run_in_executor()`.
|
||||
- **PJSUA2**: Runs in the main thread using null audio device (no sound card needed — headless server mode).
|
||||
|
||||
```
|
||||
Main Thread (asyncio)
|
||||
├── FastAPI (uvicorn)
|
||||
├── EventBus
|
||||
├── CallManager
|
||||
├── HoldSlayer
|
||||
├── AudioClassifier
|
||||
├── TranscriptionService
|
||||
├── LLMClient
|
||||
├── MediaPipeline (PJSUA2)
|
||||
├── NotificationService
|
||||
└── RecordingService
|
||||
|
||||
Sippy Thread (daemon)
|
||||
└── Sippy B2BUA event loop
|
||||
├── SIP signaling
|
||||
├── DTMF relay
|
||||
└── Call leg management
|
||||
```
|
||||
|
||||
## Design Decisions
|
||||
|
||||
### Why Sippy B2BUA + PJSUA2?
|
||||
|
||||
We split SIP signaling and media handling into two separate libraries:
|
||||
|
||||
- **Sippy B2BUA** handles SIP signaling (INVITE, BYE, REGISTER, re-INVITE, DTMF relay). It's battle-tested for telephony and handles the complex SIP state machine.
|
||||
- **PJSUA2** handles RTP media (audio streams, conference bridge, recording, tone generation). It provides a clean C++/Python API for media manipulation without needing to deal with raw RTP.
|
||||
|
||||
This split lets us tap into the audio stream (for classification and STT) without interfering with SIP signaling, and bridge calls through a conference bridge for clean transfer.
|
||||
|
||||
### Why asyncio Queue-based EventBus?
|
||||
|
||||
- **Single process** — no need for Redis/RabbitMQ cross-process messaging
|
||||
- **Zero dependencies** — pure asyncio, no external services to deploy
|
||||
- **Per-subscriber queues** — slow consumers don't block fast publishers
|
||||
- **Dead subscriber cleanup** — full queues are automatically removed
|
||||
- **Event history** — late joiners can catch up on recent events
|
||||
|
||||
If scaling to multiple gateway processes becomes necessary, the EventBus interface can be backed by Redis pub/sub without changing consumers.
|
||||
|
||||
### Why OpenAI-compatible LLM API?
|
||||
|
||||
The LLM client uses raw HTTP (httpx) against any OpenAI-compatible endpoint. This means:
|
||||
|
||||
- **Ollama** (local, free) — `http://localhost:11434/v1`
|
||||
- **LM Studio** (local, free) — `http://localhost:1234/v1`
|
||||
- **vLLM** (local, fast) — `http://localhost:8000/v1`
|
||||
- **OpenAI** (cloud) — `https://api.openai.com/v1`
|
||||
|
||||
No SDK dependency. No vendor lock-in. Switch models by changing one env var.
|
||||
Reference in New Issue
Block a user