feat: add initial Hold Slayer AI telephony gateway implementation

Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
This commit is contained in:
2026-03-21 19:23:26 +00:00
parent c9ff60702b
commit ecf37658ce
56 changed files with 11601 additions and 164 deletions

178
docs/architecture.md Normal file
View File

@@ -0,0 +1,178 @@
# Architecture
Hold Slayer is a single-process async Python application built on FastAPI. It acts as an intelligent B2BUA (Back-to-Back User Agent) sitting between your SIP trunk (PSTN access) and your desk phone/softphone.
## System Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Server │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ REST API │ │WebSocket │ │MCP Server │ │ Dashboard │ │
│ │ /api/* │ │ /ws/* │ │ (SSE) │ │ /dashboard │ │
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
│ │ │ │ │
│ ┌────┴──────────────┴──────────────┴────┐ │
│ │ Event Bus │ │
│ │ (asyncio Queue pub/sub per client) │ │
│ └────┬──────────────┬──────────────┬────┘ │
│ │ │ │ │
│ ┌────┴─────┐ ┌─────┴─────┐ ┌────┴──────────┐ │
│ │ Call │ │ Hold │ │ Services │ │
│ │ Manager │ │ Slayer │ │ (LLM, STT, │ │
│ │ │ │ │ │ Recording, │ │
│ │ │ │ │ │ Analytics, │ │
│ │ │ │ │ │ Notify) │ │
│ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
│ │ │ │
│ ┌────┴──────────────┴───────────────────┐ │
│ │ Sippy B2BUA Engine │ │
│ │ (SIP calls, DTMF, conference bridge) │ │
│ └────┬──────────────────────────────────┘ │
│ │ │
└───────┼─────────────────────────────────────────────────────────┘
┌────┴────┐
│SIP Trunk│ ──→ PSTN
└─────────┘
```
## Component Overview
### Presentation Layer
| Component | File | Protocol | Purpose |
|-----------|------|----------|---------|
| REST API | `api/calls.py`, `api/call_flows.py`, `api/devices.py` | HTTP | Call management, CRUD, configuration |
| WebSocket | `api/websocket.py` | WS | Real-time event streaming to clients |
| MCP Server | `mcp_server/server.py` | SSE | AI assistant tool integration |
### Orchestration Layer
| Component | File | Purpose |
|-----------|------|---------|
| Gateway | `core/gateway.py` | Top-level orchestrator — owns all services, routes calls |
| Call Manager | `core/call_manager.py` | Active call state, lifecycle, transcript tracking |
| Event Bus | `core/event_bus.py` | Async pub/sub connecting everything together |
### Intelligence Layer
| Component | File | Purpose |
|-----------|------|---------|
| Hold Slayer | `services/hold_slayer.py` | IVR navigation, hold monitoring, human detection |
| Audio Classifier | `services/audio_classifier.py` | Real-time waveform analysis (music/speech/DTMF/silence) |
| LLM Client | `services/llm_client.py` | OpenAI-compatible LLM for IVR menu decisions |
| Transcription | `services/transcription.py` | Speaches/Whisper STT for live audio |
| Call Flow Learner | `services/call_flow_learner.py` | Builds reusable IVR trees from exploration data |
### Infrastructure Layer
| Component | File | Purpose |
|-----------|------|---------|
| Sippy Engine | `core/sippy_engine.py` | SIP signaling (INVITE, BYE, REGISTER, DTMF) |
| Media Pipeline | `core/media_pipeline.py` | PJSUA2 RTP media handling, conference bridge, recording |
| Recording | `services/recording.py` | WAV file management and storage |
| Analytics | `services/call_analytics.py` | Call metrics, hold time stats, trends |
| Notifications | `services/notification.py` | WebSocket + SMS alerts |
| Database | `db/database.py` | SQLAlchemy async (PostgreSQL or SQLite) |
## Data Flow — Hold Slayer Call
```
1. User Request
POST /api/calls/hold-slayer { number, intent, call_flow_id }
2. Gateway.make_call()
├── CallManager.create_call() → track state
├── SippyEngine.make_call() → SIP INVITE to trunk
└── MediaPipeline.add_stream() → RTP media setup
3. HoldSlayer.run_with_flow() or run_exploration()
├── AudioClassifier.classify() → analyze 3s audio windows
│ ├── silence? → wait
│ ├── ringing? → wait
│ ├── DTMF? → detect tones
│ ├── music? → HOLD_DETECTED event
│ └── speech? → transcribe + decide
├── TranscriptionService.transcribe() → STT on speech audio
├── LLMClient.analyze_ivr_menu() → pick menu option (fallback)
│ └── SippyEngine.send_dtmf() → press the button
└── detect_hold_to_human_transition()
└── HUMAN_DETECTED! → transfer
4. Transfer
├── SippyEngine.bridge() → connect call legs
├── MediaPipeline.bridge_streams() → bridge RTP
├── EventBus.publish(TRANSFER_STARTED)
└── NotificationService → "Pick up your phone!"
5. Real-Time Updates (throughout)
EventBus.publish() → WebSocket clients
→ MCP server resources
→ Notification service
→ Analytics tracking
```
## Threading Model
Hold Slayer is primarily single-threaded async (asyncio), with one exception:
- **Main thread**: FastAPI + all async services (event bus, hold slayer, classifier, etc.)
- **Sippy thread**: Sippy B2BUA runs its own event loop in a dedicated daemon thread. The `SippyEngine` bridges async↔sync via `asyncio.run_in_executor()`.
- **PJSUA2**: Runs in the main thread using null audio device (no sound card needed — headless server mode).
```
Main Thread (asyncio)
├── FastAPI (uvicorn)
├── EventBus
├── CallManager
├── HoldSlayer
├── AudioClassifier
├── TranscriptionService
├── LLMClient
├── MediaPipeline (PJSUA2)
├── NotificationService
└── RecordingService
Sippy Thread (daemon)
└── Sippy B2BUA event loop
├── SIP signaling
├── DTMF relay
└── Call leg management
```
## Design Decisions
### Why Sippy B2BUA + PJSUA2?
We split SIP signaling and media handling into two separate libraries:
- **Sippy B2BUA** handles SIP signaling (INVITE, BYE, REGISTER, re-INVITE, DTMF relay). It's battle-tested for telephony and handles the complex SIP state machine.
- **PJSUA2** handles RTP media (audio streams, conference bridge, recording, tone generation). It provides a clean C++/Python API for media manipulation without needing to deal with raw RTP.
This split lets us tap into the audio stream (for classification and STT) without interfering with SIP signaling, and bridge calls through a conference bridge for clean transfer.
### Why asyncio Queue-based EventBus?
- **Single process** — no need for Redis/RabbitMQ cross-process messaging
- **Zero dependencies** — pure asyncio, no external services to deploy
- **Per-subscriber queues** — slow consumers don't block fast publishers
- **Dead subscriber cleanup** — full queues are automatically removed
- **Event history** — late joiners can catch up on recent events
If scaling to multiple gateway processes becomes necessary, the EventBus interface can be backed by Redis pub/sub without changing consumers.
### Why OpenAI-compatible LLM API?
The LLM client uses raw HTTP (httpx) against any OpenAI-compatible endpoint. This means:
- **Ollama** (local, free) — `http://localhost:11434/v1`
- **LM Studio** (local, free) — `http://localhost:1234/v1`
- **vLLM** (local, fast) — `http://localhost:8000/v1`
- **OpenAI** (cloud) — `https://api.openai.com/v1`
No SDK dependency. No vendor lock-in. Switch models by changing one env var.