Files

Robert Helewka ecf37658ce feat: add initial Hold Slayer AI telephony gateway implementation

Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README

2026-03-21 19:23:26 +00:00

9.1 KiB

Raw Permalink Blame History

Architecture

Hold Slayer is a single-process async Python application built on FastAPI. It acts as an intelligent B2BUA (Back-to-Back User Agent) sitting between your SIP trunk (PSTN access) and your desk phone/softphone.

System Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        FastAPI Server                           │
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌──────────────┐  │
│  │ REST API │  │WebSocket │  │MCP Server │  │  Dashboard   │  │
│  │ /api/*   │  │ /ws/*    │  │ (SSE)     │  │  /dashboard  │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────────────┘  │
│       │              │              │                            │
│  ┌────┴──────────────┴──────────────┴────┐                     │
│  │             Event Bus                  │                     │
│  │   (asyncio Queue pub/sub per client)   │                     │
│  └────┬──────────────┬──────────────┬────┘                     │
│       │              │              │                            │
│  ┌────┴─────┐  ┌─────┴─────┐  ┌────┴──────────┐               │
│  │   Call   │  │   Hold    │  │   Services    │               │
│  │ Manager  │  │  Slayer   │  │ (LLM, STT,   │               │
│  │          │  │           │  │  Recording,   │               │
│  │          │  │           │  │  Analytics,   │               │
│  │          │  │           │  │  Notify)      │               │
│  └────┬─────┘  └─────┬─────┘  └──────────────┘               │
│       │              │                                          │
│  ┌────┴──────────────┴───────────────────┐                     │
│  │         Sippy B2BUA Engine            │                     │
│  │  (SIP calls, DTMF, conference bridge) │                     │
│  └────┬──────────────────────────────────┘                     │
│       │                                                         │
└───────┼─────────────────────────────────────────────────────────┘
        │
   ┌────┴────┐
   │SIP Trunk│ ──→ PSTN
   └─────────┘

Component Overview

Presentation Layer

Component	File	Protocol	Purpose
REST API	`api/calls.py`, `api/call_flows.py`, `api/devices.py`	HTTP	Call management, CRUD, configuration
WebSocket	`api/websocket.py`	WS	Real-time event streaming to clients
MCP Server	`mcp_server/server.py`	SSE	AI assistant tool integration

Orchestration Layer

Component	File	Purpose
Gateway	`core/gateway.py`	Top-level orchestrator — owns all services, routes calls
Call Manager	`core/call_manager.py`	Active call state, lifecycle, transcript tracking
Event Bus	`core/event_bus.py`	Async pub/sub connecting everything together

Intelligence Layer

Component	File	Purpose
Hold Slayer	`services/hold_slayer.py`	IVR navigation, hold monitoring, human detection
Audio Classifier	`services/audio_classifier.py`	Real-time waveform analysis (music/speech/DTMF/silence)
LLM Client	`services/llm_client.py`	OpenAI-compatible LLM for IVR menu decisions
Transcription	`services/transcription.py`	Speaches/Whisper STT for live audio
Call Flow Learner	`services/call_flow_learner.py`	Builds reusable IVR trees from exploration data

Infrastructure Layer

Component	File	Purpose
Sippy Engine	`core/sippy_engine.py`	SIP signaling (INVITE, BYE, REGISTER, DTMF)
Media Pipeline	`core/media_pipeline.py`	PJSUA2 RTP media handling, conference bridge, recording
Recording	`services/recording.py`	WAV file management and storage
Analytics	`services/call_analytics.py`	Call metrics, hold time stats, trends
Notifications	`services/notification.py`	WebSocket + SMS alerts
Database	`db/database.py`	SQLAlchemy async (PostgreSQL or SQLite)

Data Flow — Hold Slayer Call

1. User Request
   POST /api/calls/hold-slayer { number, intent, call_flow_id }
         │
2. Gateway.make_call()
   ├── CallManager.create_call()     → track state
   ├── SippyEngine.make_call()       → SIP INVITE to trunk
   └── MediaPipeline.add_stream()    → RTP media setup
         │
3. HoldSlayer.run_with_flow() or run_exploration()
   ├── AudioClassifier.classify()    → analyze 3s audio windows
   │   ├── silence? → wait
   │   ├── ringing? → wait
   │   ├── DTMF? → detect tones
   │   ├── music? → HOLD_DETECTED event
   │   └── speech? → transcribe + decide
   │
   ├── TranscriptionService.transcribe() → STT on speech audio
   │
   ├── LLMClient.analyze_ivr_menu() → pick menu option (fallback)
   │   └── SippyEngine.send_dtmf()  → press the button
   │
   └── detect_hold_to_human_transition()
       └── HUMAN_DETECTED! → transfer
           │
4. Transfer
   ├── SippyEngine.bridge()          → connect call legs
   ├── MediaPipeline.bridge_streams() → bridge RTP
   ├── EventBus.publish(TRANSFER_STARTED)
   └── NotificationService → "Pick up your phone!"
         │
5. Real-Time Updates (throughout)
   EventBus.publish() → WebSocket clients
                      → MCP server resources
                      → Notification service
                      → Analytics tracking

Threading Model

Hold Slayer is primarily single-threaded async (asyncio), with one exception:

Main thread: FastAPI + all async services (event bus, hold slayer, classifier, etc.)
Sippy thread: Sippy B2BUA runs its own event loop in a dedicated daemon thread. The SippyEngine bridges async↔sync via asyncio.run_in_executor().
PJSUA2: Runs in the main thread using null audio device (no sound card needed — headless server mode).

Main Thread (asyncio)
├── FastAPI (uvicorn)
├── EventBus
├── CallManager
├── HoldSlayer
├── AudioClassifier
├── TranscriptionService
├── LLMClient
├── MediaPipeline (PJSUA2)
├── NotificationService
└── RecordingService

Sippy Thread (daemon)
└── Sippy B2BUA event loop
    ├── SIP signaling
    ├── DTMF relay
    └── Call leg management

Design Decisions

Why Sippy B2BUA + PJSUA2?

We split SIP signaling and media handling into two separate libraries:

Sippy B2BUA handles SIP signaling (INVITE, BYE, REGISTER, re-INVITE, DTMF relay). It's battle-tested for telephony and handles the complex SIP state machine.
PJSUA2 handles RTP media (audio streams, conference bridge, recording, tone generation). It provides a clean C++/Python API for media manipulation without needing to deal with raw RTP.

This split lets us tap into the audio stream (for classification and STT) without interfering with SIP signaling, and bridge calls through a conference bridge for clean transfer.

Why asyncio Queue-based EventBus?

Single process — no need for Redis/RabbitMQ cross-process messaging
Zero dependencies — pure asyncio, no external services to deploy
Per-subscriber queues — slow consumers don't block fast publishers
Dead subscriber cleanup — full queues are automatically removed
Event history — late joiners can catch up on recent events

If scaling to multiple gateway processes becomes necessary, the EventBus interface can be backed by Redis pub/sub without changing consumers.

Why OpenAI-compatible LLM API?

The LLM client uses raw HTTP (httpx) against any OpenAI-compatible endpoint. This means:

Ollama (local, free) — http://localhost:11434/v1
LM Studio (local, free) — http://localhost:1234/v1
vLLM (local, fast) — http://localhost:8000/v1
OpenAI (cloud) — https://api.openai.com/v1

No SDK dependency. No vendor lock-in. Switch models by changing one env var.

9.1 KiB Raw Permalink Blame History