Files
hold-slayer/docs/architecture.md
Robert Helewka ecf37658ce feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
2026-03-21 19:23:26 +00:00

9.1 KiB

Architecture

Hold Slayer is a single-process async Python application built on FastAPI. It acts as an intelligent B2BUA (Back-to-Back User Agent) sitting between your SIP trunk (PSTN access) and your desk phone/softphone.

System Diagram

┌─────────────────────────────────────────────────────────────────┐
│                        FastAPI Server                           │
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌──────────────┐  │
│  │ REST API │  │WebSocket │  │MCP Server │  │  Dashboard   │  │
│  │ /api/*   │  │ /ws/*    │  │ (SSE)     │  │  /dashboard  │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────────────┘  │
│       │              │              │                            │
│  ┌────┴──────────────┴──────────────┴────┐                     │
│  │             Event Bus                  │                     │
│  │   (asyncio Queue pub/sub per client)   │                     │
│  └────┬──────────────┬──────────────┬────┘                     │
│       │              │              │                            │
│  ┌────┴─────┐  ┌─────┴─────┐  ┌────┴──────────┐               │
│  │   Call   │  │   Hold    │  │   Services    │               │
│  │ Manager  │  │  Slayer   │  │ (LLM, STT,   │               │
│  │          │  │           │  │  Recording,   │               │
│  │          │  │           │  │  Analytics,   │               │
│  │          │  │           │  │  Notify)      │               │
│  └────┬─────┘  └─────┬─────┘  └──────────────┘               │
│       │              │                                          │
│  ┌────┴──────────────┴───────────────────┐                     │
│  │         Sippy B2BUA Engine            │                     │
│  │  (SIP calls, DTMF, conference bridge) │                     │
│  └────┬──────────────────────────────────┘                     │
│       │                                                         │
└───────┼─────────────────────────────────────────────────────────┘
        │
   ┌────┴────┐
   │SIP Trunk│ ──→ PSTN
   └─────────┘

Component Overview

Presentation Layer

Component File Protocol Purpose
REST API api/calls.py, api/call_flows.py, api/devices.py HTTP Call management, CRUD, configuration
WebSocket api/websocket.py WS Real-time event streaming to clients
MCP Server mcp_server/server.py SSE AI assistant tool integration

Orchestration Layer

Component File Purpose
Gateway core/gateway.py Top-level orchestrator — owns all services, routes calls
Call Manager core/call_manager.py Active call state, lifecycle, transcript tracking
Event Bus core/event_bus.py Async pub/sub connecting everything together

Intelligence Layer

Component File Purpose
Hold Slayer services/hold_slayer.py IVR navigation, hold monitoring, human detection
Audio Classifier services/audio_classifier.py Real-time waveform analysis (music/speech/DTMF/silence)
LLM Client services/llm_client.py OpenAI-compatible LLM for IVR menu decisions
Transcription services/transcription.py Speaches/Whisper STT for live audio
Call Flow Learner services/call_flow_learner.py Builds reusable IVR trees from exploration data

Infrastructure Layer

Component File Purpose
Sippy Engine core/sippy_engine.py SIP signaling (INVITE, BYE, REGISTER, DTMF)
Media Pipeline core/media_pipeline.py PJSUA2 RTP media handling, conference bridge, recording
Recording services/recording.py WAV file management and storage
Analytics services/call_analytics.py Call metrics, hold time stats, trends
Notifications services/notification.py WebSocket + SMS alerts
Database db/database.py SQLAlchemy async (PostgreSQL or SQLite)

Data Flow — Hold Slayer Call

1. User Request
   POST /api/calls/hold-slayer { number, intent, call_flow_id }
         │
2. Gateway.make_call()
   ├── CallManager.create_call()     → track state
   ├── SippyEngine.make_call()       → SIP INVITE to trunk
   └── MediaPipeline.add_stream()    → RTP media setup
         │
3. HoldSlayer.run_with_flow() or run_exploration()
   ├── AudioClassifier.classify()    → analyze 3s audio windows
   │   ├── silence? → wait
   │   ├── ringing? → wait
   │   ├── DTMF? → detect tones
   │   ├── music? → HOLD_DETECTED event
   │   └── speech? → transcribe + decide
   │
   ├── TranscriptionService.transcribe() → STT on speech audio
   │
   ├── LLMClient.analyze_ivr_menu() → pick menu option (fallback)
   │   └── SippyEngine.send_dtmf()  → press the button
   │
   └── detect_hold_to_human_transition()
       └── HUMAN_DETECTED! → transfer
           │
4. Transfer
   ├── SippyEngine.bridge()          → connect call legs
   ├── MediaPipeline.bridge_streams() → bridge RTP
   ├── EventBus.publish(TRANSFER_STARTED)
   └── NotificationService → "Pick up your phone!"
         │
5. Real-Time Updates (throughout)
   EventBus.publish() → WebSocket clients
                      → MCP server resources
                      → Notification service
                      → Analytics tracking

Threading Model

Hold Slayer is primarily single-threaded async (asyncio), with one exception:

  • Main thread: FastAPI + all async services (event bus, hold slayer, classifier, etc.)
  • Sippy thread: Sippy B2BUA runs its own event loop in a dedicated daemon thread. The SippyEngine bridges async↔sync via asyncio.run_in_executor().
  • PJSUA2: Runs in the main thread using null audio device (no sound card needed — headless server mode).
Main Thread (asyncio)
├── FastAPI (uvicorn)
├── EventBus
├── CallManager
├── HoldSlayer
├── AudioClassifier
├── TranscriptionService
├── LLMClient
├── MediaPipeline (PJSUA2)
├── NotificationService
└── RecordingService

Sippy Thread (daemon)
└── Sippy B2BUA event loop
    ├── SIP signaling
    ├── DTMF relay
    └── Call leg management

Design Decisions

Why Sippy B2BUA + PJSUA2?

We split SIP signaling and media handling into two separate libraries:

  • Sippy B2BUA handles SIP signaling (INVITE, BYE, REGISTER, re-INVITE, DTMF relay). It's battle-tested for telephony and handles the complex SIP state machine.
  • PJSUA2 handles RTP media (audio streams, conference bridge, recording, tone generation). It provides a clean C++/Python API for media manipulation without needing to deal with raw RTP.

This split lets us tap into the audio stream (for classification and STT) without interfering with SIP signaling, and bridge calls through a conference bridge for clean transfer.

Why asyncio Queue-based EventBus?

  • Single process — no need for Redis/RabbitMQ cross-process messaging
  • Zero dependencies — pure asyncio, no external services to deploy
  • Per-subscriber queues — slow consumers don't block fast publishers
  • Dead subscriber cleanup — full queues are automatically removed
  • Event history — late joiners can catch up on recent events

If scaling to multiple gateway processes becomes necessary, the EventBus interface can be backed by Redis pub/sub without changing consumers.

Why OpenAI-compatible LLM API?

The LLM client uses raw HTTP (httpx) against any OpenAI-compatible endpoint. This means:

  • Ollama (local, free) — http://localhost:11434/v1
  • LM Studio (local, free) — http://localhost:1234/v1
  • vLLM (local, fast) — http://localhost:8000/v1
  • OpenAI (cloud) — https://api.openai.com/v1

No SDK dependency. No vendor lock-in. Switch models by changing one env var.