Files
hold-slayer/README.md
Robert Helewka ecf37658ce feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
2026-03-21 19:23:26 +00:00

16 KiB

Hold Slayer 🔥

An AI-powered telephony gateway that calls companies, navigates IVR menus, waits on hold, and transfers you when a human picks up.

You give it a phone number and an intent ("dispute a charge on my December statement"). It dials the number through your SIP trunk, navigates the phone tree, sits through the hold music, and rings your desk phone the instant a live person answers. You never hear Vivaldi again.

Caution

Emergency calling — 911 Hold Slayer passes 911 and 9911 directly to the PSTN trunk. Your SIP trunk provider must support E911 on your DID and have your correct registered location on file before this system is put into service. VoIP emergency calls are location-dependent — verify with your provider. Do not rely on this system as your only means of reaching emergency services.

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                        FastAPI Server                           │
│                                                                 │
│  ┌──────────┐  ┌──────────┐  ┌───────────┐  ┌──────────────┐  │
│  │ REST API │  │WebSocket │  │MCP Server │  │  Dashboard   │  │
│  │ /api/*   │  │ /ws/*    │  │ (SSE)     │  │  /dashboard  │  │
│  └────┬─────┘  └────┬─────┘  └─────┬─────┘  └──────────────┘  │
│       │              │              │                            │
│  ┌────┴──────────────┴──────────────┴────┐                     │
│  │             Event Bus                  │                     │
│  │   (asyncio Queue pub/sub per client)   │                     │
│  └────┬──────────────┬──────────────┬────┘                     │
│       │              │              │                            │
│  ┌────┴─────┐  ┌─────┴─────┐  ┌────┴──────────┐               │
│  │   Call   │  │   Hold    │  │   Services    │               │
│  │ Manager  │  │  Slayer   │  │ (LLM, STT,   │               │
│  │          │  │           │  │  Recording,   │               │
│  │          │  │           │  │  Analytics,   │               │
│  │          │  │           │  │  Notify)      │               │
│  └────┬─────┘  └─────┬─────┘  └──────────────┘               │
│       │              │                                          │
│  ┌────┴──────────────┴───────────────────┐                     │
│  │         Sippy B2BUA Engine            │                     │
│  │  (SIP calls, DTMF, conference bridge) │                     │
│  └────┬──────────────────────────────────┘                     │
│       │                                                         │
└───────┼─────────────────────────────────────────────────────────┘
        │
   ┌────┴────┐
   │SIP Trunk│ ──→ PSTN
   └─────────┘

What's Implemented

Core Engine

  • Sippy B2BUA Engine (core/sippy_engine.py) — SIP call control, DTMF, bridging, conference, trunk registration
  • PJSUA2 Media Pipeline (core/media_pipeline.py) — Audio routing, recording ports, conference bridge, WAV playback
  • Call Manager (core/call_manager.py) — Active call state tracking, lifecycle management
  • Event Bus (core/event_bus.py) — Async pub/sub with per-subscriber queues, type filtering, history

Hold Slayer

  • IVR Navigation (services/hold_slayer.py) — Follows stored call flows step-by-step through phone menus
  • Audio Classifier (services/audio_classifier.py) — Real-time waveform analysis: silence, tones, DTMF, music, speech detection
  • Call Flow Learner (services/call_flow_learner.py) — Builds reusable call flows from exploration data, merges new discoveries
  • LLM Fallback — When a LISTEN step has no hardcoded DTMF, the LLM analyzes the transcript and picks the right menu option

Intelligence Layer

  • LLM Client (services/llm_client.py) — OpenAI-compatible API client (Ollama, vLLM, LM Studio, OpenAI) with JSON parsing, retry, stats
  • Transcription (services/transcription.py) — Speaches/Whisper STT integration for live call transcription
  • Recording (services/recording.py) — WAV recording with date-organized storage, dual-channel support
  • Call Analytics (services/call_analytics.py) — Hold time stats, success rates, per-company patterns, time-of-day trends
  • Notifications (services/notification.py) — WebSocket + SMS alerts for human detection, call failures, hold status

API Surface

  • REST API — Call management, device registration, call flow CRUD, service configuration
  • WebSocket — Real-time call events, transcripts, classification updates
  • MCP Server — 10 tools for AI assistant integration (make calls, send DTMF, get transcripts, manage flows)

Data Models

  • Call — Active call state with classification history, transcript chunks, hold time tracking
  • Call Flow — Stored IVR trees with steps (DTMF, LISTEN, HOLD, TRANSFER, SPEAK)
  • Events — 20+ typed events (call lifecycle, hold slayer, audio, device, system)
  • Device — SIP phone/softphone registration and routing
  • Contact — Phone number management with routing preferences

Project Structure

hold-slayer/
├── main.py                      # FastAPI app + lifespan (service wiring)
├── config.py                    # Pydantic settings from .env
├── core/
│   ├── gateway.py               # Top-level gateway orchestrator
│   ├── sippy_engine.py          # Sippy B2BUA SIP engine
│   ├── media_pipeline.py        # PJSUA2 audio routing
│   ├── call_manager.py          # Active call state management
│   └── event_bus.py             # Async pub/sub event bus
├── services/
│   ├── hold_slayer.py           # IVR navigation + hold detection
│   ├── audio_classifier.py      # Waveform analysis (music/speech/DTMF)
│   ├── call_flow_learner.py     # Auto-learns IVR trees from calls
│   ├── llm_client.py            # OpenAI-compatible LLM client
│   ├── transcription.py         # Speaches/Whisper STT
│   ├── recording.py             # Call recording management
│   ├── call_analytics.py        # Call metrics and insights
│   └── notification.py          # WebSocket + SMS notifications
├── api/
│   ├── calls.py                 # Call management endpoints
│   ├── call_flows.py            # Call flow CRUD
│   ├── devices.py               # Device registration
│   ├── websocket.py             # Real-time event stream
│   └── deps.py                  # FastAPI dependency injection
├── mcp_server/
│   └── server.py                # MCP tools + resources (10 tools)
├── models/
│   ├── call.py                  # Call state models
│   ├── call_flow.py             # IVR tree models
│   ├── events.py                # Event type definitions
│   ├── device.py                # Device models
│   └── contact.py               # Contact models
├── db/
│   └── database.py              # SQLAlchemy async (PostgreSQL/SQLite)
└── tests/
    ├── test_audio_classifier.py # 18 tests — waveform analysis
    ├── test_call_flows.py       # 10 tests — call flow models
    ├── test_hold_slayer.py      # 20 tests — IVR nav, EventBus, CallManager
    └── test_services.py         # 27 tests — LLM, notifications, recording,
                                 #             analytics, learner, EventBus

Quick Start

1. Install

python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"

2. Configure

cp .env.example .env
# Edit .env with your SIP trunk credentials, LLM endpoint, etc.

3. Run

uvicorn main:app --host 0.0.0.0 --port 8100

4. Test

pytest tests/ -v

Usage

REST API

Launch Hold Slayer on a number:

curl -X POST http://localhost:8000/api/calls/hold-slayer \
  -H "Content-Type: application/json" \
  -d '{
    "number": "+18005551234",
    "intent": "dispute Amazon charge from December 15th",
    "call_flow_id": "chase_bank_main",
    "transfer_to": "sip_phone"
  }'

Check call status:

curl http://localhost:8000/api/calls/call_abc123

WebSocket — Real-Time Events

const ws = new WebSocket("ws://localhost:8000/ws/events");
ws.onmessage = (msg) => {
  const event = JSON.parse(msg.data);
  // event.type: "human_detected", "hold_detected", "ivr_step", etc.
  // event.call_id: which call this is about
  // event.data: type-specific payload
};

MCP — AI Assistant Integration

The MCP server exposes 10 tools that any MCP-compatible assistant can use:

Tool Description
make_call Dial a number through the SIP trunk
end_call Hang up an active call
send_dtmf Send touch-tone digits to navigate menus
get_call_status Check current state of a call
get_call_transcript Get live transcript of a call
get_call_recording Get recording metadata and file path
list_active_calls List all calls in progress
get_call_summary Analytics summary (hold times, success rates)
search_call_history Search past calls by number or company
learn_call_flow Build a reusable call flow from exploration data

How It Works

  1. You request a call — via REST API, MCP tool, or dashboard
  2. Gateway dials out — Sippy B2BUA places the call through your SIP trunk
  3. Audio classifier listens — Real-time waveform analysis detects IVR prompts, hold music, ringing, silence, and live speech
  4. Transcription runs — Speaches/Whisper converts audio to text in real-time
  5. IVR navigator decides — If a stored call flow exists, it follows the steps. If not, the LLM analyzes the transcript and picks the right menu option
  6. Hold detection — When hold music is detected, the system waits patiently and monitors for transitions
  7. Human detection — The classifier detects the transition from music/silence to live speech
  8. Transfer — Your desk phone rings. Pick up and you're talking to the agent. Zero hold time.

Configuration

All configuration is via environment variables (see .env.example):

Variable Description Default
SIP_TRUNK_HOST Your SIP provider hostname
SIP_TRUNK_USERNAME SIP auth username
SIP_TRUNK_PASSWORD SIP auth password
SIP_TRUNK_DID Your phone number (E.164)
GATEWAY_SIP_PORT Port for device registration 5080
SPEACHES_URL Speaches/Whisper STT endpoint http://localhost:22070
LLM_BASE_URL OpenAI-compatible LLM endpoint http://localhost:11434/v1
LLM_MODEL Model name for IVR analysis llama3
DATABASE_URL PostgreSQL or SQLite connection SQLite fallback

Tech Stack

  • Python 3.13 + asyncio — Single-process async architecture
  • FastAPI — REST API + WebSocket server
  • Sippy B2BUA — SIP call control and DTMF
  • PJSUA2 — Media pipeline, conference bridge, recording
  • Speaches (Whisper) — Speech-to-text
  • Ollama / vLLM / OpenAI — LLM for IVR menu analysis
  • SQLAlchemy — Async database (PostgreSQL or SQLite)
  • MCP (Model Context Protocol) — AI assistant integration

Documentation

Full documentation is in /docs:

  • Architecture — System design, data flow, threading model
  • Core Engine — SIP engine, media pipeline, call manager, event bus
  • Hold Slayer Service — IVR navigation, hold detection, human detection
  • Audio Classifier — Waveform analysis, feature extraction, classification
  • Services — LLM client, transcription, recording, analytics, notifications
  • Call Flows — Call flow model, step types, auto-learner
  • API Reference — REST endpoints, WebSocket, request/response schemas
  • MCP Server — MCP tools and resources for AI assistants
  • Configuration — All environment variables, deployment options
  • Development — Setup, testing, contributing

Build Phases

Phase 1: Core Engine

  • Extract EventBus to dedicated module with typed filtering
  • Implement Sippy B2BUA SIP engine (signaling, DTMF, bridging)
  • Implement PJSUA2 media pipeline (conference bridge, audio tapping, recording)
  • Call manager with active call state tracking
  • Gateway orchestrator wiring all components

Phase 2: Intelligence Layer

  • LLM client (OpenAI-compatible — Ollama, vLLM, LM Studio, OpenAI)
  • Hold Slayer IVR navigation with LLM fallback for LISTEN steps
  • Call Flow Learner — auto-builds reusable IVR trees from exploration
  • Recording service with date-organized WAV storage
  • Call analytics with hold time stats, per-company patterns
  • Audio classifier with spectral analysis, DTMF detection, hold-to-human transition

Phase 3: API & Integration

  • REST API — calls, call flows, devices, DTMF
  • WebSocket real-time event streaming
  • MCP server with 16 tools + 3 resources
  • Notification service (WebSocket + SMS)
  • Service wiring in main.py lifespan
  • 75 passing tests across 4 test files

Phase 4: Production Hardening 🔜

  • Alembic database migrations
  • API authentication (API keys / JWT)
  • Rate limiting on API endpoints
  • Structured JSON logging
  • Health check endpoints for all dependencies
  • Graceful degradation (classifier works without STT, etc.)
  • Docker Compose (Hold Slayer + PostgreSQL + Speaches + Ollama)

Phase 5: Additional Services 🔮

  • AI Receptionist — answer inbound calls, screen callers, take messages
  • Spam Filter — detect robocalls using caller ID + audio patterns
  • Smart Routing — time-of-day rules, device priority, DND
  • Noise Cancellation — RNNoise integration in media pipeline
  • TTS/Speech — play prompts into calls (SPEAK step support)

Phase 6: Dashboard & UX 🔮

  • Web dashboard with real-time call monitor
  • Call flow visual editor (drag-and-drop IVR tree builder)
  • Call history with transcript playback
  • Analytics dashboard with hold time graphs
  • Mobile app (or PWA) for on-the-go control

License

MIT