Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
6.3 KiB
Hold Slayer Service
The Hold Slayer (services/hold_slayer.py) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
Two Operating Modes
1. Flow-Guided Mode (run_with_flow)
When a stored CallFlow exists for the number being called, the Hold Slayer follows it step-by-step:
await hold_slayer.run_with_flow(call_id, call_flow)
The call flow is a tree of steps (see Call Flows). The Hold Slayer walks through them:
CallFlow: "Chase Bank Main"
├── Step 1: WAIT 3s (wait for greeting)
├── Step 2: LISTEN (transcribe → LLM picks option)
├── Step 3: DTMF "2" (press 2 for account services)
├── Step 4: LISTEN (transcribe → LLM picks option)
├── Step 5: DTMF "1" (press 1 for disputes)
├── Step 6: HOLD (wait for human)
└── Step 7: TRANSFER (bridge to your phone)
Step execution logic:
| Step Type | What Happens |
|---|---|
DTMF |
Send the specified digits via SIP engine |
WAIT |
Sleep for the specified duration |
LISTEN |
Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
HOLD |
Monitor audio classification, wait for human detection |
SPEAK |
Play a WAV file or TTS audio (for interactive prompts) |
TRANSFER |
Bridge the call to the user's device |
2. Exploration Mode (run_exploration)
When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
Exploration loop:
┌─→ Classify audio (3-second window)
│ ├── SILENCE → wait, increment silence counter
│ ├── RINGING → wait for answer
│ ├── MUSIC → hold detected, monitor for transition
│ ├── DTMF → ignore (echo detection)
│ ├── IVR_PROMPT/SPEECH →
│ │ ├── Transcribe the audio
│ │ ├── Send transcript + intent to LLM
│ │ ├── LLM returns: { "action": "dtmf", "digits": "2" }
│ │ └── Send DTMF
│ └── LIVE_HUMAN → human detected!
│ └── TRANSFER
│
└── Loop until: human detected, max hold time, or call ended
Exploration discoveries are recorded and can be fed into the CallFlowLearner to build a reusable flow for next time.
Human Detection
The critical moment — detecting when a live person picks up after hold:
Detection Chain
AudioClassifier.classify(audio_frame)
│
├── Feature extraction:
│ ├── RMS energy (loudness)
│ ├── Spectral flatness (noise vs tone)
│ ├── Zero-crossing rate (speech indicator)
│ ├── Dominant frequency
│ └── Spectral centroid
│
├── Classification: MUSIC, SILENCE, SPEECH, etc.
│
└── Transition detection:
└── detect_hold_to_human_transition()
├── Check last N classifications
├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
├── Confidence: speech energy > threshold
└── Result: HUMAN_DETECTED event
What triggers a transfer?
The Hold Slayer considers a human detected when:
- Classification history shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
- Energy threshold — the speech audio has sufficient RMS energy (not just background noise)
- Consecutive speech frames — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
False Positive Handling
Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
- Duration check — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
- Pattern matching — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
- Transcript analysis — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
LISTEN Step + LLM Fallback
The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
# Step has hardcoded DTMF? Use it directly.
if step.dtmf:
await sip_engine.send_dtmf(call_id, step.dtmf)
# No hardcoded DTMF? Ask the LLM.
else:
transcript = await transcription.transcribe(audio)
decision = await llm_client.analyze_ivr_menu(
transcript=transcript,
intent=intent,
previous_selections=previous_steps,
)
if decision.get("action") == "dtmf":
await sip_engine.send_dtmf(call_id, decision["digits"])
The LLM receives:
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
- The user's intent ("dispute a charge on my December statement")
- Previous menu selections (to avoid loops)
And returns structured JSON:
{
"action": "dtmf",
"digits": "1",
"reasoning": "Billing is the correct department for charge disputes"
}
Event Publishing
The Hold Slayer publishes events throughout the process:
| Event | When |
|---|---|
IVR_STEP |
Each step in the call flow is executed |
IVR_DTMF_SENT |
DTMF digits are sent |
IVR_MENU_DETECTED |
An IVR menu prompt is transcribed |
HOLD_DETECTED |
Hold music is detected |
HUMAN_DETECTED |
Live human speech detected after hold |
TRANSFER_STARTED |
Call bridge initiated to user's device |
TRANSFER_COMPLETE |
User's device answered, bridge active |
All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
Configuration
| Setting | Description | Default |
|---|---|---|
MAX_HOLD_TIME |
Maximum seconds to wait on hold before giving up | 7200 (2 hours) |
HOLD_CHECK_INTERVAL |
Seconds between audio classification checks | 2.0 |
DEFAULT_TRANSFER_DEVICE |
Device to transfer to when human detected | sip_phone |
CLASSIFIER_WINDOW_SECONDS |
Audio window size for classification | 3.0 |