Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
169 lines
6.3 KiB
Markdown
169 lines
6.3 KiB
Markdown
# Hold Slayer Service
|
|
|
|
The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
|
|
|
|
## Two Operating Modes
|
|
|
|
### 1. Flow-Guided Mode (`run_with_flow`)
|
|
|
|
When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:
|
|
|
|
```python
|
|
await hold_slayer.run_with_flow(call_id, call_flow)
|
|
```
|
|
|
|
The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:
|
|
|
|
```
|
|
CallFlow: "Chase Bank Main"
|
|
├── Step 1: WAIT 3s (wait for greeting)
|
|
├── Step 2: LISTEN (transcribe → LLM picks option)
|
|
├── Step 3: DTMF "2" (press 2 for account services)
|
|
├── Step 4: LISTEN (transcribe → LLM picks option)
|
|
├── Step 5: DTMF "1" (press 1 for disputes)
|
|
├── Step 6: HOLD (wait for human)
|
|
└── Step 7: TRANSFER (bridge to your phone)
|
|
```
|
|
|
|
**Step execution logic:**
|
|
|
|
| Step Type | What Happens |
|
|
|-----------|-------------|
|
|
| `DTMF` | Send the specified digits via SIP engine |
|
|
| `WAIT` | Sleep for the specified duration |
|
|
| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
|
|
| `HOLD` | Monitor audio classification, wait for human detection |
|
|
| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
|
|
| `TRANSFER` | Bridge the call to the user's device |
|
|
|
|
### 2. Exploration Mode (`run_exploration`)
|
|
|
|
When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
|
|
|
|
```python
|
|
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
|
|
```
|
|
|
|
**Exploration loop:**
|
|
|
|
```
|
|
┌─→ Classify audio (3-second window)
|
|
│ ├── SILENCE → wait, increment silence counter
|
|
│ ├── RINGING → wait for answer
|
|
│ ├── MUSIC → hold detected, monitor for transition
|
|
│ ├── DTMF → ignore (echo detection)
|
|
│ ├── IVR_PROMPT/SPEECH →
|
|
│ │ ├── Transcribe the audio
|
|
│ │ ├── Send transcript + intent to LLM
|
|
│ │ ├── LLM returns: { "action": "dtmf", "digits": "2" }
|
|
│ │ └── Send DTMF
|
|
│ └── LIVE_HUMAN → human detected!
|
|
│ └── TRANSFER
|
|
│
|
|
└── Loop until: human detected, max hold time, or call ended
|
|
```
|
|
|
|
**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.
|
|
|
|
## Human Detection
|
|
|
|
The critical moment — detecting when a live person picks up after hold:
|
|
|
|
### Detection Chain
|
|
|
|
```
|
|
AudioClassifier.classify(audio_frame)
|
|
│
|
|
├── Feature extraction:
|
|
│ ├── RMS energy (loudness)
|
|
│ ├── Spectral flatness (noise vs tone)
|
|
│ ├── Zero-crossing rate (speech indicator)
|
|
│ ├── Dominant frequency
|
|
│ └── Spectral centroid
|
|
│
|
|
├── Classification: MUSIC, SILENCE, SPEECH, etc.
|
|
│
|
|
└── Transition detection:
|
|
└── detect_hold_to_human_transition()
|
|
├── Check last N classifications
|
|
├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
|
|
├── Confidence: speech energy > threshold
|
|
└── Result: HUMAN_DETECTED event
|
|
```
|
|
|
|
### What triggers a transfer?
|
|
|
|
The Hold Slayer considers a human detected when:
|
|
|
|
1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
|
|
2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
|
|
3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
|
|
|
|
### False Positive Handling
|
|
|
|
Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
|
|
|
|
1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
|
|
2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
|
|
3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
|
|
|
|
## LISTEN Step + LLM Fallback
|
|
|
|
The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
|
|
|
|
```python
|
|
# Step has hardcoded DTMF? Use it directly.
|
|
if step.dtmf:
|
|
await sip_engine.send_dtmf(call_id, step.dtmf)
|
|
|
|
# No hardcoded DTMF? Ask the LLM.
|
|
else:
|
|
transcript = await transcription.transcribe(audio)
|
|
decision = await llm_client.analyze_ivr_menu(
|
|
transcript=transcript,
|
|
intent=intent,
|
|
previous_selections=previous_steps,
|
|
)
|
|
if decision.get("action") == "dtmf":
|
|
await sip_engine.send_dtmf(call_id, decision["digits"])
|
|
```
|
|
|
|
The LLM receives:
|
|
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
|
|
- The user's intent ("dispute a charge on my December statement")
|
|
- Previous menu selections (to avoid loops)
|
|
|
|
And returns structured JSON:
|
|
```json
|
|
{
|
|
"action": "dtmf",
|
|
"digits": "1",
|
|
"reasoning": "Billing is the correct department for charge disputes"
|
|
}
|
|
```
|
|
|
|
## Event Publishing
|
|
|
|
The Hold Slayer publishes events throughout the process:
|
|
|
|
| Event | When |
|
|
|-------|------|
|
|
| `IVR_STEP` | Each step in the call flow is executed |
|
|
| `IVR_DTMF_SENT` | DTMF digits are sent |
|
|
| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
|
|
| `HOLD_DETECTED` | Hold music is detected |
|
|
| `HUMAN_DETECTED` | Live human speech detected after hold |
|
|
| `TRANSFER_STARTED` | Call bridge initiated to user's device |
|
|
| `TRANSFER_COMPLETE` | User's device answered, bridge active |
|
|
|
|
All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
|
|
|
|
## Configuration
|
|
|
|
| Setting | Description | Default |
|
|
|---------|-------------|---------|
|
|
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
|
|
| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
|
|
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
|
|
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |
|