feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
This commit is contained in:
168
docs/hold-slayer-service.md
Normal file
168
docs/hold-slayer-service.md
Normal file
@@ -0,0 +1,168 @@
|
||||
# Hold Slayer Service
|
||||
|
||||
The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
|
||||
|
||||
## Two Operating Modes
|
||||
|
||||
### 1. Flow-Guided Mode (`run_with_flow`)
|
||||
|
||||
When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:
|
||||
|
||||
```python
|
||||
await hold_slayer.run_with_flow(call_id, call_flow)
|
||||
```
|
||||
|
||||
The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:
|
||||
|
||||
```
|
||||
CallFlow: "Chase Bank Main"
|
||||
├── Step 1: WAIT 3s (wait for greeting)
|
||||
├── Step 2: LISTEN (transcribe → LLM picks option)
|
||||
├── Step 3: DTMF "2" (press 2 for account services)
|
||||
├── Step 4: LISTEN (transcribe → LLM picks option)
|
||||
├── Step 5: DTMF "1" (press 1 for disputes)
|
||||
├── Step 6: HOLD (wait for human)
|
||||
└── Step 7: TRANSFER (bridge to your phone)
|
||||
```
|
||||
|
||||
**Step execution logic:**
|
||||
|
||||
| Step Type | What Happens |
|
||||
|-----------|-------------|
|
||||
| `DTMF` | Send the specified digits via SIP engine |
|
||||
| `WAIT` | Sleep for the specified duration |
|
||||
| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
|
||||
| `HOLD` | Monitor audio classification, wait for human detection |
|
||||
| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
|
||||
| `TRANSFER` | Bridge the call to the user's device |
|
||||
|
||||
### 2. Exploration Mode (`run_exploration`)
|
||||
|
||||
When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
|
||||
|
||||
```python
|
||||
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
|
||||
```
|
||||
|
||||
**Exploration loop:**
|
||||
|
||||
```
|
||||
┌─→ Classify audio (3-second window)
|
||||
│ ├── SILENCE → wait, increment silence counter
|
||||
│ ├── RINGING → wait for answer
|
||||
│ ├── MUSIC → hold detected, monitor for transition
|
||||
│ ├── DTMF → ignore (echo detection)
|
||||
│ ├── IVR_PROMPT/SPEECH →
|
||||
│ │ ├── Transcribe the audio
|
||||
│ │ ├── Send transcript + intent to LLM
|
||||
│ │ ├── LLM returns: { "action": "dtmf", "digits": "2" }
|
||||
│ │ └── Send DTMF
|
||||
│ └── LIVE_HUMAN → human detected!
|
||||
│ └── TRANSFER
|
||||
│
|
||||
└── Loop until: human detected, max hold time, or call ended
|
||||
```
|
||||
|
||||
**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.
|
||||
|
||||
## Human Detection
|
||||
|
||||
The critical moment — detecting when a live person picks up after hold:
|
||||
|
||||
### Detection Chain
|
||||
|
||||
```
|
||||
AudioClassifier.classify(audio_frame)
|
||||
│
|
||||
├── Feature extraction:
|
||||
│ ├── RMS energy (loudness)
|
||||
│ ├── Spectral flatness (noise vs tone)
|
||||
│ ├── Zero-crossing rate (speech indicator)
|
||||
│ ├── Dominant frequency
|
||||
│ └── Spectral centroid
|
||||
│
|
||||
├── Classification: MUSIC, SILENCE, SPEECH, etc.
|
||||
│
|
||||
└── Transition detection:
|
||||
└── detect_hold_to_human_transition()
|
||||
├── Check last N classifications
|
||||
├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
|
||||
├── Confidence: speech energy > threshold
|
||||
└── Result: HUMAN_DETECTED event
|
||||
```
|
||||
|
||||
### What triggers a transfer?
|
||||
|
||||
The Hold Slayer considers a human detected when:
|
||||
|
||||
1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
|
||||
2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
|
||||
3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
|
||||
|
||||
### False Positive Handling
|
||||
|
||||
Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
|
||||
|
||||
1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
|
||||
2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
|
||||
3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
|
||||
|
||||
## LISTEN Step + LLM Fallback
|
||||
|
||||
The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
|
||||
|
||||
```python
|
||||
# Step has hardcoded DTMF? Use it directly.
|
||||
if step.dtmf:
|
||||
await sip_engine.send_dtmf(call_id, step.dtmf)
|
||||
|
||||
# No hardcoded DTMF? Ask the LLM.
|
||||
else:
|
||||
transcript = await transcription.transcribe(audio)
|
||||
decision = await llm_client.analyze_ivr_menu(
|
||||
transcript=transcript,
|
||||
intent=intent,
|
||||
previous_selections=previous_steps,
|
||||
)
|
||||
if decision.get("action") == "dtmf":
|
||||
await sip_engine.send_dtmf(call_id, decision["digits"])
|
||||
```
|
||||
|
||||
The LLM receives:
|
||||
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
|
||||
- The user's intent ("dispute a charge on my December statement")
|
||||
- Previous menu selections (to avoid loops)
|
||||
|
||||
And returns structured JSON:
|
||||
```json
|
||||
{
|
||||
"action": "dtmf",
|
||||
"digits": "1",
|
||||
"reasoning": "Billing is the correct department for charge disputes"
|
||||
}
|
||||
```
|
||||
|
||||
## Event Publishing
|
||||
|
||||
The Hold Slayer publishes events throughout the process:
|
||||
|
||||
| Event | When |
|
||||
|-------|------|
|
||||
| `IVR_STEP` | Each step in the call flow is executed |
|
||||
| `IVR_DTMF_SENT` | DTMF digits are sent |
|
||||
| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
|
||||
| `HOLD_DETECTED` | Hold music is detected |
|
||||
| `HUMAN_DETECTED` | Live human speech detected after hold |
|
||||
| `TRANSFER_STARTED` | Call bridge initiated to user's device |
|
||||
| `TRANSFER_COMPLETE` | User's device answered, bridge active |
|
||||
|
||||
All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
|
||||
|
||||
## Configuration
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
|
||||
| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
|
||||
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
|
||||
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |
|
||||
Reference in New Issue
Block a user