feat: add initial Hold Slayer AI telephony gateway implementation

Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
This commit is contained in:
2026-03-21 19:23:26 +00:00
parent c9ff60702b
commit ecf37658ce
56 changed files with 11601 additions and 164 deletions

168
docs/hold-slayer-service.md Normal file
View File

@@ -0,0 +1,168 @@
# Hold Slayer Service
The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
## Two Operating Modes
### 1. Flow-Guided Mode (`run_with_flow`)
When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:
```python
await hold_slayer.run_with_flow(call_id, call_flow)
```
The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:
```
CallFlow: "Chase Bank Main"
├── Step 1: WAIT 3s (wait for greeting)
├── Step 2: LISTEN (transcribe → LLM picks option)
├── Step 3: DTMF "2" (press 2 for account services)
├── Step 4: LISTEN (transcribe → LLM picks option)
├── Step 5: DTMF "1" (press 1 for disputes)
├── Step 6: HOLD (wait for human)
└── Step 7: TRANSFER (bridge to your phone)
```
**Step execution logic:**
| Step Type | What Happens |
|-----------|-------------|
| `DTMF` | Send the specified digits via SIP engine |
| `WAIT` | Sleep for the specified duration |
| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
| `HOLD` | Monitor audio classification, wait for human detection |
| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
| `TRANSFER` | Bridge the call to the user's device |
### 2. Exploration Mode (`run_exploration`)
When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
```python
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
```
**Exploration loop:**
```
┌─→ Classify audio (3-second window)
│ ├── SILENCE → wait, increment silence counter
│ ├── RINGING → wait for answer
│ ├── MUSIC → hold detected, monitor for transition
│ ├── DTMF → ignore (echo detection)
│ ├── IVR_PROMPT/SPEECH →
│ │ ├── Transcribe the audio
│ │ ├── Send transcript + intent to LLM
│ │ ├── LLM returns: { "action": "dtmf", "digits": "2" }
│ │ └── Send DTMF
│ └── LIVE_HUMAN → human detected!
│ └── TRANSFER
└── Loop until: human detected, max hold time, or call ended
```
**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.
## Human Detection
The critical moment — detecting when a live person picks up after hold:
### Detection Chain
```
AudioClassifier.classify(audio_frame)
├── Feature extraction:
│ ├── RMS energy (loudness)
│ ├── Spectral flatness (noise vs tone)
│ ├── Zero-crossing rate (speech indicator)
│ ├── Dominant frequency
│ └── Spectral centroid
├── Classification: MUSIC, SILENCE, SPEECH, etc.
└── Transition detection:
└── detect_hold_to_human_transition()
├── Check last N classifications
├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
├── Confidence: speech energy > threshold
└── Result: HUMAN_DETECTED event
```
### What triggers a transfer?
The Hold Slayer considers a human detected when:
1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
### False Positive Handling
Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
## LISTEN Step + LLM Fallback
The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
```python
# Step has hardcoded DTMF? Use it directly.
if step.dtmf:
await sip_engine.send_dtmf(call_id, step.dtmf)
# No hardcoded DTMF? Ask the LLM.
else:
transcript = await transcription.transcribe(audio)
decision = await llm_client.analyze_ivr_menu(
transcript=transcript,
intent=intent,
previous_selections=previous_steps,
)
if decision.get("action") == "dtmf":
await sip_engine.send_dtmf(call_id, decision["digits"])
```
The LLM receives:
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
- The user's intent ("dispute a charge on my December statement")
- Previous menu selections (to avoid loops)
And returns structured JSON:
```json
{
"action": "dtmf",
"digits": "1",
"reasoning": "Billing is the correct department for charge disputes"
}
```
## Event Publishing
The Hold Slayer publishes events throughout the process:
| Event | When |
|-------|------|
| `IVR_STEP` | Each step in the call flow is executed |
| `IVR_DTMF_SENT` | DTMF digits are sent |
| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
| `HOLD_DETECTED` | Hold music is detected |
| `HUMAN_DETECTED` | Live human speech detected after hold |
| `TRANSFER_STARTED` | Call bridge initiated to user's device |
| `TRANSFER_COMPLETE` | User's device answered, bridge active |
All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
## Configuration
| Setting | Description | Default |
|---------|-------------|---------|
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |