feat: add initial Hold Slayer AI telephony gateway implementation

Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
2026-03-21 19:23:26 +00:00
parent c9ff60702b
commit ecf37658ce
56 changed files with 11601 additions and 164 deletions
--- a/docs/hold-slayer-service.md
+++ b/docs/hold-slayer-service.md
@@ -0,0 +1,168 @@
+# Hold Slayer Service
+
+The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
+
+## Two Operating Modes
+
+### 1. Flow-Guided Mode (`run_with_flow`)
+
+When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:
+
+```python
+await hold_slayer.run_with_flow(call_id, call_flow)
+```
+
+The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:
+
+```
+CallFlow: "Chase Bank Main"
+├── Step 1: WAIT 3s (wait for greeting)
+├── Step 2: LISTEN (transcribe → LLM picks option)
+├── Step 3: DTMF "2" (press 2 for account services)
+├── Step 4: LISTEN (transcribe → LLM picks option)
+├── Step 5: DTMF "1" (press 1 for disputes)
+├── Step 6: HOLD (wait for human)
+└── Step 7: TRANSFER (bridge to your phone)
+```
+
+**Step execution logic:**
+
+| Step Type | What Happens |
+|-----------|-------------|
+| `DTMF` | Send the specified digits via SIP engine |
+| `WAIT` | Sleep for the specified duration |
+| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
+| `HOLD` | Monitor audio classification, wait for human detection |
+| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
+| `TRANSFER` | Bridge the call to the user's device |
+
+### 2. Exploration Mode (`run_exploration`)
+
+When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
+
+```python
+await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
+```
+
+**Exploration loop:**
+
+```
+┌─→ Classify audio (3-second window)
+│   ├── SILENCE → wait, increment silence counter
+│   ├── RINGING → wait for answer
+│   ├── MUSIC → hold detected, monitor for transition
+│   ├── DTMF → ignore (echo detection)
+│   ├── IVR_PROMPT/SPEECH →
+│   │   ├── Transcribe the audio
+│   │   ├── Send transcript + intent to LLM
+│   │   ├── LLM returns: { "action": "dtmf", "digits": "2" }
+│   │   └── Send DTMF
+│   └── LIVE_HUMAN → human detected!
+│       └── TRANSFER
+│
+└── Loop until: human detected, max hold time, or call ended
+```
+
+**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.
+
+## Human Detection
+
+The critical moment — detecting when a live person picks up after hold:
+
+### Detection Chain
+
+```
+AudioClassifier.classify(audio_frame)
+  │
+  ├── Feature extraction:
+  │   ├── RMS energy (loudness)
+  │   ├── Spectral flatness (noise vs tone)
+  │   ├── Zero-crossing rate (speech indicator)
+  │   ├── Dominant frequency
+  │   └── Spectral centroid
+  │
+  ├── Classification: MUSIC, SILENCE, SPEECH, etc.
+  │
+  └── Transition detection:
+      └── detect_hold_to_human_transition()
+          ├── Check last N classifications
+          ├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
+          ├── Confidence: speech energy > threshold
+          └── Result: HUMAN_DETECTED event
+```
+
+### What triggers a transfer?
+
+The Hold Slayer considers a human detected when:
+
+1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
+2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
+3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
+
+### False Positive Handling
+
+Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
+
+1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
+2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
+3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
+
+## LISTEN Step + LLM Fallback
+
+The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
+
+```python
+# Step has hardcoded DTMF? Use it directly.
+if step.dtmf:
+    await sip_engine.send_dtmf(call_id, step.dtmf)
+
+# No hardcoded DTMF? Ask the LLM.
+else:
+    transcript = await transcription.transcribe(audio)
+    decision = await llm_client.analyze_ivr_menu(
+        transcript=transcript,
+        intent=intent,
+        previous_selections=previous_steps,
+    )
+    if decision.get("action") == "dtmf":
+        await sip_engine.send_dtmf(call_id, decision["digits"])
+```
+
+The LLM receives:
+- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
+- The user's intent ("dispute a charge on my December statement")
+- Previous menu selections (to avoid loops)
+
+And returns structured JSON:
+```json
+{
+  "action": "dtmf",
+  "digits": "1",
+  "reasoning": "Billing is the correct department for charge disputes"
+}
+```
+
+## Event Publishing
+
+The Hold Slayer publishes events throughout the process:
+
+| Event | When |
+|-------|------|
+| `IVR_STEP` | Each step in the call flow is executed |
+| `IVR_DTMF_SENT` | DTMF digits are sent |
+| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
+| `HOLD_DETECTED` | Hold music is detected |
+| `HUMAN_DETECTED` | Live human speech detected after hold |
+| `TRANSFER_STARTED` | Call bridge initiated to user's device |
+| `TRANSFER_COMPLETE` | User's device answered, bridge active |
+
+All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
+
+## Configuration
+
+| Setting | Description | Default |
+|---------|-------------|---------|
+| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
+| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
+| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
+| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |