# Hold Slayer Service The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone. ## Two Operating Modes ### 1. Flow-Guided Mode (`run_with_flow`) When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step: ```python await hold_slayer.run_with_flow(call_id, call_flow) ``` The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them: ``` CallFlow: "Chase Bank Main" ├── Step 1: WAIT 3s (wait for greeting) ├── Step 2: LISTEN (transcribe → LLM picks option) ├── Step 3: DTMF "2" (press 2 for account services) ├── Step 4: LISTEN (transcribe → LLM picks option) ├── Step 5: DTMF "1" (press 1 for disputes) ├── Step 6: HOLD (wait for human) └── Step 7: TRANSFER (bridge to your phone) ``` **Step execution logic:** | Step Type | What Happens | |-----------|-------------| | `DTMF` | Send the specified digits via SIP engine | | `WAIT` | Sleep for the specified duration | | `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option | | `HOLD` | Monitor audio classification, wait for human detection | | `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) | | `TRANSFER` | Bridge the call to the user's device | ### 2. Exploration Mode (`run_exploration`) When no stored call flow exists, the Hold Slayer explores the IVR autonomously: ```python await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge") ``` **Exploration loop:** ``` ┌─→ Classify audio (3-second window) │ ├── SILENCE → wait, increment silence counter │ ├── RINGING → wait for answer │ ├── MUSIC → hold detected, monitor for transition │ ├── DTMF → ignore (echo detection) │ ├── IVR_PROMPT/SPEECH → │ │ ├── Transcribe the audio │ │ ├── Send transcript + intent to LLM │ │ ├── LLM returns: { "action": "dtmf", "digits": "2" } │ │ └── Send DTMF │ └── LIVE_HUMAN → human detected! │ └── TRANSFER │ └── Loop until: human detected, max hold time, or call ended ``` **Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time. ## Human Detection The critical moment — detecting when a live person picks up after hold: ### Detection Chain ``` AudioClassifier.classify(audio_frame) │ ├── Feature extraction: │ ├── RMS energy (loudness) │ ├── Spectral flatness (noise vs tone) │ ├── Zero-crossing rate (speech indicator) │ ├── Dominant frequency │ └── Spectral centroid │ ├── Classification: MUSIC, SILENCE, SPEECH, etc. │ └── Transition detection: └── detect_hold_to_human_transition() ├── Check last N classifications ├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH ├── Confidence: speech energy > threshold └── Result: HUMAN_DETECTED event ``` ### What triggers a transfer? The Hold Slayer considers a human detected when: 1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT) 2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise) 3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us") ### False Positive Handling Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by: 1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer. 2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement. 3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting. ## LISTEN Step + LLM Fallback The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow: ```python # Step has hardcoded DTMF? Use it directly. if step.dtmf: await sip_engine.send_dtmf(call_id, step.dtmf) # No hardcoded DTMF? Ask the LLM. else: transcript = await transcription.transcribe(audio) decision = await llm_client.analyze_ivr_menu( transcript=transcript, intent=intent, previous_selections=previous_steps, ) if decision.get("action") == "dtmf": await sip_engine.send_dtmf(call_id, decision["digits"]) ``` The LLM receives: - The IVR transcript ("Press 1 for billing, press 2 for technical support...") - The user's intent ("dispute a charge on my December statement") - Previous menu selections (to avoid loops) And returns structured JSON: ```json { "action": "dtmf", "digits": "1", "reasoning": "Billing is the correct department for charge disputes" } ``` ## Event Publishing The Hold Slayer publishes events throughout the process: | Event | When | |-------|------| | `IVR_STEP` | Each step in the call flow is executed | | `IVR_DTMF_SENT` | DTMF digits are sent | | `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed | | `HOLD_DETECTED` | Hold music is detected | | `HUMAN_DETECTED` | Live human speech detected after hold | | `TRANSFER_STARTED` | Call bridge initiated to user's device | | `TRANSFER_COMPLETE` | User's device answered, bridge active | All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics. ## Configuration | Setting | Description | Default | |---------|-------------|---------| | `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) | | `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` | | `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` | | `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |