hold-slayer/docs/hold-slayer-service.md

# Hold Slayer Service

The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.

## Two Operating Modes

### 1. Flow-Guided Mode (`run_with_flow`)

When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:

```python
await hold_slayer.run_with_flow(call_id, call_flow)
```

The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:

```
CallFlow: "Chase Bank Main"
├── Step 1: WAIT 3s (wait for greeting)
├── Step 2: LISTEN (transcribe → LLM picks option)
├── Step 3: DTMF "2" (press 2 for account services)
├── Step 4: LISTEN (transcribe → LLM picks option)
├── Step 5: DTMF "1" (press 1 for disputes)
├── Step 6: HOLD (wait for human)
└── Step 7: TRANSFER (bridge to your phone)
```

**Step execution logic:**

| Step Type | What Happens |
|-----------|-------------|
| `DTMF` | Send the specified digits via SIP engine |
| `WAIT` | Sleep for the specified duration |
| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
| `HOLD` | Monitor audio classification, wait for human detection |
| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
| `TRANSFER` | Bridge the call to the user's device |

### 2. Exploration Mode (`run_exploration`)

When no stored call flow exists, the Hold Slayer explores the IVR autonomously:

```python
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
```

**Exploration loop:**

```
┌─→ Classify audio (3-second window)
│   ├── SILENCE → wait, increment silence counter
│   ├── RINGING → wait for answer
│   ├── MUSIC → hold detected, monitor for transition
│   ├── DTMF → ignore (echo detection)
│   ├── IVR_PROMPT/SPEECH →
│   │   ├── Transcribe the audio
│   │   ├── Send transcript + intent to LLM
│   │   ├── LLM returns: { "action": "dtmf", "digits": "2" }
│   │   └── Send DTMF
│   └── LIVE_HUMAN → human detected!
│       └── TRANSFER
│
└── Loop until: human detected, max hold time, or call ended
```

**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.

## Human Detection

The critical moment — detecting when a live person picks up after hold:

### Detection Chain

```
AudioClassifier.classify(audio_frame)
  │
  ├── Feature extraction:
  │   ├── RMS energy (loudness)
  │   ├── Spectral flatness (noise vs tone)
  │   ├── Zero-crossing rate (speech indicator)
  │   ├── Dominant frequency
  │   └── Spectral centroid
  │
  ├── Classification: MUSIC, SILENCE, SPEECH, etc.
  │
  └── Transition detection:
      └── detect_hold_to_human_transition()
          ├── Check last N classifications
          ├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
          ├── Confidence: speech energy > threshold
          └── Result: HUMAN_DETECTED event
```

### What triggers a transfer?

The Hold Slayer considers a human detected when:

1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")

### False Positive Handling

Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:

1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.

## LISTEN Step + LLM Fallback

The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:

```python
# Step has hardcoded DTMF? Use it directly.
if step.dtmf:
    await sip_engine.send_dtmf(call_id, step.dtmf)

# No hardcoded DTMF? Ask the LLM.
else:
    transcript = await transcription.transcribe(audio)
    decision = await llm_client.analyze_ivr_menu(
        transcript=transcript,
        intent=intent,
        previous_selections=previous_steps,
    )
    if decision.get("action") == "dtmf":
        await sip_engine.send_dtmf(call_id, decision["digits"])
```

The LLM receives:
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
- The user's intent ("dispute a charge on my December statement")
- Previous menu selections (to avoid loops)

And returns structured JSON:
```json
{
  "action": "dtmf",
  "digits": "1",
  "reasoning": "Billing is the correct department for charge disputes"
}
```

## Event Publishing

The Hold Slayer publishes events throughout the process:

| Event | When |
|-------|------|
| `IVR_STEP` | Each step in the call flow is executed |
| `IVR_DTMF_SENT` | DTMF digits are sent |
| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
| `HOLD_DETECTED` | Hold music is detected |
| `HUMAN_DETECTED` | Live human speech detected after hold |
| `TRANSFER_STARTED` | Call bridge initiated to user's device |
| `TRANSFER_COMPLETE` | User's device answered, bridge active |

All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.

## Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |