feat: add initial Hold Slayer AI telephony gateway implementation

Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
This commit is contained in:
2026-03-21 19:23:26 +00:00
parent c9ff60702b
commit ecf37658ce
56 changed files with 11601 additions and 164 deletions

18
docs/README.md Normal file
View File

@@ -0,0 +1,18 @@
# Hold Slayer Documentation
Comprehensive documentation for the Hold Slayer AI telephony gateway.
## Contents
| Document | Description |
|----------|-------------|
| [Architecture](architecture.md) | System architecture, component diagram, data flow |
| [Core Engine](core-engine.md) | SIP engine, media pipeline, call manager, event bus |
| [Hold Slayer Service](hold-slayer-service.md) | IVR navigation, hold detection, human detection, transfer |
| [Audio Classifier](audio-classifier.md) | Waveform analysis, feature extraction, classification logic |
| [Services](services.md) | LLM client, transcription, recording, analytics, notifications |
| [Call Flows](call-flows.md) | Call flow model, step types, learner, CRUD API |
| [API Reference](api-reference.md) | REST endpoints, WebSocket protocol, request/response schemas |
| [MCP Server](mcp-server.md) | MCP tools and resources for AI assistant integration |
| [Configuration](configuration.md) | Environment variables, settings, deployment options |
| [Development](development.md) | Setup, testing, contributing, project conventions |

378
docs/api-reference.md Normal file
View File

@@ -0,0 +1,378 @@
# API Reference
Hold Slayer exposes a REST API, WebSocket endpoint, and MCP server.
## REST API
Base URL: `http://localhost:8000/api`
### Calls
#### Place an Outbound Call
```
POST /api/calls/outbound
```
**Request:**
```json
{
"number": "+18005551234",
"mode": "hold_slayer",
"intent": "dispute Amazon charge from December 15th",
"device": "sip_phone",
"call_flow_id": "chase_bank_disputes",
"services": {
"recording": true,
"transcription": true
}
}
```
**Call Modes:**
| Mode | Description |
|------|-------------|
| `direct` | Dial and connect to your device immediately |
| `hold_slayer` | Navigate IVR, wait on hold, transfer when human detected |
| `ai_assisted` | Connect with noise cancel, transcription, recording |
**Response:**
```json
{
"call_id": "call_abc123",
"status": "trying",
"number": "+18005551234",
"mode": "hold_slayer",
"started_at": "2026-01-15T10:30:00Z"
}
```
#### Launch Hold Slayer
```
POST /api/calls/hold-slayer
```
Convenience endpoint — equivalent to `POST /outbound` with `mode=hold_slayer`.
**Request:**
```json
{
"number": "+18005551234",
"intent": "dispute Amazon charge from December 15th",
"call_flow_id": "chase_bank_disputes",
"transfer_to": "sip_phone"
}
```
#### Get Call Status
```
GET /api/calls/{call_id}
```
**Response:**
```json
{
"call_id": "call_abc123",
"status": "on_hold",
"number": "+18005551234",
"mode": "hold_slayer",
"duration": 847,
"hold_time": 780,
"audio_type": "music",
"transcript_excerpt": "...your call is important to us...",
"classification_history": [
{"timestamp": 1706000000, "type": "ringing", "confidence": 0.95},
{"timestamp": 1706000003, "type": "ivr_prompt", "confidence": 0.88},
{"timestamp": 1706000010, "type": "music", "confidence": 0.92}
],
"services": {"recording": true, "transcription": true}
}
```
#### List Active Calls
```
GET /api/calls
```
**Response:**
```json
{
"calls": [
{"call_id": "call_abc123", "status": "on_hold", "number": "+18005551234", "duration": 847},
{"call_id": "call_def456", "status": "connected", "number": "+18009876543", "duration": 120}
],
"total": 2
}
```
#### End a Call
```
POST /api/calls/{call_id}/hangup
```
#### Transfer a Call
```
POST /api/calls/{call_id}/transfer
```
**Request:**
```json
{
"device": "sip_phone"
}
```
### Call Flows
#### List Call Flows
```
GET /api/call-flows
GET /api/call-flows?company=Chase+Bank
GET /api/call-flows?tag=banking
```
**Response:**
```json
{
"flows": [
{
"id": "chase_bank_disputes",
"name": "Chase Bank — Disputes",
"company": "Chase Bank",
"phone_number": "+18005551234",
"step_count": 7,
"success_count": 12,
"fail_count": 1,
"tags": ["banking", "disputes"]
}
]
}
```
#### Get Call Flow
```
GET /api/call-flows/{flow_id}
```
Returns the full call flow with all steps.
#### Create Call Flow
```
POST /api/call-flows
```
**Request:**
```json
{
"name": "Chase Bank — Disputes",
"company": "Chase Bank",
"phone_number": "+18005551234",
"steps": [
{"id": "wait", "type": "WAIT", "description": "Wait for greeting", "timeout": 5.0, "next_step": "menu"},
{"id": "menu", "type": "LISTEN", "description": "Main menu", "next_step": "press3"},
{"id": "press3", "type": "DTMF", "description": "Account services", "dtmf": "3", "next_step": "hold"},
{"id": "hold", "type": "HOLD", "description": "Wait for agent", "next_step": "transfer"},
{"id": "transfer", "type": "TRANSFER", "description": "Connect to user"}
]
}
```
#### Update Call Flow
```
PUT /api/call-flows/{flow_id}
```
#### Delete Call Flow
```
DELETE /api/call-flows/{flow_id}
```
### Devices
#### List Registered Devices
```
GET /api/devices
```
**Response:**
```json
{
"devices": [
{
"id": "dev_001",
"name": "Office SIP Phone",
"type": "sip_phone",
"sip_uri": "sip:robert@gateway.helu.ca",
"is_online": true,
"priority": 10
}
]
}
```
#### Register a Device
```
POST /api/devices
```
**Request:**
```json
{
"name": "Office SIP Phone",
"type": "sip_phone",
"sip_uri": "sip:robert@gateway.helu.ca",
"priority": 10,
"capabilities": ["voice"]
}
```
#### Update Device
```
PUT /api/devices/{device_id}
```
#### Remove Device
```
DELETE /api/devices/{device_id}
```
### Error Responses
All errors follow a consistent format:
```json
{
"detail": "Call not found: call_xyz789"
}
```
| Status Code | Meaning |
|-------------|---------|
| `400` | Bad request (invalid parameters) |
| `404` | Resource not found (call, flow, device) |
| `409` | Conflict (call already ended, device already registered) |
| `500` | Internal server error |
## WebSocket
### Event Stream
```
ws://localhost:8000/ws/events
ws://localhost:8000/ws/events?call_id=call_abc123
ws://localhost:8000/ws/events?types=human_detected,hold_detected
```
**Query Parameters:**
| Param | Description |
|-------|-------------|
| `call_id` | Filter events for a specific call |
| `types` | Comma-separated event types to receive |
**Event Format:**
```json
{
"type": "hold_detected",
"call_id": "call_abc123",
"timestamp": "2026-01-15T10:35:00Z",
"data": {
"audio_type": "music",
"confidence": 0.92,
"hold_duration": 0
}
}
```
### Event Types
| Type | Data Fields |
|------|------------|
| `call_started` | `number`, `mode`, `intent` |
| `call_ringing` | `number` |
| `call_connected` | `number`, `duration` |
| `call_ended` | `number`, `duration`, `reason` |
| `call_failed` | `number`, `error` |
| `hold_detected` | `audio_type`, `confidence` |
| `human_detected` | `confidence`, `transcript_excerpt` |
| `transfer_started` | `device`, `from_call_id` |
| `transfer_complete` | `device`, `bridge_id` |
| `ivr_step` | `step_id`, `step_type`, `description` |
| `ivr_dtmf_sent` | `digits`, `step_id` |
| `ivr_menu_detected` | `transcript`, `options` |
| `audio_classified` | `audio_type`, `confidence`, `features` |
| `transcript_chunk` | `text`, `speaker`, `is_final` |
| `recording_started` | `recording_id`, `path` |
| `recording_stopped` | `recording_id`, `duration`, `file_size` |
### Client Example
```javascript
const ws = new WebSocket("ws://localhost:8000/ws/events");
ws.onopen = () => {
console.log("Connected to Hold Slayer events");
};
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
switch (data.type) {
case "human_detected":
alert("🚨 A live person picked up! Pick up your phone!");
break;
case "hold_detected":
console.log("⏳ On hold...");
break;
case "transcript_chunk":
console.log(`📝 ${data.data.speaker}: ${data.data.text}`);
break;
}
};
ws.onerror = (error) => {
console.error("WebSocket error:", error);
};
```
### Python Client Example
```python
import asyncio
import websockets
import json
async def listen():
async with websockets.connect("ws://localhost:8000/ws/events") as ws:
async for message in ws:
event = json.loads(message)
print(f"[{event['type']}] {event.get('data', {})}")
asyncio.run(listen())
```

178
docs/architecture.md Normal file
View File

@@ -0,0 +1,178 @@
# Architecture
Hold Slayer is a single-process async Python application built on FastAPI. It acts as an intelligent B2BUA (Back-to-Back User Agent) sitting between your SIP trunk (PSTN access) and your desk phone/softphone.
## System Diagram
```
┌─────────────────────────────────────────────────────────────────┐
│ FastAPI Server │
│ │
│ ┌──────────┐ ┌──────────┐ ┌───────────┐ ┌──────────────┐ │
│ │ REST API │ │WebSocket │ │MCP Server │ │ Dashboard │ │
│ │ /api/* │ │ /ws/* │ │ (SSE) │ │ /dashboard │ │
│ └────┬─────┘ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
│ │ │ │ │
│ ┌────┴──────────────┴──────────────┴────┐ │
│ │ Event Bus │ │
│ │ (asyncio Queue pub/sub per client) │ │
│ └────┬──────────────┬──────────────┬────┘ │
│ │ │ │ │
│ ┌────┴─────┐ ┌─────┴─────┐ ┌────┴──────────┐ │
│ │ Call │ │ Hold │ │ Services │ │
│ │ Manager │ │ Slayer │ │ (LLM, STT, │ │
│ │ │ │ │ │ Recording, │ │
│ │ │ │ │ │ Analytics, │ │
│ │ │ │ │ │ Notify) │ │
│ └────┬─────┘ └─────┬─────┘ └──────────────┘ │
│ │ │ │
│ ┌────┴──────────────┴───────────────────┐ │
│ │ Sippy B2BUA Engine │ │
│ │ (SIP calls, DTMF, conference bridge) │ │
│ └────┬──────────────────────────────────┘ │
│ │ │
└───────┼─────────────────────────────────────────────────────────┘
┌────┴────┐
│SIP Trunk│ ──→ PSTN
└─────────┘
```
## Component Overview
### Presentation Layer
| Component | File | Protocol | Purpose |
|-----------|------|----------|---------|
| REST API | `api/calls.py`, `api/call_flows.py`, `api/devices.py` | HTTP | Call management, CRUD, configuration |
| WebSocket | `api/websocket.py` | WS | Real-time event streaming to clients |
| MCP Server | `mcp_server/server.py` | SSE | AI assistant tool integration |
### Orchestration Layer
| Component | File | Purpose |
|-----------|------|---------|
| Gateway | `core/gateway.py` | Top-level orchestrator — owns all services, routes calls |
| Call Manager | `core/call_manager.py` | Active call state, lifecycle, transcript tracking |
| Event Bus | `core/event_bus.py` | Async pub/sub connecting everything together |
### Intelligence Layer
| Component | File | Purpose |
|-----------|------|---------|
| Hold Slayer | `services/hold_slayer.py` | IVR navigation, hold monitoring, human detection |
| Audio Classifier | `services/audio_classifier.py` | Real-time waveform analysis (music/speech/DTMF/silence) |
| LLM Client | `services/llm_client.py` | OpenAI-compatible LLM for IVR menu decisions |
| Transcription | `services/transcription.py` | Speaches/Whisper STT for live audio |
| Call Flow Learner | `services/call_flow_learner.py` | Builds reusable IVR trees from exploration data |
### Infrastructure Layer
| Component | File | Purpose |
|-----------|------|---------|
| Sippy Engine | `core/sippy_engine.py` | SIP signaling (INVITE, BYE, REGISTER, DTMF) |
| Media Pipeline | `core/media_pipeline.py` | PJSUA2 RTP media handling, conference bridge, recording |
| Recording | `services/recording.py` | WAV file management and storage |
| Analytics | `services/call_analytics.py` | Call metrics, hold time stats, trends |
| Notifications | `services/notification.py` | WebSocket + SMS alerts |
| Database | `db/database.py` | SQLAlchemy async (PostgreSQL or SQLite) |
## Data Flow — Hold Slayer Call
```
1. User Request
POST /api/calls/hold-slayer { number, intent, call_flow_id }
2. Gateway.make_call()
├── CallManager.create_call() → track state
├── SippyEngine.make_call() → SIP INVITE to trunk
└── MediaPipeline.add_stream() → RTP media setup
3. HoldSlayer.run_with_flow() or run_exploration()
├── AudioClassifier.classify() → analyze 3s audio windows
│ ├── silence? → wait
│ ├── ringing? → wait
│ ├── DTMF? → detect tones
│ ├── music? → HOLD_DETECTED event
│ └── speech? → transcribe + decide
├── TranscriptionService.transcribe() → STT on speech audio
├── LLMClient.analyze_ivr_menu() → pick menu option (fallback)
│ └── SippyEngine.send_dtmf() → press the button
└── detect_hold_to_human_transition()
└── HUMAN_DETECTED! → transfer
4. Transfer
├── SippyEngine.bridge() → connect call legs
├── MediaPipeline.bridge_streams() → bridge RTP
├── EventBus.publish(TRANSFER_STARTED)
└── NotificationService → "Pick up your phone!"
5. Real-Time Updates (throughout)
EventBus.publish() → WebSocket clients
→ MCP server resources
→ Notification service
→ Analytics tracking
```
## Threading Model
Hold Slayer is primarily single-threaded async (asyncio), with one exception:
- **Main thread**: FastAPI + all async services (event bus, hold slayer, classifier, etc.)
- **Sippy thread**: Sippy B2BUA runs its own event loop in a dedicated daemon thread. The `SippyEngine` bridges async↔sync via `asyncio.run_in_executor()`.
- **PJSUA2**: Runs in the main thread using null audio device (no sound card needed — headless server mode).
```
Main Thread (asyncio)
├── FastAPI (uvicorn)
├── EventBus
├── CallManager
├── HoldSlayer
├── AudioClassifier
├── TranscriptionService
├── LLMClient
├── MediaPipeline (PJSUA2)
├── NotificationService
└── RecordingService
Sippy Thread (daemon)
└── Sippy B2BUA event loop
├── SIP signaling
├── DTMF relay
└── Call leg management
```
## Design Decisions
### Why Sippy B2BUA + PJSUA2?
We split SIP signaling and media handling into two separate libraries:
- **Sippy B2BUA** handles SIP signaling (INVITE, BYE, REGISTER, re-INVITE, DTMF relay). It's battle-tested for telephony and handles the complex SIP state machine.
- **PJSUA2** handles RTP media (audio streams, conference bridge, recording, tone generation). It provides a clean C++/Python API for media manipulation without needing to deal with raw RTP.
This split lets us tap into the audio stream (for classification and STT) without interfering with SIP signaling, and bridge calls through a conference bridge for clean transfer.
### Why asyncio Queue-based EventBus?
- **Single process** — no need for Redis/RabbitMQ cross-process messaging
- **Zero dependencies** — pure asyncio, no external services to deploy
- **Per-subscriber queues** — slow consumers don't block fast publishers
- **Dead subscriber cleanup** — full queues are automatically removed
- **Event history** — late joiners can catch up on recent events
If scaling to multiple gateway processes becomes necessary, the EventBus interface can be backed by Redis pub/sub without changing consumers.
### Why OpenAI-compatible LLM API?
The LLM client uses raw HTTP (httpx) against any OpenAI-compatible endpoint. This means:
- **Ollama** (local, free) — `http://localhost:11434/v1`
- **LM Studio** (local, free) — `http://localhost:1234/v1`
- **vLLM** (local, fast) — `http://localhost:8000/v1`
- **OpenAI** (cloud) — `https://api.openai.com/v1`
No SDK dependency. No vendor lock-in. Switch models by changing one env var.

174
docs/audio-classifier.md Normal file
View File

@@ -0,0 +1,174 @@
# Audio Classifier
The Audio Classifier (`services/audio_classifier.py`) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech.
## Classification Types
```python
class AudioClassification(str, Enum):
SILENCE = "silence" # No meaningful audio
MUSIC = "music" # Hold music
IVR_PROMPT = "ivr_prompt" # Recorded voice menu
LIVE_HUMAN = "live_human" # Live person speaking
RINGING = "ringing" # Ringback tone
DTMF = "dtmf" # Touch-tone digits
UNKNOWN = "unknown" # Can't classify
```
## Feature Extraction
Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction:
| Feature | What It Measures | How It's Used |
|---------|-----------------|---------------|
| **RMS Energy** | Loudness (root mean square of samples) | Silence detection — below threshold = silence |
| **Spectral Flatness** | How noise-like vs tonal the audio is (0=pure tone, 1=white noise) | Music has low flatness (tonal), speech has higher flatness |
| **Zero-Crossing Rate** | How often the waveform crosses zero | Speech has moderate ZCR, tones have very regular ZCR |
| **Dominant Frequency** | Strongest frequency component (via FFT) | Ringback detection (440Hz), DTMF detection |
| **Spectral Centroid** | "Center of mass" of the frequency spectrum | Speech has higher centroid than music |
| **Tonality** | Whether the audio is dominated by a single frequency | Tones/DTMF are highly tonal, speech is not |
### Feature Extraction Code
```python
def _extract_features(self, audio: np.ndarray) -> dict:
rms = np.sqrt(np.mean(audio ** 2))
# FFT for frequency analysis
fft = np.fft.rfft(audio)
magnitude = np.abs(fft)
freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate)
# Spectral flatness: geometric mean / arithmetic mean of magnitude
spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10)
# Zero-crossing rate
zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2
# Dominant frequency
dominant_freq = freqs[np.argmax(magnitude)]
# Spectral centroid
spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10)
return { ... }
```
## Classification Logic
Classification follows a priority chain:
```
1. SILENCE — RMS below threshold?
└── Yes → SILENCE (confidence based on how quiet)
2. DTMF — Goertzel algorithm detects dual-tone pairs?
└── Yes → DTMF (with detected digit in details)
3. RINGING — Dominant frequency near 440Hz + tonal?
└── Yes → RINGING
4. SPEECH vs MUSIC discrimination:
├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT
│ └── _looks_like_live_human() checks history for hold→speech transition
│ ├── Yes → LIVE_HUMAN
│ └── No → IVR_PROMPT
└── Low spectral flatness + tonal → MUSIC
```
### DTMF Detection
Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits:
```
1209 Hz 1336 Hz 1477 Hz 1633 Hz
697 Hz 1 2 3 A
770 Hz 4 5 6 B
852 Hz 7 8 9 C
941 Hz * 0 # D
```
Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT.
### Hold-to-Human Transition
The most critical detection — when a live person picks up after hold music:
```python
def detect_hold_to_human_transition(self) -> bool:
"""
Check classification history for the pattern:
MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT
Requires:
- At least 3 recent MUSIC classifications
- Followed by 2+ speech classifications
- Speech has sufficient energy (not just noise)
"""
recent = self._history[-10:]
# Find the transition point
music_count = 0
speech_count = 0
for result in recent:
if result.audio_type == AudioClassification.MUSIC:
music_count += 1
speech_count = 0 # reset
elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT):
speech_count += 1
return music_count >= 3 and speech_count >= 2
```
## Classification Result
Each classification returns:
```python
@dataclass
class ClassificationResult:
timestamp: float
audio_type: AudioClassification
confidence: float # 0.0 to 1.0
details: dict # Feature values, detected frequencies, etc.
```
The `details` dict includes all extracted features, making it available for debugging and analytics:
```python
{
"rms": 0.0423,
"spectral_flatness": 0.15,
"zcr": 0.087,
"dominant_freq": 440.0,
"spectral_centroid": 523.7,
"is_tonal": True
}
```
## Configuration
| Setting | Description | Default |
|---------|-------------|---------|
| `CLASSIFIER_MUSIC_THRESHOLD` | Spectral flatness below this = music | `0.7` |
| `CLASSIFIER_SPEECH_THRESHOLD` | Spectral flatness above this = speech | `0.6` |
| `CLASSIFIER_SILENCE_THRESHOLD` | RMS below this = silence | `0.85` |
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for each classification | `3.0` |
## Testing
The audio classifier has 18 unit tests covering:
- Silence detection (pure silence, very quiet, empty audio)
- Tone detection (440Hz ringback, 1000Hz test tone)
- DTMF detection (digit 5, digit 0)
- Speech detection (speech-like waveforms)
- Classification history (hold→human transition, IVR non-transition)
- Feature extraction (RMS, ZCR, spectral flatness, dominant frequency)
```bash
pytest tests/test_audio_classifier.py -v
```
> **Known issue:** `test_complex_tone_as_music` is a known edge case where a multi-harmonic synthetic tone is classified as `LIVE_HUMAN` instead of `MUSIC`. This is acceptable — real hold music has different characteristics than synthetic test signals.

233
docs/call-flows.md Normal file
View File

@@ -0,0 +1,233 @@
# Call Flows
Call flows are reusable IVR navigation trees that tell Hold Slayer exactly how to navigate a company's phone menu. Once a flow is learned (manually or via exploration), subsequent calls to the same number skip the LLM analysis and follow the stored steps directly.
## Data Model
### CallFlowStep
A single step in the IVR navigation:
```python
class CallFlowStep(BaseModel):
id: str # Unique step identifier
type: CallFlowStepType # DTMF, WAIT, LISTEN, HOLD, SPEAK, TRANSFER
description: str # Human-readable description
dtmf: Optional[str] = None # Digits to press (for DTMF steps)
timeout: float = 10.0 # Max seconds to wait
next_step: Optional[str] = None # ID of the next step
conditions: dict = {} # Conditional branching rules
metadata: dict = {} # Extra data (transcript patterns, etc.)
```
### Step Types
| Type | Purpose | Key Fields |
|------|---------|------------|
| `DTMF` | Press touch-tone digits | `dtmf="3"` |
| `WAIT` | Pause for a duration | `timeout=5.0` |
| `LISTEN` | Record + transcribe + decide | `timeout=15.0`, optional `dtmf` for hardcoded response |
| `HOLD` | Wait on hold, monitor for human | `timeout=7200` (max hold time) |
| `SPEAK` | Play audio to the call | `metadata={"audio_file": "greeting.wav"}` |
| `TRANSFER` | Bridge call to user's device | `metadata={"device": "sip_phone"}` |
### CallFlow
A complete IVR navigation tree:
```python
class CallFlow(BaseModel):
id: str # "chase_bank_main"
name: str # "Chase Bank — Main Menu"
company: Optional[str] # "Chase Bank"
phone_number: Optional[str] # "+18005551234"
description: Optional[str] # "Navigate to disputes department"
steps: list[CallFlowStep] # Ordered list of steps
created_at: datetime
updated_at: datetime
version: int = 1
tags: list[str] = [] # ["banking", "disputes"]
success_count: int = 0 # Times this flow succeeded
fail_count: int = 0 # Times this flow failed
```
## Example Call Flow
```json
{
"id": "chase_bank_disputes",
"name": "Chase Bank — Disputes",
"company": "Chase Bank",
"phone_number": "+18005551234",
"steps": [
{
"id": "wait_greeting",
"type": "WAIT",
"description": "Wait for greeting to finish",
"timeout": 5.0,
"next_step": "main_menu"
},
{
"id": "main_menu",
"type": "LISTEN",
"description": "Listen to main menu options",
"timeout": 15.0,
"next_step": "press_3"
},
{
"id": "press_3",
"type": "DTMF",
"description": "Press 3 for account services",
"dtmf": "3",
"next_step": "sub_menu"
},
{
"id": "sub_menu",
"type": "LISTEN",
"description": "Listen to account services sub-menu",
"timeout": 15.0,
"next_step": "press_1"
},
{
"id": "press_1",
"type": "DTMF",
"description": "Press 1 for disputes",
"dtmf": "1",
"next_step": "hold"
},
{
"id": "hold",
"type": "HOLD",
"description": "Wait on hold for disputes agent",
"timeout": 7200,
"next_step": "transfer"
},
{
"id": "transfer",
"type": "TRANSFER",
"description": "Transfer to user's phone"
}
]
}
```
## Call Flow Learner (`services/call_flow_learner.py`)
Automatically builds call flows from exploration data.
### How It Works
1. **Exploration mode** records "discoveries" — what the Hold Slayer encountered and did at each step
2. The learner converts discoveries into `CallFlowStep` objects
3. Steps are ordered and linked (`next_step` pointers)
4. The resulting `CallFlow` is saved for future calls
### Discovery Types
| Discovery | Becomes Step |
|-----------|-------------|
| Heard IVR prompt, pressed DTMF | `LISTEN``DTMF` |
| Detected hold music | `HOLD` |
| Detected silence (waiting) | `WAIT` |
| Heard speech (human) | `TRANSFER` |
| Sent DTMF digits | `DTMF` |
### Building a Flow
```python
learner = CallFlowLearner()
# After an exploration call completes:
discoveries = [
{"type": "wait", "duration": 3.0, "description": "Initial silence"},
{"type": "ivr_menu", "transcript": "Press 1 for billing...", "dtmf_sent": "1"},
{"type": "ivr_menu", "transcript": "Press 3 for disputes...", "dtmf_sent": "3"},
{"type": "hold", "duration": 480.0},
{"type": "human_detected", "transcript": "Thank you for calling..."},
]
flow = learner.build_flow(
discoveries=discoveries,
phone_number="+18005551234",
company="Chase Bank",
intent="dispute a charge",
)
# Returns a CallFlow with 5 steps: WAIT → LISTEN/DTMF → LISTEN/DTMF → HOLD → TRANSFER
```
### Merging Discoveries
When the same number is called again with exploration, new discoveries can be merged into the existing flow:
```python
updated_flow = learner.merge_discoveries(
existing_flow=flow,
new_discoveries=new_discoveries,
)
```
This handles:
- New menu options discovered
- Changed IVR structure
- Updated timing information
- Success/failure tracking
## REST API
### List Call Flows
```
GET /api/call-flows
GET /api/call-flows?company=Chase+Bank
GET /api/call-flows?tag=banking
```
### Get Call Flow
```
GET /api/call-flows/{flow_id}
```
### Create Call Flow
```
POST /api/call-flows
Content-Type: application/json
{
"name": "Chase Bank — Disputes",
"company": "Chase Bank",
"phone_number": "+18005551234",
"steps": [ ... ]
}
```
### Update Call Flow
```
PUT /api/call-flows/{flow_id}
Content-Type: application/json
{ ... updated flow ... }
```
### Delete Call Flow
```
DELETE /api/call-flows/{flow_id}
```
### Learn Flow from Exploration
```
POST /api/call-flows/learn
Content-Type: application/json
{
"call_id": "call_abc123",
"phone_number": "+18005551234",
"company": "Chase Bank"
}
```
This triggers the Call Flow Learner to build a flow from the call's exploration data.

165
docs/configuration.md Normal file
View File

@@ -0,0 +1,165 @@
# Configuration
All configuration is via environment variables, loaded through Pydantic Settings. Copy `.env.example` to `.env` and edit.
## Environment Variables
### SIP Trunk
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `SIP_TRUNK_HOST` | Your SIP provider hostname | — | Yes |
| `SIP_TRUNK_PORT` | SIP signaling port | `5060` | No |
| `SIP_TRUNK_USERNAME` | SIP auth username | — | Yes |
| `SIP_TRUNK_PASSWORD` | SIP auth password | — | Yes |
| `SIP_TRUNK_DID` | Your phone number (E.164) | — | Yes |
| `SIP_TRUNK_TRANSPORT` | Transport protocol (`udp`, `tcp`, `tls`) | `udp` | No |
### Gateway
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `GATEWAY_SIP_PORT` | Port for device SIP registration | `5080` | No |
| `GATEWAY_RTP_PORT_MIN` | Minimum RTP port | `10000` | No |
| `GATEWAY_RTP_PORT_MAX` | Maximum RTP port | `20000` | No |
| `GATEWAY_HOST` | Bind address | `0.0.0.0` | No |
### LLM
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `LLM_BASE_URL` | OpenAI-compatible API endpoint | `http://localhost:11434/v1` | No |
| `LLM_MODEL` | Model name for IVR analysis | `llama3` | No |
| `LLM_API_KEY` | API key (if required) | `not-needed` | No |
| `LLM_TIMEOUT` | Request timeout in seconds | `30.0` | No |
| `LLM_MAX_TOKENS` | Max tokens per response | `1024` | No |
| `LLM_TEMPERATURE` | Sampling temperature | `0.3` | No |
### Speech-to-Text
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `SPEACHES_URL` | Speaches/Whisper STT endpoint | `http://localhost:22070` | No |
| `SPEACHES_MODEL` | Whisper model name | `whisper-large-v3` | No |
### Database
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `DATABASE_URL` | PostgreSQL or SQLite connection string | `sqlite+aiosqlite:///./hold_slayer.db` | No |
### Notifications
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `NOTIFY_SMS_NUMBER` | Phone number for SMS alerts (E.164) | — | No |
### Audio Classifier
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` | No |
| `CLASSIFIER_SILENCE_THRESHOLD` | RMS below this = silence | `0.85` | No |
| `CLASSIFIER_MUSIC_THRESHOLD` | Spectral flatness below this = music | `0.7` | No |
| `CLASSIFIER_SPEECH_THRESHOLD` | Spectral flatness above this = speech | `0.6` | No |
### Hold Slayer
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold | `7200` | No |
| `HOLD_CHECK_INTERVAL` | Seconds between audio checks | `2.0` | No |
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to | `sip_phone` | No |
### Recording
| Variable | Description | Default | Required |
|----------|-------------|---------|----------|
| `RECORDING_DIR` | Directory for WAV recordings | `recordings` | No |
| `RECORDING_MAX_SECONDS` | Maximum recording duration | `7200` | No |
| `RECORDING_SAMPLE_RATE` | Audio sample rate | `16000` | No |
## Settings Architecture
Configuration is managed by Pydantic Settings in `config.py`:
```python
from config import get_settings
settings = get_settings()
settings.sip_trunk_host # "sip.provider.com"
settings.llm.base_url # "http://localhost:11434/v1"
settings.llm.model # "llama3"
settings.speaches_url # "http://localhost:22070"
settings.database_url # "sqlite+aiosqlite:///./hold_slayer.db"
```
LLM settings are nested under `settings.llm` as a `LLMSettings` sub-model.
## Deployment
### Development
```bash
# 1. Clone and install
git clone <repo-url>
cd hold-slayer
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
# 2. Configure
cp .env.example .env
# Edit .env
# 3. Start Ollama (for LLM)
ollama serve
ollama pull llama3
# 4. Start Speaches (for STT)
docker run -p 22070:8000 ghcr.io/speaches-ai/speaches
# 5. Run
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
```
### Production
```bash
# Use PostgreSQL instead of SQLite
DATABASE_URL=postgresql+asyncpg://user:pass@localhost/hold_slayer
# Use vLLM for faster inference
LLM_BASE_URL=http://localhost:8000/v1
LLM_MODEL=meta-llama/Llama-3-8B-Instruct
# Run with multiple workers (note: each worker is independent)
uvicorn main:app --host 0.0.0.0 --port 8000 --workers 1
```
Note: Hold Slayer is designed as a single-process application. Multiple workers would each have their own SIP engine and call state. For high availability, run behind a load balancer with sticky sessions.
### Docker
```dockerfile
FROM python:3.13-slim
# Install system dependencies for PJSUA2 and Sippy
RUN apt-get update && apt-get install -y \
build-essential \
libpjproject-dev \
&& rm -rf /var/lib/apt/lists/*
WORKDIR /app
COPY . .
RUN pip install -e .
EXPOSE 8000 5080/udp 10000-20000/udp
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
```
Port mapping:
- `8000` — HTTP API + WebSocket + MCP
- `5080/udp` — SIP device registration
- `10000-20000/udp` — RTP media ports

273
docs/core-engine.md Normal file
View File

@@ -0,0 +1,273 @@
# Core Engine
The core engine provides the foundational infrastructure: SIP call control, media handling, call state management, and event distribution.
## SIP Engine (`core/sip_engine.py` + `core/sippy_engine.py`)
### Abstract Interface
All SIP operations go through the `SIPEngine` abstract base class, which defines the contract:
```python
class SIPEngine(ABC):
async def start(self) -> None: ...
async def stop(self) -> None: ...
async def make_call(self, to_uri: str, from_uri: str = None) -> str: ...
async def hangup(self, call_id: str) -> None: ...
async def send_dtmf(self, call_id: str, digits: str) -> None: ...
async def bridge(self, call_id_a: str, call_id_b: str) -> None: ...
async def transfer(self, call_id: str, to_uri: str) -> None: ...
async def register(self, ...) -> bool: ...
async def get_trunk_status(self) -> TrunkStatus: ...
```
This abstraction allows:
- **`SippyEngine`** — Production implementation using Sippy B2BUA
- **`MockSIPEngine`** — Test implementation that simulates calls in memory
### Sippy B2BUA Engine
The `SippyEngine` wraps Sippy B2BUA for SIP signaling:
```python
class SippyEngine(SIPEngine):
"""
Production SIP engine using Sippy B2BUA.
Sippy runs its own event loop in a daemon thread.
All async methods bridge to Sippy via run_in_executor().
"""
```
**Key internals:**
| Class | Purpose |
|-------|---------|
| `SipCallLeg` | Tracks one leg of a call (call-id, state, RTP endpoint, SDP) |
| `SipBridge` | Two bridged call legs (outbound + device) |
| `SippyCallController` | Handles Sippy callbacks (INVITE received, BYE received, DTMF, etc.) |
**Call lifecycle:**
```
make_call("sip:+18005551234@trunk")
├── Create SipCallLeg (state=TRYING)
├── Sippy: send INVITE
├── Sippy callback: 180 Ringing → state=RINGING
├── Sippy callback: 200 OK → state=CONNECTED
│ └── Extract RTP endpoint from SDP
│ └── MediaPipeline.add_stream(rtp_host, rtp_port)
└── Return call_id
send_dtmf(call_id, "1")
└── Sippy: send RFC 2833 DTMF or SIP INFO
bridge(call_id_a, call_id_b)
├── Create SipBridge(leg_a, leg_b)
└── MediaPipeline.bridge_streams(stream_a, stream_b)
hangup(call_id)
├── Sippy: send BYE
├── MediaPipeline.remove_stream()
└── Cleanup SipCallLeg
```
**Graceful fallback:** If Sippy B2BUA is not installed, the engine falls back to mock mode with a warning — useful for development and testing without a SIP stack.
### Trunk Registration
The engine registers with your SIP trunk provider on startup:
```python
await engine.register(
registrar="sip.yourprovider.com",
username="your_username",
password="your_password",
realm="sip.yourprovider.com",
)
```
Registration is refreshed automatically. `get_trunk_status()` returns the current registration state and health.
## Media Pipeline (`core/media_pipeline.py`)
The media pipeline uses PJSUA2 for all RTP audio handling:
### Key Classes
| Class | Purpose |
|-------|---------|
| `AudioTap` | Extracts audio frames from a stream into an async queue (for classifier/STT) |
| `MediaStream` | Wraps a single RTP stream (transport port, conference slot, optional tap + recording) |
| `MediaPipeline` | Main orchestrator — manages all streams, bridging, recording |
### Operations
```python
# Add a new RTP stream (called when SIP call connects)
stream_id = await pipeline.add_stream(rtp_host, rtp_port, codec="PCMU")
# Tap audio for real-time analysis
tap = await pipeline.tap_stream(stream_id)
async for frame in tap:
classification = classifier.classify(frame)
# Bridge two streams (transfer)
await pipeline.bridge_streams(stream_a, stream_b)
# Record a stream to WAV
await pipeline.start_recording(stream_id, "/path/to/recording.wav")
await pipeline.stop_recording(stream_id)
# Play a tone (e.g., ringback to caller)
await pipeline.play_tone(stream_id, frequency=440, duration_ms=2000)
# Clean up
await pipeline.remove_stream(stream_id)
```
### Conference Bridge
PJSUA2's conference bridge is central to the architecture. Every stream gets a conference slot, and bridging is done by connecting slots:
```
Conference Bridge
├── Slot 0: Outbound call (to company)
├── Slot 1: AudioTap (classifier + STT reads from here)
├── Slot 2: Recording port
├── Slot 3: Device call (your phone, after transfer)
└── Slot 4: Tone generator
Bridge: Slot 0 ↔ Slot 3 (company ↔ your phone)
Tap: Slot 0 → Slot 1 (company audio → classifier)
Record: Slot 0 → Slot 2 (company audio → WAV file)
```
### Null Audio Device
The pipeline uses PJSUA2's null audio device — no sound card required. This is essential for headless server deployment.
## Call Manager (`core/call_manager.py`)
Tracks all active calls and their state:
```python
class CallManager:
async def create_call(self, number, mode, intent, ...) -> ActiveCall
async def get_call(self, call_id) -> Optional[ActiveCall]
async def update_status(self, call_id, status) -> None
async def end_call(self, call_id, reason) -> None
async def add_transcript(self, call_id, text, speaker) -> None
def active_call_count(self) -> int
def get_all_active(self) -> list[ActiveCall]
```
**ActiveCall state:**
```python
@dataclass
class ActiveCall:
call_id: str
number: str
mode: CallMode # direct, hold_slayer, ai_assisted
status: CallStatus # trying, ringing, connected, on_hold, transferring, ended
intent: Optional[str]
device: Optional[str]
call_flow_id: Optional[str]
# Timing
started_at: datetime
connected_at: Optional[datetime]
hold_started_at: Optional[datetime]
ended_at: Optional[datetime]
# Audio classification
current_audio_type: Optional[AudioClassification]
classification_history: list[ClassificationResult]
# Transcript
transcript_chunks: list[TranscriptChunk]
# Services
services: dict[str, bool] # recording, transcription, etc.
```
The CallManager publishes events to the EventBus on every state change.
## Event Bus (`core/event_bus.py`)
Pure asyncio pub/sub connecting all components:
```python
class EventBus:
async def publish(self, event: GatewayEvent) -> None
def subscribe(self, event_types: set[EventType] = None) -> EventSubscription
@property
def recent_events(self) -> list[GatewayEvent]
@property
def subscriber_count(self) -> int
```
### EventSubscription
Subscriptions are async iterators:
```python
subscription = event_bus.subscribe(event_types={EventType.HUMAN_DETECTED})
async for event in subscription:
print(f"Human detected on call {event.call_id}!")
# When done:
subscription.close()
```
### How it works
1. Each `subscribe()` creates an `asyncio.Queue` for that subscriber
2. `publish()` does `put_nowait()` on every subscriber's queue
3. Full queues (dead subscribers) are automatically cleaned up
4. Optional type filtering — only receive events you care about
5. Event history (last 1000) for late joiners
### Event Types
See [models/events.py](../models/events.py) for the full list. Key categories:
| Category | Events |
|----------|--------|
| Call Lifecycle | `CALL_STARTED`, `CALL_RINGING`, `CALL_CONNECTED`, `CALL_ENDED`, `CALL_FAILED` |
| Hold Slayer | `HOLD_DETECTED`, `HUMAN_DETECTED`, `TRANSFER_STARTED`, `TRANSFER_COMPLETE` |
| IVR Navigation | `IVR_STEP`, `IVR_DTMF_SENT`, `IVR_MENU_DETECTED`, `IVR_EXPLORATION` |
| Audio | `AUDIO_CLASSIFIED`, `TRANSCRIPT_CHUNK`, `RECORDING_STARTED`, `RECORDING_STOPPED` |
| Device | `DEVICE_REGISTERED`, `DEVICE_UNREGISTERED`, `DEVICE_RINGING` |
| System | `GATEWAY_STARTED`, `GATEWAY_STOPPED`, `TRUNK_REGISTERED`, `TRUNK_FAILED` |
## Gateway (`core/gateway.py`)
The top-level orchestrator that owns and wires all components:
```python
class AIPSTNGateway:
def __init__(self, settings: Settings):
self.event_bus = EventBus()
self.call_manager = CallManager(self.event_bus)
self.sip_engine = SippyEngine(settings, self.event_bus)
self.media_pipeline = MediaPipeline(settings)
self.llm_client = LLMClient(...)
self.transcription = TranscriptionService(...)
self.classifier = AudioClassifier()
self.hold_slayer = HoldSlayer(...)
self.recording = RecordingService(...)
self.analytics = CallAnalytics(...)
self.notification = NotificationService(...)
self.call_flow_learner = CallFlowLearner(...)
async def start(self) -> None: ... # Start all services
async def stop(self) -> None: ... # Graceful shutdown
async def make_call(self, ...) -> ActiveCall: ...
async def end_call(self, call_id) -> None: ...
```
The gateway is created once at application startup (in `main.py` lifespan) and injected into FastAPI routes via dependency injection (`api/deps.py`).

180
docs/development.md Normal file
View File

@@ -0,0 +1,180 @@
# Development
## Setup
### Prerequisites
- Python 3.13+
- Ollama (or any OpenAI-compatible LLM) — for IVR menu analysis
- Speaches or Whisper API — for speech-to-text (optional for dev)
- A SIP trunk account — for making real calls (optional for dev)
### Install
```bash
git clone <repo-url>
cd hold-slayer
python -m venv .venv
source .venv/bin/activate
pip install -e ".[dev]"
```
### Dev Dependencies
The `[dev]` extras include:
- `pytest` — test runner
- `pytest-asyncio` — async test support
- `pytest-cov` — coverage reporting
## Testing
### Run All Tests
```bash
pytest tests/ -v
```
### Run Specific Test Files
```bash
pytest tests/test_audio_classifier.py -v # 18 tests — waveform analysis
pytest tests/test_call_flows.py -v # 10 tests — call flow models
pytest tests/test_hold_slayer.py -v # 20 tests — IVR nav, EventBus, CallManager
pytest tests/test_services.py -v # 27 tests — LLM, notifications, recording,
# analytics, learner, EventBus
```
### Run with Coverage
```bash
pytest tests/ --cov=. --cov-report=term-missing
```
### Test Architecture
Tests are organized by component:
| File | Tests | What's Covered |
|------|-------|----------------|
| `test_audio_classifier.py` | 18 | Silence, tone, DTMF, music, speech detection; feature extraction; classification history |
| `test_call_flows.py` | 10 | CallFlowStep types, CallFlow navigation, serialization roundtrip, create/summary models |
| `test_hold_slayer.py` | 20 | IVR menu navigation (6 intent scenarios), EventBus pub/sub, CallManager lifecycle, MockSIPEngine |
| `test_services.py` | 27 | LLMClient init/stats/chat/JSON/errors/IVR analysis, NotificationService event mapping, RecordingService paths, CallAnalytics summaries, CallFlowLearner build/merge, EventBus integration |
### Known Test Issues
`test_complex_tone_as_music` — A synthetic multi-harmonic tone is classified as `LIVE_HUMAN` instead of `MUSIC`. This is a known edge case. Real hold music has different spectral characteristics than synthetic test signals. This test documents the limitation rather than a bug.
### Writing Tests
All tests use `pytest-asyncio` for async support. The test configuration in `pyproject.toml`:
```toml
[tool.pytest.ini_options]
asyncio_mode = "auto"
```
This means all `async def test_*` functions automatically run in an asyncio event loop.
**Pattern for testing services:**
```python
import pytest
from services.llm_client import LLMClient
class TestLLMClient:
def test_init(self):
client = LLMClient(base_url="http://localhost:11434/v1", model="llama3")
assert client._model == "llama3"
@pytest.mark.asyncio
async def test_chat(self):
# Mock httpx for unit tests
...
```
**Pattern for testing EventBus:**
```python
import asyncio
from core.event_bus import EventBus
from models.events import EventType, GatewayEvent
async def test_publish_receive():
bus = EventBus()
sub = bus.subscribe()
event = GatewayEvent(type=EventType.CALL_STARTED, call_id="test", data={})
await bus.publish(event)
received = await asyncio.wait_for(sub.get(), timeout=1.0)
assert received.type == EventType.CALL_STARTED
```
## Project Conventions
### Code Style
- **Type hints everywhere** — All function signatures have type annotations
- **Pydantic models** — All data structures are Pydantic BaseModel or dataclass
- **Async by default** — All I/O operations are async
- **Logging** — Every module uses `logging.getLogger(__name__)`
- **Docstrings** — Module-level docstrings explain purpose and usage
### File Organization
```
module.py
├── Module docstring (purpose, usage examples)
├── Imports (stdlib → third-party → local)
├── Constants
├── Classes
│ ├── Class docstring
│ ├── __init__
│ ├── Public methods (async)
│ └── Private methods (_prefixed)
└── Module-level functions (if any)
```
### Error Handling
- **Services never crash the call** — All service errors are caught, logged, and return sensible defaults
- **LLM failures** return empty string/dict — the Hold Slayer falls back to waiting
- **SIP errors** publish `CALL_FAILED` events — the user is notified
- **HTTP errors** in the API return structured error responses
### Event-Driven Architecture
All components communicate through the EventBus:
1. **Publishers** — SIP engine, Hold Slayer, classifier, services
2. **Subscribers** — WebSocket handler, MCP server, notification service, analytics
This decouples components and makes the system extensible. Adding a new feature (e.g., Slack notifications) means subscribing to events — no changes to existing code.
### Dependency Injection
The `AIPSTNGateway` owns all services and is injected into FastAPI routes via `api/deps.py`:
```python
# api/deps.py
async def get_gateway() -> AIPSTNGateway:
return app.state.gateway
# api/calls.py
@router.post("/outbound")
async def make_call(request: CallRequest, gateway: AIPSTNGateway = Depends(get_gateway)):
...
```
This makes testing easy — swap the gateway for a mock in tests.
## Contributing
1. Create a feature branch
2. Write tests for new functionality
3. Ensure all tests pass: `pytest tests/ -v`
4. Follow existing code conventions
5. Update documentation in `/docs` if adding new features
6. Submit a pull request

104
docs/dial-plan.md Normal file
View File

@@ -0,0 +1,104 @@
# Hold Slayer Gateway — Dial Plan
## Overview
The gateway accepts calls from registered SIP endpoints and routes them
based on the dialled digits. No trunk-access prefix (no "9") is needed.
All routing is pattern-matched in order; the first match wins.
---
## ⚠️ Emergency Services — 911
> **911 and 9911 are always routed directly to the PSTN trunk.**
> No gateway logic intercepts, records, or delays these calls.
> `9911` is accepted in addition to `911` to catch the common
> mis-dial habit of dialling `9` for an outside line.
>
> **Your SIP trunk provider must support emergency calling on your DID.**
> Verify this with your provider before putting this system in service.
> VoIP emergency calling has location limitations — ensure your
> registered location is correct with your provider.
---
## Extension Ranges
| Range | Purpose |
|-------|--------------------------------|
| 2XX | SIP endpoints (phones/softphones) |
| 5XX | System services |
---
## 2XX — Endpoint Extensions
Extensions are auto-assigned from **221** upward when a SIP device
registers (`SIP REGISTER`) with the gateway or via `POST /api/devices`.
| Extension | Format | Example |
|-----------|---------------------------------|--------------------------------|
| 221299 | Auto-assigned to registered devices | `sip:221@gateway.helu.ca` |
### Assignment policy
- First device to register gets **221**, next **222**, and so on.
- Extensions are persisted in the database and survive restarts.
- If a device is removed its extension is freed and may be reassigned.
- `GATEWAY_SIP_DOMAIN` in `.env` sets the domain part of the URI.
---
## 5XX — System Services
| Extension | Service | Notes |
|-----------|----------------------|-----------------------------------------|
| 500 | Auto-attendant | Reserved — not yet implemented |
| 510 | Gateway status | Plays a status announcement |
| 511 | Echo test | Returns audio back to caller |
| 520 | Hold Slayer launch | Prompts for a number to hold-slay |
| 599 | Operator fallback | Transfers to preferred device |
---
## Outbound PSTN
All outbound patterns are routed via the configured SIP trunk
(`SIP_TRUNK_HOST`). No access code prefix is needed.
### Pattern table
| Pattern | Example input | Normalised to | Notes |
|----------------------|--------------------|---------------------|------------------------------------|
| `+1NPANXXXXXX` | `+16135550100` | `+16135550100` | E.164 — pass through as-is |
| `1NPANXXXXXX` | `16135550100` | `+16135550100` | NANP with country code |
| `NPANXXXXXX` | `6135550100` | `+16135550100` | 10-digit NANP — prepend `+1` |
| `011CC…` | `01144201234567` | `+44201234567` | International — strip `011` |
| `00CC…` | `004420…` | `+4420…` | International alt prefix |
| `+CC…` | `+44201234567` | `+44201234567` | E.164 international — pass through |
### Rules
1. E.164 (`+` prefix) is always passed to the trunk unchanged.
2. NANP 11-digit (`1` + 10 digits) is normalised to E.164 by prepending `+`.
3. NANP 10-digit is normalised to E.164 by prepending `+1`.
4. International via `011` or `00` strips the IDD prefix and prepends `+`.
5. 7-digit local dialling is **not supported** — always dial the area code.
---
## Inbound PSTN
Calls arriving from the trunk on the DID (`SIP_TRUNK_DID`) are routed
to the highest-priority online device. If no device is online the call
is queued or dropped (configurable via `MAX_HOLD_TIME`).
---
## Future
- Named regions / area-code routing
- Least-cost routing across multiple trunks
- Time-of-day routing (business hours vs. after-hours)
- Ring groups across multiple 2XX extensions
- Voicemail (extension 500)

168
docs/hold-slayer-service.md Normal file
View File

@@ -0,0 +1,168 @@
# Hold Slayer Service
The Hold Slayer (`services/hold_slayer.py`) is the brain of the system. It orchestrates the entire process of navigating IVR menus, detecting hold music, recognizing when a human picks up, and triggering the transfer to your phone.
## Two Operating Modes
### 1. Flow-Guided Mode (`run_with_flow`)
When a stored `CallFlow` exists for the number being called, the Hold Slayer follows it step-by-step:
```python
await hold_slayer.run_with_flow(call_id, call_flow)
```
The call flow is a tree of steps (see [Call Flows](call-flows.md)). The Hold Slayer walks through them:
```
CallFlow: "Chase Bank Main"
├── Step 1: WAIT 3s (wait for greeting)
├── Step 2: LISTEN (transcribe → LLM picks option)
├── Step 3: DTMF "2" (press 2 for account services)
├── Step 4: LISTEN (transcribe → LLM picks option)
├── Step 5: DTMF "1" (press 1 for disputes)
├── Step 6: HOLD (wait for human)
└── Step 7: TRANSFER (bridge to your phone)
```
**Step execution logic:**
| Step Type | What Happens |
|-----------|-------------|
| `DTMF` | Send the specified digits via SIP engine |
| `WAIT` | Sleep for the specified duration |
| `LISTEN` | Record audio, transcribe, then: use hardcoded DTMF if available, otherwise ask LLM to pick the right option |
| `HOLD` | Monitor audio classification, wait for human detection |
| `SPEAK` | Play a WAV file or TTS audio (for interactive prompts) |
| `TRANSFER` | Bridge the call to the user's device |
### 2. Exploration Mode (`run_exploration`)
When no stored call flow exists, the Hold Slayer explores the IVR autonomously:
```python
await hold_slayer.run_exploration(call_id, intent="dispute Amazon charge")
```
**Exploration loop:**
```
┌─→ Classify audio (3-second window)
│ ├── SILENCE → wait, increment silence counter
│ ├── RINGING → wait for answer
│ ├── MUSIC → hold detected, monitor for transition
│ ├── DTMF → ignore (echo detection)
│ ├── IVR_PROMPT/SPEECH →
│ │ ├── Transcribe the audio
│ │ ├── Send transcript + intent to LLM
│ │ ├── LLM returns: { "action": "dtmf", "digits": "2" }
│ │ └── Send DTMF
│ └── LIVE_HUMAN → human detected!
│ └── TRANSFER
└── Loop until: human detected, max hold time, or call ended
```
**Exploration discoveries** are recorded and can be fed into the `CallFlowLearner` to build a reusable flow for next time.
## Human Detection
The critical moment — detecting when a live person picks up after hold:
### Detection Chain
```
AudioClassifier.classify(audio_frame)
├── Feature extraction:
│ ├── RMS energy (loudness)
│ ├── Spectral flatness (noise vs tone)
│ ├── Zero-crossing rate (speech indicator)
│ ├── Dominant frequency
│ └── Spectral centroid
├── Classification: MUSIC, SILENCE, SPEECH, etc.
└── Transition detection:
└── detect_hold_to_human_transition()
├── Check last N classifications
├── Pattern: MUSIC, MUSIC, MUSIC → SPEECH, SPEECH
├── Confidence: speech energy > threshold
└── Result: HUMAN_DETECTED event
```
### What triggers a transfer?
The Hold Slayer considers a human detected when:
1. **Classification history** shows a transition from hold-like audio (MUSIC, SILENCE) to speech-like audio (LIVE_HUMAN, IVR_PROMPT)
2. **Energy threshold** — the speech audio has sufficient RMS energy (not just background noise)
3. **Consecutive speech frames** — at least 2-3 consecutive speech classifications (avoids false positives from hold music announcements like "your call is important to us")
### False Positive Handling
Hold music often includes periodic announcements ("Your estimated wait time is 15 minutes"). These are speech, but not a live human. The Hold Slayer handles this by:
1. **Duration check** — Hold announcements are typically short (5-15 seconds). A live agent conversation continues longer.
2. **Pattern matching** — After speech, if audio returns to MUSIC within a few seconds, it was just an announcement.
3. **Transcript analysis** — If transcription is active, the LLM can analyze whether the speech sounds like a recorded announcement vs. a live greeting.
## LISTEN Step + LLM Fallback
The most interesting step type. When the Hold Slayer encounters a LISTEN step in a call flow:
```python
# Step has hardcoded DTMF? Use it directly.
if step.dtmf:
await sip_engine.send_dtmf(call_id, step.dtmf)
# No hardcoded DTMF? Ask the LLM.
else:
transcript = await transcription.transcribe(audio)
decision = await llm_client.analyze_ivr_menu(
transcript=transcript,
intent=intent,
previous_selections=previous_steps,
)
if decision.get("action") == "dtmf":
await sip_engine.send_dtmf(call_id, decision["digits"])
```
The LLM receives:
- The IVR transcript ("Press 1 for billing, press 2 for technical support...")
- The user's intent ("dispute a charge on my December statement")
- Previous menu selections (to avoid loops)
And returns structured JSON:
```json
{
"action": "dtmf",
"digits": "1",
"reasoning": "Billing is the correct department for charge disputes"
}
```
## Event Publishing
The Hold Slayer publishes events throughout the process:
| Event | When |
|-------|------|
| `IVR_STEP` | Each step in the call flow is executed |
| `IVR_DTMF_SENT` | DTMF digits are sent |
| `IVR_MENU_DETECTED` | An IVR menu prompt is transcribed |
| `HOLD_DETECTED` | Hold music is detected |
| `HUMAN_DETECTED` | Live human speech detected after hold |
| `TRANSFER_STARTED` | Call bridge initiated to user's device |
| `TRANSFER_COMPLETE` | User's device answered, bridge active |
All events flow through the EventBus to WebSocket clients, MCP server, notification service, and analytics.
## Configuration
| Setting | Description | Default |
|---------|-------------|---------|
| `MAX_HOLD_TIME` | Maximum seconds to wait on hold before giving up | `7200` (2 hours) |
| `HOLD_CHECK_INTERVAL` | Seconds between audio classification checks | `2.0` |
| `DEFAULT_TRANSFER_DEVICE` | Device to transfer to when human detected | `sip_phone` |
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for classification | `3.0` |

155
docs/mcp-server.md Normal file
View File

@@ -0,0 +1,155 @@
# MCP Server
The MCP (Model Context Protocol) server lets any MCP-compatible AI assistant control the Hold Slayer gateway. Built with [FastMCP](https://github.com/jlowin/fastmcp), it exposes tools and resources over SSE.
## Overview
An AI assistant connects via SSE to the MCP server and gains access to tools for placing calls, checking status, sending DTMF, getting transcripts, and managing call flows. The assistant can orchestrate an entire call through natural language.
## Tools
### make_call
Place an outbound call through the SIP trunk.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `number` | string | Yes | Phone number to call (E.164 format) |
| `mode` | string | No | Call mode: `direct`, `hold_slayer`, `ai_assisted` (default: `hold_slayer`) |
| `intent` | string | No | What you want to accomplish on the call |
| `call_flow_id` | string | No | ID of a stored call flow to follow |
Returns: Call ID and initial status.
### end_call
Hang up an active call.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The call to hang up |
### send_dtmf
Send touch-tone digits to an active call (for manual IVR navigation).
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The call to send digits to |
| `digits` | string | Yes | DTMF digits to send (e.g., "1", "3#", "1234") |
### get_call_status
Check the current state of a call.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The call to check |
Returns: Status, duration, hold time, audio classification, transcript excerpt.
### get_call_transcript
Get the live transcript of a call.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The call to get transcript for |
Returns: Array of transcript chunks with timestamps and speaker labels.
### get_call_recording
Get recording metadata and file path for a call.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The call to get recording for |
Returns: Recording path, duration, file size.
### list_active_calls
List all calls currently in progress. No parameters.
Returns: Array of active calls with status, number, duration.
### get_call_summary
Get analytics summary — hold times, success rates, call volume. No parameters.
Returns: Aggregate statistics across all calls.
### search_call_history
Search past calls by number, company, or date range.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `query` | string | Yes | Search term (phone number, company name) |
| `limit` | int | No | Max results (default: 20) |
### learn_call_flow
Build a reusable call flow from a completed exploration call.
| Param | Type | Required | Description |
|-------|------|----------|-------------|
| `call_id` | string | Yes | The exploration call to learn from |
| `company` | string | No | Company name for the flow |
Returns: The generated CallFlow object.
## Resources
MCP resources provide read-only data that assistants can reference:
| Resource URI | Description |
|-------------|-------------|
| `gateway://status` | Current gateway status — trunk registration, active calls, service health |
| `gateway://calls` | List of all active calls with current status |
| `gateway://calls/{call_id}` | Detailed status for a specific call |
| `gateway://flows` | List of all stored call flows |
| `gateway://analytics` | Call analytics summary |
## Configuration
The MCP server is mounted on the FastAPI app at `/mcp`.
### Connecting an AI Assistant
Add to your MCP client configuration (e.g., Claude Desktop, Cline):
```json
{
"mcpServers": {
"hold-slayer": {
"url": "http://localhost:8000/mcp/sse"
}
}
}
```
## Example Conversation
Here is how an AI assistant would use the MCP tools to handle a complete call:
**User:** "Call Chase Bank and dispute the Amazon charge from December 15th"
**Assistant actions:**
1. Calls `make_call(number="+18005551234", mode="hold_slayer", intent="dispute Amazon charge Dec 15th", call_flow_id="chase-bank-main")`
2. Receives `call_id: "call_abc123"`
3. Polls `get_call_status("call_abc123")` periodically
4. Status progression: `trying``ringing``connected``on_hold`
5. Tells user: "I'm on hold with Chase Bank. Currently 4 minutes in. I'll let you know when someone picks up."
6. Status changes to `transferring` — human detected!
7. Tells user: "A live agent just picked up. I'm transferring the call to your desk phone now. Pick up!"
8. After the call, calls `learn_call_flow("call_abc123", company="Chase Bank")` to save the IVR path for next time.
**User:** "How long was I on hold?"
**Assistant actions:**
1. Calls `get_call_summary()`
2. Reports: "Your Chase Bank call lasted 12 minutes total, with 8 minutes on hold. The disputes department averages 6 minutes hold time on Tuesdays."

290
docs/services.md Normal file
View File

@@ -0,0 +1,290 @@
# Services
The intelligence layer services that power Hold Slayer's decision-making, transcription, recording, analytics, and notifications.
## LLM Client (`services/llm_client.py`)
Async HTTP client for any OpenAI-compatible chat completion API. No SDK dependency — just httpx.
### Supported Backends
| Backend | URL | Notes |
|---------|-----|-------|
| Ollama | `http://localhost:11434/v1` | Local, free, good for dev |
| LM Studio | `http://localhost:1234/v1` | Local, free, GUI |
| vLLM | `http://localhost:8000/v1` | Local, fast, production |
| OpenAI | `https://api.openai.com/v1` | Cloud, paid, best quality |
### Usage
```python
client = LLMClient(
base_url="http://localhost:11434/v1",
model="llama3",
api_key="not-needed", # Ollama doesn't need a key
timeout=30.0,
max_tokens=1024,
temperature=0.3,
)
# Simple chat
response = await client.chat("What is 2+2?")
# "4"
# Chat with system prompt
response = await client.chat(
"Parse this menu transcript...",
system="You are a phone menu parser. Return JSON.",
)
# Structured JSON response (auto-parses)
result = await client.chat_json(
"Extract menu options from: Press 1 for billing, press 2 for support",
system="Return JSON with 'options' array.",
)
# {"options": [{"digit": "1", "label": "billing"}, {"digit": "2", "label": "support"}]}
```
### IVR Menu Analysis
The primary use case — analyzing IVR transcripts to pick the right menu option:
```python
decision = await client.analyze_ivr_menu(
transcript="Welcome to Chase Bank. Press 1 for account balance, press 2 for recent transactions, press 3 for disputes, press 0 for an agent.",
intent="dispute a charge from Amazon on December 15th",
previous_selections=["main_menu"],
)
# {"action": "dtmf", "digits": "3", "reasoning": "Disputes is the correct department"}
```
### JSON Extraction
The client handles messy LLM output gracefully:
1. Try `json.loads()` on the raw response
2. If that fails, look for ```json ... ``` markdown blocks
3. If that fails, look for `{...}` patterns in the text
4. If all fail, return empty dict (caller handles gracefully)
### Stats Tracking
```python
stats = client.stats
# {
# "total_requests": 47,
# "total_errors": 2,
# "avg_latency_ms": 234.5,
# "model": "llama3",
# "base_url": "http://localhost:11434/v1"
# }
```
### Error Handling
- HTTP errors return empty string/dict (never crashes the call)
- Timeouts are configurable (default 30s)
- All errors are logged with full context
- Stats track error rates for monitoring
## Transcription Service (`services/transcription.py`)
Real-time speech-to-text using Speaches (a self-hosted Whisper API).
### Architecture
```
Audio frames (from AudioTap)
└── POST /v1/audio/transcriptions
├── model: whisper-large-v3
├── audio: WAV bytes
└── language: en
└── Response: { "text": "Press 1 for billing..." }
```
### Usage
```python
service = TranscriptionService(
speaches_url="http://perseus.helu.ca:22070",
model="whisper-large-v3",
)
# Transcribe audio bytes
text = await service.transcribe(audio_bytes)
# "Welcome to Chase Bank. For English, press 1."
# Transcribe with language hint
text = await service.transcribe(audio_bytes, language="fr")
```
### Integration with Hold Slayer
The transcription service is called when the audio classifier detects speech (IVR_PROMPT or LIVE_HUMAN). The transcript is then:
1. Published as a `TRANSCRIPT_CHUNK` event (→ WebSocket clients)
2. Fed to the LLM for IVR menu analysis
3. Stored in the call's transcript history
4. Used by the Call Flow Learner to build reusable flows
## Recording Service (`services/recording.py`)
Manages call recordings via the PJSUA2 media pipeline.
### Storage Structure
```
recordings/
├── 2026/
│ ├── 01/
│ │ ├── 15/
│ │ │ ├── call_abc123_outbound.wav
│ │ │ ├── call_abc123_mixed.wav
│ │ │ └── call_def456_outbound.wav
│ │ └── 16/
│ │ └── ...
│ └── 02/
│ └── ...
```
### Recording Types
| Type | Description |
|------|-------------|
| **Outbound** | Audio from the company (IVR, hold music, agent) |
| **Inbound** | Audio from the user's device (after transfer) |
| **Mixed** | Both parties in one file (for review) |
### Usage
```python
service = RecordingService(
storage_dir="recordings",
max_recording_seconds=7200, # 2 hours
sample_rate=16000,
)
# Start recording
session = await service.start_recording(call_id, stream_id)
# session.path = "recordings/2026/01/15/call_abc123_outbound.wav"
# Stop recording
metadata = await service.stop_recording(call_id)
# metadata = { "duration": 847.3, "file_size": 27113600, "path": "..." }
# List recordings for a call
recordings = service.get_recordings(call_id)
```
## Call Analytics (`services/call_analytics.py`)
Tracks call metrics and provides insights for monitoring and optimization.
### Metrics Tracked
| Metric | Description |
|--------|-------------|
| Hold time | Duration spent on hold per call |
| Total call duration | End-to-end call time |
| Success rate | Percentage of calls that reached a human |
| IVR navigation time | Time spent navigating menus |
| Company patterns | Per-company hold time averages |
| Time-of-day trends | When hold times are shortest |
### Usage
```python
analytics = CallAnalytics(max_history=10000)
# Record a completed call
analytics.record_call(
call_id="call_abc123",
number="+18005551234",
company="Chase Bank",
hold_time=780,
total_duration=847,
success=True,
ivr_steps=6,
)
# Get summary
summary = analytics.get_summary()
# {
# "total_calls": 142,
# "success_rate": 0.89,
# "avg_hold_time": 623.4,
# "avg_total_duration": 712.1,
# }
# Per-company stats
stats = analytics.get_company_stats("Chase Bank")
# {
# "total_calls": 23,
# "avg_hold_time": 845.2,
# "best_time": "Tuesday 10:00 AM",
# "success_rate": 0.91,
# }
# Top numbers by call volume
top = analytics.get_top_numbers(limit=10)
# Hold time trends by hour
trends = analytics.get_hold_time_trend()
# [{"hour": 9, "avg_hold": 320}, {"hour": 10, "avg_hold": 480}, ...]
```
## Notification Service (`services/notification.py`)
Sends alerts when important things happen on calls.
### Notification Channels
| Channel | Status | Use Case |
|---------|--------|----------|
| **WebSocket** | ✅ Active | Real-time UI updates (always on) |
| **SMS** | ✅ Active | Critical alerts (human detected, call failed) |
| **Push** | 🔮 Future | Mobile app notifications |
### Notification Priority
| Priority | Events | Delivery |
|----------|--------|----------|
| `CRITICAL` | Human detected, transfer started | WebSocket + SMS |
| `HIGH` | Call failed, call timeout | WebSocket + SMS |
| `NORMAL` | Hold detected, call ended | WebSocket only |
| `LOW` | IVR step, DTMF sent | WebSocket only |
### Event → Notification Mapping
| Event | Notification |
|-------|-------------|
| `HUMAN_DETECTED` | 🚨 "A live person picked up — transferring you now!" |
| `TRANSFER_STARTED` | 📞 "Your call has been connected. Pick up your phone!" |
| `CALL_FAILED` | ❌ "The call couldn't be completed." |
| `HOLD_DETECTED` | ⏳ "You're on hold. We'll notify you when someone picks up." |
| `IVR_STEP` | 📍 "Navigating phone menu..." |
| `IVR_DTMF_SENT` | 📱 "Pressed 3" |
| `CALL_ENDED` | 📴 "The call has ended." |
### Deduplication
The notification service tracks what's been sent per call to avoid spamming:
```python
# Won't send duplicate "on hold" notifications for the same call
self._notified: dict[str, set[str]] # call_id → set of event dedup keys
```
Tracking is cleaned up when a call ends.
### SMS Configuration
SMS is sent for `CRITICAL` priority notifications when `NOTIFY_SMS_NUMBER` is configured:
```env
NOTIFY_SMS_NUMBER=+15559876543
```
The SMS sender is a placeholder — wire up your preferred provider (Twilio, AWS SNS, etc.).