feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
This commit is contained in:
174
docs/audio-classifier.md
Normal file
174
docs/audio-classifier.md
Normal file
@@ -0,0 +1,174 @@
|
||||
# Audio Classifier
|
||||
|
||||
The Audio Classifier (`services/audio_classifier.py`) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech.
|
||||
|
||||
## Classification Types
|
||||
|
||||
```python
|
||||
class AudioClassification(str, Enum):
|
||||
SILENCE = "silence" # No meaningful audio
|
||||
MUSIC = "music" # Hold music
|
||||
IVR_PROMPT = "ivr_prompt" # Recorded voice menu
|
||||
LIVE_HUMAN = "live_human" # Live person speaking
|
||||
RINGING = "ringing" # Ringback tone
|
||||
DTMF = "dtmf" # Touch-tone digits
|
||||
UNKNOWN = "unknown" # Can't classify
|
||||
```
|
||||
|
||||
## Feature Extraction
|
||||
|
||||
Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction:
|
||||
|
||||
| Feature | What It Measures | How It's Used |
|
||||
|---------|-----------------|---------------|
|
||||
| **RMS Energy** | Loudness (root mean square of samples) | Silence detection — below threshold = silence |
|
||||
| **Spectral Flatness** | How noise-like vs tonal the audio is (0=pure tone, 1=white noise) | Music has low flatness (tonal), speech has higher flatness |
|
||||
| **Zero-Crossing Rate** | How often the waveform crosses zero | Speech has moderate ZCR, tones have very regular ZCR |
|
||||
| **Dominant Frequency** | Strongest frequency component (via FFT) | Ringback detection (440Hz), DTMF detection |
|
||||
| **Spectral Centroid** | "Center of mass" of the frequency spectrum | Speech has higher centroid than music |
|
||||
| **Tonality** | Whether the audio is dominated by a single frequency | Tones/DTMF are highly tonal, speech is not |
|
||||
|
||||
### Feature Extraction Code
|
||||
|
||||
```python
|
||||
def _extract_features(self, audio: np.ndarray) -> dict:
|
||||
rms = np.sqrt(np.mean(audio ** 2))
|
||||
|
||||
# FFT for frequency analysis
|
||||
fft = np.fft.rfft(audio)
|
||||
magnitude = np.abs(fft)
|
||||
freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate)
|
||||
|
||||
# Spectral flatness: geometric mean / arithmetic mean of magnitude
|
||||
spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10)
|
||||
|
||||
# Zero-crossing rate
|
||||
zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2
|
||||
|
||||
# Dominant frequency
|
||||
dominant_freq = freqs[np.argmax(magnitude)]
|
||||
|
||||
# Spectral centroid
|
||||
spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10)
|
||||
|
||||
return { ... }
|
||||
```
|
||||
|
||||
## Classification Logic
|
||||
|
||||
Classification follows a priority chain:
|
||||
|
||||
```
|
||||
1. SILENCE — RMS below threshold?
|
||||
└── Yes → SILENCE (confidence based on how quiet)
|
||||
|
||||
2. DTMF — Goertzel algorithm detects dual-tone pairs?
|
||||
└── Yes → DTMF (with detected digit in details)
|
||||
|
||||
3. RINGING — Dominant frequency near 440Hz + tonal?
|
||||
└── Yes → RINGING
|
||||
|
||||
4. SPEECH vs MUSIC discrimination:
|
||||
├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT
|
||||
│ └── _looks_like_live_human() checks history for hold→speech transition
|
||||
│ ├── Yes → LIVE_HUMAN
|
||||
│ └── No → IVR_PROMPT
|
||||
│
|
||||
└── Low spectral flatness + tonal → MUSIC
|
||||
```
|
||||
|
||||
### DTMF Detection
|
||||
|
||||
Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits:
|
||||
|
||||
```
|
||||
1209 Hz 1336 Hz 1477 Hz 1633 Hz
|
||||
697 Hz 1 2 3 A
|
||||
770 Hz 4 5 6 B
|
||||
852 Hz 7 8 9 C
|
||||
941 Hz * 0 # D
|
||||
```
|
||||
|
||||
Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT.
|
||||
|
||||
### Hold-to-Human Transition
|
||||
|
||||
The most critical detection — when a live person picks up after hold music:
|
||||
|
||||
```python
|
||||
def detect_hold_to_human_transition(self) -> bool:
|
||||
"""
|
||||
Check classification history for the pattern:
|
||||
MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT
|
||||
|
||||
Requires:
|
||||
- At least 3 recent MUSIC classifications
|
||||
- Followed by 2+ speech classifications
|
||||
- Speech has sufficient energy (not just noise)
|
||||
"""
|
||||
recent = self._history[-10:]
|
||||
|
||||
# Find the transition point
|
||||
music_count = 0
|
||||
speech_count = 0
|
||||
for result in recent:
|
||||
if result.audio_type == AudioClassification.MUSIC:
|
||||
music_count += 1
|
||||
speech_count = 0 # reset
|
||||
elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT):
|
||||
speech_count += 1
|
||||
|
||||
return music_count >= 3 and speech_count >= 2
|
||||
```
|
||||
|
||||
## Classification Result
|
||||
|
||||
Each classification returns:
|
||||
|
||||
```python
|
||||
@dataclass
|
||||
class ClassificationResult:
|
||||
timestamp: float
|
||||
audio_type: AudioClassification
|
||||
confidence: float # 0.0 to 1.0
|
||||
details: dict # Feature values, detected frequencies, etc.
|
||||
```
|
||||
|
||||
The `details` dict includes all extracted features, making it available for debugging and analytics:
|
||||
|
||||
```python
|
||||
{
|
||||
"rms": 0.0423,
|
||||
"spectral_flatness": 0.15,
|
||||
"zcr": 0.087,
|
||||
"dominant_freq": 440.0,
|
||||
"spectral_centroid": 523.7,
|
||||
"is_tonal": True
|
||||
}
|
||||
```
|
||||
|
||||
## Configuration
|
||||
|
||||
| Setting | Description | Default |
|
||||
|---------|-------------|---------|
|
||||
| `CLASSIFIER_MUSIC_THRESHOLD` | Spectral flatness below this = music | `0.7` |
|
||||
| `CLASSIFIER_SPEECH_THRESHOLD` | Spectral flatness above this = speech | `0.6` |
|
||||
| `CLASSIFIER_SILENCE_THRESHOLD` | RMS below this = silence | `0.85` |
|
||||
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for each classification | `3.0` |
|
||||
|
||||
## Testing
|
||||
|
||||
The audio classifier has 18 unit tests covering:
|
||||
|
||||
- Silence detection (pure silence, very quiet, empty audio)
|
||||
- Tone detection (440Hz ringback, 1000Hz test tone)
|
||||
- DTMF detection (digit 5, digit 0)
|
||||
- Speech detection (speech-like waveforms)
|
||||
- Classification history (hold→human transition, IVR non-transition)
|
||||
- Feature extraction (RMS, ZCR, spectral flatness, dominant frequency)
|
||||
|
||||
```bash
|
||||
pytest tests/test_audio_classifier.py -v
|
||||
```
|
||||
|
||||
> **Known issue:** `test_complex_tone_as_music` is a known edge case where a multi-harmonic synthetic tone is classified as `LIVE_HUMAN` instead of `MUSIC`. This is acceptable — real hold music has different characteristics than synthetic test signals.
|
||||
Reference in New Issue
Block a user