Files
hold-slayer/docs/audio-classifier.md
Robert Helewka ecf37658ce feat: add initial Hold Slayer AI telephony gateway implementation
Complete project scaffolding and core implementation of an AI-powered
telephony system that calls companies, navigates IVR menus, waits on
hold, and transfers to the user when a human answers.

Key components:
- FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces
- SIP/VoIP call management via PJSUA2 with RTP audio streaming
- LLM-powered IVR navigation using OpenAI/Anthropic with tool calling
- Hold detection service combining audio analysis and silence detection
- Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines
- Call recording with per-channel and mixed audio capture
- Event bus (asyncio pub/sub) for real-time client updates
- Web dashboard with live call monitoring
- SQLite persistence via SQLAlchemy with call history and analytics
- Notification support (email, SMS, webhook, desktop)
- Docker Compose deployment with Opal VoIP and Opal Media containers
- Comprehensive test suite with unit, integration, and E2E tests
- Simplified .gitignore and full project documentation in README
2026-03-21 19:23:26 +00:00

6.0 KiB

Audio Classifier

The Audio Classifier (services/audio_classifier.py) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech.

Classification Types

class AudioClassification(str, Enum):
    SILENCE = "silence"        # No meaningful audio
    MUSIC = "music"            # Hold music
    IVR_PROMPT = "ivr_prompt"  # Recorded voice menu
    LIVE_HUMAN = "live_human"  # Live person speaking
    RINGING = "ringing"        # Ringback tone
    DTMF = "dtmf"             # Touch-tone digits
    UNKNOWN = "unknown"        # Can't classify

Feature Extraction

Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction:

Feature What It Measures How It's Used
RMS Energy Loudness (root mean square of samples) Silence detection — below threshold = silence
Spectral Flatness How noise-like vs tonal the audio is (0=pure tone, 1=white noise) Music has low flatness (tonal), speech has higher flatness
Zero-Crossing Rate How often the waveform crosses zero Speech has moderate ZCR, tones have very regular ZCR
Dominant Frequency Strongest frequency component (via FFT) Ringback detection (440Hz), DTMF detection
Spectral Centroid "Center of mass" of the frequency spectrum Speech has higher centroid than music
Tonality Whether the audio is dominated by a single frequency Tones/DTMF are highly tonal, speech is not

Feature Extraction Code

def _extract_features(self, audio: np.ndarray) -> dict:
    rms = np.sqrt(np.mean(audio ** 2))
    
    # FFT for frequency analysis
    fft = np.fft.rfft(audio)
    magnitude = np.abs(fft)
    freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate)
    
    # Spectral flatness: geometric mean / arithmetic mean of magnitude
    spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10)
    
    # Zero-crossing rate
    zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2
    
    # Dominant frequency
    dominant_freq = freqs[np.argmax(magnitude)]
    
    # Spectral centroid
    spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10)
    
    return { ... }

Classification Logic

Classification follows a priority chain:

1. SILENCE — RMS below threshold?
   └── Yes → SILENCE (confidence based on how quiet)

2. DTMF — Goertzel algorithm detects dual-tone pairs?
   └── Yes → DTMF (with detected digit in details)

3. RINGING — Dominant frequency near 440Hz + tonal?
   └── Yes → RINGING

4. SPEECH vs MUSIC discrimination:
   ├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT
   │   └── _looks_like_live_human() checks history for hold→speech transition
   │       ├── Yes → LIVE_HUMAN
   │       └── No → IVR_PROMPT
   │
   └── Low spectral flatness + tonal → MUSIC

DTMF Detection

Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits:

         1209 Hz  1336 Hz  1477 Hz  1633 Hz
697 Hz      1        2        3        A
770 Hz      4        5        6        B
852 Hz      7        8        9        C
941 Hz      *        0        #        D

Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT.

Hold-to-Human Transition

The most critical detection — when a live person picks up after hold music:

def detect_hold_to_human_transition(self) -> bool:
    """
    Check classification history for the pattern:
    MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT
    
    Requires:
    - At least 3 recent MUSIC classifications
    - Followed by 2+ speech classifications
    - Speech has sufficient energy (not just noise)
    """
    recent = self._history[-10:]
    
    # Find the transition point
    music_count = 0
    speech_count = 0
    for result in recent:
        if result.audio_type == AudioClassification.MUSIC:
            music_count += 1
            speech_count = 0  # reset
        elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT):
            speech_count += 1
    
    return music_count >= 3 and speech_count >= 2

Classification Result

Each classification returns:

@dataclass
class ClassificationResult:
    timestamp: float
    audio_type: AudioClassification
    confidence: float  # 0.0 to 1.0
    details: dict      # Feature values, detected frequencies, etc.

The details dict includes all extracted features, making it available for debugging and analytics:

{
    "rms": 0.0423,
    "spectral_flatness": 0.15,
    "zcr": 0.087,
    "dominant_freq": 440.0,
    "spectral_centroid": 523.7,
    "is_tonal": True
}

Configuration

Setting Description Default
CLASSIFIER_MUSIC_THRESHOLD Spectral flatness below this = music 0.7
CLASSIFIER_SPEECH_THRESHOLD Spectral flatness above this = speech 0.6
CLASSIFIER_SILENCE_THRESHOLD RMS below this = silence 0.85
CLASSIFIER_WINDOW_SECONDS Audio window size for each classification 3.0

Testing

The audio classifier has 18 unit tests covering:

  • Silence detection (pure silence, very quiet, empty audio)
  • Tone detection (440Hz ringback, 1000Hz test tone)
  • DTMF detection (digit 5, digit 0)
  • Speech detection (speech-like waveforms)
  • Classification history (hold→human transition, IVR non-transition)
  • Feature extraction (RMS, ZCR, spectral flatness, dominant frequency)
pytest tests/test_audio_classifier.py -v

Known issue: test_complex_tone_as_music is a known edge case where a multi-harmonic synthetic tone is classified as LIVE_HUMAN instead of MUSIC. This is acceptable — real hold music has different characteristics than synthetic test signals.