# Audio Classifier The Audio Classifier (`services/audio_classifier.py`) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech. ## Classification Types ```python class AudioClassification(str, Enum): SILENCE = "silence" # No meaningful audio MUSIC = "music" # Hold music IVR_PROMPT = "ivr_prompt" # Recorded voice menu LIVE_HUMAN = "live_human" # Live person speaking RINGING = "ringing" # Ringback tone DTMF = "dtmf" # Touch-tone digits UNKNOWN = "unknown" # Can't classify ``` ## Feature Extraction Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction: | Feature | What It Measures | How It's Used | |---------|-----------------|---------------| | **RMS Energy** | Loudness (root mean square of samples) | Silence detection — below threshold = silence | | **Spectral Flatness** | How noise-like vs tonal the audio is (0=pure tone, 1=white noise) | Music has low flatness (tonal), speech has higher flatness | | **Zero-Crossing Rate** | How often the waveform crosses zero | Speech has moderate ZCR, tones have very regular ZCR | | **Dominant Frequency** | Strongest frequency component (via FFT) | Ringback detection (440Hz), DTMF detection | | **Spectral Centroid** | "Center of mass" of the frequency spectrum | Speech has higher centroid than music | | **Tonality** | Whether the audio is dominated by a single frequency | Tones/DTMF are highly tonal, speech is not | ### Feature Extraction Code ```python def _extract_features(self, audio: np.ndarray) -> dict: rms = np.sqrt(np.mean(audio ** 2)) # FFT for frequency analysis fft = np.fft.rfft(audio) magnitude = np.abs(fft) freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate) # Spectral flatness: geometric mean / arithmetic mean of magnitude spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10) # Zero-crossing rate zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2 # Dominant frequency dominant_freq = freqs[np.argmax(magnitude)] # Spectral centroid spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10) return { ... } ``` ## Classification Logic Classification follows a priority chain: ``` 1. SILENCE — RMS below threshold? └── Yes → SILENCE (confidence based on how quiet) 2. DTMF — Goertzel algorithm detects dual-tone pairs? └── Yes → DTMF (with detected digit in details) 3. RINGING — Dominant frequency near 440Hz + tonal? └── Yes → RINGING 4. SPEECH vs MUSIC discrimination: ├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT │ └── _looks_like_live_human() checks history for hold→speech transition │ ├── Yes → LIVE_HUMAN │ └── No → IVR_PROMPT │ └── Low spectral flatness + tonal → MUSIC ``` ### DTMF Detection Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits: ``` 1209 Hz 1336 Hz 1477 Hz 1633 Hz 697 Hz 1 2 3 A 770 Hz 4 5 6 B 852 Hz 7 8 9 C 941 Hz * 0 # D ``` Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT. ### Hold-to-Human Transition The most critical detection — when a live person picks up after hold music: ```python def detect_hold_to_human_transition(self) -> bool: """ Check classification history for the pattern: MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT Requires: - At least 3 recent MUSIC classifications - Followed by 2+ speech classifications - Speech has sufficient energy (not just noise) """ recent = self._history[-10:] # Find the transition point music_count = 0 speech_count = 0 for result in recent: if result.audio_type == AudioClassification.MUSIC: music_count += 1 speech_count = 0 # reset elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT): speech_count += 1 return music_count >= 3 and speech_count >= 2 ``` ## Classification Result Each classification returns: ```python @dataclass class ClassificationResult: timestamp: float audio_type: AudioClassification confidence: float # 0.0 to 1.0 details: dict # Feature values, detected frequencies, etc. ``` The `details` dict includes all extracted features, making it available for debugging and analytics: ```python { "rms": 0.0423, "spectral_flatness": 0.15, "zcr": 0.087, "dominant_freq": 440.0, "spectral_centroid": 523.7, "is_tonal": True } ``` ## Configuration | Setting | Description | Default | |---------|-------------|---------| | `CLASSIFIER_MUSIC_THRESHOLD` | Spectral flatness below this = music | `0.7` | | `CLASSIFIER_SPEECH_THRESHOLD` | Spectral flatness above this = speech | `0.6` | | `CLASSIFIER_SILENCE_THRESHOLD` | RMS below this = silence | `0.85` | | `CLASSIFIER_WINDOW_SECONDS` | Audio window size for each classification | `3.0` | ## Testing The audio classifier has 18 unit tests covering: - Silence detection (pure silence, very quiet, empty audio) - Tone detection (440Hz ringback, 1000Hz test tone) - DTMF detection (digit 5, digit 0) - Speech detection (speech-like waveforms) - Classification history (hold→human transition, IVR non-transition) - Feature extraction (RMS, ZCR, spectral flatness, dominant frequency) ```bash pytest tests/test_audio_classifier.py -v ``` > **Known issue:** `test_complex_tone_as_music` is a known edge case where a multi-harmonic synthetic tone is classified as `LIVE_HUMAN` instead of `MUSIC`. This is acceptable — real hold music has different characteristics than synthetic test signals.