# Audio Classifier

The Audio Classifier (`services/audio_classifier.py`) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech.

## Classification Types

```python
class AudioClassification(str, Enum):
    SILENCE = "silence"        # No meaningful audio
    MUSIC = "music"            # Hold music
    IVR_PROMPT = "ivr_prompt"  # Recorded voice menu
    LIVE_HUMAN = "live_human"  # Live person speaking
    RINGING = "ringing"        # Ringback tone
    DTMF = "dtmf"             # Touch-tone digits
    UNKNOWN = "unknown"        # Can't classify
```

## Feature Extraction

Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction:

| Feature | What It Measures | How It's Used |
|---------|-----------------|---------------|
| **RMS Energy** | Loudness (root mean square of samples) | Silence detection — below threshold = silence |
| **Spectral Flatness** | How noise-like vs tonal the audio is (0=pure tone, 1=white noise) | Music has low flatness (tonal), speech has higher flatness |
| **Zero-Crossing Rate** | How often the waveform crosses zero | Speech has moderate ZCR, tones have very regular ZCR |
| **Dominant Frequency** | Strongest frequency component (via FFT) | Ringback detection (440Hz), DTMF detection |
| **Spectral Centroid** | "Center of mass" of the frequency spectrum | Speech has higher centroid than music |
| **Tonality** | Whether the audio is dominated by a single frequency | Tones/DTMF are highly tonal, speech is not |

### Feature Extraction Code

```python
def _extract_features(self, audio: np.ndarray) -> dict:
    rms = np.sqrt(np.mean(audio ** 2))
    
    # FFT for frequency analysis
    fft = np.fft.rfft(audio)
    magnitude = np.abs(fft)
    freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate)
    
    # Spectral flatness: geometric mean / arithmetic mean of magnitude
    spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10)
    
    # Zero-crossing rate
    zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2
    
    # Dominant frequency
    dominant_freq = freqs[np.argmax(magnitude)]
    
    # Spectral centroid
    spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10)
    
    return { ... }
```

## Classification Logic

Classification follows a priority chain:

```
1. SILENCE — RMS below threshold?
   └── Yes → SILENCE (confidence based on how quiet)

2. DTMF — Goertzel algorithm detects dual-tone pairs?
   └── Yes → DTMF (with detected digit in details)

3. RINGING — Dominant frequency near 440Hz + tonal?
   └── Yes → RINGING

4. SPEECH vs MUSIC discrimination:
   ├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT
   │   └── _looks_like_live_human() checks history for hold→speech transition
   │       ├── Yes → LIVE_HUMAN
   │       └── No → IVR_PROMPT
   │
   └── Low spectral flatness + tonal → MUSIC
```

### DTMF Detection

Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits:

```
         1209 Hz  1336 Hz  1477 Hz  1633 Hz
697 Hz      1        2        3        A
770 Hz      4        5        6        B
852 Hz      7        8        9        C
941 Hz      *        0        #        D
```

Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT.

### Hold-to-Human Transition

The most critical detection — when a live person picks up after hold music:

```python
def detect_hold_to_human_transition(self) -> bool:
    """
    Check classification history for the pattern:
    MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT
    
    Requires:
    - At least 3 recent MUSIC classifications
    - Followed by 2+ speech classifications
    - Speech has sufficient energy (not just noise)
    """
    recent = self._history[-10:]
    
    # Find the transition point
    music_count = 0
    speech_count = 0
    for result in recent:
        if result.audio_type == AudioClassification.MUSIC:
            music_count += 1
            speech_count = 0  # reset
        elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT):
            speech_count += 1
    
    return music_count >= 3 and speech_count >= 2
```

## Classification Result

Each classification returns:

```python
@dataclass
class ClassificationResult:
    timestamp: float
    audio_type: AudioClassification
    confidence: float  # 0.0 to 1.0
    details: dict      # Feature values, detected frequencies, etc.
```

The `details` dict includes all extracted features, making it available for debugging and analytics:

```python
{
    "rms": 0.0423,
    "spectral_flatness": 0.15,
    "zcr": 0.087,
    "dominant_freq": 440.0,
    "spectral_centroid": 523.7,
    "is_tonal": True
}
```

## Configuration

| Setting | Description | Default |
|---------|-------------|---------|
| `CLASSIFIER_MUSIC_THRESHOLD` | Spectral flatness below this = music | `0.7` |
| `CLASSIFIER_SPEECH_THRESHOLD` | Spectral flatness above this = speech | `0.6` |
| `CLASSIFIER_SILENCE_THRESHOLD` | RMS below this = silence | `0.85` |
| `CLASSIFIER_WINDOW_SECONDS` | Audio window size for each classification | `3.0` |

## Testing

The audio classifier has 18 unit tests covering:

- Silence detection (pure silence, very quiet, empty audio)
- Tone detection (440Hz ringback, 1000Hz test tone)
- DTMF detection (digit 5, digit 0)
- Speech detection (speech-like waveforms)
- Classification history (hold→human transition, IVR non-transition)
- Feature extraction (RMS, ZCR, spectral flatness, dominant frequency)

```bash
pytest tests/test_audio_classifier.py -v
```

> **Known issue:** `test_complex_tone_as_music` is a known edge case where a multi-harmonic synthetic tone is classified as `LIVE_HUMAN` instead of `MUSIC`. This is acceptable — real hold music has different characteristics than synthetic test signals.