Complete project scaffolding and core implementation of an AI-powered telephony system that calls companies, navigates IVR menus, waits on hold, and transfers to the user when a human answers. Key components: - FastAPI server with REST API, WebSocket, and MCP (SSE) interfaces - SIP/VoIP call management via PJSUA2 with RTP audio streaming - LLM-powered IVR navigation using OpenAI/Anthropic with tool calling - Hold detection service combining audio analysis and silence detection - Real-time STT (Whisper/Deepgram) and TTS (OpenAI/Piper) pipelines - Call recording with per-channel and mixed audio capture - Event bus (asyncio pub/sub) for real-time client updates - Web dashboard with live call monitoring - SQLite persistence via SQLAlchemy with call history and analytics - Notification support (email, SMS, webhook, desktop) - Docker Compose deployment with Opal VoIP and Opal Media containers - Comprehensive test suite with unit, integration, and E2E tests - Simplified .gitignore and full project documentation in README
6.0 KiB
Audio Classifier
The Audio Classifier (services/audio_classifier.py) performs real-time waveform analysis on phone audio to determine what's happening on the call: silence, ringing, hold music, IVR prompts, DTMF tones, or live human speech.
Classification Types
class AudioClassification(str, Enum):
SILENCE = "silence" # No meaningful audio
MUSIC = "music" # Hold music
IVR_PROMPT = "ivr_prompt" # Recorded voice menu
LIVE_HUMAN = "live_human" # Live person speaking
RINGING = "ringing" # Ringback tone
DTMF = "dtmf" # Touch-tone digits
UNKNOWN = "unknown" # Can't classify
Feature Extraction
Every audio frame (typically 3 seconds of 16kHz PCM) goes through feature extraction:
| Feature | What It Measures | How It's Used |
|---|---|---|
| RMS Energy | Loudness (root mean square of samples) | Silence detection — below threshold = silence |
| Spectral Flatness | How noise-like vs tonal the audio is (0=pure tone, 1=white noise) | Music has low flatness (tonal), speech has higher flatness |
| Zero-Crossing Rate | How often the waveform crosses zero | Speech has moderate ZCR, tones have very regular ZCR |
| Dominant Frequency | Strongest frequency component (via FFT) | Ringback detection (440Hz), DTMF detection |
| Spectral Centroid | "Center of mass" of the frequency spectrum | Speech has higher centroid than music |
| Tonality | Whether the audio is dominated by a single frequency | Tones/DTMF are highly tonal, speech is not |
Feature Extraction Code
def _extract_features(self, audio: np.ndarray) -> dict:
rms = np.sqrt(np.mean(audio ** 2))
# FFT for frequency analysis
fft = np.fft.rfft(audio)
magnitude = np.abs(fft)
freqs = np.fft.rfftfreq(len(audio), 1.0 / self._sample_rate)
# Spectral flatness: geometric mean / arithmetic mean of magnitude
spectral_flatness = np.exp(np.mean(np.log(magnitude + 1e-10))) / (np.mean(magnitude) + 1e-10)
# Zero-crossing rate
zcr = np.mean(np.abs(np.diff(np.sign(audio)))) / 2
# Dominant frequency
dominant_freq = freqs[np.argmax(magnitude)]
# Spectral centroid
spectral_centroid = np.sum(freqs * magnitude) / (np.sum(magnitude) + 1e-10)
return { ... }
Classification Logic
Classification follows a priority chain:
1. SILENCE — RMS below threshold?
└── Yes → SILENCE (confidence based on how quiet)
2. DTMF — Goertzel algorithm detects dual-tone pairs?
└── Yes → DTMF (with detected digit in details)
3. RINGING — Dominant frequency near 440Hz + tonal?
└── Yes → RINGING
4. SPEECH vs MUSIC discrimination:
├── High spectral flatness + moderate ZCR → LIVE_HUMAN or IVR_PROMPT
│ └── _looks_like_live_human() checks history for hold→speech transition
│ ├── Yes → LIVE_HUMAN
│ └── No → IVR_PROMPT
│
└── Low spectral flatness + tonal → MUSIC
DTMF Detection
Uses the Goertzel algorithm to detect the dual-tone pairs that make up DTMF digits:
1209 Hz 1336 Hz 1477 Hz 1633 Hz
697 Hz 1 2 3 A
770 Hz 4 5 6 B
852 Hz 7 8 9 C
941 Hz * 0 # D
Each DTMF digit is two simultaneous frequencies. The Goertzel algorithm efficiently checks for the presence of each specific frequency without computing a full FFT.
Hold-to-Human Transition
The most critical detection — when a live person picks up after hold music:
def detect_hold_to_human_transition(self) -> bool:
"""
Check classification history for the pattern:
MUSIC, MUSIC, MUSIC, ... → LIVE_HUMAN/IVR_PROMPT
Requires:
- At least 3 recent MUSIC classifications
- Followed by 2+ speech classifications
- Speech has sufficient energy (not just noise)
"""
recent = self._history[-10:]
# Find the transition point
music_count = 0
speech_count = 0
for result in recent:
if result.audio_type == AudioClassification.MUSIC:
music_count += 1
speech_count = 0 # reset
elif result.audio_type in (AudioClassification.LIVE_HUMAN, AudioClassification.IVR_PROMPT):
speech_count += 1
return music_count >= 3 and speech_count >= 2
Classification Result
Each classification returns:
@dataclass
class ClassificationResult:
timestamp: float
audio_type: AudioClassification
confidence: float # 0.0 to 1.0
details: dict # Feature values, detected frequencies, etc.
The details dict includes all extracted features, making it available for debugging and analytics:
{
"rms": 0.0423,
"spectral_flatness": 0.15,
"zcr": 0.087,
"dominant_freq": 440.0,
"spectral_centroid": 523.7,
"is_tonal": True
}
Configuration
| Setting | Description | Default |
|---|---|---|
CLASSIFIER_MUSIC_THRESHOLD |
Spectral flatness below this = music | 0.7 |
CLASSIFIER_SPEECH_THRESHOLD |
Spectral flatness above this = speech | 0.6 |
CLASSIFIER_SILENCE_THRESHOLD |
RMS below this = silence | 0.85 |
CLASSIFIER_WINDOW_SECONDS |
Audio window size for each classification | 3.0 |
Testing
The audio classifier has 18 unit tests covering:
- Silence detection (pure silence, very quiet, empty audio)
- Tone detection (440Hz ringback, 1000Hz test tone)
- DTMF detection (digit 5, digit 0)
- Speech detection (speech-like waveforms)
- Classification history (hold→human transition, IVR non-transition)
- Feature extraction (RMS, ZCR, spectral flatness, dominant frequency)
pytest tests/test_audio_classifier.py -v
Known issue:
test_complex_tone_as_musicis a known edge case where a multi-harmonic synthetic tone is classified asLIVE_HUMANinstead ofMUSIC. This is acceptable — real hold music has different characteristics than synthetic test signals.