Utterance
Every voice app faces the same problem: it can't tell when you're done talking.
You pause to think, and it cuts you off. You take a breath, and it responds too soon. You want to interrupt, and it keeps going.
Current solutions either detect silence (Silero VAD, ricky0123/vad) without understanding intent, or use server-side AI (OpenAI Realtime, AssemblyAI) that adds latency and costs.
Utterance is different. It uses a lightweight ML model entirely on the client side. It recognizes the difference between a thinking pause and a completed turn. No cloud. No delay. No per-minute fees.
Key Features
- Semantic endpointing — understands thinking pauses vs. turn completion
- Interrupt detection — knows when a user wants to interject
- Confidence scoring — returns probability (0–1) for each detection
- Client-side only — no cloud, no latency, no API costs
- Lightweight — model under 5MB, inference under 50ms
- Framework agnostic — works with any voice stack
- Privacy first — audio never leaves the device
Comparison
| Feature | Silero VAD | ricky0123/vad | Picovoice Cobra | OpenAI Realtime | Utterance |
|---|---|---|---|---|---|
| Detects speech vs. silence | ✅ | ✅ | ✅ | ✅ | ✅ |
| Semantic pause detection | ❌ | ❌ | ❌ | ✅ | ✅ |
| Interrupt detection | ❌ | ❌ | ❌ | ✅ | ✅ |
| Runs client-side | ✅ | ✅ | ✅ | ❌ | ✅ |
| No API costs | ✅ | ✅ | ❌ | ❌ | ✅ |
| Privacy (audio stays local) | ✅ | ✅ | ✅ | ❌ | ✅ |