Turn-taking, the ability to know exactly when a user has finished speaking, is the invisible force behind natural human conversation. Yet most voice agents today rely on Voice Activity Detection (VAD) or fixed silence timers, leading to premature cut-offs or long, robotic pauses.
We introduce state of the art NAMO Turn Detector v1 (NAMO-v1), an open-source, ONNX-optimized semantic turn detector model that predicts conversational boundaries by understanding meaning, not just silence. NAMO achieves <19 ms inference for specialized single-language models, <29 ms for multilingual, and up to 97.3 % accuracy, making it the first practical drop-in replacement for VAD in real-time voice systems.
1. Why Existing Approaches Break Down
Most deployed voice agents use one of two approaches:
- Silence-based VAD: very fast and lightweight but either interrupts users mid-sentence or waits too long to be sure they’re done.
- ASR endpointing (pause + punctuation): better than raw energy detection, but still a proxy; hesitations and lists often look “finished” when they’re not, and behavior varies wildly across languages.
Both approaches force product teams into a painful latency vs. interruption trade-off: either set a long buffer (safe but robotic) or a short one (fast but rude).
2. NAMO’s Semantic Advantage
NAMO replaces “silence as a proxy” with semantic understanding. The model looks at the text stream from your ASR and predicts whether the thought is complete. This single change brings:
- Lower floor transfer time (snappier replies) without raising false cut-offs.
- Multilingual robustness: one model works across 23+ languages, no per-language tuning.
- Production latency: ONNX-quantized models run in <30 ms on CPU/GPU with almost no accuracy loss.
- Observability & tuning: you can get calibrated probabilities and adjust thresholds for “fast vs. safe.”
Namo uses Natural Language Understanding to analyze the semantic meaning and context of speech, distinguishing between:
- Complete utterances (user is done speaking)
- Incomplete utterances (user will continue speaking)
Key Features
- Semantic Understanding: Analyzes meaning and context, not just silence
- Ultra-Fast Inference: <19ms for specialized models, <29ms for multilingual
- Lightweight: ~135MB (specialized) / ~295MB (multilingual)
- High Accuracy: Up to 97.3% for specialized models, 90.25% average for multilingual
- Production-Ready: ONNX-optimized for real-time, enterprise-grade applications
- Easy Integration: Standalone usage or plug-and-play with VideoSDK Agents SDK
3. Performance Benchmarks
Latency & Throughput
Using ONNX quantization, NAMO moves from 61 ms to 28 ms inference (multilingual) and 38 ms to 14.9 ms (specialized).
- Relative speedup: up to 2.56×
- Throughput: doubled (35.6 to 66.8 tokens/sec)
Accuracy Impact
Quantization preserves accuracy:
Confusion matrices show virtually unchanged true/false rates before and after quantization.
Language Coverage
Average multilingual accuracy: 90.25 %
Specialized single-language models: 97.3 % (Turkish/Korean), >93 % Hindi, Japanese, German.
5. Impact on Real-Time Voice AI
With NAMO you get:
- Snappier responses without the “one Mississippi” delay.
- Fewer interruptions when users pause mid-thought.
- Consistent UX across markets without tuning for each language.
- Cost-effective scaling — works with any STT and runs efficiently on commodity servers.
6. Impact on Real-Time Voice AI
Namo offers both specialized single-language models and a unified multilingual model
Variant | Languages / Focus | Model Size | Latency* | Typical Accuracy |
---|---|---|---|---|
Multilingual | 23 languages | ~295 MB | < 29 ms | ~90.25 % (average) |
Language-Specialized | One language per model | ~135 MB | < 19 ms | Up to 97.3 % |
* Latency measured after quantization on target inference hardware.
Multilingual Model (Recommended):
- Model: Namo-Turn-Detector-v1-Multilingual
- Base: mmBERT
- Languages: All 23 supported languages
- Inference: <29ms
- Size: ~295MB
- Average Accuracy: 90.25%
- Model Link: Namo Turn Detector v1 - MultiLingual
Performance Benchmarks for Multilingual Model
Evaluated on 25,000+ diverse utterances across all supported languages.
Language | Accuracy | Precision | Recall | F1 Score | Samples |
---|---|---|---|---|---|
🇹🇷 Turkish | 97.31% | 0.9611 | 0.9853 | 0.9730 | 966 |
🇰🇷 Korean | 96.85% | 0.9541 | 0.9842 | 0.9690 | 890 |
🇯🇵 Japanese | 94.36% | 0.9099 | 0.9857 | 0.9463 | 834 |
🇩🇪 German | 94.25% | 0.9135 | 0.9772 | 0.9443 | 1,322 |
🇮🇳 Hindi | 93.98% | 0.9276 | 0.9603 | 0.9436 | 1,295 |
🇳🇱 Dutch | 92.79% | 0.8959 | 0.9738 | 0.9332 | 1,401 |
🇳🇴 Norwegian | 91.65% | 0.8717 | 0.9801 | 0.9227 | 1,976 |
🇨🇳 Chinese | 91.64% | 0.8859 | 0.9608 | 0.9219 | 945 |
🇫🇮 Finnish | 91.58% | 0.8746 | 0.9702 | 0.9199 | 1,010 |
🇬🇧 English | 90.86% | 0.8507 | 0.9801 | 0.9108 | 2,845 |
🇵🇱 Polish | 90.68% | 0.8619 | 0.9568 | 0.9069 | 976 |
🇮🇩 Indonesian | 90.22% | 0.8514 | 0.9707 | 0.9071 | 971 |
🇮🇹 Italian | 90.15% | 0.8562 | 0.9640 | 0.9069 | 782 |
🇩🇰 Danish | 89.73% | 0.8517 | 0.9644 | 0.9045 | 779 |
🇵🇹 Portuguese | 89.56% | 0.8410 | 0.9676 | 0.8999 | 1,398 |
🇪🇸 Spanish | 88.88% | 0.8304 | 0.9681 | 0.8940 | 1,295 |
🇮🇳 Marathi | 88.50% | 0.8762 | 0.9008 | 0.8883 | 774 |
🇺🇦 Ukrainian | 87.94% | 0.8164 | 0.9587 | 0.8819 | 929 |
🇷🇺 Russian | 87.48% | 0.8318 | 0.9547 | 0.8890 | 1,470 |
🇻🇳 Vietnamese | 86.45% | 0.8135 | 0.9439 | 0.8738 | 1,004 |
🇸🇦 Arabic | 84.90% | 0.7965 | 0.9439 | 0.8639 | 947 |
🇧🇩 Bengali | 79.40% | 0.7874 | 0.7939 | 0.7907 | 1,000 |
Average Accuracy: 90.25% across all languages
Specialized Single-Language Models
- Architecture: DistilBERT-based
- Inference: <19ms
- Size: ~135MB each
Language | Model Link | Accuracy |
---|---|---|
🇰🇷 Korean | Namo-v1-Korean | 97.3% |
🇹🇷 Turkish | Namo-v1-Turkish | 96.8% |
🇯🇵 Japanese | Namo-v1-Japanese | 93.5% |
🇮🇳 Hindi | Namo-v1-Hindi | 93.1% |
🇩🇪 German | Namo-v1-German | 91.9% |
🇬🇧 English | Namo-v1-English | 91.5% |
🇳🇱 Dutch | Namo-v1-Dutch | 90.0% |
🇮🇳 Marathi | Namo-v1-Marathi | 89.7% |
🇨🇳 Chinese | Namo-v1-Chinese | 88.8% |
🇵🇱 Polish | Namo-v1-Polish | 87.8% |
🇳🇴 Norwegian | Namo-v1-Norwegian | 87.3% |
🇮🇩 Indonesian | Namo-v1-Indonesian | 87.1% |
🇵🇹 Portuguese | Namo-v1-Portuguese | 86.9% |
🇮🇹 Italian | Namo-v1-Italian | 86.8% |
🇪🇸 Spanish | Namo-v1-Spanish | 86.7% |
🇩🇰 Danish | Namo-v1-Danish | 86.5% |
🇻🇳 Vietnamese | Namo-v1-Vietnamese | 86.2% |
🇫🇷 French | Namo-v1-French | 85.0% |
🇫🇮 Finnish | Namo-v1-Finnish | 84.8% |
🇷🇺 Russian | Namo-v1-Russian | 84.1% |
🇺🇦 Ukrainian | Namo-v1-Ukrainian | 82.4% |
🇸🇦 Arabic | Namo-v1-Arabic | 79.7% |
🇧🇩 Bengali | Namo-v1-Bengali | 79.2% |
Try It Yourself!
We’ve provided an inference script to help you quickly test these models. Just plug it in and start experimenting!
- Hugging Face Models: https://huggingface.co/videosdk-live/models
- Github Repo Link: https://github.com/videosdk-live/NAMO-Turn-Detector-v1/tree/main
- Official Documentation: https://docs.videosdk.live/ai_agents/core-components/turn-detection-and-vad
Integration with VideoSDK Agents
For seamless integration into your voice agent pipeline:
from videosdk_agents import NamoTurnDetectorV1, pre_download_namo_turn_v1_model
# Download model files (one-time setup)
# For multilingual (default):
pre_download_namo_turn_v1_model()
# For specific language:
# pre_download_namo_turn_v1_model(language="en")
# Initialize turn detector
turn_detector = NamoTurnDetectorV1() # Multilingual
# turn_detector = NamoTurnDetectorV1(language="en") # English-specific
# Add to your agent pipeline
from videosdk_agents import CascadingPipeline
pipeline = CascadingPipeline(
stt=your_stt_service,
llm=your_llm_service,
tts=your_tts_service,
turn_detector=turn_detector # Namo integration
)
7. Training & Testing
Each model includes Colab notebooks for training and testing:
- Training Notebooks: Fine-tune models on your own datasets
- Testing Notebooks: Evaluate model performance on custom data
Visit individual model pages for notebook links:
Looking Ahead: Future Directions
- Multi-party turn-taking detection: deciding when one speaker yields to another.
- Hybrid signals: combine semantics with prosody, pitch, silence, etc.
- Adaptive thresholds & confidence models: dynamic sensitivity based on conversation flow.
- Distilled / edge versions for latency-constrained devices.
- Continuous learning / feedback loop: let models adapt to usage patterns over time.
Integrate Namo-Turn-Detection-Model on Any Device
Conclusion
NAMO-v1 turns a long-standing bottleneck, turn-taking, into a solved engineering problem. By combining semantic intelligence with real-time speed, it finally allows voice AI systems to feel human, fast, and globally scalable.
Citation
@software{namo2025,
title={Namo Turn Detector v1: Semantic Turn Detection for Conversational AI},
author={VideoSDK Team},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/collections/videosdk-live/namo-turn-detector-v1-68d52c0564d2164e9d17ca97}
}