Echo Models: Linguistic Turn Detection for Voice Agents

Echo Turn Detection is a real-time timing layer that transitions voice agents from rigid, walkie-talkie-like interfaces to natural, human-like dialogue. By analysing linguistic patterns rather than simple silence timeouts, Echo delivers up to 96.2% turn-taking accuracy and 25x fewer missed turn endings, allowing users to pause, hesitate, interrupt, and backchannel naturally.

The Turn-Taking Challenge

In a voice agent pipeline, speech recognition gets most of the attention. But knowing what the user said is only half the problem. The harder real-time question is: should the agent speak now, keep listening, ignore a short backchannel, or stop talking?

A bad turn detector relies on a simple silence threshold (e.g., waiting 700ms after audio stops). This approach creates four classic failure modes:

Responding too early: The user pauses mid-sentence and the agent cuts them off.
Waiting too long: The user is finished, but the agent leaves an awkward silence.
Misreading backchannels: The user says “yeah” or “uh-huh” as a listening cue, and the agent treats it as a new command.
Failing to stop: The user says “stop” or “wait” while the agent is speaking, but the agent keeps monologuing.

To transition to true natural interaction, we need a dedicated timing layer that reasons over speech transcripts and linguistic intent.

Introducing Echo Turn

Echo Turn is a turn-taking detector designed for real-time voice agents. Instead of treating turn detection as a simple silence timeout, Echo predicts the conversational state of the user by analysing the linguistic patterns in the streaming speech transcript:

Linguistic intent: Analysing semantic structure to distinguish between incomplete statements and completed thoughts.
Hesitation and filler patterns: Detecting words like “um”, “uh”, “yeah”, and “wait” to infer speaker state.
Transcript punctuation: Using STT-generated markers for pauses and hesitations.
Real-time constraints: Ultra-low latency inference for production.

By analysing these cues, Echo can tell when a user saying "yeah" means "Yes, that is my answer" (Complete) versus "I am listening, keep going" (Backchannel).

Core Prediction States

Echo classifies streaming transcripts into four distinct conversational states to orchestrate agent behaviour. While identifying completion states is standard practice, Echo's true breakthrough lies in its ability to natively handle complex conversational signals like backchannels and pause requests.

1. Standard Industry States

Complete (complete): The user has finished speaking and is expecting a response.
- Agent Action: Immediately triggers response generation.
- Example: "What's the weather like in New York?"
Incomplete (incomplete): The user has paused mid-thought to think or catch their breath, but is not done.
- Agent Action: Remains silent and continues listening, ignoring the brief silence.
- Example: "I would like to order a..." (pause) "...pepperoni pizza."

2. VideoSDK's Innovations

Unlike traditional turn detectors that only distinguish between complete and incomplete turns, Echo introduces support for human-like conversational dynamics:

Backchannel (backchannel): The user provides short verbal acknowledgments (e.g., "uh-huh", "yeah", "right", "okay okay") to signal they are listening while the agent is speaking.
- Agent Action: The agent continues speaking uninterrupted, rather than treating these verbal cues as interruptions or new command inputs. This solves the classic issue where voice assistants cut themselves off the moment a user shows active listening.
Wait (wait): The user explicitly requests the agent to hold (e.g., "wait a second", "hold on", "give me a moment").
- Agent Action: The agent pauses its response generation or playback and waits patiently, keeping the session active without timing out or interpreting the pause as completion.

Model Family

We offer Echo Turn in two distinct variants so you can tune for your use case:

Echo Small: The default, lowest-latency model optimized for the fastest possible turn detection. Best when responsiveness matters most.
Echo Large: A higher-accuracy model that trades a little latency for better classification. Best when accuracy matters more than raw speed.

Model Architecture

Supported Languages

Echo Turn supports 12 languages across both model sizes: English, French, German, Italian, Spanish, Hindi, Gujarati, Marathi, Tamil, Telugu, Urdu, and Bengali.

Benchmark: Turns2k Dataset

We evaluated both Echo model variants on the English TURNS2K dataset for turn completion detection (classifying Incomplete vs. Complete turns):

Overall Comparative Metrics

Metric	Echo-Small	Echo-Large	Baseline
Accuracy	93.60%	96.20%	61.13%
Recall (Complete)	97.31%	96.50%	32.83%
Specificity	88.91%	95.81%	96.83%
F1 Score (Complete)	0.9443	0.9659	0.4851

Echo-Large

On 2,000 English conversational samples, Echo-Large achieved 96.2% accuracy in detecting whether a speaker had finished their turn, substantially outperforming the Baseline under the same conditions. The largest difference was in turn-completion detection: Echo-Large correctly identified 96.5% of completed turns versus 32.8% for the Baseline, resulting in far fewer missed responses.

For every 100 times a user finished speaking, Echo-Large responded correctly approximately 97 times, missing only ~3.5 turn completions; the Baseline responded correctly about 33 times.

Echo-Small

Echo-Small is optimized for responsive voice interactions where recognizing completed speech quickly is critical.

For every 100 times a user finished speaking, Echo-Small responded correctly 97.3 times versus 32.8 for the Baseline, about 25× fewer missed turn endings in this benchmark. Echo-Small is designed for applications where fast conversational turn-taking is a priority, while maintaining strong overall classification performance.

How it is Hosted

Echo Turn is server-hosted on the VideoSDK Inference Gateway and exposed through the TurnV2 class, so there is no model to download or run on your machine.

Integration Guidelines

Echo Turn acts as the real-time timing layer in a voice pipeline. As the user speaks, VAD detects the speech and STT produces a transcript. After each user utterance, the latest transcript is sent to the Inference Gateway, where the selected Echo model (echo-small or echo-large) classifies the turn.

Python SDK Implementation

Echo Turn is server-hosted on the VideoSDK Inference Gateway and exposed through the TurnV2 class in the videosdk-agents library. No local model download is required.

To use Echo Turn in your voice pipeline:

Initialise the detector in your Python code:

from videosdk.agents.inference import TurnV2

# Initialize the default, lowest-latency model
turn_detector = TurnV2.echo_small()

# Or, initialize the higher-accuracy model
# turn_detector = TurnV2.echo_large()

pipeline = Pipeline(
       stt=DeepgramSTT(),
       llm=OpenAILLM(),
       tts=ElevenLabsTTS(),
       vad=SileroVAD(),
       turn_detector=turn_detector
   )

Set your authentication token:

export VIDEOSDK_AUTH_TOKEN="your-videosdk-auth-token"

Install the package (version 1.0.18 or higher is required):

pip install "videosdk-agents>=1.0.18"

Impact

Reliable turn detection improves core voice metrics:

Reduced Latency: Agents respond immediately when the user is finished, eliminating fixed timeouts.
Conversation Naturalness: Fewer awkward silences and false interruptions make conversation feel human.
Infrastructure Efficiency: Reduces unnecessary LLM API calls triggered by misclassified partial turns.

Try Echo Turn

Echo Turn is now available across our Agent Cloud and Runtime.

Quick Start Demo: Clone and run our GitHub Quickstart Example to see Echo's real-time turn-taking pipeline in action locally within minutes.
Documentation: Read the Docs to learn more, or check out the Inference Plugin Docs for integration details.
Join the Community: Have questions or want to discuss integration? Join our Discord.