Speech-to-text is the first and most critical step in any voice agent. If transcription is slow or inaccurate, everything downstream reasoning and response suffers. Gladia STT is built for real-time transcription with strong multilingual support, fast partial results, and handling of code-switching.

In this guide, we’ll walk through how to integrate Gladia STT with the VideoSDK Agents SDK and use it as a reliable input layer for voice-driven applications

Why Gladia STT?

Many voice agents operate in environments where users switch languages mid-sentence or expect instant feedback while speaking. Gladia is optimized for these scenarios. It provides:

  • Low-latency transcription
  • Support for multiple languages
  • Automatic code-switching
  • Partial transcripts for faster turn detection

This makes it a strong choice for real-time agents, live calls, and interactive voice applications.

Key Features

  • Real-Time Transcription : Gladia streams transcription results as audio is processed, reducing perceived latency in conversations.
  • Multilingual Support : You can specify one or more languages, making it suitable for global or multilingual users.
  • Code-Switching : Gladia can automatically detect and switch languages within the same conversation without manual intervention.
  • Partial Transcripts : By enabling partial transcripts, agents can start reasoning before the user finishes speaking, improving responsiveness.

Installation

Install the Gladia-STT VideoSDK Agents package:

pip install "videosdk-plugins-gladia"

Authentication

  1. Sign up at Gladia : signup link
  2. Sign up at VideoSDK - authentication token
GLADIA_API_KEY=your_api_key_here
VIDEOSDK_AUTH_TOKEN = token

When using environment variables, you don’t need to pass the API key directly in code the SDK reads it automatically.

Importing Gladia STT

from videosdk.plugins.gladia import GladiaSTT

Basic Usage Example

Below is a minimal example showing how to configure Gladia STT and attach it to a cascading pipeline.

from videosdk.plugins.gladia import GladiaSTT
from videosdk.agents import CascadingPipeline

# Initialize the Gladia STT model
stt = GladiaSTT(
    api_key="your-gladia-api-key",
    languages=["en"],
    code_switching=True,
    receive_partial_transcripts=True
)

#  Add stt to a cascading pipeline
pipeline = CascadingPipeline(stt=stt)

This setup enables:

  • Real-time transcription
  • Automatic language switching
  • Partial transcripts for faster downstream processing

Configuration Options

Gladia STT provides fine-grained control over transcription behavior:

  • languages: List of language codes to detect (e.g., ["en", "fr"])
  • code_switching: Enables automatic language switching
  • receive_partial_transcripts: Streams interim results for lower latency
  • model: STT model to use (default: "solaria-1")
  • input_sample_rate: Incoming audio sample rate
  • output_sample_rate: Processing sample rate
  • encoding: Audio encoding format
  • bit_depth: Audio bit depth
  • channels: Number of audio channels (mono or stereo)

These parameters let you tune accuracy, latency, and compatibility with your audio pipeline.

Conclusion

Gladia STT provides a strong foundation for real-time voice agents by combining speed, accuracy, and multilingual flexibility. When integrated with VideoSDK’s agent pipelines, it enables agents to listen effectively even in dynamic, multilingual conversations. A reliable STT layer like Gladia helps ensure that downstream reasoning and responses stay accurate, responsive, and consistent.

Resources and Next Steps