Introducing "NAMO" Real-Time Speech AI Model: On-Device & Hybrid Cloud 📢PRESS RELEASE

Outbound Call AI: Build a Real-Time Voice Agent Using VideoSDK and Twilio

Learn how to build a real-time AI voice agent for outbound calls using VideoSDK, Twilio, and FastAPI. Stream, process, and respond to live audio in a SIP-connected call.

The future of outbound voice communication is autonomous. Imagine an AI that can pick up a phone call, understand what the person is saying, and respond intelligently — all in real-time. That’s what an Outbound Call AI Voice Agent does. And with tools like VideoSDK, Twilio, and FastAPI, you can build it yourself.
In this blog, we’ll show you exactly how to create your own AI voice agent using SIP, real-time audio streaming, and Python — powered by VideoSDK.

What is Outbound Call AI?

Outbound Call AI refers to software that makes or answers phone calls autonomously using artificial intelligence — specifically voice AI. These systems are used in:
  • Sales and lead generation
  • Customer support
  • Appointment reminders
  • Follow-ups or surveys
  • Real-time information dispatch
Unlike basic IVRs or prerecorded messages, AI voice agents can have intelligent, dynamic conversations — powered by real-time speech recognition and LLM-based reasoning.

Why Use VideoSDK for AI Voice Agents?

While Twilio can make and receive calls, you need a media layer to intercept and process real-time audio. That’s where VideoSDK’s SIP + audio stream support becomes crucial.
Benefits of using VideoSDK:
  • Join SIP calls and access live audio streams
  • Native Python SDK for meetings and participant handling
  • Event-driven architecture for stream and meeting events
  • Great for plugging in speech-to-text, LLM, and TTS systems
When combined with Twilio and FastAPI, you get a powerful backend that can create calls, process audio, and route the results — all using your own custom logic.

Architecture Overview: AI Voice Agent System

Here’s how the components work together:
  1. Twilio handles telephony and connects phone calls to a SIP URI.
  2. FastAPI receives Twilio webhooks and generates SIP instructions.
  3. VideoSDK creates a meeting (room) for each call and gives us access to audio streams.
  4. AI Agent listens to the audio, processes it, and can respond (optional).
This enables a real-time, full-duplex communication path between a caller and an AI.
Workflow:
  • Caller → Twilio → FastAPI webhook → VideoSDK room → Agent joins room → Agent listens → (Future) Agent speaks back

Setting Up the Backend – main.py Walkthrough

Let’s break down the Python server that powers our voice agent.

1. FastAPI + CORS + .env config

We initialize our backend with:
1from fastapi import FastAPI
2from fastapi.middleware.cors import CORSMiddleware
3
4app = FastAPI()
5app.add_middleware(
6    CORSMiddleware,
7    allow_origins=["*"],
8    allow_credentials=True,
9    allow_methods=["*"],
10    allow_headers=["*"],
11)
This makes our FastAPI app accessible to Twilio and other services.

2. Twilio Webhook to Handle Inbound Calls

When a call is received, Twilio hits /join-agent. This is where we create the room and respond with SIP instructions.
1@app.post("/join-agent")
2async def handle_twilio_call(request: Request):
3    # Validate request
4    # Create VideoSDK room
5    # Start agent in background
6    # Return TwiML to dial SIP
We use the twilio package’s VoiceResponse and RequestValidator to validate requests and generate XML responses.

3. Create a VideoSDK Room via API

1async def create_videosdk_room():
2    headers = {"Authorization": VIDEOSDK_AUTH_TOKEN}
3    async with httpx.AsyncClient() as client:
4        response = await client.post("https://api.videosdk.live/v2/rooms", headers=headers)
5        return response.json().get("roomId")
This call returns a roomId that becomes the SIP address.

4. Generate TwiML to Connect SIP Call

Once the room is created and the agent is started, we return this SIP connection via TwiML:
1response = VoiceResponse()
2dial = response.dial()
3dial.sip(f"sip:{room_id}@sip.videosdk.live", username=..., password=...)
4return Response(content=str(response), media_type="application/xml")

Building the Agent – agent.py

This file contains the VideoSDKAgent class which connects to the meeting room and listens to the participants’ audio.

1. Initialize the Agent and Join the Room

1class VideoSDKAgent:
2    def __init__(self, room_id, videosdk_token):
3        self.room_id = room_id
4        self.videosdk_token = videosdk_token
5        self.meeting = None
6        self._initialize_meeting()
We use VideoSDK.init_meeting() to create a meeting object and attach event handlers.

2. Handle Meeting and Participant Events

Meeting Events:

1class AgentMeetingEventHandler(MeetingEventHandler):
2    def on_meeting_joined(self, data): ...
3    def on_participant_joined(self, participant): ...
4    def on_meeting_left(self, data): ...

Participant Events:

1class AgentParticipantEventHandler(ParticipantEventHandler):
2    def on_stream_enabled(self, stream): ...
3    def on_stream_disabled(self, stream): ...
This is where you detect when a participant joins and sends an audio stream — which you can intercept and process.

3. Placeholder: Processing the Audio

1if stream.kind == "audio" and not self.participant.local:
2    # TODO: Start STT + LLM + TTS processing
3    logger.info("Received audio stream from participant")
Currently, you can log and observe incoming audio events. In a full system, this is where you’d:
  • Convert audio → text (with OpenAI Whisper or Deepgram)
  • Run AI response logic using LLM (GPT, Claude, etc.)
  • Convert response text → audio (TTS from Google, Azure, ElevenLabs)

Sample Response Flow (Future Implementation)

While the current code only listens, here’s how you could add bidirectional audio:
  1. Transcribe with Whisper
  2. Pass to LLM
  3. Generate a TTS audio stream
  4. Inject that audio back into the room using VideoSDK’s custom_microphone_audio_track option
This would enable a fully autonomous, talking AI assistant.

Use Cases for AI Voice Agents

This architecture can power:
  • Sales Assistants: Automated cold calls with dynamic responses
  • Support Bots: Voice-first helpdesks for tier-1 questions
  • Reminder Systems: Appointments, billing, or re-engagement calls
  • Voice Surveys: Dynamic call-based forms powered by AI
You can easily scale this with parallel Twilio calls and individual room agents.

Scaling Your Voice AI

  • Deploy with Railway, Render, or AWS
  • Use Docker and gunicorn for performance
  • Add Redis to manage agent state
  • Monitor audio events with observability tools
  • Add fallback flows for failed SIP or stream connections

Get 10,000 Free Minutes Every Months

No credit card required to start

Final Thoughts

Outbound Call AI is no longer futuristic. With VideoSDK + Twilio + FastAPI, you can build your own real-time voice agent that listens, understands, and soon — responds.
Whether you’re a solo developer or an enterprise team, this open architecture gives you everything you need to create intelligent, voice-driven workflows that scale.

Want to level-up your learning? Subscribe now

Subscribe to our newsletter for more tech based insights

FAQ