Modern Voice Agent Architectures: A Deep Dive
Voice agents have become increasingly sophisticated, enabling natural human-computer interactions across various applications from virtual assistants to customer service agents. When designing these systems, developers typically choose between two architectural approaches, each with distinct advantages and trade-offs.
The Modular Pipeline: Speech-to-Text (STT) → LLM → Text-to-Speech (TTS): A decomposed system where specialized components handle discrete functions in sequence.
The Unified Approach: Speech-to-Speech Models: An end-to-end approach where a single integrated system processes audio input and generates audio output with minimal intermediate transformations.
Let’s examine each approach in detail.
1. Modular Pipeline (STT → LLM → TTS)
In this architecture, three components operate as a pipeline:
Speech-to-Text (STT): This is the first stage of the pipeline. It captures and converts the user’s audio input into text transcription, often with specialized acoustic and language models.
Large Language Model (LLM): This is the brain of the agent. It processes the transcribed text, performs reasoning, calls tools and external APIs, manages context or memory, and generates appropriate responses.
Text-to-Speech (TTS): This is the final stage that synthesizes the LLM’s text response into spoken audio output.
Most modern implementations stream these components to reduce latency, allowing the agent to begin formulating responses even before the user finishes speaking.
Critical Auxiliary Components: VAD and Turn Detection
Beyond these three core components, effective voice agents require two additional systems:
Voice Activity Detection (VAD): This component identifies when a user is speaking versus when there is silence or background noise. VAD is essential for determining when to start and stop processing audio, conserving computational resources and reducing latency. High-quality VAD systems can distinguish between human speech and other sounds, preventing false activations.
Turn Detection: This component determines when a user has completed their thought or utterance, signaling to the agent that it’s time to respond. Effective turn detection is crucial for natural conversation flow and prevents the agent from interrupting users mid-sentence. Turn detection may use a combination of silence duration, prosodic features (intonation patterns), and semantic completeness to identify appropriate response moments.
Consider this scenario for turn detection:
“I’m looking for tickets to the concert … [brief pause] on Friday.”
A naive turn detector might interpret the pause after “concert” as the end of the user’s turn and trigger a response. However, a sophisticated turn detector would recognize that the semantic content may be incomplete or that the intonation suggests more information is coming, and would wait for the complete utterance before prompting the agent to respond.
Newer STT models are increasingly incorporating these capabilities directly. For example, AssemblyAI’s Universal and OpenAI’s realtime models now offer built-in VAD and preliminary turn detection features, simplifying the architecture while potentially improving responsiveness.
Example Agent
Below is an example agent in LiveKit using modular pipeline architecture. It uses:
Assembly AI for Speech-to-Text
OpenAI for LLM
Cartesia for Text-to-Speech
This example uses an edited version of LiveKit’s example from https://github.com/livekit-examples/agent-starter-python
import logging
from dotenv import load_dotenv
from livekit.agents import (
Agent,
AgentSession,
JobContext,
JobProcess,
MetricsCollectedEvent,
RoomInputOptions,
WorkerOptions,
cli,
inference,
metrics,
)
from livekit.plugins import noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
logger = logging.getLogger(”agent”)
load_dotenv(”.env”)
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions=”“”You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
You eagerly assist users with their questions by providing information from your extensive knowledge.
Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
You are curious, friendly, and have a sense of humor.”“”,
)
def prewarm(proc: JobProcess):
proc.userdata[”vad”] = silero.VAD.load()
async def entrypoint(ctx: JobContext):
ctx.log_context_fields = {
“room”: ctx.room.name,
}
# Set up a voice AI pipeline using OpenAI, Cartesia, AssemblyAI, and the LiveKit turn detector
session = AgentSession(
stt=inference.STT(model=”assemblyai/universal-streaming”, language=”en”),
llm=inference.LLM(model=”openai/gpt-4.1-mini”),
tts=inference.TTS(
model=”cartesia/sonic-3”, voice=”9626c31c-bec5-4cca-baa8-f8ba9e84c8bc”
),
turn_detection=MultilingualModel(),
vad=ctx.proc.userdata[”vad”],
preemptive_generation=True,
)
# Metrics collection, to measure pipeline performance
usage_collector = metrics.UsageCollector()
@session.on(”metrics_collected”)
def _on_metrics_collected(ev: MetricsCollectedEvent):
metrics.log_metrics(ev.metrics)
usage_collector.collect(ev.metrics)
async def log_usage():
summary = usage_collector.get_summary()
logger.info(f”Usage: {summary}”)
ctx.add_shutdown_callback(log_usage)
# Start the session, which initializes the voice pipeline and warms up the models
await session.start(
agent=Assistant(),
room=ctx.room,
room_input_options=RoomInputOptions(
# For telephony applications, use `BVCTelephony` for best results
noise_cancellation=noise_cancellation.BVC(),
),
)
# Join the room and connect to the user
await ctx.connect()
if __name__ == “__main__”:
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))Advantages
Independent Scaling & Optimization: Each component is a separate service and can be scaled or optimized independently. For example, you can scale the STT workers based on user load and the LLM workers based on computational intensity independently.
Flexibility & Vendor Lock-in Mitigation: This is the ultimate “plug-and-play” architecture. You can easily swap an AWS Transcribe STT with a fine-tuned Whisper model or switch from GPT-4 to Claude for the LLM without major system rewrites.
Tool and RAG Integration: LLMs can easily integrate with external tools, databases, and APIs between the STT and TTS stages.
Explainability and Audit: The architecture inherently produces clean, traceable text transcripts at the STT output, which is critical for logging, compliance, fine-tuning, and downstream analytics.
Limitations
Latency: Each transition between components in the pipeline introduces some latency, potentially affecting conversation flow and making the interaction feel less spontaneous.
Error Propagation: Errors in earlier components (e.g., STT misrecognition) cascade through the pipeline.
Context Loss: Prosodic information like tone or emphasis may be lost when converting speech to text, losing the user’s intent and limiting the ability to respond with genuine emotional context.
2. Unified Approach: Speech-to-Speech Models
This newer architecture uses end-to-end models that process audio directly and generate audio responses with minimal intermediate steps. While text may still be used internally as a latent representation, the primary pathway is audio-to-audio, eliminating explicit modality conversions.
These models are trained to maintain conversational context and generate responses in real-time, often producing more natural-sounding interactions including non-verbal backchannels (like “mm-hmm” or “uh-huh”).
Below is an example agent in LiveKit using OpenAI’s realtime model.
import logging
from dotenv import load_dotenv
from livekit.agents import (
Agent,
AgentSession,
JobContext,
JobProcess,
MetricsCollectedEvent,
RoomInputOptions,
WorkerOptions,
cli,
inference,
metrics,
)
from livekit.plugins import openai, noise_cancellation, silero
from livekit.plugins.turn_detector.multilingual import MultilingualModel
logger = logging.getLogger(”agent”)
load_dotenv(”.env”)
class Assistant(Agent):
def __init__(self) -> None:
super().__init__(
instructions=”“”You are a helpful voice AI assistant. The user is interacting with you via voice, even if you perceive the conversation as text.
You eagerly assist users with their questions by providing information from your extensive knowledge.
Your responses are concise, to the point, and without any complex formatting or punctuation including emojis, asterisks, or other symbols.
You are curious, friendly, and have a sense of humor.”“”,
)
def prewarm(proc: JobProcess):
proc.userdata[”vad”] = silero.VAD.load()
async def entrypoint(ctx: JobContext):
ctx.log_context_fields = {
“room”: ctx.room.name,
}
session = AgentSession(
llm=openai.realtime.RealtimeModel(voice=”marin”),
turn_detection=MultilingualModel(),
vad=ctx.proc.userdata[”vad”],
)
session = AgentSession(
llm=openai.realtime.RealtimeModel(voice=”marin”)
)
# Metrics collection, to measure pipeline performance
usage_collector = metrics.UsageCollector()
@session.on(”metrics_collected”)
def _on_metrics_collected(ev: MetricsCollectedEvent):
metrics.log_metrics(ev.metrics)
usage_collector.collect(ev.metrics)
async def log_usage():
summary = usage_collector.get_summary()
logger.info(f”Usage: {summary}”)
ctx.add_shutdown_callback(log_usage)
# Start the session, which initializes the voice pipeline and warms up the models
await session.start(
agent=Assistant(),
room=ctx.room,
room_input_options=RoomInputOptions(
# For telephony applications, use `BVCTelephony` for best results
noise_cancellation=noise_cancellation.BVC(),
),
)
# Join the room and connect to the user
await ctx.connect()
if __name__ == “__main__”:
cli.run_app(WorkerOptions(entrypoint_fnc=entrypoint, prewarm_fnc=prewarm))Advantages
Lower Latency: Direct audio processing enables faster turn-taking and more natural conversation flow.
Preserved Prosody: Maintains acoustic features like tone, emphasis, and rhythm throughout processing. This results in the model:
Responding with appropriate tone, e.g., a sympathetic voice to a frustrated user.
Generating natural backchanneling (”mm-hmm,” “uh-huh”) precisely timed during the user’s speech.
Simplified Deployment: Less operational complexity than coordinating three distinct, streaming components.
Limitations
Black-Box Constraint: This is a major hurdle for enterprise deployments. The lack of an explicit, auditable text transcript makes debugging, explainability, and compliance more challenging. If the agent fails, diagnosing whether it was a “speech understanding” or “reasoning” failure is difficult.
Reasoning capabilities: The reasoning capabilities of these models are lesser than their similar sized LLM counterparts.
Reduced Flexibility: Hard to swap or fine-tune one aspect (e.g., improving speech recognition).
As comparison, here are two voice recordings from agent interactions built using the two architectures. The first recording uses an agent built with the modular pipeline approach (STT → LLM → TTS), while the second uses OpenAI’s realtime speech model showcasing the unified approach.
Modular STT -> LLM -> TTS agent
Speech-to-speech agent using OpenAI
Conclusion
The choice between these architectures depends on your specific requirements, resources, and the nature of the voice interactions you’re designing. The modular pipeline approach is more widely used currently and offers flexibility and explainability at the cost of some latency and potential error propagation, while unified speech models provide more natural interactions but with less visibility and flexibility.
As the field evolves, we’ll see the realtime speech models continue to get better at handling complex interactions, supporting multiple languages, and integrating with external systems. Their capabilities will expand while maintaining the conversational fluidity that makes them compelling, potentially narrowing the flexibility gap with modular systems.
For technical teams building voice agents today, understanding these architectural trade-offs is essential for delivering experiences that meet user expectations for both functionality and naturalness.


Couldn't agree more. Your breakdown of the modular pipeline is spot on an super clear. Makes you wonder how many hilarious misinterpretations VAD catches before it even hits the LLM right?