When a client came to us needing a voice bot that could handle inbound customer calls for their FAQ support, we knew the standard approaches would not work. IVR trees are brittle. Generic LLM wrappers hallucinate. Off-the-shelf voice platforms are black boxes you cannot tune. So we built the infrastructure ourselves.
This is the story of how we built a production-grade voice agent platform — the architectural decisions, the problems we ran into, and what we learned.
A voice bot sounds simple: transcribe audio, send to LLM, speak the response. But building something that works reliably in production, at low latency, for real phone calls, with verifiable answers, and full observability — that is a different problem entirely.
We needed a system where:
The foundation we chose was Pipecat — an open source Python framework for building real-time audio pipelines. It gave us the transport abstraction, frame-based pipeline model, and service integrations we needed to move fast without reinventing everything.
The core of the system is a linear frame pipeline:
transport.input()
→ STT (Deepgram Nova-2)
→ LLM context aggregator
→ RAG injector
→ LLM (Groq Llama 3.3 70B)
→ TTS (Deepgram Aura-2)
→ transcript processor
→ transport.output()
→ audio buffer (recording)
→ LLM assistant aggregator
Every component is a Pipecat frame processor. Audio frames come in from the transport, flow downstream through transcription and LLM processing, and audio frames come back out to the transport. The pipeline is synchronous from the frame perspective but async at the I/O level.
The key design decision was keeping the pipeline transport-agnostic. The transport is swapped depending on the channel — SmallWebRTCTransport for browser sessions, FastAPIWebsocketTransport with TwilioFrameSerializer for phone calls. Everything above the transport is identical.
We measured end-to-end voice-to-voice latency from when the user stops speaking to when the bot starts playing audio. Our target was under one second.
The pipeline has three sequential latency contributors:
STT TTFB — Deepgram returns the final transcript. This includes endpointing — waiting to confirm the user has finished speaking. With our Smart Turn detection configuration this averaged 480ms.
LLM TTFB — Groq returns the first token. Groq's inference hardware is fast. This averaged 144ms.
TTS TTFB — Deepgram Aura-2 returns the first audio chunk. This averaged 180ms.
Naively you would add these and get 804ms. But Pipecat streams LLM tokens directly into TTS as they arrive — so TTS starts before the LLM finishes. The actual measured total was consistently under 700ms.
We instrumented every session with a MetricsObserver that hooks into Pipecat's frame observer pattern, capturing TTFBMetricsData and ProcessingMetricsData frames from each service and attaching them to the session record per turn.
This was the most technically interesting challenge. The client's FAQ knowledge base needed to drive all answers — the LLM should never answer from its training data.
Our first approach was STATIC mode — embed the full KB into the system prompt at session start. It failed because we were using all-mpnet-base-v2 with a generic query. The semantic distance between a question and a factual answer sentence is low with general-purpose models. We were retrieving irrelevant chunks.
We tested the scores — for "why is my order delayed", the correct chunk "Orders may be delayed during peak hours" scored -0.0022 with COSINE similarity. Random irrelevant chunks scored higher.
The fix was a model switch. multi-qa-mpnet-base-dot-v1 is trained specifically on question-answer pairs. The same query with the new model scored 0.7174 for the correct chunk, with a clear gap to the next-best irrelevant chunk at 0.68.
After the model fix we re-embedded the entire collection and moved to per-turn RAG injection. On every user turn:
1. Extract the user's message from context
2. Skip if it is a short conversational token (hello, yes, okay, thanks)
3. Embed the query with the task-specific model
4. Search Milvus with COSINE similarity, threshold 0.72
5. Inject results as a system message before the LLM processes the turn
6. Remove the previous turn's RAG context to prevent accumulation
This approach means the LLM always has exactly the right KB context for each specific question, with no stale context from previous turns inflating the token count.
FAQ bots need structured conversation flows — collect caller name, ask query, answer from KB, offer escalation. We integrated pipecat-flows for this.
The flow definition is a JSON config stored per bot. Each node has a task message that replaces the system prompt, functions the LLM can call to signal transitions, and a silence prompt. The LLM drives transitions naturally — when it has collected the required information it calls the transition function and the flow manager switches to the next node with a context reset.
The interesting engineering problem here was that flow functions with no parameters were coming back from Groq as "arguments": "null" — a JSON null string. Our handler was failing silently because json.dumps(None) in a dict context raises a TypeError. Adding a null guard on args fixed the transition failures that were showing up as repeated name collection loops.
Every call produces a structured JSON session record:
{
"session_id": "uuid",
"transcript": [
{"role": "user", "text": "...", "timestamp": "...", "seq": 1},
{"role": "bot", "text": "...", "seq": 2, "latency": {
"total_latency_ms": 625,
"stt_ttfb_ms": 480,
"llm_ttfb_ms": 144,
"tts_ttfb_ms": 180,
"llm_prompt_tokens": 255,
"llm_completion_tokens": 20
}}
],
"turn_count": 7,
"silence_prompts_sent": 1,
"recording_url": "recordings/session_id_conversation.wav"
}
Latency is attached inline to the bot turn that produced it. This makes it trivial to see for any specific exchange exactly where time was spent. The MetricsObserver ties VADUserStoppedSpeakingFrame to BotStartedSpeakingFrame with wall clock times for the end-to-end measure, and reads MetricsFrame emissions from each service for the component-level breakdown.
Recording uses Pipecat's AudioBufferProcessor which handles timeline alignment and resampling internally, producing a mixed mono WAV per session.
The platform serves multiple client bots from a single server instance. Each bot is defined by a Pydantic BotConfig schema covering every configurable aspect:
When a client connects with their clientid and botid, the platform fetches this config from the backend and assembles the pipeline dynamically. Two different clients connecting simultaneously get completely isolated pipeline instances with no shared state.
Conversation history trimming from day one. Context grows with every turn. By turn 10 we were sending 1100+ tokens of history to Groq on every call. A rolling window of 6 turns would have kept costs lower from the start.
Model benchmarking before KB design. We spent time debugging RAG retrieval before realising the embedding model was the wrong tool for the job. Running a quick retrieval benchmark against your actual data before committing to a model would have saved several hours.
Per-node context strategy. The flow engine resets context on node transitions, which is the right behaviour. But the transition between nodes sometimes left residual RAG context from the previous node in the message history. Explicit context cleanup on transition would be cleaner.
A voice bot that answers inbound customer calls, handles FAQ queries from a verified knowledge base, follows a structured conversation flow, and produces full observability data per session — running on a single FastAPI server, dual-channel across browser and phone, with sub-700ms voice-to-voice latency measured end to end.
The platform is now the foundation for every voice agent we build.
Want to build something similar for your business? We can deploy a custom voice agent for your use case in weeks, not months.