AIVoice AgentsPythonEngineering

How I Built AI Voice Agents That Handle Real Phone Calls

Huzaifa AtharFebruary 18, 20268 min read

Why Voice Agents, and Why Dentists?

I got into voice AI because of a problem I kept seeing at DigitLabs. Clients would ask for chatbots, and those worked fine for text. But some businesses, especially medical and dental offices, still run on phone calls. Their front desk staff are overwhelmed. Patients call to book appointments, confirm times, ask about insurance, and nobody picks up half the time.

So I decided to build an AI voice agent that could actually handle real phone calls. Not a toy demo. A production system that answers the phone, understands what the caller needs, and takes action.

Picking the Right Architecture

The first big decision was the overall pipeline. A voice agent is really four systems stitched together:

Speech-to-Text (STT): Converting the caller's audio into text
LLM Processing: Understanding intent and generating a response
Text-to-Speech (TTS): Converting the response back to natural-sounding audio
Telephony Integration: Actually connecting to phone lines

I evaluated a bunch of STT options. Google Cloud Speech-to-Text was solid but expensive at scale. Whisper (OpenAI's open source model) gave great accuracy, but running it in real time required GPU infrastructure I didn't want to manage early on. I ended up going with Deepgram for STT because their streaming API had consistently low latency, around 300ms, which matters a lot when someone is waiting on the phone.

For TTS, I tested ElevenLabs, Amazon Polly, and Google Cloud TTS. ElevenLabs sounded the most natural by far, but their latency was inconsistent. I went with a hybrid approach: ElevenLabs for the primary voice with Google Cloud TTS as a fallback when latency spiked above 500ms.

The LLM layer uses GPT-4 with carefully tuned system prompts. I tried smaller models first to save cost, but the difference in handling ambiguous requests was night and day.

The Hard Parts Nobody Warns You About

Building a demo that handles a scripted conversation is easy. Building something that survives real callers is a completely different game.

Background noise. People call from their cars, from restaurants, with kids screaming. My first version would hallucinate words from background noise and go off the rails. I added a confidence threshold on the STT output. If the transcription confidence drops below 0.7, the agent asks the caller to repeat instead of guessing.

Interruptions. Real people don't wait for the AI to finish talking before they start speaking. They interrupt constantly. I had to implement barge-in detection, which means the agent stops its current TTS output when it detects the caller is speaking. This sounds simple but the timing is tricky. Too sensitive and the agent cuts itself off from its own audio feedback. Too slow and the caller feels ignored.

Silence handling. Sometimes callers go quiet. Maybe they're looking at their calendar, maybe they got distracted. I built a tiered silence handler: after 5 seconds, a gentle prompt ("Take your time, I'm still here"). After 15 seconds, a check-in ("Are you still on the line?"). After 30 seconds, a graceful goodbye.

Call transfers. When the agent can't handle something, it needs to transfer to a human. This required SIP integration with the office's existing phone system, and every office had a slightly different setup. I ended up building an adapter layer that could handle Twilio, Vonage, and direct SIP connections.

Calendar Integration Was Its Own Beast

The dental receptionist agent needs to book actual appointments. That means talking to calendar systems in real time during the call.

Most dental offices use practice management software like Dentrix, Eaglesoft, or Open Dental. None of them have modern APIs. Some have ODBC connections, some have proprietary sync tools. I built a middleware service that normalizes calendar operations across different backends into a simple REST API: check availability, create appointment, cancel appointment.

The tricky part is handling conflicts. While the AI is talking to a patient and checking Tuesday at 2pm, the front desk might book that same slot manually. I implemented optimistic locking with a 60-second hold on proposed slots. If the caller confirms within that window, the appointment goes through. If not, the slot releases back to the pool.

Prompt Engineering for Phone Conversations

Writing prompts for voice is different from writing prompts for chat. Phone conversations are linear. You can't show a list of options on screen. You can't use formatting.

I learned a few things the hard way:

Keep responses short. If the agent talks for more than 15 seconds without a pause, callers zone out. I set a hard limit of 3 sentences per turn.
Confirm everything. "I have you down for Tuesday, February 18th at 2pm with Dr. Martinez. Does that sound right?" Repeating back details catches errors before they become problems.
Use natural fillers. A tiny pause or an "Alright" before a response sounds way more human than an instant reply. I added randomized micro-delays between 200ms and 800ms.
Handle off-topic gracefully. Callers ask random things. "What's your address?" "Do you take Delta Dental?" The agent needs to handle these without losing the conversation thread.

Deployment and Monitoring

The system runs on AWS with the following setup:

ECS Fargate for the main voice agent service
Redis for session state and slot locking
PostgreSQL for call logs and analytics
CloudWatch for monitoring, with custom metrics for latency at each pipeline stage

I track a few key metrics obsessively:

End-to-end latency: Time from when the caller stops speaking to when the agent starts responding. Target is under 1.5 seconds. We average about 1.1 seconds.
Task completion rate: Percentage of calls where the caller accomplished what they called for. Currently sitting at 73%, up from 41% in the first month.
Handoff rate: How often the agent has to transfer to a human. Started at 45%, now down to 22%.
Caller satisfaction: Post-call survey scores. This one surprised me. Most callers don't even realize they're talking to an AI.

What I'd Do Differently

If I started over, I'd invest in better testing infrastructure earlier. I spent too long manually testing by calling the agent myself. I eventually built a test harness that simulates calls with pre-recorded audio and validates the agent's responses. Should have done that from day one.

I'd also consider using a purpose-built voice AI framework like Vocode or LiveKit instead of wiring everything together myself. The custom approach gave me more control, but the maintenance overhead is real.

Voice AI is still early. The tools are getting better fast. But building something that works reliably in production, on real phone calls with real people, takes a lot of patient engineering work. There's no shortcut around it.

Want to work together?

I'm always open to new projects and opportunities.

Get in touch