Everything I got wrong building a voice agent that worked

Everything I got wrong building a voice agent that worked

I built an AI phone receptionist that works. It took a month, and most of that month was me being wrong.

The talking was never the hard part. On day one it could greet a caller and work out what they wanted. The other thirty days went to everything around the talking, and to one decision I made early that quietly poisoned the rest.

What it was supposed to do

The client is a small business, the kind where a missed call is just a customer who already dialed someone else. They had grown faster than they could hire, the phones never stopped, and the two summer interns they'd thrown at the problem were drowning along with everyone else. They needed a receptionist that didn't take lunch, so I built one from scratch on a voice platform (Bland). The job was short:

  1. Answer every inbound call and figure out who's calling and why.
  2. Put them through to the right person, by phone.
  3. Ping that person in the team chat before their phone rings, so nobody gets ambushed.
  4. Log every call so nothing quietly falls on the floor.

Three of those are talking. One is a decision. Take a guess which one ate the month.

One call, end to end

Inbound call
forwarded from the office line
Answer + classify
who is this, what do they want
Decide who
pick the right human
Heads-up to staff
message before the phone rings
Connect the call
cold transfer to a direct line
A human picks up
the only outcome that counts

Only one of these steps is a decision. It is the box in red, and it is where the project nearly died.

It's the box in red. Everything below is the story of that one box.

The plumbing was the project, not the AI

The AI parts came up almost insultingly fast. Greeting, intent classification, sounding like a person, basically free. Then I met the integration layer, and the integration layer wanted to fight.

Everything failed silently, which is the worst way for anything to fail. The "ping the staffer before transfer" step was referenced all over the prompts and had never once fired, because I'd told the agent to use a tool I never actually registered. The chat notifications showed up as a wall of raw markdown asterisks. The end-of-call summary had been built against exactly one example call, so every other call type arrived half-empty under a footer cheerfully insisting "empty fields were not collected." They were collected. The template just wasn't looking.

The theme connecting all of it: HTTP 200, green checkmark, nothing actually happened. A status code is a machine telling you it's fine, not that it did the thing.

The glue holding all this together was Zapier, and Zapier treated reliability as a stretch goal. Two Zaps did the off-platform work: one fired the pre-transfer chat message, the other logged every finished call to email and a spreadsheet. The logging Zap dropped about nine of every ten rows while reporting success on all of them, because two calls finishing at the same moment would race for the same spreadsheet row and the loser just evaporated. The pre-transfer Zap pointed at an inner webhook URL that had quietly gone 404, so every heads-up message sailed confidently into a dead endpoint. One afternoon the whole thing stopped for thirty minutes and didn't think to mention it.

At some point you stop debugging the no-code tool and accept that the no-code tool is the bug. So I deleted both Zaps and wrote a small Go service to do their jobs: one endpoint to log the call, one to send the heads-up. It runs in a container on a cheap VPS, and its entire reason for existing is that when something fails, it says so out loud. Owning the glue killed a whole genre of silent failure in an afternoon. (It tried to crawl back once. Weeks later a new phone number came up pointed at the old Zapier hook by mistake and ate a full day of call logs before I noticed. Some bugs you have to kill twice.)

The first time the heads-up message actually beat the phone call into the staffer's chat, I called in to test it myself. It transferred me fine. Nobody picked up. Which, it turns out, is also fine: the platform's job ends the instant it dials a human. Whether the human answers is billed to a different department.

Here's the actual conversation graph, redacted. Don't try to read it. Just count the boxes. Every line between them is a decision something has to make, and the platform's default decision-maker has opinions.

A redacted node graph of the conversation pathway: dozens of nodes connected by many edges, with names and identifiers blurred out.

Figure 1: the conversation graph, names and identifiers blurred. The point is the number of edges, not the content.

Then real callers showed up

Real callers do not behave like a test suite. The week it went live, it found a fresh way to embarrass me every day.

Monday, I blamed the vendor. An overnight test run came back with 24 of 25 calls failing on the same error: the agent produced no speech at all. I checked the platform's status page like it might confess. Green. I rolled back to a known-good version. Still mute. So I declared an outage and went to bed. It was not an outage. A "persona" layer sitting on top of my graph had an old instruction in it that said do not generate any speech, and an updated model had started taking that completely literally, returning silence on every call before my graph got a turn. The fix was deleting one line, which is the most humiliating kind of fix.

Tuesday, I DoSed my own phone number, which was somehow more embarrassing. The call path went like this: the client's public number (bought from someone else) pointed at a Twilio number I'd bought, which pointed at Bland, which transferred back out to the staff extensions. So my Twilio number was load-bearing for the entire operation. I'd been stress-testing by having my harness dial it over and over, which is a wonderful way to learn that a carrier will quietly brick a number and decline to give it back. Real callers stopped getting through at the exact minute my robots started calling, because the brick landed dead center in the chain. We stood up a new number, wired it in, and held a short funeral for the old one. New rule, in permanent marker: do not let your test harness hammer your own production line.

Wednesday onward, the transfers just didn't. The agent would do everything right, gather the details, confirm the matter, then announce "I'm going to connect you now" with the serene confidence of someone who has no intention of connecting you. Then it would say it again. Five times. Once, seventeen. The caller, reasonably, hung up. The numbers: 93 calls one day, around 20% connected. After a dozen patches, a careful count still showed 26.5%, with 42 of 83 calls dying in silence. The front desk got zero transfers that day. Not a strong showing for a receptionist.

I shipped patch after patch. Each one fixed exactly one symptom. The success rate barely moved, because I was prescribing cough syrup to a guy with a broken leg.

Why every fix passed testing and then died in the field

The platform picks which path a conversation takes by handing the model the plain-English label on each branch and letting it choose. Lovely in a demo. In production it means the same sentence routes three different ways depending on the model's mood, and worse, the model will sometimes narrate a transfer it never actually performs, like an actor describing a door instead of walking through it. That narration, with no door behind it, was the dead-air bug.

It also didn't help that the model underneath wasn't a good one. Bland trains its own model in-house instead of renting a frontier one, and you could feel it. It lost to the cheap "mini" models the big labs give away: misreading a clearly labeled branch, forgetting an instruction it had followed one turn earlier, and botching structured tool calls often enough that I stopped believing a tool had fired just because the transcript said it should have. It behaved like a model with plenty of raw pre-training and not much of the post-training that teaches one to follow instructions and call a tool without improvising. Unfortunate, given that the entire job is following instructions and calling tools.

So I did the obvious thing and wrote better labels. Three separate times. Exclusion clauses, a central routing node, new fallback branches with very stern descriptions. Every version sailed through the simulator. Every version got ignored on real calls. One of them broke a working path so thoroughly a test caller sat stuck for eleven minutes before I put him out of his misery.

The line I eventually wrote down, in defeat, had the whole problem in it:

Three different prose mechanisms tried, all failed identically on real calls, and all worked fine in the simulator.

The simulator was a yes-man. Its edge-picker was more deterministic than production, so it kept approving fixes that died the second a real human called. I'd spent days getting validated by a liar.

The actual bug was never my wording. I had handed a routing decision to a language model, and a routing decision has exactly one right answer, which is the one thing a language model will not promise you.

Who picks the edge?

caller: “I need the paralegal handling my case.”same sentence, every call
voicemail
paralegal on the casetaken → dial
new-client intake
billing
main line
A human picks up.

Same input, different edge, call to call. The model is doing what it is built to do — predict plausible text — on a problem that has exactly one right answer.

Same sentence, different door, every call, on the left. The model is doing precisely what it was built to do, which is produce plausible text, on a problem where plausible isn't the bar. Take the decision away from it and the same input lands on the same door every time.

The fix: stop letting the model choose

Buried in the webhook nodes was a deterministic option I'd been ignoring: branch on a variable with a real comparison (route == "paralegal"), evaluated by the routing engine instead of the model. My Go server makes the actual call and hands back the variable, and the graph just runs a switch statement. The model goes back to its one genuine talent, working out who the caller wants, and loses its driving privileges.

The bug that survived four prose rewrites died on the first deterministic try. Anticlimactic. I'll take anticlimactic.

The platform got one more shot in, silently, because of course it did. The schema for reading the server's reply has to be in object form. The tuple form looks equally correct and fails with no error at all, leaving the node to loop on itself like it's stuck buffering:

responseData: { data: "$.route", name: "route", context: "..." }   // reads the variable
responseData: [ "route", "string", "$.route" ]                      // reads nothing, loops forever, says nothing

Then the last indignity. Every transfer runs through a "notify" node that pings the staffer and then advances to the dial, wired to advance when the webhook returns a 200. It does not advance on a 200. It waits for the caller to say something first. So a chatty caller who said "okay, thanks" tripped the next turn and got connected, and a polite caller who went quiet to be transferred got several minutes of dead air instead. The deterministic routing, conveniently, did advance on a 200, so I had the server return advance: "true" and branched the notify node on that. A long-running Spanish dead-end I'd been blaming on the greeting prompt for a week turned out to be this exact bug wearing a disguise.

The next workday I pulled every transfer-intent call. Eleven of twelve connected. The one miss hung up and immediately called back, so even the failure had second thoughts. Median time from answer to transfer, about fifteen seconds.

Pre-fix: 58%. Post-fix: 92%.

Then I made it prove it

Eyeballing a dozen calls and declaring victory is how you end up right back here in two weeks. So I wrote three graders: did the call reach a human, did the agent claim a transfer that never happened, and did any call sit on a notify step for more than thirty seconds. Run against calls from before the fix, they reproduced my hand diagnosis almost exactly, at about seven cents a call, which is cheaper than my hand.

Transfers that actually connected

025507510055%
first transcript audit
the bad week
measured baseline
server-side router
auto-advance fix

55% · ~1 in 3 callers told "I can't transfer you"

The third grader is the one I actually care about. It's a tripwire: if the worst bug of the whole project ever crawls back, a score drops on a dashboard instead of a customer finding out for me. The same Go server also emails the owner a daily spreadsheet of every call and where it went, so the person paying for this can watch it work without ever learning what a webhook is.

What was true the whole time

By late May the receptionist had gone from "needs a babysitter" to "runs itself and complains when something's wrong." What's left is a genuine long tail of edge cases (callers who name a staffer in the first half-second, people dialing extensions, the occasional caller who mostly wants to yell at a human) rather than a hole in the design.

The fixes that stuck were mechanical. The fixes that kept leaking were all me trying to talk the model into behaving. That's the whole lesson, and I now apply it to everything I build:

If it has to be reliable, make it mechanical. Prose is for the parts where "close enough" is honestly close enough.

Language models didn't make routing hard. They just made it very tempting to hand a decision with a right answer to something that only does guesses.