Vapi Voice Agent Architecture¶
Overview¶
Vapi is a voice AI platform for building and deploying conversational voice agents. It handles the full voice pipeline: STT → LLM → TTS, with tool calling, transfer, and telephony integration.
Production Stack (Voice Controller AI)¶
Voice Controller AI runs ~50,000 calls/month on Vapi, primarily serving automotive shops. The stack:
- Vapi: Voice pipeline orchestration
- n8n: Workflow automation (webhooks, tool execution, CRM integration)
- Vercel: Frontend hosting
- Supabase: Database and auth
Key Optimization Vectors¶
The full-stack optimization space for voice agents includes:
- STT (Speech-to-Text): Accuracy, latency, language support. Affects downstream LLM quality.
- LLM (Language Model): Prompt engineering, model selection, response latency. The intelligence layer.
- TTS (Text-to-Speech): Naturalness, latency, voice selection. The user-facing quality layer.
- Prompt Design: System prompts, tool descriptions, conversation flow. Highest-leverage optimization.
- Tool Calling: Latency, reliability, error handling. Determines agent capability.
- Timing: Turn-taking, interruption handling, silence detection. Determines conversational feel.
Autoresearch Approach¶
The Voice Prompt AutoResearch (VPAR) project applies Karpathy's autoresearch loop to voice prompt optimization:
- Automated prompt variant generation
- Real Vapi call execution for evaluation
- LLM-judge scoring of call quality
- Iterative improvement based on results
Budget constraint: ~$3/cycle, with ~$20 total Vapi credits. At 50k calls/month, a 1% improvement = 500 better conversations.
Status (2026-04-04): VPAR is paused due to runaway charges (~$90 over 2 days). Pause enforcement gap identified — individual experiment scripts bypassed the pause toggle. Fix required before re-enabling.
Related¶
- Karpathy LLM Wiki Pattern — the autoresearch loop is itself a Karpathy concept