Skip to content

Vapi Voice Agent Architecture

Overview

Vapi is a voice AI platform for building and deploying conversational voice agents. It handles the full voice pipeline: STT → LLM → TTS, with tool calling, transfer, and telephony integration.

Production Stack (Voice Controller AI)

Voice Controller AI runs ~50,000 calls/month on Vapi, primarily serving automotive shops. The stack:

  • Vapi: Voice pipeline orchestration
  • n8n: Workflow automation (webhooks, tool execution, CRM integration)
  • Vercel: Frontend hosting
  • Supabase: Database and auth

Key Optimization Vectors

The full-stack optimization space for voice agents includes:

  1. STT (Speech-to-Text): Accuracy, latency, language support. Affects downstream LLM quality.
  2. LLM (Language Model): Prompt engineering, model selection, response latency. The intelligence layer.
  3. TTS (Text-to-Speech): Naturalness, latency, voice selection. The user-facing quality layer.
  4. Prompt Design: System prompts, tool descriptions, conversation flow. Highest-leverage optimization.
  5. Tool Calling: Latency, reliability, error handling. Determines agent capability.
  6. Timing: Turn-taking, interruption handling, silence detection. Determines conversational feel.

Autoresearch Approach

The Voice Prompt AutoResearch (VPAR) project applies Karpathy's autoresearch loop to voice prompt optimization:

  • Automated prompt variant generation
  • Real Vapi call execution for evaluation
  • LLM-judge scoring of call quality
  • Iterative improvement based on results

Budget constraint: ~$3/cycle, with ~$20 total Vapi credits. At 50k calls/month, a 1% improvement = 500 better conversations.

Status (2026-04-04): VPAR is paused due to runaway charges (~$90 over 2 days). Pause enforcement gap identified — individual experiment scripts bypassed the pause toggle. Fix required before re-enabling.