Skip to main content

Architecture

VOCALS is organized into three layers: a config panel (frontend), an orchestrator API (backend), and a telephony layer (Twilio SIP). This page describes how they fit together and how audio flows through the system.

System Overview

Twilio SIP ──► Orchestrator API ──► STT Provider (configurable)
| |
| transcribed text
| |
| LLM/Reasoning (configurable)
| |
| generated text
| |
| TTS Provider (configurable)
| |
<────────────── audio response back to caller

Admin Panel ──► Orchestrator API (config endpoints)

Config Panel (Frontend)

A React + TypeScript single-page application served by Nginx. Provides the dashboard for managing agents, providers, SIP configurations, call logs, analytics, webhooks, and API keys. Communicates with the orchestrator API over HTTPS. Authenticated via Supabase Auth (email/password, OAuth).

Orchestrator API (Backend)

A Python FastAPI application that handles two distinct concerns:

  1. REST API -- CRUD operations for agents, providers, calls, webhooks, analytics, and API keys. Mounted at /api/v1/.
  2. WebSocket handler -- Receives real-time audio from Twilio Media Streams. Each active call maintains a WebSocket connection that drives the STT -> LLM -> TTS pipeline.

The backend runs on Uvicorn with uvloop for high-performance async I/O.

Telephony Layer (Twilio)

Twilio SIP handles the PSTN connection. When a call arrives (or is initiated), Twilio opens a WebSocket Media Stream to the orchestrator backend, streaming audio in mulaw/8kHz format. The backend converts this to PCM and feeds it through the pipeline.

Voice Pipeline

Each call flows through a real-time pipeline with four stages:

Twilio Media Stream (mulaw/8kHz)
|
v
Audio Conversion (mulaw -> PCM 16kHz 16-bit mono)
|
v
STT Provider (streaming transcription)
|
v
LLM Provider (streaming text generation)
|
v
TTS Provider (streaming speech synthesis)
|
v
Audio Conversion (PCM -> mulaw/8kHz)
|
v
Twilio Media Stream (back to caller)

Audio Conversion

Twilio sends audio as mulaw-encoded 8kHz samples. Most STT and TTS providers expect PCM 16-bit 16kHz mono. The orchestrator performs real-time resampling in both directions:

  • Inbound: mulaw/8kHz -> PCM 16kHz 16-bit mono (for STT)
  • Outbound: PCM 16kHz 16-bit mono -> mulaw/8kHz (for Twilio)

Speech-to-Text (STT)

The STT provider receives a streaming async iterator of PCM audio chunks and yields transcribed text fragments. Both partial (interim) and final results are emitted. The orchestrator uses final transcription results to trigger LLM generation.

Large Language Model (LLM)

The LLM provider receives the conversation history (system prompt + all previous turns) and generates a streaming text response. The orchestrator detects sentence boundaries in the stream and sends each complete sentence to TTS immediately, rather than waiting for the full response. This reduces perceived latency.

The system prompt is augmented with call control instructions. The LLM can output special markers:

  • [END_CALL] -- Signals the orchestrator to hang up the call
  • [VOICEMAIL] -- Signals that an answering machine was detected (for outbound calls)

Text-to-Speech (TTS)

The TTS provider receives text and yields streaming PCM audio chunks. These are converted back to mulaw and sent to Twilio in real time.

Barge-In

VOCALS supports caller interruption (barge-in). A Voice Activity Detector (VAD) monitors the incoming audio. When speech is detected while the agent is talking, the orchestrator:

  1. Stops the current TTS playback
  2. Cancels any pending TTS chunks
  3. Resumes STT processing for the caller's speech
  4. Routes the new transcript through the LLM

Barge-in sensitivity is configurable per agent (low, medium, high). The welcome message can optionally be made non-interruptible.

Session Management

Active Sessions (Redis)

While a call is in progress, its state is managed in Redis:

  • Conversation history -- The running list of user/assistant message turns
  • Call metadata -- Call SID, agent ID, provider references, timing data
  • Call queue -- Outbound calls waiting to be initiated (processed at 1 call per second)
  • Rate limit counters -- Per-API-key request counts (RPM and RPD)
  • Config cache -- Twilio credentials and agent settings cached for 60 seconds to avoid DB hits on concurrent requests

Persistence (PostgreSQL)

After a call ends, the orchestrator writes the final call record to PostgreSQL (via Supabase):

  • Call logs -- Duration, status, outcome, transcript, provider usage, latency metrics, recording path, error reason
  • STT events -- Connection lifecycle events (opened, closed, rejected, timeout, error)
  • Audit logs -- Timestamped records of every configuration change (agent, provider, webhook, API key operations)

All data is scoped to a tenant. Multi-tenancy is enforced at the query level with tenant_id filters on every database operation.

Provider Abstraction

Each pipeline stage (STT, LLM, TTS) is defined by an abstract base class with swappable implementations:

class STTProvider(BaseProvider):
async def transcribe(self, audio_stream) -> AsyncIterator[str]: ...

class LLMProvider(BaseProvider):
async def generate(self, messages, system_prompt) -> AsyncIterator[str]: ...

class TTSProvider(BaseProvider):
async def synthesize(self, text) -> AsyncIterator[bytes]: ...

Providers are registered in a central registry and instantiated at runtime based on the agent's configuration. Each provider needs:

  • An API key (encrypted at rest)
  • A model identifier
  • Optional provider-specific configuration

This pattern allows you to:

  • Switch providers per agent without code changes
  • A/B test different provider configurations
  • Add new providers by implementing the base interface

Currently Registered Providers

TypeProviders
STTDeepgram, OpenAI, Whisper, ElevenLabs, Qwen
LLMOpenAI, Claude, Google Gemini, Kimi
TTSDeepgram, OpenAI, ElevenLabs, Qwen, Resemble

Error Handling

The orchestrator includes structured failure logging and retry logic:

  • Provider failures are logged with tenant ID, call ID, provider type/name, error message, and whether a retry was attempted.
  • Graceful degradation -- If a provider fails mid-call, the system attempts to continue with an error message to the caller rather than dropping the call silently.
  • Voicemail detection -- For outbound calls, both LLM-based detection (via the [VOICEMAIL] marker) and regex pattern matching on transcripts are used to identify answering machines.