Architecture

VOCALS is organized into three layers: a config panel (frontend), an orchestrator API (backend), and a telephony layer (SIP). This page describes how they fit together and how audio flows through the system.

System Overview

Twilio SIP ─────────── WebSocket Media Streams ───► Orchestrator API
                                                          |
Generic SIP Provider ── SIP trunk ── Asterisk             |
                                        |                 |
                                   AudioSocket            |
                                        |                 |
                                     Bridge ─── WebSocket ┘
                                                          |
                                                    STT Provider
                                                          |
                                                    LLM Provider
                                                          |
                                                    TTS Provider
                                                          |
                                                   audio response
                                                   back to caller

Admin Panel ──► Orchestrator API (config endpoints)

VOCALS supports two telephony paths:

Twilio -- Direct WebSocket Media Streams to the orchestrator (mulaw/8kHz)
Generic SIP -- SIP trunk → Asterisk → AudioSocket → Bridge → WebSocket to orchestrator (alaw or ulaw, auto-detected)

Config Panel (Frontend)

A React + TypeScript single-page application served by Nginx. Provides the dashboard for managing agents, providers, SIP configurations, call logs, analytics, webhooks, and API keys. Communicates with the orchestrator API over HTTPS. Authenticated via Supabase Auth (email/password, OAuth).

Orchestrator API (Backend)

A Python FastAPI application that handles two distinct concerns:

REST API -- CRUD operations for agents, providers, calls, webhooks, analytics, and API keys. Mounted at /api/v1/.
WebSocket handler -- Receives real-time audio from Twilio Media Streams. Each active call maintains a WebSocket connection that drives the STT -> LLM -> TTS pipeline.

The backend runs on Uvicorn with uvloop for high-performance async I/O.

Telephony Layer

VOCALS supports two telephony transports. Both feed into the same voice pipeline.

Twilio: Twilio SIP handles the PSTN connection. When a call arrives (or is initiated), Twilio opens a WebSocket Media Stream to the orchestrator backend, streaming audio in mulaw/8kHz format. The backend converts this to PCM and feeds it through the pipeline.

Generic SIP (Asterisk): For non-Twilio providers (Netelip, Zadarma, etc.), VOCALS runs Asterisk as a SIP gateway. Asterisk registers with the SIP provider, handles SIP signaling, and bridges audio via the AudioSocket protocol to a lightweight Python bridge service. The bridge converts audio (alaw/ulaw/slin) and forwards it over WebSocket to the backend, where it enters the same pipeline as Twilio calls.

Key components:

Asterisk -- SIP gateway. Handles trunk registration, dialplan routing, RTP media, and answering machine detection (AMD) for outbound calls.
Bridge -- Python service that translates between Asterisk's AudioSocket protocol (binary TCP) and the backend's WebSocket protocol. Auto-detects codec from frame size (160 bytes = alaw/ulaw, 320 bytes = slin). Runs on port 9092 (audio) and 9093 (health check).
Shared volume -- Backend generates PJSIP trunk configuration and writes it to a shared Docker volume. Asterisk reads the config via symlink. Live reload is triggered via ARI when SIP configs are created, updated, or deleted in the dashboard.

Voice Pipeline

Each call flows through a real-time pipeline with four stages:

Twilio Media Stream (mulaw/8kHz)     Asterisk AudioSocket (alaw/ulaw/slin 8kHz)
        |                                      |
        v                                      v
  Audio Conversion                        Bridge Service
  (mulaw -> PCM 16kHz)                   (codec detect + WebSocket)
        |                                      |
        └──────────────┬───────────────────────┘
                       v
                STT Provider (streaming transcription)
                       |
                       v
                LLM Provider (streaming text generation)
                       |
                       v
                TTS Provider (streaming speech synthesis)
                       |
                       v
                Audio Conversion (PCM -> caller codec)
                       |
                       v
                Back to caller

Audio Conversion

Audio arrives in different formats depending on the telephony path:

Twilio: mulaw-encoded 8kHz samples
Asterisk: alaw (G.711a, common in Europe), ulaw (G.711u, common in North America), or signed linear 16-bit 8kHz. The bridge auto-detects the codec from the first AudioSocket frame size.

Most STT and TTS providers expect PCM 16-bit 16kHz mono. The orchestrator performs real-time resampling in both directions:

Inbound: source codec/8kHz -> PCM 16kHz 16-bit mono (for STT)
Outbound: PCM 16kHz 16-bit mono -> source codec/8kHz (back to caller)

Speech-to-Text (STT)

The STT provider receives a streaming async iterator of PCM audio chunks and yields transcribed text fragments. Both partial (interim) and final results are emitted. The orchestrator uses final transcription results to trigger LLM generation.

Large Language Model (LLM)

The LLM provider receives the conversation history (system prompt + all previous turns) and generates a streaming text response. The orchestrator detects sentence boundaries in the stream and sends each complete sentence to TTS immediately, rather than waiting for the full response. This reduces perceived latency.

The system prompt is augmented with call control instructions. The LLM can output special markers:

[END_CALL] -- Signals the orchestrator to hang up the call
[VOICEMAIL] -- Signals that an answering machine was detected (for outbound calls)

Text-to-Speech (TTS)

The TTS provider receives text and yields streaming PCM audio chunks. These are converted back to mulaw and sent to Twilio in real time.

Barge-In

VOCALS supports caller interruption (barge-in). A Voice Activity Detector (VAD) monitors the incoming audio. When speech is detected while the agent is talking, the orchestrator:

Stops the current TTS playback
Cancels any pending TTS chunks
Resumes STT processing for the caller's speech
Routes the new transcript through the LLM

Barge-in sensitivity is configurable per agent (very_low, low, medium, high, very_high). The welcome message can optionally be made non-interruptible.

Session Management

Active Sessions (Redis)

While a call is in progress, its state is managed in Redis:

Conversation history -- The running list of user/assistant message turns
Call metadata -- Call SID, agent ID, provider references, timing data
Call queue -- Outbound calls waiting to be initiated (processed at 1 call per second)
Rate limit counters -- Per-API-key request counts (RPM and RPD)
Config cache -- Twilio credentials and agent settings cached for 60 seconds to avoid DB hits on concurrent requests

Persistence (PostgreSQL)

After a call ends, the orchestrator writes the final call record to PostgreSQL (via Supabase):

Call logs -- Duration, status, outcome, transcript, provider usage, latency metrics, recording path, error reason
STT events -- Connection lifecycle events (opened, closed, rejected, timeout, error)
Audit logs -- Timestamped records of every configuration change (agent, provider, webhook, API key operations)

All data is scoped to a tenant. Multi-tenancy is enforced at the query level with tenant_id filters on every database operation.

Provider Abstraction

Each pipeline stage (STT, LLM, TTS) is defined by an abstract base class with swappable implementations:

class STTProvider(BaseProvider):
    async def transcribe(self, audio_stream) -> AsyncIterator[str]: ...

class LLMProvider(BaseProvider):
    async def generate(self, messages, system_prompt) -> AsyncIterator[str]: ...

class TTSProvider(BaseProvider):
    async def synthesize(self, text) -> AsyncIterator[bytes]: ...

Providers are registered in a central registry and instantiated at runtime based on the agent's configuration. Each provider needs:

An API key (encrypted at rest)
A model identifier
Optional provider-specific configuration

This pattern allows you to:

Switch providers per agent without code changes
A/B test different provider configurations
Add new providers by implementing the base interface

Currently Registered Providers

Type	Providers
STT	Deepgram, OpenAI, Whisper, ElevenLabs, Qwen, Fish Audio
LLM	OpenAI, Claude, Google Gemini, Kimi
TTS	Deepgram, OpenAI, ElevenLabs, Qwen, Resemble, Fish Audio

Error Handling

The orchestrator includes structured failure logging and retry logic:

Provider failures are logged with tenant ID, call ID, provider type/name, error message, and whether a retry was attempted.
Graceful degradation -- If a provider fails mid-call, the system attempts to continue with an error message to the caller rather than dropping the call silently.
Voicemail detection -- For outbound calls, multiple detection layers are used:
- Asterisk AMD (generic SIP only) -- Analyzes the first few seconds of audio before the call reaches the pipeline. If a machine is detected, Asterisk hangs up immediately.
- LLM-based detection -- The LLM can output a [VOICEMAIL] marker if it detects voicemail patterns during conversation.
- Regex pattern matching -- Transcripts are checked against common voicemail greeting patterns.

System Overview​

Config Panel (Frontend)​

Orchestrator API (Backend)​

Telephony Layer​

Voice Pipeline​

Audio Conversion​

Speech-to-Text (STT)​

Large Language Model (LLM)​

Text-to-Speech (TTS)​

Barge-In​

Session Management​

Active Sessions (Redis)​

Persistence (PostgreSQL)​

Provider Abstraction​

Currently Registered Providers​

Error Handling​