Core Concepts

Understanding the key building blocks of VOCALS.

Tenants

A tenant is an isolated workspace in VOCALS. Each tenant has its own:

Providers (API keys and configurations)
Agents (AI phone agents)
Phone numbers
Call logs and analytics
API keys and webhooks

Users belong to a tenant via membership. VOCALS supports multiple tenants, each with isolated data and configuration.

Agents

An agent is an AI-powered phone assistant configured with:

System prompt — Instructions defining the agent's personality and behavior
Providers — Which STT, LLM, and TTS services to use
Welcome message — Optional greeting played at the start of each call
Language — The language the agent speaks
Barge-in sensitivity — How sensitive the agent is to the caller interrupting (very_low/low/medium/high/very_high)
Max call duration — Automatic hang-up after a set time

Each phone number is assigned to one agent.

Providers

VOCALS uses a provider abstraction layer — each stage of the voice pipeline (STT, LLM, TTS) can use any supported provider. This allows you to:

Mix providers (e.g., Deepgram for STT + Anthropic for LLM + ElevenLabs for TTS)
Switch providers without code changes
A/B test different provider combinations

Supported providers

Type	Providers
STT	Deepgram, OpenAI Whisper, Alibaba Qwen, Fish Audio
LLM	OpenAI, Anthropic Claude, Google Gemini, Moonshot Kimi
TTS	ElevenLabs, OpenAI TTS, Resemble AI, Fish Audio

The Voice Pipeline

Every call flows through a three-stage pipeline:

STT (Speech-to-Text) — Caller audio is streamed via WebSocket to the STT provider, which returns text transcripts in real-time
LLM (Language Model) — The transcript and conversation history are sent to the LLM, which generates a response as a stream of tokens
TTS (Text-to-Speech) — Response text is synthesized into audio and streamed back to the caller

The pipeline processes each stage in a streaming fashion for minimal latency. Sentence boundaries are detected during LLM streaming so that TTS can start before the full response is generated.

Barge-in

Barge-in allows callers to interrupt the agent mid-sentence. When voice activity is detected during TTS playback, the agent stops speaking and listens to the caller. This is controlled by the agent's barge-in sensitivity setting.

Sessions

Each active call has a session stored in Redis containing:

Call metadata (SID, direction, phone numbers)
Conversation history (user and assistant messages)
Provider usage and timing metrics

When a call ends, the session is persisted to PostgreSQL as a call log for long-term storage and analytics.

Webhooks

VOCALS can notify your application of events via webhooks:

call.started — A call has connected
call.ended — A call has completed
call.failed — A call encountered an error
transcript.ready — A conversation transcript is available

See Webhooks for configuration details.

Knowledge Base

Agents can be augmented with a knowledge base — a collection of text entries that provide reference information the agent can draw on during conversations. This is useful for FAQs, product details, pricing, or any domain-specific content the agent should know without embedding it all in the system prompt.

Entries can be added from raw text, a web URL, or an uploaded document, and are managed programmatically via the Knowledge Base API.

Teams

VOCALS supports team collaboration within a tenant. Tenant owners can invite team members via email, and each member gets access to the shared workspace. Team size limits depend on your subscription tier. See Billing for details on tier limits.

Tenants​

Agents​

Providers​

Supported providers​

The Voice Pipeline​

Barge-in​

Sessions​

Webhooks​

Knowledge Base​

Teams​