Skip to main content

Core Concepts

Understanding the key building blocks of VOCALS.

Tenants

A tenant is an isolated workspace in VOCALS. Each tenant has its own:

  • Providers (API keys and configurations)
  • Agents (AI phone agents)
  • Phone numbers
  • Call logs and analytics
  • API keys and webhooks

Users belong to a tenant via membership. VOCALS supports multiple tenants, each with isolated data and configuration.

Agents

An agent is an AI-powered phone assistant configured with:

  • System prompt — Instructions defining the agent's personality and behavior
  • Providers — Which STT, LLM, and TTS services to use
  • Welcome message — Optional greeting played at the start of each call
  • Language — The language the agent speaks
  • Barge-in sensitivity — How sensitive the agent is to the caller interrupting (low/medium/high)
  • Max call duration — Automatic hang-up after a set time

Each phone number is assigned to one agent.

Providers

VOCALS uses a provider abstraction layer — each stage of the voice pipeline (STT, LLM, TTS) can use any supported provider. This allows you to:

  • Mix providers (e.g., Deepgram for STT + Anthropic for LLM + ElevenLabs for TTS)
  • Switch providers without code changes
  • A/B test different provider combinations

Supported providers

TypeProviders
STTDeepgram, OpenAI Whisper, Alibaba Qwen
LLMOpenAI, Anthropic Claude, Google Gemini, Moonshot Kimi
TTSElevenLabs, OpenAI TTS, Resemble AI

The Voice Pipeline

Every call flows through a three-stage pipeline:

  1. STT (Speech-to-Text) — Caller audio is streamed via WebSocket to the STT provider, which returns text transcripts in real-time
  2. LLM (Language Model) — The transcript and conversation history are sent to the LLM, which generates a response as a stream of tokens
  3. TTS (Text-to-Speech) — Response text is synthesized into audio and streamed back to the caller

The pipeline processes each stage in a streaming fashion for minimal latency. Sentence boundaries are detected during LLM streaming so that TTS can start before the full response is generated.

Barge-in

Barge-in allows callers to interrupt the agent mid-sentence. When voice activity is detected during TTS playback, the agent stops speaking and listens to the caller. This is controlled by the agent's barge-in sensitivity setting.

Sessions

Each active call has a session stored in Redis containing:

  • Call metadata (SID, direction, phone numbers)
  • Conversation history (user and assistant messages)
  • Provider usage and timing metrics

When a call ends, the session is persisted to PostgreSQL as a call log for long-term storage and analytics.

Webhooks

VOCALS can notify your application of events via webhooks:

  • call.started — A call has connected
  • call.ended — A call has completed
  • call.failed — A call encountered an error
  • transcript.ready — A conversation transcript is available

See Webhooks for configuration details.