Core Concepts
Understanding the key building blocks of VOCALS.
Tenants
A tenant is an isolated workspace in VOCALS. Each tenant has its own:
- Providers (API keys and configurations)
- Agents (AI phone agents)
- Phone numbers
- Call logs and analytics
- API keys and webhooks
Users belong to a tenant via membership. VOCALS supports multiple tenants, each with isolated data and configuration.
Agents
An agent is an AI-powered phone assistant configured with:
- System prompt — Instructions defining the agent's personality and behavior
- Providers — Which STT, LLM, and TTS services to use
- Welcome message — Optional greeting played at the start of each call
- Language — The language the agent speaks
- Barge-in sensitivity — How sensitive the agent is to the caller interrupting (very_low/low/medium/high/very_high)
- Max call duration — Automatic hang-up after a set time
Each phone number is assigned to one agent.
Providers
VOCALS uses a provider abstraction layer — each stage of the voice pipeline (STT, LLM, TTS) can use any supported provider. This allows you to:
- Mix providers (e.g., Deepgram for STT + Anthropic for LLM + ElevenLabs for TTS)
- Switch providers without code changes
- A/B test different provider combinations
Supported providers
| Type | Providers |
|---|---|
| STT | Deepgram, OpenAI Whisper, Alibaba Qwen, Fish Audio |
| LLM | OpenAI, Anthropic Claude, Google Gemini, Moonshot Kimi |
| TTS | ElevenLabs, OpenAI TTS, Resemble AI, Fish Audio |
The Voice Pipeline
Every call flows through a three-stage pipeline:
- STT (Speech-to-Text) — Caller audio is streamed via WebSocket to the STT provider, which returns text transcripts in real-time
- LLM (Language Model) — The transcript and conversation history are sent to the LLM, which generates a response as a stream of tokens
- TTS (Text-to-Speech) — Response text is synthesized into audio and streamed back to the caller
The pipeline processes each stage in a streaming fashion for minimal latency. Sentence boundaries are detected during LLM streaming so that TTS can start before the full response is generated.
Barge-in
Barge-in allows callers to interrupt the agent mid-sentence. When voice activity is detected during TTS playback, the agent stops speaking and listens to the caller. This is controlled by the agent's barge-in sensitivity setting.
Sessions
Each active call has a session stored in Redis containing:
- Call metadata (SID, direction, phone numbers)
- Conversation history (user and assistant messages)
- Provider usage and timing metrics
When a call ends, the session is persisted to PostgreSQL as a call log for long-term storage and analytics.
Webhooks
VOCALS can notify your application of events via webhooks:
call.started— A call has connectedcall.ended— A call has completedcall.failed— A call encountered an errortranscript.ready— A conversation transcript is available
See Webhooks for configuration details.
Knowledge Base
Agents can be augmented with a knowledge base — a collection of text entries that provide reference information the agent can draw on during conversations. This is useful for FAQs, product details, pricing, or any domain-specific content the agent should know without embedding it all in the system prompt.
Teams
VOCALS supports team collaboration within a tenant. Tenant owners can invite team members via email, and each member gets access to the shared workspace. Team size limits depend on your subscription tier. See Billing for details on tier limits.