Core Concepts
Understanding the key building blocks of VOCALS.
Tenants
A tenant is an isolated workspace in VOCALS. Each tenant has its own:
- Providers (API keys and configurations)
- Agents (AI phone agents)
- Phone numbers
- Call logs and analytics
- API keys and webhooks
Users belong to a tenant via membership. VOCALS supports multiple tenants, each with isolated data and configuration.
Agents
An agent is an AI-powered phone assistant configured with:
- System prompt — Instructions defining the agent's personality and behavior
- Providers — Which STT, LLM, and TTS services to use
- Welcome message — Optional greeting played at the start of each call
- Language — The language the agent speaks
- Barge-in sensitivity — How sensitive the agent is to the caller interrupting (low/medium/high)
- Max call duration — Automatic hang-up after a set time
Each phone number is assigned to one agent.
Providers
VOCALS uses a provider abstraction layer — each stage of the voice pipeline (STT, LLM, TTS) can use any supported provider. This allows you to:
- Mix providers (e.g., Deepgram for STT + Anthropic for LLM + ElevenLabs for TTS)
- Switch providers without code changes
- A/B test different provider combinations
Supported providers
| Type | Providers |
|---|---|
| STT | Deepgram, OpenAI Whisper, Alibaba Qwen |
| LLM | OpenAI, Anthropic Claude, Google Gemini, Moonshot Kimi |
| TTS | ElevenLabs, OpenAI TTS, Resemble AI |
The Voice Pipeline
Every call flows through a three-stage pipeline:
- STT (Speech-to-Text) — Caller audio is streamed via WebSocket to the STT provider, which returns text transcripts in real-time
- LLM (Language Model) — The transcript and conversation history are sent to the LLM, which generates a response as a stream of tokens
- TTS (Text-to-Speech) — Response text is synthesized into audio and streamed back to the caller
The pipeline processes each stage in a streaming fashion for minimal latency. Sentence boundaries are detected during LLM streaming so that TTS can start before the full response is generated.
Barge-in
Barge-in allows callers to interrupt the agent mid-sentence. When voice activity is detected during TTS playback, the agent stops speaking and listens to the caller. This is controlled by the agent's barge-in sensitivity setting.
Sessions
Each active call has a session stored in Redis containing:
- Call metadata (SID, direction, phone numbers)
- Conversation history (user and assistant messages)
- Provider usage and timing metrics
When a call ends, the session is persisted to PostgreSQL as a call log for long-term storage and analytics.
Webhooks
VOCALS can notify your application of events via webhooks:
call.started— A call has connectedcall.ended— A call has completedcall.failed— A call encountered an errortranscript.ready— A conversation transcript is available
See Webhooks for configuration details.