Providers
VOCALS uses a modular provider system for each stage of the voice pipeline: Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). You can mix and match providers per agent to optimize for latency, accuracy, cost, or language support.
Adding a Provider
- Navigate to Settings > Providers in the dashboard.
- Click Add Provider.
- Select the provider type (STT, LLM, or TTS) and the specific service.
- Enter your API key and configure provider-specific settings.
- Click Save.
Each provider is validated on save -- VOCALS will make a lightweight test request to confirm your API key and configuration are valid.
API Key Management
- API keys are encrypted at rest and never displayed in full after initial entry.
- To rotate a key, click the Edit button on any provider and enter the new key.
- If a provider returns authentication errors during a call, the call will fall back gracefully and the error will appear in your call logs.
Create separate API keys for VOCALS rather than reusing keys from other projects. This makes it easier to track usage and rotate credentials without breaking other integrations.
STT Providers
Speech-to-Text providers transcribe the caller's audio into text in real time.
Deepgram
| Setting | Description | Default |
|---|---|---|
| Model | nova-2, nova-2-general, nova-2-phonecall, nova-2-meeting | nova-2 |
| Language | BCP-47 language code (e.g., en-US, es, pt-BR) | en-US |
| Smart Format | Enables punctuation, casing, and number formatting | Enabled |
| Endpointing | Silence duration (ms) before finalizing a phrase | 300 |
| Interim Results | Stream partial transcripts for faster response | Enabled |
Deepgram is the recommended STT provider for most use cases due to its low latency and strong streaming support.
OpenAI Whisper
| Setting | Description | Default |
|---|---|---|
| Model | whisper-1 | whisper-1 |
| Language | ISO-639-1 code (e.g., en, es, fr) | Auto-detect |
| Temperature | Sampling temperature for decoding (0.0 - 1.0) | 0.0 |
OpenAI Whisper operates in batch mode -- audio is buffered and sent in chunks rather than streamed continuously. This adds some latency but can improve accuracy for noisy environments.
Alibaba Qwen
| Setting | Description | Default |
|---|---|---|
| Model | qwen-audio | qwen-audio |
| Language | Language code (e.g., en, zh, ja) | en |
Qwen provides strong multilingual support, particularly for Chinese and East Asian languages.
LLM Providers
LLM providers generate the agent's conversational responses based on the transcript and system prompt.
OpenAI
| Setting | Description | Default |
|---|---|---|
| Model | gpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbo | gpt-4o-mini |
| Temperature | Controls randomness (0.0 - 2.0) | 0.7 |
| Max Tokens | Maximum tokens in the response | 256 |
OpenAI models offer a good balance of quality and speed. Use gpt-4o-mini for cost-efficient general conversations and gpt-4o when response quality is critical.
Anthropic Claude
| Setting | Description | Default |
|---|---|---|
| Model | claude-sonnet-4-20250514, claude-haiku-4-20250414 | claude-sonnet-4-20250514 |
| Temperature | Controls randomness (0.0 - 1.0) | 0.7 |
| Max Tokens | Maximum tokens in the response | 256 |
Claude excels at following detailed system prompts and maintaining consistent personas. A strong choice for agents that need to adhere strictly to scripts or compliance requirements.
Google Gemini
| Setting | Description | Default |
|---|---|---|
| Model | gemini-2.0-flash, gemini-2.0-pro | gemini-2.0-flash |
| Temperature | Controls randomness (0.0 - 2.0) | 0.7 |
| Max Tokens | Maximum tokens in the response | 256 |
Gemini Flash offers very low latency at a competitive price point. A good option for high-volume deployments where speed matters.
Moonshot Kimi
| Setting | Description | Default |
|---|---|---|
| Model | moonshot-v1-8k, moonshot-v1-32k | moonshot-v1-8k |
| Temperature | Controls randomness (0.0 - 1.0) | 0.7 |
| Max Tokens | Maximum tokens in the response | 256 |
Kimi provides strong Chinese language support and competitive pricing for Asian-market deployments.
TTS Providers
Text-to-Speech providers convert the LLM's text response into audio that the caller hears.
ElevenLabs
| Setting | Description | Default |
|---|---|---|
| Voice ID | The ID of the voice to use (from your ElevenLabs account) | -- |
| Model | eleven_turbo_v2_5, eleven_turbo_v2, eleven_multilingual_v2 | eleven_turbo_v2_5 |
| Stability | Voice consistency (0.0 - 1.0). Lower values sound more expressive. | 0.5 |
| Similarity Boost | How closely to match the original voice (0.0 - 1.0) | 0.75 |
| Optimize Streaming Latency | Latency optimization level (0 - 4, higher = faster but lower quality) | 3 |
ElevenLabs produces the most natural-sounding voices and supports voice cloning. Use the turbo models for telephony where latency matters.
In your ElevenLabs dashboard, go to Voices, select a voice, and copy the Voice ID from the voice details panel. You can also use the ElevenLabs API to list available voices.
OpenAI TTS
| Setting | Description | Default |
|---|---|---|
| Model | tts-1, tts-1-hd | tts-1 |
| Voice | alloy, echo, fable, onyx, nova, shimmer | alloy |
| Speed | Playback speed (0.25 - 4.0) | 1.0 |
OpenAI TTS is simple to configure with six built-in voices. Use tts-1 for telephony (lower latency) and tts-1-hd only when audio quality is the top priority.
Resemble AI
| Setting | Description | Default |
|---|---|---|
| Voice UUID | The UUID of the voice from your Resemble project | -- |
| Project UUID | Your Resemble project identifier | -- |
| Sample Rate | Output sample rate in Hz | 8000 |
Resemble AI specializes in custom voice cloning. If you need a branded voice that sounds like a specific person, Resemble is the best choice.
Provider Recommendations by Use Case
| Use Case | STT | LLM | TTS |
|---|---|---|---|
| General English (low latency) | Deepgram nova-2 | OpenAI gpt-4o-mini | ElevenLabs turbo v2.5 |
| High-quality conversations | Deepgram nova-2 | Anthropic Claude Sonnet | ElevenLabs multilingual v2 |
| Budget-conscious | Deepgram nova-2 | Google Gemini Flash | OpenAI tts-1 |
| Chinese language | Alibaba Qwen | Moonshot Kimi | ElevenLabs multilingual v2 |
| Custom brand voice | Deepgram nova-2 | OpenAI gpt-4o | Resemble AI |