Providers

VOCALS uses a modular provider system for each stage of the voice pipeline: Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). You can mix and match providers per agent to optimize for latency, accuracy, cost, or language support.

Adding a Provider

Navigate to Configuration > Providers in the dashboard.
Click Add Provider.
Select the provider type (STT, LLM, or TTS) and the specific service.
Enter your API key and configure provider-specific settings.
Click Save.

Each provider is validated on save -- VOCALS will make a lightweight test request to confirm your API key and configuration are valid.

API Key Management

API keys are encrypted at rest and never displayed in full after initial entry.
To rotate a key, click the Edit button on any provider and enter the new key.
If a provider returns authentication errors during a call, the call will fall back gracefully and the error will appear in your call logs.

tip

Create separate API keys for VOCALS rather than reusing keys from other projects. This makes it easier to track usage and rotate credentials without breaking other integrations.

STT Providers

Speech-to-Text providers transcribe the caller's audio into text in real time.

Deepgram

Setting	Description	Default
Model	`nova-2`, `nova-2-general`, `nova-2-phonecall`, `nova-2-meeting`	`nova-2`
Language	BCP-47 language code (e.g., `en-US`, `es`, `pt-BR`)	`en-US`
Smart Format	Enables punctuation, casing, and number formatting	Enabled
Endpointing	Silence duration (ms) before finalizing a phrase	`300`
Interim Results	Stream partial transcripts for faster response	Enabled

Deepgram is the recommended STT provider for most use cases due to its low latency and strong streaming support.

OpenAI Whisper

Setting	Description	Default
Model	`whisper-1`	`whisper-1`
Language	ISO-639-1 code (e.g., `en`, `es`, `fr`)	Auto-detect
Temperature	Sampling temperature for decoding (0.0 - 1.0)	`0.0`

OpenAI Whisper operates in batch mode -- audio is buffered and sent in chunks rather than streamed continuously. This adds some latency but can improve accuracy for noisy environments.

Alibaba Qwen

Setting	Description	Default
Model	`qwen-audio`	`qwen-audio`
Language	Language code (e.g., `en`, `zh`, `ja`)	`en`

Qwen provides strong multilingual support, particularly for Chinese and East Asian languages.

Fish Audio

Setting	Description	Default
Model	`transcribe-1`	`transcribe-1`
Language	Language code (optional, auto-detect if empty)	Auto-detect

Fish Audio's transcribe-1 model operates in batch mode with support for 30+ languages. Currently in beta.

LLM Providers

LLM providers generate the agent's conversational responses based on the transcript and system prompt.

OpenAI

Setting	Description	Default
Model	`gpt-4o`, `gpt-4o-mini`, `gpt-4-turbo`, `gpt-3.5-turbo`	`gpt-4o-mini`
Temperature	Controls randomness (0.0 - 2.0)	`0.7`
Max Tokens	Maximum tokens in the response	`256`

OpenAI models offer a good balance of quality and speed. Use gpt-4o-mini for cost-efficient general conversations and gpt-4o when response quality is critical.

Anthropic Claude

Setting	Description	Default
Model	`claude-sonnet-4-20250514`, `claude-haiku-4-20250414`	`claude-sonnet-4-20250514`
Temperature	Controls randomness (0.0 - 1.0)	`0.7`
Max Tokens	Maximum tokens in the response	`256`

Claude excels at following detailed system prompts and maintaining consistent personas. A strong choice for agents that need to adhere strictly to scripts or compliance requirements.

Google Gemini

Setting	Description	Default
Model	`gemini-2.5-flash`, `gemini-2.5-pro`	`gemini-2.5-flash`
Temperature	Controls randomness (0.0 - 2.0)	`0.7`
Max Tokens	Maximum tokens in the response	`256`

Gemini Flash offers very low latency at a competitive price point. A good option for high-volume deployments where speed matters.

Moonshot Kimi

Setting	Description	Default
Model	`moonshot-v1-8k`, `moonshot-v1-32k`	`moonshot-v1-8k`
Temperature	Controls randomness (0.0 - 1.0)	`0.7`
Max Tokens	Maximum tokens in the response	`256`

Kimi provides strong Chinese language support and competitive pricing for Asian-market deployments.

TTS Providers

Text-to-Speech providers convert the LLM's text response into audio that the caller hears.

ElevenLabs

Setting	Description	Default
Voice ID	The ID of the voice to use (from your ElevenLabs account)	--
Model	`eleven_turbo_v2_5`, `eleven_turbo_v2`, `eleven_multilingual_v2`	`eleven_turbo_v2_5`
Stability	Voice consistency (0.0 - 1.0). Lower values sound more expressive.	`0.5`
Similarity Boost	How closely to match the original voice (0.0 - 1.0)	`0.75`
Optimize Streaming Latency	Latency optimization level (0 - 4, higher = faster but lower quality)	`3`

ElevenLabs produces the most natural-sounding voices and supports voice cloning. Use the turbo models for telephony where latency matters.

Finding your Voice ID

In your ElevenLabs dashboard, go to Voices, select a voice, and copy the Voice ID from the voice details panel. You can also use the ElevenLabs API to list available voices.

OpenAI TTS

Setting	Description	Default
Model	`tts-1`, `tts-1-hd`	`tts-1`
Voice	`alloy`, `echo`, `fable`, `onyx`, `nova`, `shimmer`	`alloy`
Speed	Playback speed (0.25 - 4.0)	`1.0`

OpenAI TTS is simple to configure with six built-in voices. Use tts-1 for telephony (lower latency) and tts-1-hd only when audio quality is the top priority.

Resemble AI

Setting	Description	Default
Voice UUID	The UUID of the voice from your Resemble project	--
Project UUID	Your Resemble project identifier	--
Sample Rate	Output sample rate in Hz	`8000`

Resemble AI specializes in custom voice cloning. If you need a branded voice that sounds like a specific person, Resemble is the best choice.

Fish Audio

Setting	Description	Default
Model	`s2`, `s1`, `speech-1.6`, `speech-1.5`	`s2`
Reference ID	Voice reference ID from your Fish Audio account	--
Temperature	Controls expressiveness (0.0 - 1.0)	`0.7`
Top P	Nucleus sampling parameter (0.0 - 1.0)	`0.7`
Speed	Playback speed multiplier	`1.0`
Latency	`low`, `balanced`, `normal`	`balanced`

Fish Audio produces natural-sounding speech with emotion control across 30+ languages. The s2 model is the latest and most capable, with improved latency and quality over s1. Use latency: "balanced" for the best tradeoff between speed and quality in telephony.

Provider Recommendations by Use Case

Use Case	STT	LLM	TTS
General English (low latency)	Deepgram nova-2	OpenAI gpt-4o-mini	ElevenLabs turbo v2.5
High-quality conversations	Deepgram nova-2	Anthropic Claude Sonnet	ElevenLabs multilingual v2
Budget-conscious	Deepgram nova-2	Google Gemini Flash	OpenAI tts-1
Multilingual (30+ languages)	Fish Audio transcribe-1	Google Gemini Flash	Fish Audio s2
Chinese language	Alibaba Qwen	Moonshot Kimi	ElevenLabs multilingual v2
Custom brand voice	Deepgram nova-2	OpenAI gpt-4o	Resemble AI

Adding a Provider​

API Key Management​

STT Providers​

Deepgram​

OpenAI Whisper​

Alibaba Qwen​

Fish Audio​

LLM Providers​

OpenAI​

Anthropic Claude​

Google Gemini​

Moonshot Kimi​

TTS Providers​

ElevenLabs​

OpenAI TTS​

Resemble AI​

Fish Audio​

Provider Recommendations by Use Case​

Adding a Provider

API Key Management

STT Providers

Deepgram

OpenAI Whisper

Alibaba Qwen

Fish Audio

LLM Providers

OpenAI

Anthropic Claude

Google Gemini

Moonshot Kimi

TTS Providers

ElevenLabs

OpenAI TTS

Resemble AI

Fish Audio

Provider Recommendations by Use Case