Skip to main content

Providers

VOCALS uses a modular provider system for each stage of the voice pipeline: Speech-to-Text (STT), Large Language Model (LLM), and Text-to-Speech (TTS). You can mix and match providers per agent to optimize for latency, accuracy, cost, or language support.

Adding a Provider

  1. Navigate to Settings > Providers in the dashboard.
  2. Click Add Provider.
  3. Select the provider type (STT, LLM, or TTS) and the specific service.
  4. Enter your API key and configure provider-specific settings.
  5. Click Save.

Each provider is validated on save -- VOCALS will make a lightweight test request to confirm your API key and configuration are valid.

API Key Management

  • API keys are encrypted at rest and never displayed in full after initial entry.
  • To rotate a key, click the Edit button on any provider and enter the new key.
  • If a provider returns authentication errors during a call, the call will fall back gracefully and the error will appear in your call logs.
tip

Create separate API keys for VOCALS rather than reusing keys from other projects. This makes it easier to track usage and rotate credentials without breaking other integrations.

STT Providers

Speech-to-Text providers transcribe the caller's audio into text in real time.

Deepgram

SettingDescriptionDefault
Modelnova-2, nova-2-general, nova-2-phonecall, nova-2-meetingnova-2
LanguageBCP-47 language code (e.g., en-US, es, pt-BR)en-US
Smart FormatEnables punctuation, casing, and number formattingEnabled
EndpointingSilence duration (ms) before finalizing a phrase300
Interim ResultsStream partial transcripts for faster responseEnabled

Deepgram is the recommended STT provider for most use cases due to its low latency and strong streaming support.

OpenAI Whisper

SettingDescriptionDefault
Modelwhisper-1whisper-1
LanguageISO-639-1 code (e.g., en, es, fr)Auto-detect
TemperatureSampling temperature for decoding (0.0 - 1.0)0.0

OpenAI Whisper operates in batch mode -- audio is buffered and sent in chunks rather than streamed continuously. This adds some latency but can improve accuracy for noisy environments.

Alibaba Qwen

SettingDescriptionDefault
Modelqwen-audioqwen-audio
LanguageLanguage code (e.g., en, zh, ja)en

Qwen provides strong multilingual support, particularly for Chinese and East Asian languages.

LLM Providers

LLM providers generate the agent's conversational responses based on the transcript and system prompt.

OpenAI

SettingDescriptionDefault
Modelgpt-4o, gpt-4o-mini, gpt-4-turbo, gpt-3.5-turbogpt-4o-mini
TemperatureControls randomness (0.0 - 2.0)0.7
Max TokensMaximum tokens in the response256

OpenAI models offer a good balance of quality and speed. Use gpt-4o-mini for cost-efficient general conversations and gpt-4o when response quality is critical.

Anthropic Claude

SettingDescriptionDefault
Modelclaude-sonnet-4-20250514, claude-haiku-4-20250414claude-sonnet-4-20250514
TemperatureControls randomness (0.0 - 1.0)0.7
Max TokensMaximum tokens in the response256

Claude excels at following detailed system prompts and maintaining consistent personas. A strong choice for agents that need to adhere strictly to scripts or compliance requirements.

Google Gemini

SettingDescriptionDefault
Modelgemini-2.0-flash, gemini-2.0-progemini-2.0-flash
TemperatureControls randomness (0.0 - 2.0)0.7
Max TokensMaximum tokens in the response256

Gemini Flash offers very low latency at a competitive price point. A good option for high-volume deployments where speed matters.

Moonshot Kimi

SettingDescriptionDefault
Modelmoonshot-v1-8k, moonshot-v1-32kmoonshot-v1-8k
TemperatureControls randomness (0.0 - 1.0)0.7
Max TokensMaximum tokens in the response256

Kimi provides strong Chinese language support and competitive pricing for Asian-market deployments.

TTS Providers

Text-to-Speech providers convert the LLM's text response into audio that the caller hears.

ElevenLabs

SettingDescriptionDefault
Voice IDThe ID of the voice to use (from your ElevenLabs account)--
Modeleleven_turbo_v2_5, eleven_turbo_v2, eleven_multilingual_v2eleven_turbo_v2_5
StabilityVoice consistency (0.0 - 1.0). Lower values sound more expressive.0.5
Similarity BoostHow closely to match the original voice (0.0 - 1.0)0.75
Optimize Streaming LatencyLatency optimization level (0 - 4, higher = faster but lower quality)3

ElevenLabs produces the most natural-sounding voices and supports voice cloning. Use the turbo models for telephony where latency matters.

Finding your Voice ID

In your ElevenLabs dashboard, go to Voices, select a voice, and copy the Voice ID from the voice details panel. You can also use the ElevenLabs API to list available voices.

OpenAI TTS

SettingDescriptionDefault
Modeltts-1, tts-1-hdtts-1
Voicealloy, echo, fable, onyx, nova, shimmeralloy
SpeedPlayback speed (0.25 - 4.0)1.0

OpenAI TTS is simple to configure with six built-in voices. Use tts-1 for telephony (lower latency) and tts-1-hd only when audio quality is the top priority.

Resemble AI

SettingDescriptionDefault
Voice UUIDThe UUID of the voice from your Resemble project--
Project UUIDYour Resemble project identifier--
Sample RateOutput sample rate in Hz8000

Resemble AI specializes in custom voice cloning. If you need a branded voice that sounds like a specific person, Resemble is the best choice.

Provider Recommendations by Use Case

Use CaseSTTLLMTTS
General English (low latency)Deepgram nova-2OpenAI gpt-4o-miniElevenLabs turbo v2.5
High-quality conversationsDeepgram nova-2Anthropic Claude SonnetElevenLabs multilingual v2
Budget-consciousDeepgram nova-2Google Gemini FlashOpenAI tts-1
Chinese languageAlibaba QwenMoonshot KimiElevenLabs multilingual v2
Custom brand voiceDeepgram nova-2OpenAI gpt-4oResemble AI