Skip to main content

Provider Integration

VOCALS uses a provider abstraction pattern that lets you add custom STT, LLM, and TTS implementations without modifying the core pipeline. This guide walks through the base interfaces, required methods, and registration process.

Base Interfaces

All providers inherit from BaseProvider, which handles API key storage and requires a validate() method. Each pipeline stage has its own abstract class.

BaseProvider

# backend/app/providers/base.py

class BaseProvider(ABC):
def __init__(self, api_key: str, config: Optional[Dict[str, Any]] = None):
self.api_key = api_key
self.config = config or {}

@abstractmethod
async def validate(self) -> bool:
"""Validate the provider configuration and API key.
Returns True if valid, raises an exception with details otherwise."""
...

async def list_models(self) -> List[str]:
"""Return available model IDs. Override in subclasses.
Returns an empty list by default."""
return []

STTProvider

class STTProvider(BaseProvider):
@abstractmethod
async def transcribe(self, audio_stream: AsyncIterator[bytes]) -> AsyncIterator[str]:
"""Transcribe streaming audio to text.

Args:
audio_stream: Async iterator of PCM audio chunks (16kHz, 16-bit mono).

Yields:
Transcribed text fragments (partial or final results).
"""
...

LLMProvider

class LLMProvider(BaseProvider):
@abstractmethod
async def generate(
self,
messages: list,
system_prompt: Optional[str] = None,
) -> AsyncIterator[str]:
"""Generate a streaming response from the LLM.

Args:
messages: Conversation history as list of dicts with 'role' and 'content'.
system_prompt: Optional system prompt to prepend.

Yields:
Text tokens/chunks as they are generated.
"""
...

TTSProvider

class TTSProvider(BaseProvider):
@abstractmethod
async def synthesize(self, text: str) -> AsyncIterator[bytes]:
"""Synthesize text to streaming audio.

Args:
text: Text to convert to speech.

Yields:
Audio data chunks (PCM 16kHz, 16-bit mono).
"""
...

Implementing a Custom Provider

Here is a complete example of adding a custom TTS provider.

Step 1: Create the Provider Class

Create a new file in the appropriate subdirectory:

# backend/app/providers/tts/my_tts.py

from typing import Any, AsyncIterator, Dict, List, Optional
import httpx

from app.providers.base import TTSProvider


class MyTTSProvider(TTSProvider):
"""Custom TTS provider implementation."""

# Hardcoded model list (or fetch from API in list_models)
MODELS = ["model-standard", "model-hd"]

def __init__(self, api_key: str, config: Optional[Dict[str, Any]] = None):
super().__init__(api_key, config)
self.model = config.get("model", "model-standard")
self.voice = config.get("voice_id", "default")
self.base_url = "https://api.my-tts-service.com/v1"

async def validate(self) -> bool:
"""Test the API key by making a lightweight API call."""
async with httpx.AsyncClient() as client:
resp = await client.get(
f"{self.base_url}/voices",
headers={"Authorization": f"Bearer {self.api_key}"},
timeout=10.0,
)
if resp.status_code == 401:
raise ValueError("Invalid API key")
resp.raise_for_status()
return True

async def list_models(self) -> List[str]:
"""Return available models. Can query the API or return a static list."""
return self.MODELS

async def synthesize(self, text: str) -> AsyncIterator[bytes]:
"""Stream synthesized audio as PCM 16kHz 16-bit mono chunks."""
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
f"{self.base_url}/synthesize",
headers={"Authorization": f"Bearer {self.api_key}"},
json={
"text": text,
"model": self.model,
"voice": self.voice,
"output_format": "pcm_16000",
},
timeout=30.0,
) as resp:
resp.raise_for_status()
async for chunk in resp.aiter_bytes(chunk_size=4096):
yield chunk

Step 2: Register the Provider

Add the import and registration call to the provider registry:

# backend/app/providers/registry.py

def _register_all() -> None:
# ... existing registrations ...

from app.providers.tts.my_tts import MyTTSProvider
register_provider("tts", "my_tts", MyTTSProvider)

The register_provider function takes three arguments:

ArgumentTypeDescription
provider_typestringstt, llm, or tts
namestringUnique identifier used in the API (e.g. my_tts)
clsType[BaseProvider]The provider class

Step 3: Use the Provider

After registration, the provider is immediately available through the API:

# Create a provider configuration
curl -X POST \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"type": "tts",
"name": "my_tts",
"api_key": "your-api-key-here",
"model_id": "model-hd",
"extra_config": { "voice_id": "custom-voice" }
}' \
https://api.usevocals.com/api/v1/providers

# Test the provider
curl -X POST \
-H "Authorization: Bearer $JWT" \
https://api.usevocals.com/api/v1/providers/{provider_id}/test

# List available models
curl -H "Authorization: Bearer $JWT" \
https://api.usevocals.com/api/v1/providers/{provider_id}/models

Then assign the provider to an agent:

curl -X PUT \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{"active_tts_provider_id": "provider-uuid-here"}' \
https://api.usevocals.com/api/v1/agents/{agent_id}

Provider Registry

The registry (backend/app/providers/registry.py) maintains three dictionaries -- one per provider type -- mapping names to classes:

_stt_providers: Dict[str, Type[STTProvider]] = {}
_llm_providers: Dict[str, Type[LLMProvider]] = {}
_tts_providers: Dict[str, Type[TTSProvider]] = {}

Key functions:

FunctionDescription
register_provider(type, name, cls)Register a provider class
get_provider(type, name, api_key, config)Instantiate a registered provider
list_providers(type)List registered provider names for a type

The _register_all() function runs at import time and registers all built-in providers.

Currently Registered Providers

STT Providers

NameClassDescription
deepgramDeepgramSTTProviderReal-time streaming via Deepgram Nova
openaiOpenAISTTProviderOpenAI Whisper API
whisperWhisperSTTProviderOpenAI Whisper (local)
elevenlabsElevenLabsSTTProviderElevenLabs STT
qwenQwenSTTProviderAlibaba Qwen STT

LLM Providers

NameClassDescription
openaiOpenAILLMProviderGPT-4o, GPT-4, GPT-3.5
claudeClaudeLLMProviderAnthropic Claude models
googleGoogleLLMProviderGoogle Gemini models
kimiKimiLLMProviderMoonshot Kimi models

TTS Providers

NameClassDescription
deepgramDeepgramTTSProviderDeepgram Aura TTS
openaiOpenAITTSProviderOpenAI TTS
elevenlabsElevenLabsTTSProviderElevenLabs TTS
qwenQwenTTSProviderAlibaba Qwen TTS
resembleResembleTTSProviderResemble AI TTS

Audio Format Requirements

All STT providers receive audio as an AsyncIterator[bytes] of PCM 16kHz, 16-bit mono chunks. The orchestrator handles conversion from Twilio's mulaw/8kHz format before passing audio to the STT provider.

All TTS providers must yield audio as PCM 16kHz, 16-bit mono chunks. The orchestrator converts the output back to mulaw/8kHz for Twilio.

If your provider's API uses a different format, perform the conversion inside your provider implementation.

Best Practices

  • Streaming: Use streaming APIs where available. The pipeline processes audio in real time, so batch APIs add significant latency.
  • Timeout handling: Set reasonable timeouts on HTTP calls. The orchestrator will retry or fall back gracefully on provider failures.
  • Error messages: Raise ValueError from validate() with a clear message (e.g. "Invalid API key", "Model not found"). These messages are returned to the user through the API.
  • Model listing: Implement list_models() to query the provider API when possible. Fall back to a hardcoded list if the API does not support model enumeration. Return an empty list only as a last resort.
  • Config access: Use self.config to read provider-specific settings passed via extra_config when creating or updating a provider through the API.