Provider Integration
VOCALS uses a provider abstraction pattern that lets you add custom STT, LLM, and TTS implementations without modifying the core pipeline. This guide walks through the base interfaces, required methods, and registration process.
Base Interfaces
All providers inherit from BaseProvider, which handles API key storage and requires a validate() method. Each pipeline stage has its own abstract class.
BaseProvider
# backend/app/providers/base.py
class BaseProvider(ABC):
def __init__(self, api_key: str, config: Optional[Dict[str, Any]] = None):
self.api_key = api_key
self.config = config or {}
@abstractmethod
async def validate(self) -> bool:
"""Validate the provider configuration and API key.
Returns True if valid, raises an exception with details otherwise."""
...
async def list_models(self) -> List[str]:
"""Return available model IDs. Override in subclasses.
Returns an empty list by default."""
return []
STTProvider
class STTProvider(BaseProvider):
@abstractmethod
async def transcribe(self, audio_stream: AsyncIterator[bytes]) -> AsyncIterator[str]:
"""Transcribe streaming audio to text.
Args:
audio_stream: Async iterator of PCM audio chunks (16kHz, 16-bit mono).
Yields:
Transcribed text fragments (partial or final results).
"""
...
LLMProvider
class LLMProvider(BaseProvider):
@abstractmethod
async def generate(
self,
messages: list,
system_prompt: Optional[str] = None,
) -> AsyncIterator[str]:
"""Generate a streaming response from the LLM.
Args:
messages: Conversation history as list of dicts with 'role' and 'content'.
system_prompt: Optional system prompt to prepend.
Yields:
Text tokens/chunks as they are generated.
"""
...
TTSProvider
class TTSProvider(BaseProvider):
@abstractmethod
async def synthesize(self, text: str) -> AsyncIterator[bytes]:
"""Synthesize text to streaming audio.
Args:
text: Text to convert to speech.
Yields:
Audio data chunks (PCM 16kHz, 16-bit mono).
"""
...
Implementing a Custom Provider
Here is a complete example of adding a custom TTS provider.
Step 1: Create the Provider Class
Create a new file in the appropriate subdirectory:
# backend/app/providers/tts/my_tts.py
from typing import Any, AsyncIterator, Dict, List, Optional
import httpx
from app.providers.base import TTSProvider
class MyTTSProvider(TTSProvider):
"""Custom TTS provider implementation."""
# Hardcoded model list (or fetch from API in list_models)
MODELS = ["model-standard", "model-hd"]
def __init__(self, api_key: str, config: Optional[Dict[str, Any]] = None):
super().__init__(api_key, config)
self.model = config.get("model", "model-standard")
self.voice = config.get("voice_id", "default")
self.base_url = "https://api.my-tts-service.com/v1"
async def validate(self) -> bool:
"""Test the API key by making a lightweight API call."""
async with httpx.AsyncClient() as client:
resp = await client.get(
f"{self.base_url}/voices",
headers={"Authorization": f"Bearer {self.api_key}"},
timeout=10.0,
)
if resp.status_code == 401:
raise ValueError("Invalid API key")
resp.raise_for_status()
return True
async def list_models(self) -> List[str]:
"""Return available models. Can query the API or return a static list."""
return self.MODELS
async def synthesize(self, text: str) -> AsyncIterator[bytes]:
"""Stream synthesized audio as PCM 16kHz 16-bit mono chunks."""
async with httpx.AsyncClient() as client:
async with client.stream(
"POST",
f"{self.base_url}/synthesize",
headers={"Authorization": f"Bearer {self.api_key}"},
json={
"text": text,
"model": self.model,
"voice": self.voice,
"output_format": "pcm_16000",
},
timeout=30.0,
) as resp:
resp.raise_for_status()
async for chunk in resp.aiter_bytes(chunk_size=4096):
yield chunk
Step 2: Register the Provider
Add the import and registration call to the provider registry:
# backend/app/providers/registry.py
def _register_all() -> None:
# ... existing registrations ...
from app.providers.tts.my_tts import MyTTSProvider
register_provider("tts", "my_tts", MyTTSProvider)
The register_provider function takes three arguments:
| Argument | Type | Description |
|---|---|---|
provider_type | string | stt, llm, or tts |
name | string | Unique identifier used in the API (e.g. my_tts) |
cls | Type[BaseProvider] | The provider class |
Step 3: Use the Provider
After registration, the provider is immediately available through the API:
# Create a provider configuration
curl -X POST \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{
"type": "tts",
"name": "my_tts",
"api_key": "your-api-key-here",
"model_id": "model-hd",
"extra_config": { "voice_id": "custom-voice" }
}' \
https://api.usevocals.com/api/v1/providers
# Test the provider
curl -X POST \
-H "Authorization: Bearer $JWT" \
https://api.usevocals.com/api/v1/providers/{provider_id}/test
# List available models
curl -H "Authorization: Bearer $JWT" \
https://api.usevocals.com/api/v1/providers/{provider_id}/models
Then assign the provider to an agent:
curl -X PUT \
-H "Authorization: Bearer $JWT" \
-H "Content-Type: application/json" \
-d '{"active_tts_provider_id": "provider-uuid-here"}' \
https://api.usevocals.com/api/v1/agents/{agent_id}
Provider Registry
The registry (backend/app/providers/registry.py) maintains three dictionaries -- one per provider type -- mapping names to classes:
_stt_providers: Dict[str, Type[STTProvider]] = {}
_llm_providers: Dict[str, Type[LLMProvider]] = {}
_tts_providers: Dict[str, Type[TTSProvider]] = {}
Key functions:
| Function | Description |
|---|---|
register_provider(type, name, cls) | Register a provider class |
get_provider(type, name, api_key, config) | Instantiate a registered provider |
list_providers(type) | List registered provider names for a type |
The _register_all() function runs at import time and registers all built-in providers.
Currently Registered Providers
STT Providers
| Name | Class | Description |
|---|---|---|
deepgram | DeepgramSTTProvider | Real-time streaming via Deepgram Nova |
openai | OpenAISTTProvider | OpenAI Whisper API |
whisper | WhisperSTTProvider | OpenAI Whisper (local) |
elevenlabs | ElevenLabsSTTProvider | ElevenLabs STT |
qwen | QwenSTTProvider | Alibaba Qwen STT |
LLM Providers
| Name | Class | Description |
|---|---|---|
openai | OpenAILLMProvider | GPT-4o, GPT-4, GPT-3.5 |
claude | ClaudeLLMProvider | Anthropic Claude models |
google | GoogleLLMProvider | Google Gemini models |
kimi | KimiLLMProvider | Moonshot Kimi models |
TTS Providers
| Name | Class | Description |
|---|---|---|
deepgram | DeepgramTTSProvider | Deepgram Aura TTS |
openai | OpenAITTSProvider | OpenAI TTS |
elevenlabs | ElevenLabsTTSProvider | ElevenLabs TTS |
qwen | QwenTTSProvider | Alibaba Qwen TTS |
resemble | ResembleTTSProvider | Resemble AI TTS |
Audio Format Requirements
All STT providers receive audio as an AsyncIterator[bytes] of PCM 16kHz, 16-bit mono chunks. The orchestrator handles conversion from Twilio's mulaw/8kHz format before passing audio to the STT provider.
All TTS providers must yield audio as PCM 16kHz, 16-bit mono chunks. The orchestrator converts the output back to mulaw/8kHz for Twilio.
If your provider's API uses a different format, perform the conversion inside your provider implementation.
Best Practices
- Streaming: Use streaming APIs where available. The pipeline processes audio in real time, so batch APIs add significant latency.
- Timeout handling: Set reasonable timeouts on HTTP calls. The orchestrator will retry or fall back gracefully on provider failures.
- Error messages: Raise
ValueErrorfromvalidate()with a clear message (e.g. "Invalid API key", "Model not found"). These messages are returned to the user through the API. - Model listing: Implement
list_models()to query the provider API when possible. Fall back to a hardcoded list if the API does not support model enumeration. Return an empty list only as a last resort. - Config access: Use
self.configto read provider-specific settings passed viaextra_configwhen creating or updating a provider through the API.