Documentation

Everything you need to build with VoxLabs.

Introduction

VoxLabs is a next-generation voice synthesis engine designed for developers who need emotional control and low-latency performance. Unlike traditional TTS, VoxLabs uses advanced neural vocoders to generate speech that captures the subtleties of human expression.

Emotional Intelligence

Control pitch, speed, and energy to convey specific moods.

Zero-Shot Cloning

Clone voices with just 30 seconds of reference audio.

Streaming API

Real-time audio generation for conversational AI.

Privacy First

Voice data is encrypted and isolated.

Installation

Install our official Python SDK to get started immediately.

Terminal

pip install voxlabs-sdk

Quick Start

Generate your first audio file in less than 5 minutes.

python

from voxlabs import VoxClient

client = VoxClient(api_key="your_api_key")

# Generate speech
audio = client.tts.create(
    text="Hello! I can speak with real emotion.",
    emotion="happy",
    voice_id="en-us-1"
)

# Save to file
audio.save("hello.mp3")

Authentication

All API requests must include your API key in the Authorization header.

Authorization: Bearer YOUR_API_KEY

Endpoints

POST/v1/ttsGenerate speech from text.

GET/v1/voicesList available voices.

POST/v1/voices/cloneCreate a custom voice clone.

Error Handling

VoxLabs uses standard HTTP status codes to indicate the success or failure of requests.

400

Bad Request

Invalid input parameters or audio format.

401

Unauthorized

Missing or invalid API key.

429

Rate Limit Exceeded

You have made too many requests. Please wait.

Voice Cloning Guide

Our cloning engine requires high-quality audio samples to create accurate voice models. For best results, upload at least 30 seconds of clean speech without background noise.

Requirements

• Format: WAV or MP3 (44.1kHz+)
• Duration: 30s - 5min
• Content: Natural, flowing speech
• Noise: Minimal background noise/reverb

Emotion Control

You can direct the emotional tone of the generated speech using the emotion parameter.

Happy

Sad

Angry

Fearful

Excited

Whisper

Shout

Neutral

Best Practices

Cache Generated Audio

TTS generation is computationally expensive. Always cache results for identical text/voice combinations.

Streaming for Long Text

For long paragraphs, use the streaming endpoint to play audio immediately as it's generated, reducing perceived latency.