How to use Streaming TTS with websockets

Real-time Text to Speech Synthesis

The WavesStreamingTTS class provides high-performance text-to-speech conversion with configurable streaming parameters. This implementation is optimized for low-latency applications where immediate audio feedback is critical, such as voice assistants, live narration, or interactive applications.

Configuration Setup

The streaming TTS system uses a TTSConfig object to manage synthesis parameters:

from smallestai.waves import TTSConfig, WavesStreamingTTS

config = TTSConfig(
    voice_id="aditi",
    api_key="YOUR_SMALLEST_API_KEY", 
    sample_rate=24000,
    speed=1.0,
    max_buffer_flush_ms=100
)

streaming_tts = WavesStreamingTTS(config)

Basic Text Synthesis

For straightforward text-to-speech conversion, use the synthesize method:

text = "Hello world, this is a test of the Smallest AI streaming TTS SDK."
audio_chunks = []

for chunk in streaming_tts.synthesize(text):
    audio_chunks.append(chunk)

Streaming Text Input

For real-time applications where text arrives incrementally, use synthesize_streaming:

def text_stream():
    text = "Streaming synthesis with chunked text input for Smallest SDK."
    for word in text.split():
        yield word + " "

audio_chunks = []
for chunk in streaming_tts.synthesize_streaming(text_stream()):
    audio_chunks.append(chunk)

Saving Audio to WAV File

Convert the raw PCM audio chunks to a standard WAV file:

import wave
from io import BytesIO

def save_audio_chunks_to_wav(audio_chunks, filename="output.wav"):
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(1)        # Mono
        wf.setsampwidth(2)        # 16-bit
        wf.setframerate(24000)    # 24kHz
        wf.writeframes(b''.join(audio_chunks))

text = "Your text to synthesize here."
audio_chunks = list(streaming_tts.synthesize(text))
save_audio_chunks_to_wav(audio_chunks, "speech_output.wav")

Configuration Parameters

voice_id: Voice identifier (e.g., “aditi”, “male-1”, “female-2”)
api_key: Your Smallest AI API key
language: Language code for synthesis (default: “en”)
sample_rate: Audio sample rate in Hz (default: 24000)
speed: Speech speed multiplier (default: 1.0 - normal speed, 0.5 = half speed, 2.0 = double speed)
consistency: Voice consistency parameter (default: 0.5, range: 0.0-1.0)
enhancement: Audio enhancement level (default: 1)
similarity: Voice similarity parameter (default: 0, range: 0.0-1.0)
max_buffer_flush_ms: Maximum buffer time in milliseconds before forcing audio output (default: 0)

Output Format

The streaming TTS returns raw PCM audio data as bytes objects. Each chunk represents a portion of the synthesized audio that can be:

Played directly through audio hardware
Saved to audio files (WAV, MP3, etc.)
Streamed over network protocols
Processed with additional audio effects

The raw format ensures minimal latency and maximum flexibility for real-time applications where immediate audio feedback is essential.

Introduction

Getting Started

Text to Speech

Voice Cloning

Integrations

Product

Best Practices

How to use Streaming TTS with websockets

Real-time Text to Speech Synthesis

Configuration Setup

Basic Text Synthesis

Streaming Text Input

Saving Audio to WAV File

Configuration Parameters

Output Format

Introduction

Getting Started

Text to Speech

Voice Cloning

Integrations

Product

Best Practices

​Real-time Text to Speech Synthesis

​Configuration Setup

​Basic Text Synthesis

​Streaming Text Input

​Saving Audio to WAV File

​Configuration Parameters

​Output Format

Real-time Text to Speech Synthesis

Configuration Setup

Basic Text Synthesis

Streaming Text Input

Saving Audio to WAV File

Configuration Parameters

Output Format