Real-time Text to Speech Synthesis

The WavesStreamingTTS class provides high-performance text-to-speech conversion with configurable streaming parameters. This implementation is optimized for low-latency applications where immediate audio feedback is critical, such as voice assistants, live narration, or interactive applications.

Configuration Setup

The streaming TTS system uses a TTSConfig object to manage synthesis parameters:
from smallestai.waves import TTSConfig, WavesStreamingTTS

config = TTSConfig(
    voice_id="aditi",
    api_key="YOUR_SMALLEST_API_KEY", 
    sample_rate=24000,
    speed=1.0,
    max_buffer_flush_ms=100
)

streaming_tts = WavesStreamingTTS(config)

Basic Text Synthesis

For straightforward text-to-speech conversion, use the synthesize method:
text = "Hello world, this is a test of the Smallest AI streaming TTS SDK."
audio_chunks = []

for chunk in streaming_tts.synthesize(text):
    audio_chunks.append(chunk)

Streaming Text Input

For real-time applications where text arrives incrementally, use synthesize_streaming:
def text_stream():
    text = "Streaming synthesis with chunked text input for Smallest SDK."
    for word in text.split():
        yield word + " "

audio_chunks = []
for chunk in streaming_tts.synthesize_streaming(text_stream()):
    audio_chunks.append(chunk)

Saving Audio to WAV File

Convert the raw PCM audio chunks to a standard WAV file:
import wave
from io import BytesIO

def save_audio_chunks_to_wav(audio_chunks, filename="output.wav"):
    with wave.open(filename, 'wb') as wf:
        wf.setnchannels(1)        # Mono
        wf.setsampwidth(2)        # 16-bit
        wf.setframerate(24000)    # 24kHz
        wf.writeframes(b''.join(audio_chunks))

text = "Your text to synthesize here."
audio_chunks = list(streaming_tts.synthesize(text))
save_audio_chunks_to_wav(audio_chunks, "speech_output.wav")

Configuration Parameters

  • voice_id: Voice identifier (e.g., “aditi”, “male-1”, “female-2”)
  • api_key: Your Smallest AI API key
  • language: Language code for synthesis (default: “en”)
  • sample_rate: Audio sample rate in Hz (default: 24000)
  • speed: Speech speed multiplier (default: 1.0 - normal speed, 0.5 = half speed, 2.0 = double speed)
  • consistency: Voice consistency parameter (default: 0.5, range: 0.0-1.0)
  • enhancement: Audio enhancement level (default: 1)
  • similarity: Voice similarity parameter (default: 0, range: 0.0-1.0)
  • max_buffer_flush_ms: Maximum buffer time in milliseconds before forcing audio output (default: 0)

Output Format

The streaming TTS returns raw PCM audio data as bytes objects. Each chunk represents a portion of the synthesized audio that can be:
  • Played directly through audio hardware
  • Saved to audio files (WAV, MP3, etc.)
  • Streamed over network protocols
  • Processed with additional audio effects
The raw format ensures minimal latency and maximum flexibility for real-time applications where immediate audio feedback is essential.