Real-time Text to Speech Synthesis
TheWavesStreamingTTS
class provides high-performance text-to-speech conversion with configurable streaming parameters. This implementation is optimized for low-latency applications where immediate audio feedback is critical, such as voice assistants, live narration, or interactive applications.
Configuration Setup
The streaming TTS system uses aTTSConfig
object to manage synthesis parameters:
Basic Text Synthesis
For straightforward text-to-speech conversion, use thesynthesize
method:
Streaming Text Input
For real-time applications where text arrives incrementally, usesynthesize_streaming
:
Saving Audio to WAV File
Convert the raw PCM audio chunks to a standard WAV file:Configuration Parameters
voice_id
: Voice identifier (e.g., “aditi”, “male-1”, “female-2”)api_key
: Your Smallest AI API keylanguage
: Language code for synthesis (default: “en”)sample_rate
: Audio sample rate in Hz (default: 24000)speed
: Speech speed multiplier (default: 1.0 - normal speed, 0.5 = half speed, 2.0 = double speed)consistency
: Voice consistency parameter (default: 0.5, range: 0.0-1.0)enhancement
: Audio enhancement level (default: 1)similarity
: Voice similarity parameter (default: 0, range: 0.0-1.0)max_buffer_flush_ms
: Maximum buffer time in milliseconds before forcing audio output (default: 0)
Output Format
The streaming TTS returns raw PCM audio data as bytes objects. Each chunk represents a portion of the synthesized audio that can be:- Played directly through audio hardware
- Saved to audio files (WAV, MP3, etc.)
- Streamed over network protocols
- Processed with additional audio effects