Lightning v3.1

Latest Release Lightning v3.1 is a high-fidelity, low-latency text-to-speech model delivering natural, expressive, and realistic speech at 44 kHz. Optimized for real-time applications with ultra-low latency and voice cloning support, it delivers broadcast-quality audio with genuinely conversational characteristics. Now with 15 languages, automatic language detection, and code-switching.

44.1 kHz

Native sample rate

100ms

Latency at 20 concurrent requests

15 Languages

Auto-detection + code-switching

3.3x

Real-time factor (faster than playback)

Model Overview


Developed by	Smallest AI
Model type	Text-to-Speech / Speech Synthesis
Languages	15 (auto-detection + code-switching)
License	Proprietary
Version	v3.1
Native sample rate	44,100 Hz

Key Capabilities

Real-Time Optimized

Ultra-low latency architecture designed for conversational AI and live streaming.

Voice Cloning

Instant voice cloning with just 5-15 seconds of audio via API and console.

Streaming

HTTP, SSE, and WebSocket support for real-time applications.

Multi-Language

15 languages with automatic detection and code-switching. No restarts or reconnections needed.

High Fidelity

Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.

Pronunciation Control

Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.

Performance & Benchmarks

Audio Generation Evaluation

Full-sentence audio generation. The entire text is synthesized in a single pass and then evaluated.

Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.

Lightning v3.1
ElevenLabs Turbo 2.5

Category	Metric	Score	Notes
Audio Quality	WVMOS	5.06	Broadcast-quality audio
	Naturalness	4.33	Predominantly human-like
	Overall Quality	4.42	Premium-tier experience
	Native Sample Rate	44.1 kHz	Highest fidelity among Lightning models
Intelligibility	Word Error Rate (WER)	6.3%	93.7% word accuracy
	Character Error Rate (CER)	1.6%	Excellent character-level accuracy
Latency & Speed	Latency	100ms	At 20 concurrent requests
	Real-Time Factor (RTF)	0.3	3.3x faster than playback
	Speed Control	0.5x - 2.0x	Adjustable playback speed
	Max Chunk Size	250 chars	Optimal: 140 characters per request
Prosody	Pronunciation	4.70 / 5.0	Near-perfect articulation
	Intonation	4.71 / 5.0	Highly expressive pitch variation
	Prosody	4.47 / 5.0	Natural conversational rhythm

Category	Metric	Score
Audio Quality	WVMOS	4.64
	Naturalness	4.36
	Overall Quality	4.50
Intelligibility	Word Error Rate (WER)	5.93%
	Character Error Rate (CER)	1.47%
Latency & Speed	Latency	250-300ms
Prosody	Pronunciation	4.83
	Intonation	4.81
	Prosody	4.5

Agent Call Evaluation

Chunk-by-chunk audio generation. Simulates real-world voice agent behavior where text is streamed and synthesized incrementally, as it happens during live calls.

Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.

Lightning v3.1
OpenAI
Cartesia Sonic 3
ElevenLabs Turbo 2.5

Category	Metric	Score	Notes
Audio Quality	MOS	3.89	Highest among all models tested
	Audio Quality	3.80	Broadcast-quality audio
	Overall Naturalness	3.33	Most natural-sounding output
	Naturalness	2.67	—
Intelligibility	Word Error Rate (WER)	5.38%	94.6% word accuracy
	Character Error Rate (CER)	1.54%	Excellent character-level accuracy
Prosody	Pronunciation	3.80	Near-perfect articulation
	Intonation	3.33	Expressive pitch variation
	Prosody	3.07	Natural conversational rhythm

Category	Metric	Score
Audio Quality	MOS	3.73
	Audio Quality	3.79
	Overall Naturalness	3.14
	Naturalness	2.45
Intelligibility	Word Error Rate (WER)	6.00%
	Character Error Rate (CER)	1.43%
Prosody	Pronunciation	3.66
	Intonation	3.04
	Prosody	2.76

Category	Metric	Score
Audio Quality	MOS	3.66
	Audio Quality	3.80
	Overall Naturalness	3.25
	Naturalness	2.67
Intelligibility	Word Error Rate (WER)	5.29%
	Character Error Rate (CER)	1.84%
Prosody	Pronunciation	3.73
	Intonation	3.07
	Prosody	3.00

Category	Metric	Score
Audio Quality	MOS	3.75
	Audio Quality	3.73
	Overall Naturalness	3.20
	Naturalness	2.47
Intelligibility	Word Error Rate (WER)	6.14%
	Character Error Rate (CER)	1.66%
Prosody	Pronunciation	3.67
	Intonation	3.20
	Prosody	2.93

Want to reproduce these results? See the TTS evaluation script to measure TTFB and synthesis quality in your own environment.

Supported Languages

Automatic Language Detection & Code-Switching: Set language to "auto" (default) and Lightning v3.1 will automatically detect the language from input text. The model also supports code-switching within a single session without requiring a restart or reconnection.

Language	Code	Status
English	`en`	Available
Spanish	`es`	Available
Hindi	`hi`	Available
Tamil	`ta`	Available
Kannada	`kn`	Available
Telugu	`te`	Available
Malayalam	`ml`	Available
Marathi	`mr`	Available
Gujarati	`gu`	Available
French	`fr`	Available Beta
Italian	`it`	Available Beta
Dutch	`nl`	Available Beta
Swedish	`sv`	Available Beta
Portuguese	`pt`	Available Beta
German	`de`	Available Beta

Voice Catalog

English (US) — Best Voices

Voice ID	Name	Gender
`quinn`	Quinn	Female
`mia`	Mia	Female
`magnus`	Magnus	Male
`olivia`	Olivia	Female
`daniel`	Daniel	Male
`rachel`	Rachel	Female
`nicole`	Nicole	Female
`elizabeth`	Elizabeth	Female

Hindi / English — Best Voices

Voice ID	Name	Gender
`neel`	Neel	Male
`maithili`	Maithili	Female
`devansh`	Devansh	Male
`sameera`	Sameera	Female
`mihir`	Mihir	Male
`aarush`	Aarush	Male
`sakshi`	Sakshi	Female
`vivaan`	Vivaan	Male
`srishti`	Srishti	Female

Spanish — Best Voices

Voice ID	Name	Gender
`daniella`	Daniella	Female
`sandra`	Sandra	Female
`carlos`	Carlos	Male
`jose`	José	Male
`luis`	Luís	Male
`mariana`	Mariana	Female
`miguel`	Miguel	Male

Other Indian Languages — Best Voices

Language	Voice ID	Name	Gender
Tamil	`jeevan`	Jeevan	Male
Tamil	`rajeshwari`	Rajeshwari	Female
Malayalam	`vaisakh`	Vaisakh	Male
Malayalam	`shibi`	Shibi	Female
Telugu	`srihari`	Srihari	Male
Telugu	`padmaja`	Padmaja	Female
Marathi	`rupali`	Rupali	Female
Marathi	`nilesh`	Nilesh	Male
Gujarati	`niharika`	Niharika	Female
Gujarati	`dhruvit`	Dhruvit	Male
Kannada	`deepashri`	Deepashri	Female
Kannada	`pranav`	Pranav	Male

Voice Cloning

Instant Voice Cloning

Audio required: 5-15 secondsSelf-serve voice cloning available via API and console. Captures core voice characteristics for quick replication.

Try Voice Cloning

Clone a voice from a 5-15 second audio sample directly in the console. No code required.

API Reference

Endpoints

Endpoint	Method	Use Case
`https://api.smallest.ai/waves/v1/lightning-v3.1/get_speech`	POST	Synchronous synthesis
`https://api.smallest.ai/waves/v1/lightning-v3.1/stream`	POST (SSE)	Server-sent events streaming
`wss://api.smallest.ai/waves/v1/lightning-v3.1/get_speech/stream`	WebSocket	Real-time streaming

Request Parameters

Parameter	Type	Required	Default	Description
`text`	string	Yes	—	Text to synthesize
`voice_id`	string	Yes	—	Voice identifier
`sample_rate`	integer	No	44100	Output sample rate (Hz)
`speed`	float	No	1.0	Speech speed (0.5-2.0)
`language`	string	No	`"auto"`	Language code or `"auto"` for automatic detection
`output_format`	string	No	`"pcm"`	Audio format
`pronunciation_dicts`	array	No	—	Custom pronunciation IDs (WebSocket only)

Quickstart

Generate your first audio in under a minute with a single API call.

Technical Specifications

Audio Output

Specification	Details
Native sample rate	44,100 Hz
Supported sample rates	8,000 / 16,000 / 24,000 / 44,100 Hz
Output formats	PCM, MP3, WAV, mulaw
Audio channels	Mono

Text Formatting Guidelines

Aspect	Recommendation
Language scripts	Use native script for each language. English/Spanish/French/Italian/Dutch/Swedish/Portuguese/German in Latin script, Hindi/Marathi/Gujarati in Devanagari, Tamil/Kannada/Telugu/Malayalam in their native scripts
Break points	Natural punctuation (`.` `!` `?` `,`)
Mixed language	Avoid transliteration. Use native script for each language

Number & Date Handling

Type	Format
Phone numbers	Default 3-4-3 grouping
Dates	DD/MM/YYYY or DD-MM-YYYY
Time	HH:MM or HH:MM:SS

Compute Infrastructure

Hardware

Recommended GPU: NVIDIA L40S
Recommended VRAM: 48 GB

Software

Server regions (AWS): India (Hyderabad), USA (Oregon)
Automatic geo-location based routing for lowest latency

Best Practices

Code-Switching

Lightning v3.1 supports real-time intra-session language switching via two mutually exclusive language groups. Each group shares a unified phoneme space, enabling seamless mid-utterance transitions between member languages without session re-initialization. Cross-group switching is not supported within a single session.

Language Groups

Indic Group. Optimized for South Asian language pairs with English as the bridging language.

Language	Code
English	`en`
Hindi	`hi`
Tamil	`ta`
Telugu	`te`
Malayalam	`ml`
Kannada	`kn`
Marathi	`mr`
Gujarati	`gu`

Global Group. Optimized for European language pairs with English and Hindi as bridging languages.

Language	Code
English	`en`
Hindi	`hi`
Spanish	`es`
French	`fr`
Italian	`it`
Portuguese	`pt`
German	`de`
Dutch	`nl`
Swedish	`sv`

Intra-group switching is unrestricted. Any language within the same group can be interleaved at the token level. Cross-group switching (e.g., Tamil from Indic + French from Global) is architecturally unsupported and will produce undefined behavior.

en and hi exist in both groups. All other languages are exclusive to one group. The group is determined at session initialization based on the first non-shared language encountered. Design your session’s language set accordingly.

Routing Examples

// Indic group — Hindi ↔ Tamil interleaving
"नमस्ते, வணக்கம், how are you?"  ✅  Valid: all languages within Indic group

// Global group — Spanish ↔ French interleaving
"Hola amigo, comment ça va?"  ✅  Valid: all languages within Global group

// Cross-group — Tamil (Indic) + French (Global)
"வணக்கம், comment ça va?"  ❌  Invalid: cross-group switching unsupported

Voice Cloning

Reference Audio

Environment. Record in a quiet room with no background noise, hiss, or rumble. Ambient sound is captured in the clone and cannot be removed after the fact.
Speaking style. Speak naturally in your normal conversational voice. The model captures timbre, accent, emotional tone, rhythm, and pacing automatically. Do not exaggerate unless a specific tone is intended.
Audio length. Provide 5 to 15 seconds of clean, continuous speech.

Multi-Lingual Cloning

Language matching. For best results, record reference audio in the same language as your intended output. Cross-lingual cloning is supported (e.g., English reference used for Spanish output), but a language-matched reference produces higher fidelity.
Accent retention. When synthesizing in a different language than the reference, the original accent is preserved. A clone from a South Indian English speaker will retain that accent in Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics. For accent-neutral output in a specific language, provide reference audio from a native speaker of that language.
Script encoding. Input text must use native script for each language (Devanagari for Hindi/Marathi/Gujarati, respective Brahmic scripts for Dravidian languages, Latin for European languages). Transliterated input degrades synthesis quality.
Group constraint. Cloned voices follow the same language group routing rules. A session initialized in the Indic group cannot switch to Global-exclusive languages, regardless of the voice’s source language.

For detailed recording examples and expressive cloning techniques, see Voice Cloning Best Practices.

Text Formatting

Chunk boundaries. Segment input at natural prosodic boundaries (. ! ? ,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request.
Script integrity. Avoid transliteration. Use "नमस्ते" not "Namaste" for Hindi; "வணக்கம்" not "Vanakkam" for Tamil. Mixed-script input within a single language token produces unpredictable phoneme mappings.
Numeric normalization. Use standard formats (DD/MM/YYYY, HH:MM). Phone numbers default to 3-4-3 digit grouping.
Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.

For comprehensive text formatting rules (numeric handling, date/time, symbols, chunking logic), see TTS Best Practices.

Use Cases

Direct Use

Voice assistants and conversational AI
Interactive chatbots with voice output
Real-time narration and live streaming
Accessibility tools and screen readers
Gaming (dynamic character voices)
Customer service automation

Downstream Use

Multi-turn conversational agents
Audio content generation pipelines
Telephony and IVR systems
Podcast and audiobook generation

Limitations & Safety

Known Limitations

Mixed-language text (transliteration) may produce suboptimal results. Hindi text should be in Devanagari script (e.g., “नमस्ते”), not Latin (e.g., “Namaste”). English text should be in Latin script, not Devanagari. Each language should use its native script.

Recommendations: Use proper script for each language. Break long text at natural punctuation points. Use pronunciation dictionaries for specialized vocabulary. Test voice selection for your specific use case.

Lightning v3.1 must not be used for impersonation or fraud, generating deceptive audio content (deepfakes), creating content that violates consent or privacy, harassment or abuse, or any illegal or unethical purposes.

Safety & Compliance

Voice cloning requires explicit consent
No retention of synthesized audio
No storage of personal voice data beyond cloning scope
Usage monitoring for policy compliance

For compliance documentation (GDPR, SOC2, HIPAA), contact support@smallest.ai.

Channel	Details
Support	support@smallest.ai
Documentation	waves-docs.smallest.ai
Console	app.smallest.ai
Community	Discord

Text to Speech

Speech to Text

44.1 kHz

100ms

15 Languages

3.3x

​Model Overview

​Key Capabilities

Real-Time Optimized

Voice Cloning

Streaming

Multi-Language

High Fidelity

Pronunciation Control

​Performance & Benchmarks

​Audio Generation Evaluation

​Agent Call Evaluation

​Supported Languages

​Voice Catalog

​English (US) — Best Voices

​Hindi / English — Best Voices

​Spanish — Best Voices

​Other Indian Languages — Best Voices

​Voice Cloning

Instant Voice Cloning

Try Voice Cloning

​API Reference

​Endpoints

​Request Parameters

Quickstart

​Technical Specifications

​Audio Output

​Text Formatting Guidelines

​Number & Date Handling

​Best Practices

​Code-Switching

​Language Groups

​Routing Examples

​Voice Cloning

​Reference Audio

​Multi-Lingual Cloning

​Text Formatting

​Use Cases

​Direct Use

​Downstream Use

​Limitations & Safety

​Known Limitations

​Safety & Compliance

Model Overview

Key Capabilities

Performance & Benchmarks

Audio Generation Evaluation

Agent Call Evaluation

Supported Languages

Voice Catalog

English (US) — Best Voices

Hindi / English — Best Voices

Spanish — Best Voices

Other Indian Languages — Best Voices

Voice Cloning

API Reference

Endpoints

Request Parameters

Technical Specifications

Audio Output

Text Formatting Guidelines

Number & Date Handling

Best Practices

Code-Switching

Language Groups

Routing Examples

Voice Cloning

Reference Audio

Multi-Lingual Cloning

Text Formatting

Use Cases

Direct Use

Downstream Use

Limitations & Safety

Known Limitations

Safety & Compliance