44.1 kHz
Native sample rate
100ms
Latency at 20 concurrent requests
15 Languages
Auto-detection + code-switching
3.3x
Real-time factor (faster than playback)
Model Overview
| Developed by | Smallest AI |
| Model type | Text-to-Speech / Speech Synthesis |
| Languages | 15 (auto-detection + code-switching) |
| License | Proprietary |
| Version | v3.1 |
| Native sample rate | 44,100 Hz |
Key Capabilities
Real-Time Optimized
Ultra-low latency architecture designed for conversational AI and live streaming.
Voice Cloning
Instant voice cloning with just 5-15 seconds of audio via API and console.
Streaming
HTTP, SSE, and WebSocket support for real-time applications.
Multi-Language
15 languages with automatic detection and code-switching. No restarts or reconnections needed.
High Fidelity
Broadcast-quality 44.1 kHz audio with natural prosody, intonation, and conversational rhythm.
Pronunciation Control
Custom pronunciation dictionaries for specialized vocabulary, brand names, and domain-specific terms.
Performance & Benchmarks
Audio Generation Evaluation
Full-sentence audio generation. The entire text is synthesized in a single pass and then evaluated.Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.
- Lightning v3.1
- ElevenLabs Turbo 2.5
| Category | Metric | Score | Notes |
|---|---|---|---|
| Audio Quality | WVMOS | 5.06 | Broadcast-quality audio |
| Naturalness | 4.33 | Predominantly human-like | |
| Overall Quality | 4.42 | Premium-tier experience | |
| Native Sample Rate | 44.1 kHz | Highest fidelity among Lightning models | |
| Intelligibility | Word Error Rate (WER) | 6.3% | 93.7% word accuracy |
| Character Error Rate (CER) | 1.6% | Excellent character-level accuracy | |
| Latency & Speed | Latency | 100ms | At 20 concurrent requests |
| Real-Time Factor (RTF) | 0.3 | 3.3x faster than playback | |
| Speed Control | 0.5x - 2.0x | Adjustable playback speed | |
| Max Chunk Size | 250 chars | Optimal: 140 characters per request | |
| Prosody | Pronunciation | 4.70 / 5.0 | Near-perfect articulation |
| Intonation | 4.71 / 5.0 | Highly expressive pitch variation | |
| Prosody | 4.47 / 5.0 | Natural conversational rhythm |
Agent Call Evaluation
Chunk-by-chunk audio generation. Simulates real-world voice agent behavior where text is streamed and synthesized incrementally, as it happens during live calls.Evaluation: Seed TTS dataset, 1,088 English samples. LLM-as-a-Judge framework.
- Lightning v3.1
- OpenAI
- Cartesia Sonic 3
- ElevenLabs Turbo 2.5
| Category | Metric | Score | Notes |
|---|---|---|---|
| Audio Quality | MOS | 3.89 | Highest among all models tested |
| Audio Quality | 3.80 | Broadcast-quality audio | |
| Overall Naturalness | 3.33 | Most natural-sounding output | |
| Naturalness | 2.67 | — | |
| Intelligibility | Word Error Rate (WER) | 5.38% | 94.6% word accuracy |
| Character Error Rate (CER) | 1.54% | Excellent character-level accuracy | |
| Prosody | Pronunciation | 3.80 | Near-perfect articulation |
| Intonation | 3.33 | Expressive pitch variation | |
| Prosody | 3.07 | Natural conversational rhythm |
Supported Languages
Automatic Language Detection & Code-Switching: Set
language to "auto" (default) and Lightning v3.1 will automatically detect the language from input text. The model also supports code-switching within a single session without requiring a restart or reconnection.| Language | Code | Status |
|---|---|---|
| English | en | Available |
| Spanish | es | Available |
| Hindi | hi | Available |
| Tamil | ta | Available |
| Kannada | kn | Available |
| Telugu | te | Available |
| Malayalam | ml | Available |
| Marathi | mr | Available |
| Gujarati | gu | Available |
| French | fr | Available Beta |
| Italian | it | Available Beta |
| Dutch | nl | Available Beta |
| Swedish | sv | Available Beta |
| Portuguese | pt | Available Beta |
| German | de | Available Beta |
Voice Catalog
English (US) — Best Voices
| Voice ID | Name | Gender |
|---|---|---|
quinn | Quinn | Female |
mia | Mia | Female |
magnus | Magnus | Male |
olivia | Olivia | Female |
daniel | Daniel | Male |
rachel | Rachel | Female |
nicole | Nicole | Female |
elizabeth | Elizabeth | Female |
Hindi / English — Best Voices
| Voice ID | Name | Gender |
|---|---|---|
neel | Neel | Male |
maithili | Maithili | Female |
devansh | Devansh | Male |
sameera | Sameera | Female |
mihir | Mihir | Male |
aarush | Aarush | Male |
sakshi | Sakshi | Female |
vivaan | Vivaan | Male |
srishti | Srishti | Female |
Spanish — Best Voices
| Voice ID | Name | Gender |
|---|---|---|
daniella | Daniella | Female |
sandra | Sandra | Female |
carlos | Carlos | Male |
jose | José | Male |
luis | Luís | Male |
mariana | Mariana | Female |
miguel | Miguel | Male |
Other Indian Languages — Best Voices
| Language | Voice ID | Name | Gender |
|---|---|---|---|
| Tamil | jeevan | Jeevan | Male |
| Tamil | rajeshwari | Rajeshwari | Female |
| Malayalam | vaisakh | Vaisakh | Male |
| Malayalam | shibi | Shibi | Female |
| Telugu | srihari | Srihari | Male |
| Telugu | padmaja | Padmaja | Female |
| Marathi | rupali | Rupali | Female |
| Marathi | nilesh | Nilesh | Male |
| Gujarati | niharika | Niharika | Female |
| Gujarati | dhruvit | Dhruvit | Male |
| Kannada | deepashri | Deepashri | Female |
| Kannada | pranav | Pranav | Male |
Voice Cloning
Instant Voice Cloning
Audio required: 5-15 secondsSelf-serve voice cloning available via API and console. Captures core voice characteristics for quick replication.
Try Voice Cloning
Clone a voice from a 5-15 second audio sample directly in the console. No code required.
API Reference
Endpoints
| Endpoint | Method | Use Case |
|---|---|---|
https://api.smallest.ai/waves/v1/lightning-v3.1/get_speech | POST | Synchronous synthesis |
https://api.smallest.ai/waves/v1/lightning-v3.1/stream | POST (SSE) | Server-sent events streaming |
wss://api.smallest.ai/waves/v1/lightning-v3.1/get_speech/stream | WebSocket | Real-time streaming |
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | — | Text to synthesize |
voice_id | string | Yes | — | Voice identifier |
sample_rate | integer | No | 44100 | Output sample rate (Hz) |
speed | float | No | 1.0 | Speech speed (0.5-2.0) |
language | string | No | "auto" | Language code or "auto" for automatic detection |
output_format | string | No | "pcm" | Audio format |
pronunciation_dicts | array | No | — | Custom pronunciation IDs (WebSocket only) |
Quickstart
Generate your first audio in under a minute with a single API call.
Technical Specifications
Audio Output
| Specification | Details |
|---|---|
| Native sample rate | 44,100 Hz |
| Supported sample rates | 8,000 / 16,000 / 24,000 / 44,100 Hz |
| Output formats | PCM, MP3, WAV, mulaw |
| Audio channels | Mono |
Text Formatting Guidelines
| Aspect | Recommendation |
|---|---|
| Language scripts | Use native script for each language. English/Spanish/French/Italian/Dutch/Swedish/Portuguese/German in Latin script, Hindi/Marathi/Gujarati in Devanagari, Tamil/Kannada/Telugu/Malayalam in their native scripts |
| Break points | Natural punctuation (. ! ? ,) |
| Mixed language | Avoid transliteration. Use native script for each language |
Number & Date Handling
| Type | Format |
|---|---|
| Phone numbers | Default 3-4-3 grouping |
| Dates | DD/MM/YYYY or DD-MM-YYYY |
| Time | HH:MM or HH:MM:SS |
Compute Infrastructure
Compute Infrastructure
Hardware
- Recommended GPU: NVIDIA L40S
- Recommended VRAM: 48 GB
- Server regions (AWS): India (Hyderabad), USA (Oregon)
- Automatic geo-location based routing for lowest latency
Best Practices
Code-Switching
Lightning v3.1 supports real-time intra-session language switching via two mutually exclusive language groups. Each group shares a unified phoneme space, enabling seamless mid-utterance transitions between member languages without session re-initialization. Cross-group switching is not supported within a single session.Language Groups
Indic Group. Optimized for South Asian language pairs with English as the bridging language.| Language | Code |
|---|---|
| English | en |
| Hindi | hi |
| Tamil | ta |
| Telugu | te |
| Malayalam | ml |
| Kannada | kn |
| Marathi | mr |
| Gujarati | gu |
| Language | Code |
|---|---|
| English | en |
| Hindi | hi |
| Spanish | es |
| French | fr |
| Italian | it |
| Portuguese | pt |
| German | de |
| Dutch | nl |
| Swedish | sv |
Routing Examples
Voice Cloning
Reference Audio
- Environment. Record in a quiet room with no background noise, hiss, or rumble. Ambient sound is captured in the clone and cannot be removed after the fact.
- Speaking style. Speak naturally in your normal conversational voice. The model captures timbre, accent, emotional tone, rhythm, and pacing automatically. Do not exaggerate unless a specific tone is intended.
- Audio length. Provide 5 to 15 seconds of clean, continuous speech.
Multi-Lingual Cloning
- Language matching. For best results, record reference audio in the same language as your intended output. Cross-lingual cloning is supported (e.g., English reference used for Spanish output), but a language-matched reference produces higher fidelity.
- Accent retention. When synthesizing in a different language than the reference, the original accent is preserved. A clone from a South Indian English speaker will retain that accent in Hindi or Tamil output. This is by design: the clone reproduces your voice, including accent characteristics. For accent-neutral output in a specific language, provide reference audio from a native speaker of that language.
- Script encoding. Input text must use native script for each language (Devanagari for Hindi/Marathi/Gujarati, respective Brahmic scripts for Dravidian languages, Latin for European languages). Transliterated input degrades synthesis quality.
- Group constraint. Cloned voices follow the same language group routing rules. A session initialized in the Indic group cannot switch to Global-exclusive languages, regardless of the voice’s source language.
Text Formatting
- Chunk boundaries. Segment input at natural prosodic boundaries (
.!?,). Maximum chunk size is 250 characters; optimal throughput at 140 characters per request. - Script integrity. Avoid transliteration. Use
"नमस्ते"not"Namaste"for Hindi;"வணக்கம்"not"Vanakkam"for Tamil. Mixed-script input within a single language token produces unpredictable phoneme mappings. - Numeric normalization. Use standard formats (
DD/MM/YYYY,HH:MM). Phone numbers default to 3-4-3 digit grouping. - Lexicon overrides. Use pronunciation dictionaries for domain-specific terms, brand names, and acronyms where default grapheme-to-phoneme conversion is insufficient.
Use Cases
Direct Use
- Voice assistants and conversational AI
- Interactive chatbots with voice output
- Real-time narration and live streaming
- Accessibility tools and screen readers
- Gaming (dynamic character voices)
- Customer service automation
Downstream Use
- Multi-turn conversational agents
- Audio content generation pipelines
- Telephony and IVR systems
- Podcast and audiobook generation
Limitations & Safety
Known Limitations
- Mixed-language text (transliteration) may produce suboptimal results. Hindi text should be in Devanagari script (e.g., “नमस्ते”), not Latin (e.g., “Namaste”). English text should be in Latin script, not Devanagari. Each language should use its native script.
Recommendations: Use proper script for each language. Break long text at natural punctuation points. Use pronunciation dictionaries for specialized vocabulary. Test voice selection for your specific use case.
Safety & Compliance
- Voice cloning requires explicit consent
- No retention of synthesized audio
- No storage of personal voice data beyond cloning scope
- Usage monitoring for policy compliance
| Channel | Details |
|---|---|
| Support | support@smallest.ai |
| Documentation | waves-docs.smallest.ai |
| Console | app.smallest.ai |
| Community | Discord |

