44.1 kHz
Native sample rate
200ms
Latency at 20 concurrent requests
4 Languages
English, Hindi, Spanish, Tamil
3.3x
Real-time factor (faster than playback)
Model Overview
| Developed by | Smallest AI |
| Model type | Text-to-Speech / Speech Synthesis |
| Languages | English, Hindi, Spanish, Tamil |
| License | Proprietary |
| Version | v3.1 |
| Native sample rate | 44,100 Hz |
Key Capabilities
Real-Time Optimized
Ultra-low latency architecture designed for conversational AI and live streaming.
Voice Cloning
Instant voice cloning with just 5-15 seconds of audio. Professional cloning available on demand.
Streaming
HTTP, SSE, and WebSocket support for real-time applications.
Performance & Benchmarks
In blind listening tests against OpenAI GPT-4o-mini-TTS, Lightning v3.1 was preferred by listeners 76.2% of the time — a 3.4x preference ratio.Evaluation: Seed TTS dataset, 1,088 samples across English, Hindi, Spanish, and Tamil. LLM-as-a-Judge framework with ASR-based intelligibility testing.
- Lightning v3.1
- ElevenLabs Turbo 2.5
| Category | Metric | Score | Notes |
|---|---|---|---|
| Audio Quality | WVMOS | 5.06 | Broadcast-quality audio |
| Naturalness | 4.33 | Predominantly human-like | |
| Overall Quality | 4.42 | Premium-tier experience | |
| Native Sample Rate | 44.1 kHz | Highest fidelity among Lightning models | |
| Intelligibility | Word Error Rate (WER) | 6.3% | 93.7% word accuracy |
| Character Error Rate (CER) | 1.6% | Excellent character-level accuracy | |
| Latency & Speed | Latency | 200ms | At 20 concurrent requests |
| Real-Time Factor (RTF) | 0.3 | 3.3x faster than playback | |
| Speed Control | 0.5x - 2.0x | Adjustable playback speed | |
| Max Chunk Size | 250 chars | Optimal: 140 characters per request | |
| Prosody | Pronunciation | 4.70 / 5.0 | Near-perfect articulation |
| Intonation | 4.71 / 5.0 | Highly expressive pitch variation | |
| Prosody | 4.47 / 5.0 | Natural conversational rhythm |
Supported Languages
Automatic Language Detection & Language Switching: Set
language to "auto" (default) and Lightning v3.1 will automatically detect the language from input text. The model also supports language switching within a single session — no need to restart or reconnect when switching between supported languages.| Language | Code | Status |
|---|---|---|
| English | en | Available |
| Hindi | hi | Available |
| Spanish | es | Available |
| Tamil | ta | Available |
| Italian | it | Coming soon |
| French | fr | Coming soon |
| Portuguese | pt | Coming soon |
| Swedish | sv | Coming soon |
| Dutch | nl | Coming soon |
| German | de | Coming soon |
| Telugu | te | Coming soon |
| Malayalam | ml | Coming soon |
| Kannada | kn | Coming soon |
| Marathi | mr | Coming soon |
| Gujarati | gu | Coming soon |
Voice Catalog
English Voices
| Voice ID | Name | Gender | Accent | Languages |
|---|---|---|---|---|
magnus | Magnus | Male | American | English |
olivia | Olivia | Female | American | English |
daniel | Daniel | Male | American | English |
rachel | Rachel | Female | American | English |
nicole | Nicole | Female | American | English |
elizabeth | Elizabeth | Female | American | English |
kyle | Kyle | Male | American | English |
Hindi Voices
| Voice ID | Name | Gender | Accent | Languages |
|---|---|---|---|---|
aarush | Aarush | Male | Indian | English, Hindi |
sakshi | Sakshi | Female | Indian | English, Hindi |
parth | Parth | Male | Indian | English, Hindi |
sana | Sana | Female | Indian | English, Hindi |
vivaan | Vivaan | Male | Indian | English, Hindi |
Voice Cloning
Instant Voice Cloning
Audio required: 5-15 secondsSelf-serve voice cloning available via API and console. Captures core voice characteristics for quick replication.
Professional Voice Cloning
Audio required: 45+ minutes (high-quality)Near-perfect voice match capturing intonation, accent, emotions, and vocal nuances. Available on demand — contact support@smallest.ai to get started.
API Reference
Endpoints
| Endpoint | Method | Use Case |
|---|---|---|
/waves/v1/lightning-v3.1/get_speech | POST | Synchronous synthesis |
/waves/v1/lightning-v3.1/stream | POST (SSE) | Server-sent events streaming |
/waves/v1/lightning-v3.1/get_speech/stream | WebSocket | Real-time streaming |
Request Parameters
| Parameter | Type | Required | Default | Description |
|---|---|---|---|---|
text | string | Yes | — | Text to synthesize |
voice_id | string | Yes | — | Voice identifier |
sample_rate | integer | No | 44100 | Output sample rate (Hz) |
speed | float | No | 1.0 | Speech speed (0.5-2.0) |
language | string | No | "auto" | Language code (en, hi, es, ta) |
output_format | string | No | "pcm" | Audio format |
pronunciation_dicts | array | No | — | Custom pronunciation IDs (WebSocket only) |
Quickstart
Get started in minutes with synchronous or streaming synthesis.
Technical Specifications
Audio Output
| Specification | Details |
|---|---|
| Native sample rate | 44,100 Hz |
| Supported sample rates | 8,000 / 16,000 / 24,000 / 44,100 Hz |
| Output formats | PCM, MP3, WAV, mulaw |
| Audio channels | Mono |
Text Formatting Guidelines
| Aspect | Recommendation |
|---|---|
| Language scripts | English and Spanish in Latin script, Hindi in Devanagari |
| Break points | Natural punctuation (. ! ? ,) |
| Mixed language | Avoid transliteration — use native script for each language |
Number & Date Handling
| Type | Format |
|---|---|
| Phone numbers | Default 3-4-3 grouping |
| Dates | DD/MM/YYYY or DD-MM-YYYY |
| Time | HH:MM or HH:MM:SS |
Compute Infrastructure
Compute Infrastructure
Hardware
- Recommended GPU: NVIDIA L40S
- Recommended VRAM: 48 GB
- Server regions (AWS): India (Hyderabad), USA (Oregon)
- Automatic geo-location based routing for lowest latency
Use Cases
Direct Use
- Voice assistants and conversational AI
- Interactive chatbots with voice output
- Real-time narration and live streaming
- Accessibility tools and screen readers
- Gaming (dynamic character voices)
- Customer service automation
Downstream Use
- Multi-turn conversational agents
- Audio content generation pipelines
- Telephony and IVR systems
- Podcast and audiobook generation
Limitations & Safety
Known Limitations
- Mixed-language text (transliteration) may produce suboptimal results. Hindi text should be in Devanagari script (e.g., “नमस्ते”), not Latin (e.g., “Namaste”). English text should be in Latin script, not Devanagari.
Recommendations: Use proper script for each language. Break long text at natural punctuation points. Use pronunciation dictionaries for specialized vocabulary. Test voice selection for your specific use case.
Safety & Compliance
- Voice cloning requires explicit consent
- No retention of synthesized audio
- No storage of personal voice data beyond cloning scope
- Usage monitoring for policy compliance
| Channel | Details |
|---|---|
| Support | support@smallest.ai |
| Documentation | waves-docs.smallest.ai |
| Console | app.smallest.ai |
| Community | Discord |

