Performance

This page provides performance benchmarks for Lightning STT, including latency, accuracy, and throughput metrics.

Latency Metrics

End-to-End Latency

Average Latency: ~64ms
P50 Latency: 60-65ms (median - 50% of requests complete within this time)
P95 Latency: 80-100ms (95% of requests complete within this time)
P99 Latency: 100-150ms (99% of requests complete within this time)

Measured on 16kHz mono PCM audio, English language

Time-to-First-Transcript

HTTP POST: ~64ms for complete transcription

Accuracy Metrics

Word Error Rate (WER)

All models were evaluated on the FLEURS dataset, a standardised multilingual speech benchmark ensuring fair cross-model comparison.

Language	WER
English	5.1%
Italian	4.2%
Spanish	5.4%
Hindi	11.4%

Throughput

Requests Per Second

Audio Length	HTTP POST
Short (< 5s)	50-100
Medium (5-30s)	20-50
Long (30s+)	10-20

Throughput varies based on audio length, format, and server load

Performance by Audio Format

Linear16 (PCM)

Latency: Lowest (~64ms)
Accuracy: Highest
Bandwidth: Highest
Best for: High-quality applications

Opus

Latency: Low (~70-80ms)
Accuracy: High
Bandwidth: Low
Best for: Browser/mobile applications

FLAC

Latency: Medium (~80-90ms)
Accuracy: Highest
Bandwidth: Medium
Best for: Archival/quality-critical use cases

μ-law

Latency: Low (~65-75ms)
Accuracy: Good
Bandwidth: Lowest
Best for: Telephony applications

Performance by Language

High-Performance Languages

Italian: 4.2% WER, ~64ms latency
English: 5.1% WER, ~64ms latency
Spanish: 5.4% WER, ~64ms latency
Portuguese: 7.1% WER, ~64ms latency
German: 8.5% WER, ~64ms latency
French: 9.2% WER, ~64ms latency

Regional Variations

Indian Languages: 10-15% WER, ~90-100ms latency
Eastern European: 9-12% WER, ~85-95ms latency

Feature Impact on Performance

Diarization

Latency Impact: +10-20ms
Accuracy Impact: Minimal
Use When: Multiple speakers present

Word Timestamps

Latency Impact: +5-10ms
Accuracy Impact: None
Use When: Timing information needed

Emotion Detection

Latency Impact: +15-25ms
Accuracy Impact: None
Use When: Emotion analysis required

Age/Gender Detection

Latency Impact: +10-15ms
Accuracy Impact: None
Use When: Demographic analysis needed

Optimization Tips

Use 16kHz sample rate for optimal balance
Choose linear16 format for lowest latency
Enable only needed features to reduce latency
Batch process when latency isn’t critical

Introduction

Getting Started

Text to Speech

Speech to Text

Voice Cloning

Integrations

Product

Best Practices

Latency Metrics

End-to-End Latency

Time-to-First-Transcript

Accuracy Metrics

Word Error Rate (WER)

Throughput

Requests Per Second

Performance by Audio Format

Linear16 (PCM)

Opus

FLAC

μ-law

Performance by Language

High-Performance Languages

Regional Variations

Feature Impact on Performance

Diarization

Word Timestamps

Emotion Detection

Age/Gender Detection

Optimization Tips

Next Steps

Introduction

Getting Started

Text to Speech

Speech to Text

Voice Cloning

Integrations

Product

Best Practices

​Latency Metrics

​End-to-End Latency

​Time-to-First-Transcript

​Accuracy Metrics

​Word Error Rate (WER)

​Throughput

​Requests Per Second

​Performance by Audio Format

​Linear16 (PCM)

​Opus

​FLAC

​μ-law

​Performance by Language

​High-Performance Languages

​Regional Variations

​Feature Impact on Performance

​Diarization

​Word Timestamps

​Emotion Detection

​Age/Gender Detection

​Optimization Tips

​Next Steps

Latency Metrics

End-to-End Latency

Time-to-First-Transcript

Accuracy Metrics

Word Error Rate (WER)

Throughput

Requests Per Second

Performance by Audio Format

Linear16 (PCM)

Opus

FLAC

μ-law

Performance by Language

High-Performance Languages

Regional Variations

Feature Impact on Performance

Diarization

Word Timestamps

Emotion Detection

Age/Gender Detection

Optimization Tips

Next Steps