Summary
Best Overall Model: 🏆 smallestai_streaming delivers the lowest average Word Error Rate (WER) across English, Hindi, and code-switched audio, while also handling noisy and disfluent speech effectively. It edges out deepgram_nova3_streaming in amost every category, making it the top choice for multilingual and mixed-context scenarios. Note: AssemblyAI’s streaming ASR does not support Hindi or code-switched transcription, it is English-only in streaming mode.Test Dataset - Categories Overview
The internal test dataset comprises a diverse set of speech samples designed to evaluate ASR performance across real-world conditions, including language variation, background noise, and natural conversation artifacts.-
Code-Switched: Hindi-English mix within the same conversation.
Example: “यह movie बहुत अच्छी थी but ending थोड़ी confusing लगी”- How it was created: Generated using our in-house TTS on curated code-switched text, ensuring natural alternation between languages in the same sentence or utterance.
-
Hindi: Traditional Hindi in Devanagari.
Example: “हमने उसका जन्मदिन मनाया”- How it was created: Recorded and synthesized from native Hindi speakers using varied topics, from casual conversations to descriptive narratives.
-
English: Standard English in Latin script.
Example: “jovial joggers joyfully joined jogging jaunts justifying joyful jolliness”- How it was created: Includes tongue twisters, technical/scientific terminology, and diverse domains such as technology, healthcare, and finance to evaluate robustness across vocabularies.
-
Disfluency: Audio containing hesitation words, repetitions, and self-corrections.
Example: “see uh uh i i when i went i thought the food was not good”- How it was created: Real recordings of people speaking naturally, sourced from Atoms call recordings to capture genuine in-the-wild speech patterns.
-
Noisy: Audio with background interference, low quality mics, or multiple speakers.
- How it was created: Real recordings from Atoms call data with actual customer interactions, ambient sounds, and overlapping speech, replicating real-world ASR deployment conditions.
Model Overview
Model Name | Provider | Type |
---|---|---|
smallestai_streaming | Smallest AI | WebSocket Streaming |
gpt4o_mini_streaming | OpenAI | WebSocket Streaming |
gpt4o_streaming | OpenAI | WebSocket Streaming |
assemblyai_streaming | Assembly AI | WebSocket Streaming |
deepgram_nova3_streaming | Deepgram | WebSocket Streaming |
Performance Benchmarks
Accuracy Metrics
Rank | Model | English WER | Hindi WER | Code-Switched WER | Disfluency Terms | Noisy WER | Overall WER |
---|---|---|---|---|---|---|---|
1 | smallestai_streaming | 2.10% | 22.74% | 12.33% | 9.99% | 15.52% | 12.53% |
2 | deepgram_nova3_streaming | 2.05% | 23.10% | 10.90% | 10.20% | 15.90% | 12.66% |
3 | gpt4o_streaming | 10.19% | 9.93% | 29.58% | 12.00% | 22.06% | 16.75% |
4 | gpt4o_mini_streaming | 11.11% | 12.28% | 36.97% | 15.19% | 20.47% | 19.20% |
5 | assemblyai_streaming | 3.94% | - | - | 14.01% | 14.56% | 10.83% |