Summary

Best Overall Model: 🏆 smallestai_streaming delivers the lowest average Word Error Rate (WER) across English, Hindi, and code-switched audio, while also handling noisy and disfluent speech effectively. It edges out deepgram_nova3_streaming in amost every category, making it the top choice for multilingual and mixed-context scenarios. Note: AssemblyAI’s streaming ASR does not support Hindi or code-switched transcription, it is English-only in streaming mode.

Test Dataset - Categories Overview

The internal test dataset comprises a diverse set of speech samples designed to evaluate ASR performance across real-world conditions, including language variation, background noise, and natural conversation artifacts.
  • Code-Switched: Hindi-English mix within the same conversation.
    Example: “यह movie बहुत अच्छी थी but ending थोड़ी confusing लगी”
    • How it was created: Generated using our in-house TTS on curated code-switched text, ensuring natural alternation between languages in the same sentence or utterance.
  • Hindi: Traditional Hindi in Devanagari.
    Example: “हमने उसका जन्मदिन मनाया”
    • How it was created: Recorded and synthesized from native Hindi speakers using varied topics, from casual conversations to descriptive narratives.
  • English: Standard English in Latin script.
    Example: “jovial joggers joyfully joined jogging jaunts justifying joyful jolliness”
    • How it was created: Includes tongue twisters, technical/scientific terminology, and diverse domains such as technology, healthcare, and finance to evaluate robustness across vocabularies.
  • Disfluency: Audio containing hesitation words, repetitions, and self-corrections.
    Example: “see uh uh i i when i went i thought the food was not good”
    • How it was created: Real recordings of people speaking naturally, sourced from Atoms call recordings to capture genuine in-the-wild speech patterns.
  • Noisy: Audio with background interference, low quality mics, or multiple speakers.
    • How it was created: Real recordings from Atoms call data with actual customer interactions, ambient sounds, and overlapping speech, replicating real-world ASR deployment conditions.

Model Overview

Model NameProviderType
smallestai_streamingSmallest AIWebSocket Streaming
gpt4o_mini_streamingOpenAIWebSocket Streaming
gpt4o_streamingOpenAIWebSocket Streaming
assemblyai_streamingAssembly AIWebSocket Streaming
deepgram_nova3_streamingDeepgramWebSocket Streaming

Performance Benchmarks

Accuracy Metrics

RankModelEnglish WERHindi WERCode-Switched WERDisfluency TermsNoisy WEROverall WER
1smallestai_streaming2.10%22.74%12.33%9.99%15.52%12.53%
2deepgram_nova3_streaming2.05%23.10%10.90%10.20%15.90%12.66%
3gpt4o_streaming10.19%9.93%29.58%12.00%22.06%16.75%
4gpt4o_mini_streaming11.11%12.28%36.97%15.19%20.47%19.20%
5assemblyai_streaming3.94%--14.01%14.56%10.83%