ASR Benchmarks

Summary

Best Overall Model: 🏆 smallestai_streaming delivers the lowest average Word Error Rate (WER) across English, Hindi, and code-switched audio, while also handling noisy and disfluent speech effectively. It edges out deepgram_nova3_streaming in amost every category, making it the top choice for multilingual and mixed-context scenarios. Note: AssemblyAI’s streaming ASR does not support Hindi or code-switched transcription, it is English-only in streaming mode.

Test Dataset - Categories Overview

The internal test dataset comprises a diverse set of speech samples designed to evaluate ASR performance across real-world conditions, including language variation, background noise, and natural conversation artifacts.

Code-Switched: Hindi-English mix within the same conversation.
Example: “यह movie बहुत अच्छी थी but ending थोड़ी confusing लगी”
- How it was created: Generated using our in-house TTS on curated code-switched text, ensuring natural alternation between languages in the same sentence or utterance.
Hindi: Traditional Hindi in Devanagari.
Example: “हमने उसका जन्मदिन मनाया”
- How it was created: Recorded and synthesized from native Hindi speakers using varied topics, from casual conversations to descriptive narratives.
English: Standard English in Latin script.
Example: “jovial joggers joyfully joined jogging jaunts justifying joyful jolliness”
- How it was created: Includes tongue twisters, technical/scientific terminology, and diverse domains such as technology, healthcare, and finance to evaluate robustness across vocabularies.
Disfluency: Audio containing hesitation words, repetitions, and self-corrections.
Example: “see uh uh i i when i went i thought the food was not good”
- How it was created: Real recordings of people speaking naturally, sourced from Atoms call recordings to capture genuine in-the-wild speech patterns.
Noisy: Audio with background interference, low quality mics, or multiple speakers.
- How it was created: Real recordings from Atoms call data with actual customer interactions, ambient sounds, and overlapping speech, replicating real-world ASR deployment conditions.

Model Overview

Model Name	Provider	Type
smallestai_streaming	Smallest AI	WebSocket Streaming
gpt4o_mini_streaming	OpenAI	WebSocket Streaming
gpt4o_streaming	OpenAI	WebSocket Streaming
assemblyai_streaming	Assembly AI	WebSocket Streaming
deepgram_nova3_streaming	Deepgram	WebSocket Streaming

Performance Benchmarks

Accuracy Metrics

Rank	Model	English WER	Hindi WER	Code-Switched WER	Disfluency Terms	Noisy WER	Overall WER
1	smallestai_streaming	2.10%	22.74%	12.33%	9.99%	15.52%	12.53%
2	deepgram_nova3_streaming	2.05%	23.10%	10.90%	10.20%	15.90%	12.66%
3	gpt4o_streaming	10.19%	9.93%	29.58%	12.00%	22.06%	16.75%
4	gpt4o_mini_streaming	11.11%	12.28%	36.97%	15.19%	20.47%	19.20%
5	assemblyai_streaming	3.94%	-	-	14.01%	14.56%	10.83%

Introduction

Getting Started

Text to Speech

Speech to Text (Automatic Speech Recognition)

Voice Cloning

Integrations

Product

Best Practices

ASR Benchmarks

Summary

Test Dataset - Categories Overview

Model Overview

Performance Benchmarks

Accuracy Metrics

Introduction

Getting Started

Text to Speech

Speech to Text (Automatic Speech Recognition)

Voice Cloning

Integrations

Product

Best Practices

​Summary

​Test Dataset - Categories Overview

​Model Overview

​Performance Benchmarks

​Accuracy Metrics

Summary

Test Dataset - Categories Overview

Model Overview

Performance Benchmarks

Accuracy Metrics