Evaluation Walkthrough

Our evaluation guide outlines a repeatable process: choose representative audio, generate transcripts, compute WER/CER/latency, and document findings. Use the streamlined steps below (with ready-to-run snippets) to mirror that workflow.

1. Assemble your dataset matrix

Collect 50–200 files per use case (support calls, meetings, media, etc.).
Produce verified transcripts plus optional speaker labels and timestamps.
Track metadata for accent, language, and audio quality so you can pivot metrics later.

dataset = [
    {"audio": "samples/en_agent01.wav", "reference": "Thank you for calling.", "language": "en"},
    {"audio": "samples/es_call02.wav", "reference": "Hola, ¿en qué puedo ayudarte?", "language": "es"},
]

2. Install the evaluation toolkit

pip install smallestai jiwer whisper-normalizer pandas

smallestai → Lightning STT client
jiwer → WER/CER computation
whisper-normalizer → normalization that matches the official guidance

3. Transcribe + normalize

import os
from jiwer import wer, cer
from whisper_normalizer.english import EnglishTextNormalizer
from smallestai.waves import WavesClient

client = WavesClient(api_key=os.environ["SMALLEST_AI_API_KEY"])
normalizer = EnglishTextNormalizer()

def run_sample(sample):
    response = client.transcribe(
        audio_file=sample["audio"],
        language=sample["language"],
        word_timestamps=True,
        diarize=True
    )
    ref = normalizer(sample["reference"])
    hyp = normalizer(response.transcription)
    return {
        "path": sample["audio"],
        "wer": wer(ref, hyp),
        "cer": cer(ref, hyp),
        "latency_ms": response.metrics["latency_ms"],
        "rtf": response.metrics["real_time_factor"],
        "transcription": response.transcription
    }

4. Batch evaluation + aggregation

import pandas as pd

results = [run_sample(s) for s in dataset]
df = pd.DataFrame(results)

summary = {
    "samples": len(df),
    "avg_wer": df.wer.mean(),
    "p95_wer": df.wer.quantile(0.95),
    "avg_latency_ms": df.latency_ms.mean(),
    "avg_rtf": df.rtf.mean()
}

Recommended metrics to report

WER / CER per use case and language.
Time to first result and RTF from response.metrics.
Diarization coverage: % of utterances entries with speaker.

5. Error analysis

def breakdown(df):
    worst = df.sort_values("wer", ascending=False).head(5)[["path", "wer", "transcription"]]
    return worst.to_dict(orient="records")

outliers = breakdown(df)

Classify errors into substitutions, deletions, insertions.
Highlight audio traits (noise, accent) that correlate with higher WER.

6. Compare configurations

configs = [
    {"language": "en", "word_timestamps": True},
    {"language": "multi", "word_timestamps": True, "diarize": True}
]

def evaluate_config(config):
    return [run_sample({**s, **config}) for s in dataset]

for config in configs:
    cfg_results = pd.DataFrame(evaluate_config(config))
    print(config, cfg_results.wer.mean())

Use this to decide whether to enable diarization, sentence-level timestamps, or enrichment features; the official evaluation doc recommends capturing cost/latency impact alongside accuracy.

7. Publish the report

Include:

Dataset description + rationale
Metrics table (WER/CER/TTFR/RTF, p50/p90/p95)
Error taxonomy with audio snippets
Configuration recommendation (e.g., language=multi, word_timestamps=true, diarize=true)
Follow-up experiments or model versions to track

Example JSON summary

{
  "dataset": "contact-center-q1",
  "samples": 120,
  "average_wer": 0.064,
  "average_cer": 0.028,
  "average_latency_ms": 61.3,
  "average_rtf": 0.41,
  "p95_latency_ms": 88.2,
  "timestamp": "2025-01-15T10:00:00Z"
}

This completes the process of self metric evaluation. With these steps, you can identify strengths and weaknesses in any STT model.

Introduction

Getting Started

Text to Speech

Speech to Text

Voice Cloning

Integrations

Product

Best Practices

Evaluation Walkthrough

1. Assemble your dataset matrix

2. Install the evaluation toolkit

3. Transcribe + normalize

4. Batch evaluation + aggregation

Recommended metrics to report

5. Error analysis

6. Compare configurations

7. Publish the report

Example JSON summary

Introduction

Getting Started

Text to Speech

Speech to Text

Voice Cloning

Integrations

Product

Best Practices

​1. Assemble your dataset matrix

​2. Install the evaluation toolkit

​3. Transcribe + normalize

​4. Batch evaluation + aggregation

​Recommended metrics to report

​5. Error analysis

​6. Compare configurations

​7. Publish the report

​Example JSON summary

1. Assemble your dataset matrix

2. Install the evaluation toolkit

3. Transcribe + normalize

4. Batch evaluation + aggregation

Recommended metrics to report

5. Error analysis

6. Compare configurations

7. Publish the report

Example JSON summary