| Spoken (ASR output) | Written (with ITN) |
|---|---|
| “the total is twenty five dollars" | "the total is $25" |
| "call me at nine one zero five five five twelve thirty four" | "call me at 910-555-1234" |
| "the meeting is on january fifteenth twenty twenty six" | "the meeting is on January 15th, 2026" |
| "it costs three point five percent" | "it costs 3.5%" |
| "send it to john at gmail dot com" | "send it to john@gmail.com" |
| "i live at one two three main street" | "i live at 123 Main Street” |
Recommended Setup for Agentic Use Cases
Since ITN normalizes only finalized transcripts, it is recommended that you control finalization from your end for the best results:
- Set
finalize_on_words=falseso the server does not finalize internally based on word count - Set
eou_timeout_msto match your VAD (Voice Activity Detection) silence threshold - When your VAD detects that the user has stopped speaking, send
{"type": "finalize"}— this finalizes the entire chunk and ITN normalizes it as a whole
Enabling ITN
Passitn_normalize=true as a query parameter when connecting:
Parameters
These parameters can be combined withitn_normalize to control transcription behavior:
| Parameter | Type | Default | Description |
|---|---|---|---|
itn_normalize | boolean | false | Enable inverse text normalization |
max_words | integer | pipeline default | Max words before forced finalization. Useful for keeping ITN chunks short and accurate |
eou_timeout_ms | integer | 800 | End-of-utterance silence timeout in milliseconds. Lower values finalize faster |
finalize_on_words | boolean | true | When false, disables automatic word-count-based finalization. Use this when you want full control over when to finalize via the finalize message |
word_timestamps | boolean | false | Include per-word timestamps. ITN remaps timestamps when words collapse (e.g., “one two three” → “123” spans all three source timestamps) |
numerals | string | "auto" | Digit formatting. When ITN is enabled, this is typically left as "auto" since ITN handles number conversion |
Supported Semiotic Classes
ITN covers all standard semiotic classes:| Class | Example (spoken → written) |
|---|---|
| Cardinal | ”one hundred twenty three” → “123” |
| Ordinal | ”twenty first” → “21st” |
| Money | ”twenty five dollars” → “$25” |
| Telephone | ”nine one zero five five five one two three four” → “910-555-1234” |
| Date | ”january fifteenth twenty twenty six” → “January 15th, 2026” |
| Time | ”three thirty p m” → “3:30 PM” |
| Decimal | ”three point one four” → “3.14” |
| Measure | ”five kilograms” → “5 kg” |
| Electronic | ”john at gmail dot com” → “john@gmail.com” |
| Address | ”one two three main street” → “123 Main Street” |
| Verbatim | ”a b c” → “ABC” |
Finalize Control
You can force-finalize the current utterance at any time by sending a JSON text message over the WebSocket:finalize_on_words=false, where automatic finalization is disabled and you control exactly when each utterance boundary occurs. Common use cases:
- Form filling — Finalize after each form field to get a clean ITN result per field
- Voice commands — Finalize on button release or keyword detection
- Turn-based conversations — Finalize when the other party starts speaking
from_finalize: true:
from_finalize: true) so your request/response indexing stays in sync.
End Stream
To close the stream and flush remaining audio, send:is_last: true, and closes the connection.
Examples
Python — WebSocket with ITN
JavaScript — WebSocket with ITN
Python — Agentic Setup (Recommended for Voice AI)
Disable internal word-count finalization, seteou_timeout_ms to match your VAD, and send {"type": "finalize"} when your VAD detects end-of-speech. This finalizes the entire chunk and ITN normalizes it as a whole:
Combining ITN with Other Features
ITN works alongside all other post-processing features:Response Format
When ITN is enabled, final responses contain the normalized transcript:- Word timestamps are remapped. When multiple spoken words collapse into one written token (e.g., “twenty five dollars” → “$25”), the output word spans the full time range of all source words and takes the max confidence.
- Punctuation is preserved. Periods, commas, and other punctuation from ASR output are stripped before ITN and reattached to the correct output token afterward.
- Interim responses are not normalized. ITN only runs on finalized transcripts (
is_final: true) to avoid unnecessary processing on text that may still change. - Capitalization is preserved. ITN runs in cased mode, so proper nouns and sentence-initial caps from the ASR model are maintained.
How It Works
- End-of-utterance detection — The ASR pipeline detects a natural pause or hits the
max_wordslimit, producing a finalized transcript. Or you send{"type": "finalize"}to force it. - Punctuation stripping — Trailing punctuation (
.,!?;:) is stripped from each word before normalization, since the underlying FST engine cannot parse through punctuation. - ITN normalization — The text is passed through a Weighted Finite State Transducer (WFST) that converts spoken-form entities to written form with word alignment tracking.
- Punctuation reattachment — Stripped punctuation is mapped back to the correct output token using the word alignment from step 3.
- Timestamp remapping — Word-level timestamps from ASR are remapped to the ITN output using the alignment, spanning collapsed words.

