utterances array aggregates contiguous words into sentence-level segments, providing structured timing information for longer audio chunks.
Enabling sentence-level timestamps
Pre-Recorded API
For the Pre-Recorded API, setword_timestamps=true in your query parameters. When word timestamps are enabled, the response includes both words and utterances arrays.
Sentence-level timestamps (utterances) are supported in both Pre-Recorded and Real-Time APIs. Use
sentence_timestamps=true for Real-Time API.Real-Time API (WebSocket)
For the Real-Time WebSocket API, setsentence_timestamps=true as a query parameter when establishing the WebSocket connection.
Output format
Eachutterances entry contains text, start, end, and optional speaker fields (when diarization is enabled). Use these sentence-level timestamps when you need to display readable captions, synchronize larger chunks of audio, or store structured call summaries.
Sample response
Pre-Recorded API
This response has the
speaker field due to diarize being enabled in the query.Real-Time API (WebSocket)
When
diarize=true is enabled, the utterances array also includes a speaker field (integer ID) for real-time API responses. For example: { "text": "Hello world.", "start": 0.0, "end": 0.9, "speaker": 0 }
