Skip to main content
Pre-Recorded Real-Time

Enabling speaker diarization

Pre-Recorded API

Pass diarize=true when calling the Pulse STT POST endpoint. The parameter can be combined with other enrichment options (timestamps, emotions, etc.) without changing your audio payload.
curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
  --header "Authorization: Bearer $SMALLEST_API_KEY" \
  --header "Content-Type: audio/wav" \
  --data-binary "@/path/to/audio.wav"

Real-Time WebSocket API

Add diarize=true to your WebSocket connection query parameters when connecting to the Pulse STT WebSocket API.
const url = new URL("wss://waves-api.smallest.ai/api/v1/pulse/get_text");
url.searchParams.append("language", "en");
url.searchParams.append("encoding", "linear16");
url.searchParams.append("sample_rate", "16000");
url.searchParams.append("diarize", "true");

const ws = new WebSocket(url.toString(), {
  headers: {
    Authorization: `Bearer ${API_KEY}`,
  },
});

Output format & field of interest

When enabled, every entry in words includes a speaker field (integer ID: 0, 1, …) and speaker_confidence field (0.0 to 1.0) for real-time API, or string labels (speaker_0, speaker_1, …) for pre-recorded API. The utterances array also carries speaker labels so you can reconstruct conversations, build turn-taking analytics, or display multi-speaker captions.

Pre-Recorded API

Sample request

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&diarize=true" \
  --header "Authorization: Bearer $SMALLEST_API_KEY" \
  --header "Content-Type: audio/wav" \
  --data-binary "@/path/to/two-speaker.wav"

Sample response

Pre-Recorded API Response

{
  "transcription": "Agent: Hello world. Customer: Hi there.",
  "words": [
    { "start": 0.0, "end": 0.4, "speaker": "speaker_0", "word": "Hello" },
    { "start": 0.4, "end": 0.8, "speaker": "speaker_0", "word": "world." },
    { "start": 1.0, "end": 1.2, "speaker": "speaker_1", "word": "Hi" },
    { "start": 1.2, "end": 1.6, "speaker": "speaker_1", "word": "there." }
  ],
  "utterances": [
    { "text": "Hello world.", "start": 0.0, "end": 0.8, "speaker": "speaker_0" },
    { "text": "Hi there.", "start": 1.0, "end": 1.6, "speaker": "speaker_1" }
  ]
}

Real-Time WebSocket API Response

{
  "session_id": "sess_12345abcde",
  "transcript": "Hello world. Hi there.",
  "is_final": true,
  "is_last": false,
  "language": "en",
  "words": [
    {
      "word": "Hello",
      "start": 0.0,
      "end": 0.4,
      "confidence": 0.98,
      "speaker": 0,
      "speaker_confidence": 0.95
    },
    {
      "word": "world.",
      "start": 0.4,
      "end": 0.8,
      "confidence": 0.97,
      "speaker": 0,
      "speaker_confidence": 0.92
    },
    {
      "word": "Hi",
      "start": 1.0,
      "end": 1.2,
      "confidence": 0.99,
      "speaker": 1,
      "speaker_confidence": 0.88
    },
    {
      "word": "there.",
      "start": 1.2,
      "end": 1.6,
      "confidence": 0.96,
      "speaker": 1,
      "speaker_confidence": 0.91
    }
  ],
  "utterances": [
    {
      "text": "Hello world.",
      "start": 0.0,
      "end": 0.8,
      "speaker": 0
    },
    {
      "text": "Hi there.",
      "start": 1.0,
      "end": 1.6,
      "speaker": 1
    }
  ]
}

Response Fields

FieldTypeWhen IncludedDescription
speakerinteger (realtime) / string (pre-recorded)diarize=trueSpeaker label. Real-time API uses integer IDs (0, 1, …), pre-recorded API uses string labels (speaker_0, speaker_1, …)
speaker_confidencenumberdiarize=true (realtime only)Confidence score for the speaker assignment (0.0 to 1.0)