Skip to main content
POST
/
api
/
v1
/
pulse
/
get_text
Convert speech to text
curl --request POST \
  --url https://waves-api.smallest.ai/api/v1/pulse/get_text \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/octet-stream' \
  --data '"<string>"'
{
  "status": "success",
  "transcription": "Hello world.",
  "words": [
    {
      "start": 0,
      "end": 0.5,
      "speaker": "speaker_0",
      "word": "Hello"
    },
    {
      "start": 0.6,
      "end": 0.9,
      "speaker": "speaker_0",
      "word": "world."
    }
  ],
  "utterances": [
    {
      "text": "Hello world.",
      "start": 0,
      "end": 0.9,
      "speaker": "speaker_0"
    }
  ],
  "age": "adult",
  "gender": "male",
  "emotions": {
    "happiness": 0.8,
    "sadness": 0.15,
    "disgust": 0.02,
    "fear": 0.03,
    "anger": 0.05
  },
  "metadata": {
    "filename": "audio.mp3",
    "duration": 1.7,
    "fileSize": 1000000
  }
}
The STT POST API allows you to convert speech to text using two different input methods:
  1. Raw Audio Bytes (application/octet-stream) - Send raw audio data with all parameters as query parameters
  2. Audio URL (application/json) - Provide only a URL to an audio file in the JSON body, with all other parameters as query parameters
Both methods use our Pulse STT model with automatic language detection across 30+ languages.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:
Authorization: Bearer YOUR_API_KEY

Input Methods

Choose the input method that best fits your use case:
MethodContent TypeUse CaseParameters
Raw Bytesapplication/octet-streamStreaming audio data, real-time processingQuery parameters
Audio URLapplication/jsonRemote audio files, webhook processingQuery parameters

Code Examples

Method 1: Raw Audio Bytes (application/octet-stream)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&word_timestamps=true&diarize=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: audio/wav' \
  --data-binary '@/path/to/your/audio.wav'

Method 2: Audio URL (application/json)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&word_timestamps=true&diarize=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/audio.mp3"
  }'

Supported Languages

The Pulse STT model supports automatic language detection and transcription across 30+ languages. For the full list of supported languages, please check STT Supported Languages.
Specify the language of the input audio using its ISO 639-1 code. Use multi to enable automatic language detection from the supported list. The default is en (English).

Authorizations

Authorization
string
header
required

Bearer authentication header of the form Bearer <api_key>, where <api_key> is your api key.

Query Parameters

model
enum<string>
required

The ASR model to use for transcription

Available options:
pulse
Example:

"pulse"

language
enum<string>
default:en

Language of the audio file. Use multi for automatic language detection

Available options:
it,
es,
en,
pt,
hi,
de,
fr,
uk,
ru,
kn,
ml,
pl,
mr,
gu,
cs,
sk,
te,
or,
nl,
bn,
lv,
et,
ro,
pa,
fi,
sv,
bg,
ta,
hu,
da,
lt,
mt,
multi
webhook_url
string<uri>

URL to the webhook to receive the transcription results

Example:

"https://example.com/webhook"

webhook_extra
string

Extra parameters to pass to the transcription. These will be added to the request body as a JSON object. Add comma separated key-value pairs to the query string. eg "custom_key:custom_value,custom_key2:custom_value2"

Example:

"custom_key:custom_value,custom_key2:custom_value2"

word_timestamps
boolean
default:false

Whether to include word and utterance level timestamps in the response

diarize
boolean
default:false

Whether to perform speaker diarization

age_detection
enum<string>
default:false

Whether to predict age group of the speaker

Available options:
true,
false
gender_detection
enum<string>
default:false

Whether to predict the gender of the speaker

Available options:
true,
false
emotion_detection
enum<string>
default:false

Whether to predict speaker emotions

Available options:
true,
false

Body

Raw audio bytes. Content-Type header should specify the audio format (e.g., audio/wav, audio/mp3). All parameters are passed as query parameters.

Response

Speech transcribed successfully

status
string

Status of the transcription request

Example:

"success"

transcription
string

The transcribed text from the audio file

Example:

"Hello world."

audio_length
number

Duration of the audio file in seconds

Example:

1.7

words
object[]

Word-level timestamps in seconds.

utterances
object[]

List of utterances with start and end times

age
enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:
infant,
teenager,
adult,
old
Example:

"adult"

gender
enum<string>

Predicted gender of the speaker if requested

Available options:
male,
female
Example:

"male"

emotions
object

Predicted emotions of the speaker if requested

metadata
object

Metadata about the transcription