Pulse (Pre-Recorded)

The STT POST API allows you to convert speech to text using two different input methods:

Raw Audio Bytes (application/octet-stream) - Send raw audio data with all parameters as query parameters
Audio URL (application/json) - Provide only a URL to an audio file in the JSON body, with all other parameters as query parameters

Both methods use our Pulse STT model with automatic language detection across 30+ languages.

Authentication

This endpoint requires authentication using a Bearer token in the Authorization header:

Authorization: Bearer YOUR_API_KEY

Input Methods

Choose the input method that best fits your use case:

Method	Content Type	Use Case	Parameters
Raw Bytes	`application/octet-stream`	Streaming audio data, real-time processing	Query parameters
Audio URL	`application/json`	Remote audio files, webhook processing	Query parameters

Code Examples

Method 1: Raw Audio Bytes (application/octet-stream)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&word_timestamps=true&diarize=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: audio/wav' \
  --data-binary '@/path/to/your/audio.wav'

Method 2: Audio URL (application/json)

curl --request POST \
  --url "https://waves-api.smallest.ai/api/v1/pulse/get_text?model=pulse&language=en&word_timestamps=true&diarize=true&age_detection=true&gender_detection=true&emotion_detection=true" \
  --header 'Authorization: Bearer <token>' \
  --header 'Content-Type: application/json' \
  --data '{
    "url": "https://example.com/audio.mp3"
  }'

Supported Languages

The Pulse STT model supports automatic language detection and transcription across 30+ languages. For the full list of supported languages, please check STT Supported Languages.

Specify the language of the input audio using its ISO 639-1 code. Use multi to enable automatic language detection from the supported list. The default is en (English).

Authorizations

Authorization

string

header

required

Bearer authentication header of the form Bearer <api_key>, where <api_key> is your api key.

Query Parameters

model

enum<string>

required

The ASR model to use for transcription

Available options:

pulse

Example:

"pulse"

language

enum<string>

default:en

Language of the audio file. Use multi for automatic language detection

Available options:

it,

es,

en,

pt,

hi,

de,

fr,

uk,

ru,

kn,

ml,

pl,

mr,

gu,

cs,

sk,

te,

or,

nl,

bn,

lv,

et,

ro,

pa,

fi,

sv,

bg,

ta,

hu,

da,

lt,

mt,

multi

webhook_url

string<uri>

URL to the webhook to receive the transcription results

Example:

"https://example.com/webhook"

webhook_extra

string

Extra parameters to pass to the transcription. These will be added to the request body as a JSON object. Add comma separated key-value pairs to the query string. eg "custom_key:custom_value,custom_key2:custom_value2"

Example:

"custom_key:custom_value,custom_key2:custom_value2"

word_timestamps

boolean

default:false

Whether to include word and utterance level timestamps in the response

diarize

boolean

default:false

Whether to perform speaker diarization

age_detection

enum<string>

default:false

Whether to predict age group of the speaker

Available options:

true,

false

gender_detection

enum<string>

default:false

Whether to predict the gender of the speaker

Available options:

true,

false

emotion_detection

enum<string>

default:false

Whether to predict speaker emotions

Available options:

true,

false

Body

Raw audio bytes. Content-Type header should specify the audio format (e.g., audio/wav, audio/mp3). All parameters are passed as query parameters.

Response

Speech transcribed successfully

status

string

Status of the transcription request

Example:

"success"

transcription

string

The transcribed text from the audio file

Example:

"Hello world."

audio_length

number

Duration of the audio file in seconds

Example:

1.7

words

object[]

Word-level timestamps in seconds.

Show child attributes

utterances

object[]

List of utterances with start and end times

Show child attributes

age

enum<string>

Predicted age group of the speaker (e.g., infant, teenager, adult, old)

Available options:

infant,

teenager,

adult,

old

Example:

"adult"

gender

enum<string>

Predicted gender of the speaker if requested

Available options:

male,

female

Example:

"male"

emotions

object

Predicted emotions of the speaker if requested

Show child attributes

metadata

object

Metadata about the transcription

Show child attributes

API References

Text to Speech

Speech to Text

Voices

Voice Cloning

Pronunciation Dictionaries

Pulse (Pre-Recorded)

Authentication

Input Methods

Code Examples

Method 1: Raw Audio Bytes (application/octet-stream)

Method 2: Audio URL (application/json)

Supported Languages

Authorizations

Query Parameters

Body

Response

API References

Text to Speech

Speech to Text

Voices

Voice Cloning

Pronunciation Dictionaries

​Authentication

​Input Methods

​Code Examples

​Method 1: Raw Audio Bytes (application/octet-stream)

​Method 2: Audio URL (application/json)

​Supported Languages

Authorizations

Query Parameters

Body

Response

Authentication

Input Methods

Code Examples

Method 1: Raw Audio Bytes (application/octet-stream)

Method 2: Audio URL (application/json)

Supported Languages